Ray on Golem troubleshooting
What is going on with my cluster - status and dashboard
Problem
Often you will find yourself wondering "How is my cluster doing?"
Solution
Ray offers two tools to inspect the cluster state: the status and the dashboard.
Ray status shows information about the nodes that make up the cluster, the total resources of the cluster, and their current usage.
The ray status
command needs to be executed on the head node:
ray exec golem-cluster.yaml 'ray status'
Ray On Golem webserver
Not starting webserver, as it's already running
Fetched IP: 192.168.0.3
======== Autoscaler status: 2023-11-28 08:09:48.391906 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray.head.default
1 ray.worker.default
Pending:
ray.worker.default, 5 launching
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
2.0/2.0 CPU
0B/5.34GiB memory
0B/2.61GiB object_store_memory
Demands:
{'CPU': 1.0}: 385+ pending tasks/actors
Shared connection to 192.168.0.3 closed.
Ray dashboard visualizes the nodes and various other aspects of a running cluster, like specific jobs and function calls that get executed by each node.
Run ray dashboard golem-cluster.yaml
to forward your cluster's dashboard to http://localhost:8265. Keep it running in the background so that the tunnel stays active as long as you need it.
For a more thorough description of the dashboard's amazing secrets, check out the Ray docs.
Was this helpful?
Ray on Golem's log files
Problem
When you encounter issues while running one of the various Ray commands that utilize Ray on Golem, it may not always be immediately obvious what the problem is from the output of the command itself.
Where possible, we try to display a meaningful error message - e.g. when ray up
fails to start up the Ray on Golem's webserver, we display a few last lines of the logs.
If that's not enough to suggest a way to fix the issue, investigating the logs may provide the required details.
Solution
The log files that Ray on Golem produces are stored in the following files (in a default, platform-dependent application directory):
~/.local/share/ray_on_golem/webserver.log
- cluster manager log - basic Ray on Golem logs.~/.local/share/ray_on_golem/webserver_debug.log
cluster manager debug log - more detailed Ray on Golem logs.~/.local/share/ray_on_golem/yagna.log
- Golem node (yagna) logs.
If you're unsure where that directory is, you can run ray-on-golem status
to find out which location is in use on your system.
Given these, you can either:
- Look at the
webserver.log
yourself - our aim is for you to be able to diagnose and resolve the most straightforward issues on your own. - Share the
webserver_debug.log
andyagna.log
with us on#Ray on Golem
discord channel - we will be more than happy to assist.
Was this helpful?
Starting over with a clean slate
Problem
It may happen that something goes wrong and you wish to start over with a clean slate.
Solution
The first thing to do is ray down
- it should be enough to clear the situation, but sadly it isn't always the case.
You can use ray-on-golem stop
to ultimately stop all Ray on Golem components. The next ray up
will start them again in a fresh state. Check out more details about Ray on Golem stopping and starting.
It might also be a good idea to clean up Ray's configuration cache files:
rm /tmp/ray-config-*
Was this helpful?
How can I resolve the node not found error?
Problem
Sometimes, ray up
might time out with a note saying that there is no node available.
It might look like this:
ray_on_golem.client.exceptions.RayOnGolemClientError: Couldn't create node: {"error": "Creating activities timeout reached"}
Solution
This means, that there are not enough providers on the network.
If you are running Ray on Golem on the testnet (property payment_network: "holesky"
in the cluster yaml) - most likely all the nodes are busy with requests of other users.
We are preparing a tool to check providers' availability. Another solution would be to move to mainnet - we are also working on enabling this option.
For now, the best course of action would be to report the situation on #Ray on Golem
discord channel.
Usually, the testnet isn't busy for a long time - it might be enough to wait a couple of minutes.
Was this helpful?
Additional packages required on the cluster
Problem
Your app might need non-standard packages to run on a Ray on Golem cluster. The default VM image includes nothing besides the bare minimum.
In such a case, the output for ray submit
may look like this:
Traceback (most recent call last):
File "/root/dds-with-ray.py", line 49, in <module>
results = get_lots_of_deals()
File "/root/dds-with-ray.py", line 42, in get_lots_of_deals
results = ray.get(result_ids)
File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ModuleNotFoundError): ray::get_deal() (pid=871, ip=192.168.0.3)
File "/root/dds-with-ray.py", line 30, in get_deal
from endplay.dds import calc_dd_table
ModuleNotFoundError: No module named 'endplay'
Shared connection to 192.168.0.3 closed.
Error: Command failed:
Solution
Note that even if you have the needed dependencies installed, and your app runs on your local environment, you still need to tell Ray on Golem cluster the packages are needed.
The best way to do it is by adding an appropriate pip install
command to setup_commands
in the cluster yaml file. Check out the cluster yaml article to get more information.
Was this helpful?
Passing arguments to your Ray script fails
Problem
Often, you need to pass arguments to the Ray script you are submitting:
ray submit golem-cluster.yaml examples/simple-task.py -n 20
While it looks correct, in this particular case it won't have the intended effect of passing the parameter to the simple-stask.py
script. Instead, it will cause Ray to, seemingly, suddenly require a password for the node, i.e.:
Checking External environment settings
Ray On Golem webserver
Not starting webserver, as it's already running
Fetched IP: 192.168.0.3
root@192.168.0.3's password:
Solution
ray submit
and other ray commands have their own arguments. If your script arguments are named the same, Ray will get confused.
In this example -n
stands for cluster name, which we don't support yet - hence the unexpected request for the SSH password.
The solution is to precede your arguments with the double-dash symbol (--
):
ray submit golem-cluster.yaml examples/simple-task.py -- -n 20
This informs Ray that everything after the double dash is not to be interpreted, but instead passed as-is to the executed script.
Was this helpful?
Head node not found
Problem
Sometimes, you might get RuntimeError: Head node of cluster (golem-cluster) not found!
error message when attempting to run a ray cli command.
Solution
This means, that you are trying to run a command that requires a cluster to be up (like ray submit
), but the cluster is not alive.
The simplest reason might be that you hadn't successfully run ray up
. Checkout the logs to see what happened.
Another explanation might be that the cluster had been launched, but it stopped or crashed - please share the logs with us to help us prevent this from happening in the future.
If you see nothing interesting in the logs, you might want to start over.
Feel free to reach out to us on #Ray on Golem
discord channel - we will be more than happy to assist.
Was this helpful?
Was this helpful?