You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
Here are the supported PyTorch Lightning versions:
33
+
34
+
| Ray Lightning | PyTorch Lightning |
35
+
|---|---|
36
+
| 0.1 | 1.4.1 |
37
+
38
+
15
39
## PyTorch Distributed Data Parallel Plugin on Ray
16
40
The `RayPlugin` provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.
17
41
@@ -35,6 +59,26 @@ Because Ray is used to launch processes, instead of the same script being called
35
59
- Jupyter Notebooks, Google Colab, Kaggle
36
60
- Calling `fit` or `test` multiple times in the same script
37
61
62
+
## Multi-node Distributed Training
63
+
Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.
64
+
1)[Use Ray's cluster launcher](https://docs.ray.io/en/master/cluster/launcher.html) to start a Ray cluster- `ray up my_cluster_config.yaml`.
65
+
2)[Execute your Python script on the Ray cluster](https://docs.ray.io/en/master/cluster/commands.html#running-ray-scripts-on-the-cluster-ray-submit)- `ray submit my_cluster_config.yaml train.py`. This will `rsync` your training script to the head node, and execute it on the Ray cluster.
66
+
67
+
You no longer have to set environment variables or configurations and run your training script on every single node.
68
+
69
+
## Multi-node Training from your Laptop
70
+
Ray provides capabilities to run multi-node and GPU training all from your laptop through [Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html)
71
+
72
+
You can follow the instructions [here](https://docs.ray.io/en/master/cluster/ray-client.html) to setup the cluster.
73
+
Then, add this line to the beginning of your script to connect to the cluster:
74
+
```python
75
+
# replace with the appropriate host and port
76
+
ray.init("ray://<head_node_host>:10001")
77
+
```
78
+
Now you can run your training script on the laptop, but have it execute as if your laptop has all the resources of the cluster essentially providing you with an **infinite laptop**.
79
+
80
+
**Note:** When using with Ray Client, you must disable checkpointing and logging for your Trainer by setting `checkpoint_callback` and `logger` to `False`.
81
+
38
82
## Horovod Plugin on Ray
39
83
Or if you prefer to use Horovod as the distributed training protocol, use the `HorovodRayPlugin` instead.
40
84
@@ -73,13 +117,6 @@ trainer.fit(ptl_model)
73
117
```
74
118
See the [Pytorch Lightning docs](https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#sharded-training) for more information on sharded training.
75
119
76
-
## Multi-node Distributed Training
77
-
Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.
78
-
1)[Use Ray's cluster launcher](https://docs.ray.io/en/master/cluster/launcher.html) to start a Ray cluster- `ray up my_cluster_config.yaml`.
79
-
2)[Execute your Python script on the Ray cluster](https://docs.ray.io/en/master/cluster/commands.html#running-ray-scripts-on-the-cluster-ray-submit)- `ray submit my_cluster_config.yaml train.py`. This will `rsync` your training script to the head node, and execute it on the Ray cluster.
80
-
81
-
You no longer have to set environment variables or configurations and run your training script on every single node.
82
-
83
120
## Hyperparameter Tuning with Ray Tune
84
121
`ray_lightning` also integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed model training. You can run multiple PyTorch Lightning training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, pass the function to tune.run, and make sure to add the appropriate callback (Either `TuneReportCallback` or `TuneReportCheckpointCallback`) to your PyTorch Lightning Trainer.
85
122
@@ -125,6 +162,7 @@ analysis = tune.run(
125
162
126
163
print("Best hyperparameters found were: ", analysis.best_config)
127
164
```
165
+
128
166
## FAQ
129
167
> RaySGD already has a [Pytorch Lightning integration](https://docs.ray.io/en/master/raysgd/raysgd_ptl.html). What's the difference between this integration and that?
0 commit comments