Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Commit 9dc6aaa

Browse files
authored
Readme update (#72)
* wip * update readme * update
1 parent f2ffd82 commit 9dc6aaa

File tree

1 file changed

+46
-8
lines changed

1 file changed

+46
-8
lines changed

README.md

Lines changed: 46 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,35 @@ Once you add your plugin to the PyTorch Lightning Trainer, you can parallelize t
77

88
This library also comes with an integration with [Ray Tune](tune.io) for distributed hyperparameter tuning experiments.
99

10+
# Table of Contents
11+
1. [Installation](#installation)
12+
2. [PyTorch Lightning Compatibility](#pytorch-lightning-compatibility)
13+
3. [PyTorch Distributed Data Parallel Plugin on Ray](#pytorch-distributed-data-parallel-plugin-on-ray)
14+
4. [Multi-Node Distributed Training](#multinode-distributed-training)
15+
5. [Multi-Node Training from your Laptop](#multinode-training-from-your-laptop)
16+
5. [Horovod Plugin on Ray](#horovod-plugin-on-ray)
17+
6. [Model Parallel Sharded Training on Ray](#model-parallel-sharded-training-on-ray)
18+
7. [Hyperparameter Tuning with Ray Tune](#hyperparameter-tuning-with-ray-tune)
19+
8. [FAQ](#faq)
20+
21+
1022
## Installation
11-
You can install the master branch of ray_lightning like so:
23+
You can install Ray Lightning via `pip`:
24+
25+
`pip install ray_lightning`
26+
27+
Or to install master:
1228

1329
`pip install git+https://github.com/ray-project/ray_lightning#ray_lightning`
1430

31+
## PyTorch Lightning Compatibility
32+
Here are the supported PyTorch Lightning versions:
33+
34+
| Ray Lightning | PyTorch Lightning |
35+
|---|---|
36+
| 0.1 | 1.4.1 |
37+
38+
1539
## PyTorch Distributed Data Parallel Plugin on Ray
1640
The `RayPlugin` provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.
1741

@@ -35,6 +59,26 @@ Because Ray is used to launch processes, instead of the same script being called
3559
- Jupyter Notebooks, Google Colab, Kaggle
3660
- Calling `fit` or `test` multiple times in the same script
3761

62+
## Multi-node Distributed Training
63+
Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.
64+
1) [Use Ray's cluster launcher](https://docs.ray.io/en/master/cluster/launcher.html) to start a Ray cluster- `ray up my_cluster_config.yaml`.
65+
2) [Execute your Python script on the Ray cluster](https://docs.ray.io/en/master/cluster/commands.html#running-ray-scripts-on-the-cluster-ray-submit)- `ray submit my_cluster_config.yaml train.py`. This will `rsync` your training script to the head node, and execute it on the Ray cluster.
66+
67+
You no longer have to set environment variables or configurations and run your training script on every single node.
68+
69+
## Multi-node Training from your Laptop
70+
Ray provides capabilities to run multi-node and GPU training all from your laptop through [Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html)
71+
72+
You can follow the instructions [here](https://docs.ray.io/en/master/cluster/ray-client.html) to setup the cluster.
73+
Then, add this line to the beginning of your script to connect to the cluster:
74+
```python
75+
# replace with the appropriate host and port
76+
ray.init("ray://<head_node_host>:10001")
77+
```
78+
Now you can run your training script on the laptop, but have it execute as if your laptop has all the resources of the cluster essentially providing you with an **infinite laptop**.
79+
80+
**Note:** When using with Ray Client, you must disable checkpointing and logging for your Trainer by setting `checkpoint_callback` and `logger` to `False`.
81+
3882
## Horovod Plugin on Ray
3983
Or if you prefer to use Horovod as the distributed training protocol, use the `HorovodRayPlugin` instead.
4084

@@ -73,13 +117,6 @@ trainer.fit(ptl_model)
73117
```
74118
See the [Pytorch Lightning docs](https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#sharded-training) for more information on sharded training.
75119

76-
## Multi-node Distributed Training
77-
Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.
78-
1) [Use Ray's cluster launcher](https://docs.ray.io/en/master/cluster/launcher.html) to start a Ray cluster- `ray up my_cluster_config.yaml`.
79-
2) [Execute your Python script on the Ray cluster](https://docs.ray.io/en/master/cluster/commands.html#running-ray-scripts-on-the-cluster-ray-submit)- `ray submit my_cluster_config.yaml train.py`. This will `rsync` your training script to the head node, and execute it on the Ray cluster.
80-
81-
You no longer have to set environment variables or configurations and run your training script on every single node.
82-
83120
## Hyperparameter Tuning with Ray Tune
84121
`ray_lightning` also integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed model training. You can run multiple PyTorch Lightning training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, pass the function to tune.run, and make sure to add the appropriate callback (Either `TuneReportCallback` or `TuneReportCheckpointCallback`) to your PyTorch Lightning Trainer.
85122

@@ -125,6 +162,7 @@ analysis = tune.run(
125162

126163
print("Best hyperparameters found were: ", analysis.best_config)
127164
```
165+
128166
## FAQ
129167
> RaySGD already has a [Pytorch Lightning integration](https://docs.ray.io/en/master/raysgd/raysgd_ptl.html). What's the difference between this integration and that?
130168

0 commit comments

Comments
 (0)