Skip to content

Commit 5596293

Browse files
author
Thomas Mulc
committed
fixed errors in README and updated README with Beginner Tutorial Information
1 parent 9788e2b commit 5596293

File tree

1 file changed

+19
-10
lines changed

1 file changed

+19
-10
lines changed

README.md

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,36 @@ Currently, there are few examples of distributed TensorFlow code. Further, the
44

55
This is collection of examples for help getting started with distributed computing in TensorFlow and that can act as boilerplate code. Many of the examples focus on implementing well-known distributed training schemes, such as those available in [Distriubted Keras](https://github.com/cerndb/dist-keras) which were discussed in the author's [blog post](http://joerihermans.com/ramblings/distributed-deep-learning-part-1-an-introduction/). The official Distributed TensorFlow guide can be found [here]( https://www.tensorflow.org/deploy/distributed).
66

7-
Almost all the examples can be run on a single machine with a CPU.
7+
Almost all the examples can be run on a single machine with a CPU.
88

9+
## Beginner Tutorial
910

10-
## Examples
11+
See the Beginner Tutorial folder for notebooks demonstrating core concepts used in distributed TensorFlow. The rest of the examples assume understanding of the beginner tutorial.
12+
13+
* `Servers.ipynb` -- basics of TensorFlow servers
14+
* `Parameter Sever.ipynb` -- everything about parameter servers
15+
* `Local then Global Variable.ipynb` -- creates a graph locally then make global copies of the variables. Useful for graphs that do local updates before pushing global updates (e.g. DOWNPOUR, ADAG, etc.)
16+
17+
18+
## Training Examples
1119

1220
The complete list of examples is below. The asynchronous examples are *easier* than the synchronous, so people getting started should first have a complete understanding of those before moving to synchronous examples. The first example, `Non-Distributed Setup`, shows the basic learning problem we want to solve distributively; this example should be familiar to all since it doesn't use any distributed code. The second example, `Distributed Setup` shows the same problem being solved with distributed code (i.e. with one parameter server and one worker).
1321

1422
* `Non-Distributed Setup`
1523
* `Distributed Setup`
1624
* `HogWild` (Asychronous SGD)
25+
* `DOWNPOUR` **TODO**
1726
* `ADAG` (Asynchronous Distributed Adaptive Gradients) **WIP**
1827
* `Synchronous SGD`
1928
* `Synchronous SGD different learning rates`
2029
* `SDAG` (Synchronous Distributed Adaptive Gradients) **WIP**
2130
* `Multiple GPUs Single Machine`
22-
* Dynamic SGD **(TODO)**
23-
* Asynchronous Elastic Averaging SGD (AEASGD) **(TODO)**
24-
* Asynchronous Elastic Averaging Momentum SGD (AEAMSGD) **(TODO)**
31+
* Dynamic SGD **TODO**
32+
* Asynchronous Elastic Averaging SGD (AEASGD) **TODO**
33+
* Asynchronous Elastic Averaging Momentum SGD (AEAMSGD) **TODO**
2534

26-
## Running Examples
27-
All the examples (except the non-distributed example) live in a folder. To run them, move to the example directory and run the bash script.
35+
## Running Training Examples
36+
All the training examples (except the non-distributed example) live in a folder. To run them, move to the example directory and run the bash script.
2837

2938
```bash
3039
cd <example_name>/
@@ -46,18 +55,18 @@ sudo pkill python
4655
## Links
4756
* [Official Documenation](https://www.tensorflow.org/deploy/distributed)
4857
* [Threads and Queues](https://www.tensorflow.org/programmers_guide/threading_and_queues)
49-
* [More TensorFlow Documentation](https://www.tensorflow.org/api_guides/python/train#Distributed execution)
58+
* [More TensorFlow Documentation](https://www.tensorflow.org/api_guides/python/train#Distributedexecution)
5059

5160
## Glossary
52-
* [Server](https://www.tensorflow.org/api_docs/python/tf/train/Server) -- encapsulates a Session and belongs to a cluster
61+
* [Server](https://www.tensorflow.org/api_docs/python/tf/train/Server) -- encapsulates a Session target and belongs to a cluster
5362
* [Coordinator](https://www.tensorflow.org/api_docs/python/tf/train/Coordinator) -- coordinates threads
5463
* [Session Manager](https://www.tensorflow.org/api_docs/python/tf/train/SessionManager) -- restores session and initialized variables and coordinates threads
5564
* [Supervisor](https://www.tensorflow.org/api_docs/python/tf/train/Supervisor) -- good for threads. Coordinater, Saver, and Session Manager. > Session Manager
5665
* [Session Creator](https://www.tensorflow.org/api_docs/python/tf/train/SessionCreator) -- Factory for creating a session?
5766
* [Monitored Session](https://www.tensorflow.org/api_docs/python/tf/train/MonitoredSession) -- Session. initialization, hooks, recovery.
5867
* [Monitored Training Session](https://www.tensorflow.org/api_docs/python/tf/train/MonitoredTrainingSession) -- only distributed solution for sync optimization
5968
* [Sync Replicas](https://www.tensorflow.org/api_docs/python/tf/train/SyncReplicasOptimizer) -- wrapper of optimizer for synchronous optimization
60-
* [Scaffold](https://www.tensorflow.org/api_docs/python/tf/train/Scaffold) -- holds lots of meta training settings and passed to session creator
69+
* [Scaffold](https://www.tensorflow.org/api_docs/python/tf/train/Scaffold) -- holds lots of meta training settings and passed to Session creator
6170

6271
### Hooks
6372
* [Stop Hook](https://www.tensorflow.org/api_docs/python/tf/train/StopAtStepHook) -- Hook to request stop training

0 commit comments

Comments
 (0)