You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+19-10Lines changed: 19 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -4,27 +4,36 @@ Currently, there are few examples of distributed TensorFlow code. Further, the
4
4
5
5
This is collection of examples for help getting started with distributed computing in TensorFlow and that can act as boilerplate code. Many of the examples focus on implementing well-known distributed training schemes, such as those available in [Distriubted Keras](https://github.com/cerndb/dist-keras) which were discussed in the author's [blog post](http://joerihermans.com/ramblings/distributed-deep-learning-part-1-an-introduction/). The official Distributed TensorFlow guide can be found [here](https://www.tensorflow.org/deploy/distributed).
6
6
7
-
Almost all the examples can be run on a single machine with a CPU.
7
+
Almost all the examples can be run on a single machine with a CPU.
8
8
9
+
## Beginner Tutorial
9
10
10
-
## Examples
11
+
See the Beginner Tutorial folder for notebooks demonstrating core concepts used in distributed TensorFlow. The rest of the examples assume understanding of the beginner tutorial.
12
+
13
+
*`Servers.ipynb` -- basics of TensorFlow servers
14
+
*`Parameter Sever.ipynb` -- everything about parameter servers
15
+
*`Local then Global Variable.ipynb` -- creates a graph locally then make global copies of the variables. Useful for graphs that do local updates before pushing global updates (e.g. DOWNPOUR, ADAG, etc.)
16
+
17
+
18
+
## Training Examples
11
19
12
20
The complete list of examples is below. The asynchronous examples are *easier* than the synchronous, so people getting started should first have a complete understanding of those before moving to synchronous examples. The first example, `Non-Distributed Setup`, shows the basic learning problem we want to solve distributively; this example should be familiar to all since it doesn't use any distributed code. The second example, `Distributed Setup` shows the same problem being solved with distributed code (i.e. with one parameter server and one worker).
*[Session Manager](https://www.tensorflow.org/api_docs/python/tf/train/SessionManager) -- restores session and initialized variables and coordinates threads
55
64
*[Supervisor](https://www.tensorflow.org/api_docs/python/tf/train/Supervisor) -- good for threads. Coordinater, Saver, and Session Manager. > Session Manager
56
65
*[Session Creator](https://www.tensorflow.org/api_docs/python/tf/train/SessionCreator) -- Factory for creating a session?
*[Monitored Training Session](https://www.tensorflow.org/api_docs/python/tf/train/MonitoredTrainingSession) -- only distributed solution for sync optimization
59
68
*[Sync Replicas](https://www.tensorflow.org/api_docs/python/tf/train/SyncReplicasOptimizer) -- wrapper of optimizer for synchronous optimization
60
-
*[Scaffold](https://www.tensorflow.org/api_docs/python/tf/train/Scaffold) -- holds lots of meta training settings and passed to session creator
69
+
*[Scaffold](https://www.tensorflow.org/api_docs/python/tf/train/Scaffold) -- holds lots of meta training settings and passed to Session creator
61
70
62
71
### Hooks
63
72
*[Stop Hook](https://www.tensorflow.org/api_docs/python/tf/train/StopAtStepHook) -- Hook to request stop training
0 commit comments