Skip to content
Hao Zhang edited this page May 12, 2016 · 42 revisions

Poseidon is a scalable open-source framework for large-scale distributed deep learning on CPU/GPU clusters. Initially released on January 2015 along with Petuum v1.0 as an application under the Bösen parameter server, we are now refactoring it as a stand-alone application for users who are primarily interested in deep learning. We also disclosed the system architecture of Poseidon and several distributing strategies for fast parallelization of deep learning in the following arXiv paper:

Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, Eric Xing. Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines. In arXiv, 2015.

If you are coming from the main Petuum wiki, please note that Poseidon is installed separately from the other Petuum applications. Do continue to follow this wiki for instructions.

Poseidon builds upon the Caffe framework (http://caffe.berkeleyvision.org/), and extends it with distributed, multi-machine capability. If you have a cluster with multiple GPU-equipped machines, you can now take advantage of all of them while still enjoying the familiar interface of Caffe!

News

(New) CUDA 7.5 and cudnn R3 are supported!

(New) New updates on the performance of accelerating the training of AlexNet and GoogLeNet!

Multi-GPU training and cuDNN R2 are now supported!

Multi-GPU Training

To enable multiple-GPU training, one need to specify the GPU device IDs in the starting script. For example, suppose you are going to train GoogleNet using 2 machines, each of which has two GPUs with device ID 0 and 1, in total 4 GPUs.

  1. First set the machine IPs and ports in the localserver.

  2. Then specify device = [0, 1] in examples/googlenet/run_local.py, or if you prefer bash script, specify device IDs as device="0,1" and set num_app_threads=2 in example/googlenet/train_googlent.sh.

  3. Start the script.

The log will show both GPUs are enabled for training in every machine.