-
Notifications
You must be signed in to change notification settings - Fork 63
Home
Poseidon is a scalable open-source framework for large-scale distributed deep learning on CPU/GPU clusters. Initially released on January 2015 along with Petuum v1.0 as an application under the Bösen parameter server, we are now releasing it as a stand-alone application for users who are primarily interested in deep learning.
If you are coming from the main Petuum wiki, please note that Poseidon is installed separately from the other Petuum applications. Do continue to follow this wiki for instructions.
Poseidon builds upon the Caffe framework (http://caffe.berkeleyvision.org/), and extends it with distributed, multi-machine capability. If you have a cluster with multiple GPU-equipped machines, you can now take advantage of all of them while still enjoying the familiar interface of Caffe!
(New) CUDA 7.5 and cudnn R3 are supported!
(New) New updates on the performance of accelerating the training of AlexNet and GoogLeNet!
Multi-GPU training and cuDNN R2 are now supported!
We have tested Poseidon on GCC v4.7, v4.8, CUDA v6.5 and v7.5, cuDNN v2 and v3. If you are having trouble compiling, please try using CUDA v7.5 and cuDNN v3.
Download Poseidon by running the following commands:
git clone https://github.com/petuum/poseidon.git
cd poseidon
Install third party pre-requisites - run the following commands under the poseidon directory:
sudo apt-get -y install g++ make python-dev libxml2-dev libxslt-dev git zlibc zlib1g zlib1g-dev libbz2-1.0 libbz2-dev
git clone https://github.com/petuum/third_party.git
cd third_party
make -j2
cd ..
Next, compile the Bösen key-value store, which enables distributed multi-machine support in Caffe:
cd ps
make
cd ..
Now, we are ready to compile Caffe. Install the Caffe prerequisites. If you are using Ubuntu 14.04, we have provided a script to do so:
sh scripts/setup_third_party.sh
Enter Y
when you are asked Do you want to continue [Y/n]?
.
After setting up the Caffe prerequisites, create the Caffe configuration file:
cp Makefile.config.example Makefile.config
You may want to configure Makefile.config
for your environment; see the instructions inside the file for details. By default, Makefile.config
assumes that GPU support is available; CPU-only users will need to modify it. Finally, build Caffe using
make all -j4
See this section for performance numbers.
Note: Some users may encounter an issue involving the Boost library and the CUDA 7.0 compiler, which produces compile errors in .cuo files. We are working on fixing this issue.
Petuum uses ssh
(and mpirun
, which invokes ssh
) to coordinate tasks on different machines, even if you are only using a single machine. This requires password-less key-based authentication on all machines you are going to use (Petuum will fail if a password prompt appears).
If you don't already have an SSH key, generate one via
ssh-keygen
You'll then need to add your public key to each machine, by appending your public key file ~/.ssh/id_rsa.pub
to ~/.ssh/authorized_keys
on each machine. If your home directory is on a shared filesystem visible to all machines, then simply run
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
If the machines do not have a shared filesystem, you need to upload your public key to each machine, and the append it as described above.
Note: Password-less authentication can fail if ~/.ssh/authorized_keys
does not have the correct permissions. To fix this, run chmod 600 ~/.ssh/authorized_keys
.
Data can be in different format, such as LevelDB, LMDB or image files, which is specified in the model configuration files (see below for more details). We take the CIFAR-10 dataset for example.
Caffe has already provided scripts for downloading and converting CIFAR-10 data. Simply run the following commands:
# download dataset
sh data/cifar10/get_cifar10.sh
# convert to leveldb format
sh examples/cifar10/create_cifar10.sh
The script creates the leveldb dataset directories examples/cifar10/cifar10_test_leveldb
and examples/cifar10/cifar10_train_leveldb
, and the data set image mean examples/cifar10/mean.binaryproto
.
In the distributed setting you sometimes need to partition the whole dataset into M pieces, where M is the total number of clients (machines). Each client will be in charge of one piece. (Note: You can skip this step if you are using one client or multiple clients with a shared file system.)
To partition the CIFAR-10 LevelDB training data, use the script scripts/partition_data.sh
: edit the script to change the DB_PATH
and NUM_PARTITIONS
to the path of CIFAR-10 training LevelDB (i.e. examples/cifar10/cifar10_train_leveldb
) and number of partitions (e.g. 2), respectively; then run the command:
sh scripts/partition_data.sh
After that there should be NUM_PARTITIONS partitioned datasets, cifar10_leveldb_train_k
(k is the index), in the same location as CIFAR-10 training LevelDB. Now you can distribute the leveldb to the clients (client k will read data from cifar10_leveldb_train_k, so make sure the index k is consistent with the client id you specified in $PETUUM_ROOT/machinefiles/localserver
).
Repeat this process for the test data examples/cifar10/cifar10_test_leveldb
.
Training a Neural Network involves several configuration files. Some examples have been provided in ./examples/
, but you will need to configure them for your environment. We shall explain how to do so, using the CIFAR-10 image classification task as a running example:
-
Network definition file:
./examples/cifar10/cifar10_quick_train_test.prototxt
-
This defines the model architecture by specifying neural layers one by one. Please refer to Caffe Layer Tutorial for the details of neural layer definition.
-
The grammar is almost the same with that of original Caffe, except that we introduce an addition parameter for data layers (specifically, for data layers with type
DATA
andIMAGE_DATA
), i.e.,shared_file_system
. If clients share a file system and access the same data copy, setshared_file_system
totrue
, and the clients will read the data you specified insource
; Otherwise (and by default),shared_file_system
is set tofalse
, and client k will append_k
to thesource
and read the corresponding data piece. For example, if you set:source: "POSEIDON_ROOT/examples/cifar10/cifar10_train_leveldb" shared_file_system: false then client 0 will read training data from `POSEIDON_ROOT/examples/cifar10/cifar10_train_leveldb_0`. LevelDB does not support concurrent access by multiple clients. So if your data is stored in LevelDB and you are using more than one clients, you should set `shared_file_system` to `false`, and partition/distribute the dataset (refer to the *Data Partitioning* section).
-
Note that since we are running in the distributed setting, absolute path is required when specifying the paths of
source
andmean_file
. ReplacePOSEIDON_ROOT
according to your setting.
-
-
Solver configuration file:
./examples/cifar10/cifar10_quick_solver.prototxt
- This specifies the optimization configuration for model training, including the GPU/CPU modes, solver type, and learning rate, etc. The grammer is exactly the same with original Caffe, and please refer to Caffe Solver Tutorial for the details.
- Note: GPU mode is the default - if you are using CPU mode, you will need to change
solver mode: GPU
tosolver mode: CPU
- Again, make sure to use absolute paths when specifying the paths of
net
,snapshot_prefix
. (ReplacePOSEIDON_ROOT
according to your setting.)
-
Execution script:
./examples/cifar10/run_local.py
- Some system parameters are specified in the script:
-
solver
: (absolute) path of the solver configuration file. -
num_table_threads
: equal to the number of worker threads per client + 1 (currently does not support multiple GPU on single machine, so set to1
if using GPU. Will support multi-GPU per single machine soon.) -
table_staleness
: staleness of parameter server tables (suggest value: 0) -
num_rows_per_table
: number of rows per ps table. It is related to table partition in the parameter server. Simply set it to1
if the model size is not too big (e.g. #param <= 250M). -
svb
: true to use sufficient vector broadcast (SVB, can reduce communication cost of fully-connected layers. Not recommended for small NN).
-
- Replace
POSEIDON_ROOT
according to your setting.
- Some system parameters are specified in the script:
To start training, specify the paths in above files according to your setting, then run:
python examples/cifar10/train_cifar10.py
When training a model, you will see output like this:
I1126 20:40:16.731869 29444 solver.cpp:294] Iteration 100, loss: 1.64066, time: 141.762
I1126 20:40:16.731956 29444 solver.cpp:318] Train net output #0: loss = 1.64066 (* 1 = 1.64066 loss)
...
I1126 20:42:27.374675 29444 solver.cpp:456] Iteration 200, Testing net (#0)
I1126 20:42:27.375102 29444 solver.cpp:486] Test net output #0: accuracy = 0.4592
I1126 20:42:27.375174 29444 solver.cpp:486] Test net output #1: loss = 1.49516 (* 1 = 1.49516 loss)
For each training iteration, loss
is the training function evalued at thread 0 on client 0. For the output of the testing phase, score 0 is the accuracy, and score 1 is the testing loss function. Again, both scores are evaluated at thread 0 on client 0.
After the training is completed, there will be some output files in output/cifar10/
. Among the files, cifar-netoutputs
records the loss and accuracy scores of each iteration, where the values are averaged across all threads, and thus more accurate than the outputs on the fly.
Th Caffe app runs in the background, and outputs its progress to standard error. If you need to terminate the app before it finishes, just run
./scripts/kill_caffe.py <petuum_ps_hostfile>
##Feature Extraction
A trained Neural Network model can be used to extract features from images. Here we will use the pre-trained BVLC Reference CaffeNet for demonstration.
Run the following command to download the pre-trained model:
./scripts/download_model_binary.py models/bvlc_reference_caffenet
Then prepare data and other input files as follows:
# 1. Make a temporary folder to store things into
mkdir examples/_temp
# 2. Generate a list of the files to process
find `pwd`/examples/images -type f -exec echo {} \; > examples/_temp/temp.txt
# 3. Add line number to the end of each line, so that we can know
# which image each extracted feature corresponds to
cat examples/_temp/temp.txt | awk '{printf("%s %d\n", $0, NR)}' > examples/_temp/file_list.txt
# 4. Download the mean image of the ILSVRC dataset. We will subtract
# the mean image from the dataset
sh data/ilsvrc12/get_ilsvrc_aux.sh
# 5. Copy the network definition
cp examples/feature_extraction/imagenet_val.prototxt examples/_temp
Note: The above commands assume you are using one client, or multiple clients with shared file system. if you are using multiple clients which do not share file systems with each other, then you need to partition the image data across the clients, run the above commands on each client, rename ./examples/_temp/file_list.txt
on each client c to examples/_temp/file_list.txt_c
where c
is the client index, and set shared_file_system
to false
in examples/_temp/imagenet_val.prototxt
.
Edit examples/_temp/imagenet_val.prototxt
to change the paths for your setting. Specifically, on the lines with source:
and mean_file:
, replace POSEIDON_ROOT
with the full path to bosen/app/caffe/
. Now we can start extracting features by running:
sh scripts/extract_features.sh
The features are stored to LevelDBs examples/_temp/features_c_t
, where c
is the index of client while t
is the index of thread. I.e., the features extracted by different threads are stored in seperate LevelDBs. You can read the records using data structure Datum
, where Datum::float_data
is the features and Datum::label
is the corresponding image id (i.e. the line number you added to file_list.txt
).
After finishing the extaction, run
./scripts/kill_feature_extractor.py ../../machinefiles/localserver
to terminate the app.
To enable multiple-GPU training, one need to specify the GPU device IDs in the starting script. For example, suppose you are going to train GoogleNet using 2 machines, each of which has two GPUs with device ID 0 and 1, in total 4 GPUs.
-
First set the machine IPs and ports in the
localserver
. -
Then specify
device = [0, 1]
inexamples/googlenet/run_local.py
, or if you prefer bash script, specify device IDs asdevice="0,1"
and setnum_app_threads=2
inexample/googlenet/train_googlent.sh
. -
Start the script.
The log will show both GPUs are enabled for training in every machine.
We will frequently update this section for the latest performance we achieved.
-
Objective: We train the AlexNet using ILSVRC 2012 Dataset.
-
Environment: The throughput is measured on a distributed GPU cluster, every node of which is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.
-
Setting: See the net prototxt and solver. The training script and PS settings are provided here.
The following figure shows Poseidon's speedup of throughput when training AlexNet using different settings of staleness values and number of nodes. When using 1 node, the performance of the original Caffe is reported.
On our cluster, when training AlexNet with 8 nodes, Poseidon takes only 1 day to converge (compared to 5 - 7 days on the single machine Caffe), and achieves 56.5% top-1 accuracy on the validation set.
The following figures show how the validation error decreases along with training time and iterations. When using 1 node, the performance of the original Caffe is reported.
-
Objective: We train the GoogLeNet using ILSVRC 2012 Dataset.
-
Environment: The throughput is measured on a distributed GPU cluster, every node of which is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.
-
Setting: See the net prototxt and solver. The training script and PS settings are provided here.
The following figure shows Poseidon's speedup of throughput when training GoogLeNet using different settings of staleness values and number of nodes, compared to single machine Caffe.
When training GoogLeNet with 8 nodes, Poseidon takes less than 48 hours to achieve 50% top-1 accuracy, and less than 75 hours to achieves 57% top-1 accuracy, and finally achieve 67.1% top-1 accuracy, enjoys about 4 times speedup compared to single machine Caffe, which usually takes 15- 20 days to converge, as shown in the following figures.
-
Objective and dataset: We train a CNN using all available images in ImageNet, including 14,197,087 labeled images from 21,841 categories. We randomly split the whole set into two parts, and use the first 7.1 million of images for training and remained for test. The whole data size is about 3.2Tb with 1.6Tb of training and 1.6Tb as test.
-
Environment: We train the CNN with fully data-parallelism on a GPU cluster with 8 nodes, of which every node is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.
-
Settings: The network and solver configurations will be released soon.
The following table compares our result to those of previous work on ImageNet 22K, in terms of experimental settings, machine resources and training time used, and train/test accuracy.