This project contains a set of functions to help distribute the execution of Matlab functions on a cluster.
-
The cluster (i.e. "the server") and the main workstation (i.e., "the client") should to be able to access some shared disk space. It is usually possible to do so by mounting a network folder on both stations. If the server and client do not share the same network, mounting can be performed over ssh with sshfs (Linux, MacOS, Windows)
-
The client should have an SSH software installed. Both ssh, for linux, and PuTTY, for Windows, are currently handled.
-
The client should be able to connect to the cluster without having to type a password. This is usually managed by registering RSA keys on the cluster. On linux, this is done by
- Generate a set of public and private keys
ssh-keygen -t rsa
- Register the key with the cluster
ssh-copy-id login@cluster
-
Matlab should be installed on the server, with an infinite (or large) number of licenses.
- For now, we only support clusters managed with Sun Grid Engine. It should also work with its forks (OGS, ...) but they were not tested. We plan on supporting other queuing systems, such as PBS, in the future?
First an option structure must be generated with the distribute_default
function. Beforehand, a number of mandatory options should be manually set (else, jobs wil be run locally).
Here is an example of a typical configuration:
opt = struct;
opt.server.ip = 'cluster.university.ac.uk';
opt.server.login = 'me';
opt.server.folder = '/home/me/distribute';
opt.client.folder = '/Users/me/distribute';
opt.matlab.bin = '/share/apps/matlab';
opt.translate = {'/Users/me/mydata' '/home/me/data'};
opt = distribute_default(opt);
Several additional options can be set in order to specify the cluster configuration more precisely, or to load matlab packages at runtime. Here is the complete list:
mode - Parallelisation mode: 'qsub'/'parfor'/'for' - ['for']
verbose - Speak during processing - [false]
server.ip - IP adress (or alias name) of the cluster - ['' = (server == client)]
server.login - Login with which to connect - ['']
server.source - Files to source on server side - [try to find bashrc and/or bash_profile]
server.folder - Shared folder for writing data, scripts, etc. - ['~/.distribute']
client.source - Files to source on server side - [auto]
client.workers - Number of local workers - [auto]
client.folder - Shared folder for writing data, scripts, etc. - ['~/.distribute']
ssh.type - SSH software to use 'ssh'/'putty' - [try to detect]
ssh.bin - Path to the ssh binary - [try to detect]
ssh.opt - SSH options - ['-x']
sched.sub - Path to the submit binary - [try to detect]
sched.stat - Path to the stat binary - [try to detect]
sched.acct - Path to the acct binary - [try to detect]
sched.type - Type of scheduler 'sge'/'pbs' - [try to detect]
job.batch - Submit jobs as a batch (force same mem for all) - [true]
job.mem - (Initial) Max memory usage by a single job - ['2G']
job.est_mem - Estimate max memory usage from previous runs - [true]
job.sd - Amount of extra memory to add to estimated max memory - [0.1]
job.use_dummy - Uses a dummy job to decide when job have finished - [false]
matlab.bin - Path to matlab binary - [try to detect]
matlab.add - Paths to add to Matlab path - [{}]
matlab.addsub - Paths to add to Matlab path, with subdirectories - [{}]
matlab.opt - Commandline options to pass to matlab - ['-nojvm -nodesktop -nosplash -singleCompThread']
spm.path - Path to SPM - []
spm.toolboxes - List of SPM toolboxes to add to Matlab path - [{}]
translate - Cell array of size 2xN with translation between client and - [{client.folder server.folder}].
server paths. Example:
{'/home/me/' '/mnt/users/me/' ;
'/shared/data/' '/mnt/shared/data'}
restrict - Restrict translation to a class: 'char'/'file_array' - ['']
clean - Clean tmp data when finished - [true]
clean_init - Initially clean tmp data - [false]
The main function is distribute
. Its syntax, is quite straightforward: it takes the option structure, a function name or handle, and the list of arguments to pass to the function. Arguments that should be sliced (i.e., iterated over), should be preceded by 'iter'
. Arguments that should be sliced and are both inputs and outputs (in particular, structure arrays) should be preceded by 'inplace'
. The option structure is returned because some functionalities (RAM usage estimation) need this structure to be updated.
FORMAT [opt, out1, ...] = distribute(opt, func, ('iter'/'inplace'), arg1, ...)
opt - Option structure. See 'help distribute_default'.
func - Matlab function to apply (string or function handle)
arg - Arguments of the function
> Arguments that should be iterated over should be preceded
with 'iter'
> Arguments that should be iterated in place (i.e. the output
replaces the input) should be preceded with 'inplace'
out - Output of the function. Each one is a cell array of outputs,
unless some arguments were 'inplace'. In this case, the
function should return inplace arguments first, in the same
order as their input order.
To parallelise locally, one can also replace the option structure with the number of workers. In this case, no option structure is returned:
FORMAT [out1, ...] = distribute(nworkers, func, ('iter'/'inplace'), arg1, ...)
To loop without parallelisation, one can also remove entirely the option argument. In this case, no option structure is returned:
FORMAT [out1, ...] = distribute(func, ('iter'/'inplace'), arg1, ...)
Let us give a first use case, were we have a list of pairs of arrays that should be summed:
% Initialise arrays
N = 10;
DIM = [5 5];
a = cell(1,N);
b = cell(1,N);
for i=1:N
a{i} = randn(DIM);
b{i} = randn(DIM);
end
% Local processing
true_c = cellfun(@plus, a, b, 'UniformOutput', false);
% Distributed processing
[opt, dist_c] = distribute(opt, 'plus', 'iter', a, 'iter', b);
Let us now apply a distributed process to a structure array, which is both input and output
% Initialise structure array
N = 10;
f = cell(1,N);
a = struct('f', f);
% Set the value 3 in all fields 'f'
[opt, a] = distribute(dist, @setfield, 'inplace', a, 'f', 3);
We intend to allow:
- job batching, where a single job processes several "subjects".
- optimising cluster use by choosing between local and distributed processing based on the cluster load.
- distributing scripts/binaries on top of Matlab functions. This can be helpful for working with compiled Matlab scripts.
This software was developed by Mikael Brudfors and Yaël Balbastre in John Ashburner's Computational Anatomy Group at the Wellcome Centre for Human Neuroimaging in UCL.
If you encounter any difficulty, please send an email to y.balbastre
or mikael.brudfors.15
at ucl.ac.uk
This software is released under the GNU General Public License version 3 (GPL v3). As a result, you may copy, distribute and modify the software as long as you track changes/dates in source files. Any modifications to or software including (via compiler) GPL-licensed code must also be made available under the GPL along with build & install instructions.