Skip to content

Commit 80d7edc

Browse files
authored
Initial commit
0 parents  commit 80d7edc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+2168
-0
lines changed

Diff for: .github/workflows/deploy.yml

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: deploy-documentation
2+
3+
# Only run this when the master branch changes
4+
on:
5+
push:
6+
branches:
7+
- main
8+
# Only run if edits in DS-documentation
9+
paths:
10+
- documentation/DS-documentation/**
11+
- .github/workflows/deploy.yml
12+
13+
# This job installs dependencies, builds the book, and pushes it to `gh-pages`
14+
jobs:
15+
deploy-book:
16+
runs-on: ubuntu-latest
17+
steps:
18+
- uses: actions/checkout@v2
19+
20+
# Install dependencies
21+
- name: Set up Python 3.8
22+
uses: actions/setup-python@v2
23+
with:
24+
python-version: 3.8
25+
26+
- name: Install dependencies
27+
run: |
28+
pip install jupyter-book
29+
30+
# Build the book
31+
- name: Build the book
32+
run: |
33+
jupyter-book build documentation/DS-documentation/
34+
35+
# Push the book's HTML to github-pages
36+
- name: GitHub Pages action
37+
uses: peaceiris/[email protected]
38+
with:
39+
github_token: ${{ secrets.GITHUB_TOKEN }}
40+
publish_dir: ./documentation/DS-documentation/_build/html

Diff for: LICENSE

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
Distributed-Something is distributed under the following BSD-style license:
2+
3+
Copyright © 2022 Broad Institute, Inc. All rights reserved.
4+
5+
Redistribution and use in source and binary forms, with or without
6+
modification, are permitted provided that the following conditions are
7+
met:
8+
9+
1. Redistributions of source code must retain the above copyright
10+
notice, this list of conditions and the following disclaimer.
11+
12+
2. Redistributions in binary form must reproduce the above copyright
13+
notice, this list of conditions and the following disclaimer in the
14+
documentation and/or other materials provided with the distribution.
15+
16+
3. Neither the name of the Broad Institute, Inc. nor the names of its
17+
contributors may be used to endorse or promote products derived from
18+
this software without specific prior written permission.
19+
20+
THIS SOFTWARE IS PROVIDED “AS IS.” BROAD MAKES NO EXPRESS OR IMPLIED
21+
REPRESENTATIONS OR WARRANTIES OF ANY KIND REGARDING THE SOFTWARE AND
22+
COPYRIGHT, INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF
23+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, CONFORMITY WITH ANY
24+
DOCUMENTATION, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER
25+
DEFECTS, WHETHER OR NOT DISCOVERABLE. IN NO EVENT SHALL BROAD, THE
26+
COPYRIGHT HOLDERS, OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
27+
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
28+
BUT NOT LIMITED TO PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
29+
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
30+
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
31+
TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
32+
USE OF THIS SOFTWARE, EVEN IF ADVISED OF, HAVE REASON TO KNOW, OR IN
33+
FACT SHALL KNOW OF THE POSSIBILITY OF SUCH DAMAGE.
34+
35+
If, by operation of law or otherwise, any of the aforementioned
36+
warranty disclaimers are determined inapplicable, your sole remedy,
37+
regardless of the form of action, including, but not limited to,
38+
negligence and strict liability, shall be replacement of the software
39+
with an updated version if one exists.

Diff for: README.md

+70
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Distributed-Something
2+
Run encapsulated docker containers that do... something in the Amazon Web Services (AWS) infrastructure.
3+
We are interested in scientific image analysis so we have used it for [CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler), [Fiji](https://github.com/CellProfiler/Distributed-Fiji), and [BioFormats2Raw](https://github.com/CellProfiler/Distributed-OmeZarrMaker).
4+
You can use it for whatever you want!
5+
6+
## Documentation
7+
Full documentation is available on our [Documentation Website](https://distributedscience.github.io/Distributed-Something).
8+
9+
## Overview
10+
11+
This code is an example of how to use AWS distributed infrastructure for running anything Dockerized.
12+
The configuration of the AWS resources is done using boto3 and the AWS CLI.
13+
The worker is written in Python and is encapsulated in a Docker container.
14+
There are four AWS components that are minimally needed to run distributed jobs:
15+
16+
17+
1. An SQS queue
18+
2. An ECS cluster
19+
3. An S3 bucket
20+
4. A spot fleet of EC2 instances
21+
22+
23+
All of them can be managed individually through the AWS Management Console.
24+
However, this code helps to get started quickly and run a job autonomously if all the configuration is correct.
25+
The code runs a script that links all these components and prepares the infrastructure to run a distributed job.
26+
When the job is completed, the code is also able to stop resources and clean up components.
27+
It also adds logging and alarms via CloudWatch, helping the user troubleshoot runs and destroy stuck machines.
28+
29+
## Running the code
30+
31+
### Step 1
32+
Edit the config.py file with all the relevant information for your job.
33+
Then, start creating the basic AWS resources by running the following script:
34+
35+
$ python3 run.py setup
36+
37+
This script initializes the resources in AWS.
38+
Notice that the docker registry is built separately and you can modify the worker code to build your own.
39+
Any time you modify the worker code, you need to update the docker registry using the Makefile script inside the worker directory.
40+
41+
### Step 2
42+
After the first script runs successfully, the job can now be submitted to with the following command:
43+
44+
$ python3 run.py submitJob files/exampleJob.json
45+
46+
Running the script uploads the tasks that are configured in the json file.
47+
You have to customize the exampleJob.json file with information that make sense for your project.
48+
You'll want to figure out which information is generic and which is the information that makes each job unique.
49+
50+
### Step 3
51+
After submitting the job to the queue, we can add computing power to process all tasks in AWS.
52+
This code starts a fleet of spot EC2 instances which will run the worker code.
53+
The worker code is encapsulated in Docker containers, and the code uses ECS services to inject them in EC2.
54+
All this is automated with the following command:
55+
56+
$ python3 run.py startCluster files/exampleFleet.json
57+
58+
After the cluster is ready, the code informs you that everything is setup, and saves the spot fleet identifier in a file for further reference.
59+
60+
### Step 4
61+
When the cluster is up and running, you can monitor progress using the following command:
62+
63+
$ python3 run.py monitor files/APP_NAMESpotFleetRequestId.json
64+
65+
The file APP_NAMESpotFleetRequestId.json is created after the cluster is setup in step 3.
66+
It is important to keep this monitor running if you want to automatically shutdown computing resources when there are no more tasks in the queue (recommended).
67+
68+
See our [full documentation](https://distributedscience.github.io/Distributed-Something) for more information about each step of the process.
69+
70+
![Distributed-Something](documentation/DS-documentation/images/Distributed-Something_chronological_overview.png)

Diff for: config.py

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Constants (User configurable)
2+
3+
APP_NAME = 'DistributedSomething' # Used to generate derivative names unique to the application.
4+
5+
# DOCKER REGISTRY INFORMATION:
6+
DOCKERHUB_TAG = 'user/distributed-something:sometag'
7+
8+
# AWS GENERAL SETTINGS:
9+
AWS_REGION = 'us-east-1'
10+
AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation
11+
SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh
12+
AWS_BUCKET = 'your-bucket-name'
13+
14+
# EC2 AND ECS INFORMATION:
15+
ECS_CLUSTER = 'default'
16+
CLUSTER_MACHINES = 3
17+
TASKS_PER_MACHINE = 1
18+
MACHINE_TYPE = ['m4.xlarge']
19+
MACHINE_PRICE = 0.10
20+
EBS_VOL_SIZE = 30 # In GB. Minimum allowed is 22.
21+
22+
# DOCKER INSTANCE RUNNING ENVIRONMENT:
23+
DOCKER_CORES = 4 # Number of software processes to run inside a docker container
24+
CPU_SHARES = DOCKER_CORES * 1024 # ECS computing units assigned to each docker container (1024 units = 1 core)
25+
MEMORY = 15000 # Memory assigned to the docker container in MB
26+
SECONDS_TO_START = 3*60 # Wait before the next process is initiated to avoid memory collisions
27+
28+
# SQS QUEUE INFORMATION:
29+
SQS_QUEUE_NAME = APP_NAME + 'Queue'
30+
SQS_MESSAGE_VISIBILITY = 1*60 # Timeout (secs) for messages in flight (average time to be processed)
31+
SQS_DEAD_LETTER_QUEUE = 'arn:aws:sqs:some-region:111111100000:DeadMessages'
32+
33+
# LOG GROUP INFORMATION:
34+
LOG_GROUP_NAME = APP_NAME
35+
36+
# REDUNDANCY CHECKS
37+
CHECK_IF_DONE_BOOL = 'False' # True or False - should it check if there are a certain number of non-empty files and delete the job if yes?
38+
EXPECTED_NUMBER_FILES = 7 # What is the number of files that trigger skipping a job?
39+
MIN_FILE_SIZE_BYTES = 1 # What is the minimal number of bytes an object should be to "count"?
40+
NECESSARY_STRING = '' # Is there any string that should be in the file name to "count"?
41+
42+
# PUT ANYTHING SPECIFIC TO YOUR PROGRAM DOWN HERE
+41
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# SQS QUEUE Information
2+
3+
This is in-depth information about the configurable components in SQS QUEUE INFORMATION, a section in [Step 1: Configuration](step_1_configuration.md) of running Distributed CellProfiler.
4+
5+
## SQS_QUEUE_NAME
6+
7+
**SQS_QUEUE_NAME** is the name of the queue where all of your jobs are sent. (A queue is exactly what it sounds like - a list of things waiting their turn. Jobs represent one complete run through a CellProfiler pipeline (though each job may involve any number of images. e.g. analysis may require thousands of jobs, each with a single image making one complete CellProfiler run, while making an illumination correction may be a single job that iterates through thousands of images to produce a single output file.)) You want a name that is descriptive enough to distinguish it from other queues. We usually name our queues based on the project and the step or pipeline goal. An example may be something like Hepatocyte_Differentiation_Illum or Lipid_Droplet_Analysis.
8+
9+
## SQS_DEAD_LETTER_QUEUE
10+
11+
**SQS_DEAD_LETTER_QUEUE** is the name of the queue where all the jobs that failed to run are sent. If everything goes perfectly, this will always remain empty. If jobs that are in the queue fail multiple times (our default is 10) they are moved to the dead-letter queue, which is not used to initiate jobs. The dead-letter queue therefore functions effectively as a log so you can see if any of your jobs failed. It is different from your other queue as machines do not try and pull jobs from it. Protip: Each member of our team maintains their own dead-letter queue so we don’t have to worry about finding messages if multiple people are running jobs at the same time. We use names like DeadMessages_Erin.
12+
13+
If all of your jobs end up in your dead-letter queue there are many different places you could have a problem. Hopefully, you’ll keep an eye on the logs in your CloudWatch (the part of AWS used for monitoring what all your other AWS services are doing) after starting a run and catch the issue before all of your jobs fail multiple times.
14+
15+
If a single job ends up in your dead-letter queue while the rest of your jobs complete successfully, it is likely that that an image is corrupted (a corrupted image is one that has failed to save properly or has been damaged so that it will not open). This is true whether your pipeline processes a single image at a time (such as in analysis runs where you’re interested in cellular measurements on a per-image basis) or whether your pipeline processes many images at a time (such as when making an illumination correction image on a per-plate basis). This is the major reason why we have the dead-letter queue: you certainly don’t want to pay for your cluster to indefinitely attempt to process a corrupted image. Keeping an eye on your CloudWatch logs wouldn’t necessarily help you catch this kind of error because you could have tens or hundreds of successful jobs run before an instance pulls the job for the corrupted image, or the corrupted image could be thousands of images into an illumination correction run, etc.
16+
17+
## SQS_MESSAGE_VISIBILITY
18+
19+
**SQS_MESSAGE_VISIBILITY** controls how long jobs are hidden after being pulled by a machine to run. Jobs must be visible (i.e. not hidden) in order to be pulled by a Docker and therefore run. In other words, the time you enter in SQS_MESSAGE_VISIBILITY is how long a job is allowed a chance to complete before it is unhidden and made available to be started by a different copy of CellProfiler. It’s quite important to set this time correctly- we typically say to estimate 1.5X how long the job typically takes to run (or your best guess of that if you’re not sure). To understand why, and the consequences of setting an incorrect time, let’s look more carefully at the SQS queue.
20+
21+
The SQS queue has two categories - “Messages Available” and “Messages In Flight”. Each message is a job and regardless of the category it’s in, the jobs all remain in the same queue. In effect, “In Flight” means currently hiding and “Available” means not currently hiding.
22+
23+
When you submit your Config file to AWS it creates your queue in SQS but that queue starts out empty. When you submit your Jobs file to AWS it puts all of your jobs into the queue under “Messages Available”. When you submit your Fleet file to AWS it 1) creates machines in EC2, 2) ECS puts Docker containers on those instances, and 3) those instances look in “Messages Available” in SQS for jobs to run.
24+
25+
Once a Docker has pulled a job, that job moves from “Available’ to “In Flight”. It remains hidden (“In Flight”) for the duration of time set in SQS_MESSAGE_VISIBILITY and then it becomes visible again (“Available”). Jobs are hidden so that multiple machines don’t process the same job at the same time. If the job completes successfully, the Docker tells the queue to delete that message.
26+
27+
If the job completes but it is not successful (e.g. CellProfiler errors), the Docker tells the queue to move the job from “In Flight” to “Available” so another Docker (with a different copy of CellProfiler) can attempt to complete the job.
28+
29+
If the SQS_MESSAGE_VISIBILITY is too short then a job will become unhidden even though it is still currently being (hopefully successfully) run by the Docker that originally picked it up. This means that another Docker may come along and start the same job and you end up paying for unnecessary compute time because both Dockers will continue to run the job until they each finish.
30+
31+
If the SQS_MESSAGE_VISIBILITY is too long then you can end up wasting time and money waiting for the job to become available again after a crash even when the rest of your analysis is done. If anything causes a job to stop mid-run (e.g. CellProfiler crashes, the instance crashes, or the instance is removed by AWS because you are outbid), that job stays hidden until the set time. If a Docker instance goes to the queue and doesn’t find any visible jobs, then it does not try to run any more jobs in that copy of CellProfiler, limiting the effective computing power of that Docker. Therefore, some or all of your instances may hang around doing nothing (but costing money) until the job is visible again. When in doubt, it is better to have your SQS_MESSAGE_VISIBILITY set too long than too short because, while crashes can happen, it is rare that AWS takes small machines from your fleet, though we do notice it happening with larger machines.
32+
33+
There is not an easy way to see if you have selected the appropriate amount of time for your SQS_MESSAGE_VISIBILITY on your first run through a brand new pipeline. To confirm that multiple Dockers didn’t run the same job, after the jobs are complete, you need to manually go through each log in CloudWatch and figure out how many times you got the word “SUCCESS” in each log. (This may be reasonable to do on an illumination correction run where you have a single job per plate, but it’s not so reasonable if running an analysis pipeline on thousands of individual images). To confirm that multiple Dockers are never processing the same job, you can keep an eye on your queue and make sure that you never have more jobs “In Flight” than the number of copies of CellProfiler that you have running; likewise, if your timeout time is too short, it may seem like too few jobs are “In Flight” even though the CPU usage on all your machines is high.
34+
35+
Once you have run a pipeline once, you can check the execution time (either by noticing how long after you started your jobs that your first jobs begin to finish, or by checking the logs of individual jobs and noting the start and end time), you will then have an accurate idea of roughly how long that pipeline needs to execute, and can set your message visibility accordingly. You can even do this on the fly while jobs are currently processing; the updated visibility time won’t affect the jobs already out for processing (ie if the time was set to 3 hours and you change it to 1 hour, the jobs already processing will remain hidden for 3 hours or until finished), but any job that begins processing AFTER the change will use the new visibility timeout setting.
36+
37+
## Example SQS Queue
38+
39+
[[images/Sample_SQS_Queue.png|alt="Sample_SQS_Queue"]]
40+
41+
This is an example of an SQS Queue. You can see that there is one active task with 64 jobs in it. In this example, we are running a fleet of 32 instances, each with a single Docker, so at this moment (right after starting the fleet), there are 32 tasks "In Flight" and 32 tasks that are still "Available." You can also see that many lab members have their own dead-letter queues which are, fortunately, all currently empty.

Diff for: documentation/DS-documentation/_config.yml

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Book settings
2+
# For your DS implementation, you will need to update author, repository:url:, and html:baseurl:
3+
4+
# Learn more at https://jupyterbook.org/customize/config.html
5+
title: DS Documentation
6+
author: Broad Institute
7+
copyright: "2022"
8+
9+
# Only build files that are in the ToC
10+
only_build_toc_files: true
11+
12+
# Force re-execution of notebooks on each build.
13+
# See https://jupyterbook.org/content/execute.html
14+
execute:
15+
execute_notebooks: force
16+
17+
# Information about where the book exists on the web
18+
repository:
19+
url: https://github.com/distributedscience/distributed-something
20+
branch: main # Which branch of the repository should be used when creating links (optional)
21+
path_to_book: documentation/DS-documentation
22+
23+
html:
24+
baseurl: distributedscience.github.io
25+
use_repository_button: true
26+
use_issues_button: true
27+
use_edit_page_button: true
28+
comments:
29+
hypothesis: true
30+
31+
parse:
32+
myst_enable_extensions:
33+
# Only required if you use html <img>
34+
- html_image

Diff for: documentation/DS-documentation/_toc.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Table of contents
2+
# Learn more at https://jupyterbook.org/customize/toc.html
3+
4+
format: jb-book
5+
root: introduction
6+
parts:
7+
- caption: FAQ
8+
chapters:
9+
- file: overview
10+
- file: overview_2
11+
- file: costs
12+
- caption: Adapting Distributed-Something to a new application
13+
chapters:
14+
- file: customizing_DS
15+
- file: implementing_DS
16+
- file: troubleshooting_implementation
17+
- caption: Running an application made by Distributed-Something as an end user
18+
chapters:
19+
- file: step_0_prep
20+
- file: step_1_configuration
21+
sections:
22+
- file: SQS_QUEUE_information
23+
- file: step_2_submit_jobs
24+
- file: step_3_start_cluster
25+
- file: step_4_monitor
26+
- caption: Technical guides
27+
chapters:
28+
- file: troubleshooting_runs
29+
- file: hygiene
30+
- file: versions

0 commit comments

Comments
 (0)