Skip to content

Commit 334371d

Browse files
weiji14yellowcap
andauthored
Document how to run the datacube pipeline with a batch job (#97)
* 🚚 Move datacube batch job instructions from main README.md to docs Migrate the instructions on how to run a batch job from the main README.md file to docs/data_datacube.md. Added a placeholder section on how to create a single datacube for an MGRS tile. * Document datacube script for single MGRS tile * Improve datacube docs --------- Co-authored-by: Daniel Wiesmann <[email protected]>
1 parent d703174 commit 334371d

File tree

3 files changed

+70
-50
lines changed

3 files changed

+70
-50
lines changed

README.md

Lines changed: 0 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -100,53 +100,3 @@ To generate embeddings from the pretrained model's encoder on 1024 images
100100

101101
More options can be found using `python trainer.py fit --help`, or at the
102102
[LightningCLI docs](https://lightning.ai/docs/pytorch/2.1.0/cli/lightning_cli.html).
103-
104-
### Running the datacube pipeline
105-
106-
How to run the data pipeline on AWS Batch Spot instances using
107-
a [fetch-and-run](https://aws.amazon.com/blogs/compute/creating-a-simple-fetch-and-run-aws-batch-job/)
108-
approach.
109-
110-
#### Prepare docker image in ECR
111-
112-
Build the docker image and push it to a ecr repository.
113-
114-
```bash
115-
ecr_repo_id=12345
116-
cd batch
117-
docker build -t $ecr_repo_iud.dkr.ecr.us-east-1.amazonaws.com/fetch-and-run .
118-
119-
aws ecr get-login-password --profile clay --region us-east-1 | docker login --username AWS --password-stdin $ecr_repo_iud.dkr.ecr.us-east-1.amazonaws.com
120-
121-
docker push $ecr_repo_iud.dkr.ecr.us-east-1.amazonaws.com/fetch-and-run:latest
122-
```
123-
#### Prepare AWS batch
124-
125-
To prepare batch, we need to create a compute environment, job queue, and job
126-
definition.
127-
128-
Example configurations for the compute environment and the job definition are
129-
provided in the `batch` directory.
130-
131-
The `submit.py` script contains a loop for submitting jobs to the queue. An
132-
alternative to this individual job submissions would be to use array jobs, but
133-
for now the individual submissions are simpler and failures are easier to track.
134-
135-
#### Create ZIP file with the package to execute
136-
137-
Package the model and the inference script into a zip file. The `datacube.py`
138-
script is the one that will be executed on the instances.
139-
140-
Put the scripts in a zip file and upload the zip package into S3 so that
141-
the batch fetch and run can use it.
142-
143-
```bash
144-
zip -FSrj "batch-fetch-and-run.zip" ./scripts/* -x "scripts/*.pyc"
145-
146-
aws s3api put-object --bucket clay-fetch-and-run-packages --key "batch-fetch-and-run.zip" --body "batch-fetch-and-run.zip"
147-
```
148-
149-
#### Submit job
150-
151-
We can now submit a batch job to run the pipeline. The `submit.py` file
152-
provides an example on how to sumbit jobs in python.

docs/_toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ parts:
1010
file: installation
1111
- caption: Data Preparation
1212
chapters:
13+
- title: Creating datacubes
14+
file: data_datacube
1315
- title: Benchmark dataset labels
1416
file: data_labels
1517
- caption: Running the model

docs/data_datacube.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Creating datacubes
2+
3+
## How to create a datacube
4+
5+
The `datacube.py` script collects Sentinel-2, Sentinel-1, and DEM data over individual MGRS tiles. The source list of the MGRS tiles to be processed is provided in an input file with MGRS geometries. Each run of the script will collect data for one of the MGRS tiles in the source file. The tile to be processed is based on the row index number provided as input. The MGRS tile ID is expected to be in the `name` property of the input file.
6+
7+
For the target MGRS tile, the script loops through the years between 2017 and 2023 in random order. For each year, it will search for the least cloudy Sentinel-2 scene. Based on the date of the selected Sentinel-2 scene, it will search for the Sentinel-1 scenes that are the closest match to that date, with a maximum of +/- 3 days of difference. It will include multiple Sentinel-1 scenes until the full MGRS tile is covered. There are cases where no matching Sentinel-1 scenes can be found, in which case the script moves to the next year. The script stops when 3 matching datasets were collected for 3 different years. Finally, the script will also select the intersecting part of the Copernicus Digital Elevation Model (DEM).
8+
9+
The script will then download all of the Sentinel-2 scene, and match the data cube with the corresponding Sentinel-1 and DEM data. The scene level data is then split into smaller chips of a fixed size of 512x512 pixels. The Sentinel2, Sentinel-1 and DEM bands are then packed together in a single TIFF file for each chip. These are saved locally and synced to a S3 bucket at the end of the script. The bucket name can be specified as input.
10+
11+
For testing and debugging, the data size can be reduced by specifying a pixel window using the `subset` parameter. Data will then be requested only for the specified pixel window. This will reduce the data size considerably which speeds up the processing during testing.
12+
13+
The example run below will search for data for the geometry with row index 1 in a with a local MGRS sample file, for a 1000x1000 pixel window.
14+
15+
```bash
16+
python datacube.py --sample /home/user/Desktop/mgrs_sample.fgb --bucket "my-bucket" --subset "1000,1000,2000,2000" --index 1
17+
```
18+
19+
## Running the datacube pipeline as a batch job
20+
21+
This section describes how to containerize the data pipeline and run it on AWS Batch Spot instances using
22+
a [fetch-and-run](https://aws.amazon.com/blogs/compute/creating-a-simple-fetch-and-run-aws-batch-job/)
23+
approach.
24+
25+
### Prepare docker image in ECR
26+
27+
Build the docker image and push it to a ecr repository.
28+
29+
```bash
30+
ecr_repo_id=12345
31+
cd batch
32+
docker build -t $ecr_repo_iud.dkr.ecr.us-east-1.amazonaws.com/fetch-and-run .
33+
34+
aws ecr get-login-password --profile clay --region us-east-1 | docker login --username AWS --password-stdin $ecr_repo_iud.dkr.ecr.us-east-1.amazonaws.com
35+
36+
docker push $ecr_repo_iud.dkr.ecr.us-east-1.amazonaws.com/fetch-and-run:latest
37+
```
38+
39+
### Prepare AWS batch
40+
41+
To prepare batch, we need to create a compute environment, job queue, and job
42+
definition.
43+
44+
Example configurations for the compute environment and the job definition are
45+
provided in the `batch` directory.
46+
47+
The `submit.py` script contains a loop for submitting jobs to the queue. An
48+
alternative to this individual job submissions would be to use array jobs, but
49+
for now the individual submissions are simpler and failures are easier to track.
50+
51+
### Create ZIP file with the package to execute
52+
53+
Package the model and the inference script into a zip file. The `datacube.py`
54+
script is the one that will be executed on the instances.
55+
56+
Put the scripts in a zip file and upload the zip package into S3 so that
57+
the batch fetch and run can use it.
58+
59+
```bash
60+
zip -FSrj "batch-fetch-and-run.zip" ./scripts/* -x "scripts/*.pyc"
61+
62+
aws s3api put-object --bucket clay-fetch-and-run-packages --key "batch-fetch-and-run.zip" --body "batch-fetch-and-run.zip"
63+
```
64+
65+
### Submit job
66+
67+
We can now submit a batch job to run the pipeline. The `submit.py` file
68+
provides an example on how to submit jobs in python.

0 commit comments

Comments
 (0)