Allow ClayDataModule to load GeoTIFF files directly from s3 #92

weiji14 · 2023-12-19T07:47:50Z

Similar to work done in #85 on the GeoTIFFDataPipeModule, this PR implements similar functionality in ClayDataModule to load GeoTIFF files from an s3 bucket. Plus a few more minor tweaks to align both LightningDataModules.

Implementation uses torchdata's S3FileLister to get the files, but instead of returning an iterator, a list is returned.

TODO:

Allow ClayDataModule to load GeoTIFF files directly from s3
Rename datacube["path"] to datacube["source_url"] to match Rename embeddings file to include MGRS code and store GeoTIFF source_url #86
Implement predict dataloader
Add a unit test

Continuing on from #91, this PR is part 2/3 of working towards generating new embeddings from the model developed at #47.

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.

Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule

The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).

Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.

Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.

Not just testing one, but two different LightningDataModules now!

Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.

weiji14 · 2023-12-19T08:27:15Z

src/datamodule.py

@@ -80,13 +80,16 @@ def __getitem__(self, idx):
        cube = self.read_chip(chip_path)

        # remove nans and convert to tensor
-        cube["pixels"] = torch.nan_to_num(torch.as_tensor(data=cube["pixels"]), nan=0.0)
+        cube["pixels"] = torch.as_tensor(data=cube["pixels"], dtype=torch.float16)


Heads up @srmsoumya that I've remove the NaN to 0 clipping here, since the new batch of GeoTIFF files shouldn't have NaNs anymore per #68.

* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.

weiji14 added 2 commits December 19, 2023 20:11

✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket

5634533

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.

🚚 Rename datacube's path key to source_url

34f4c9e

Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule

weiji14 added the data-pipeline Pull Requests about the data pipeline label Dec 19, 2023

weiji14 self-assigned this Dec 19, 2023

weiji14 added 3 commits December 19, 2023 21:09

🚑 Use try-except to get absolute chip_path or fallback to str

d0ffdde

The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).

✨ Implement predict_dataloader for ClayDataModule

7409e50

Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.

weiji14 marked this pull request as ready for review December 19, 2023 08:25

weiji14 added 2 commits December 19, 2023 21:29

📝 Edit docstrings in test_datamodule.py to be more generic

71bcaf6

Not just testing one, but two different LightningDataModules now!

🔧 Add GDAL environment variables that might help with s3 loading

734435c

Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.

weiji14 mentioned this pull request Dec 19, 2023

Let ClayDataModule return same spatiotemporal fields as GeoTIFFDataModule #91

Merged

4 tasks

weiji14 commented Dec 19, 2023

View reviewed changes

weiji14 merged commit df9aff5 into main Dec 19, 2023
1 check passed

weiji14 deleted the predict-from-s3 branch December 19, 2023 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow ClayDataModule to load GeoTIFF files directly from s3 #92

Allow ClayDataModule to load GeoTIFF files directly from s3 #92

weiji14 commented Dec 19, 2023 •

edited

Loading

weiji14 Dec 19, 2023

Allow ClayDataModule to load GeoTIFF files directly from s3 #92

Allow ClayDataModule to load GeoTIFF files directly from s3 #92

Conversation

weiji14 commented Dec 19, 2023 • edited Loading

weiji14 Dec 19, 2023

Choose a reason for hiding this comment

weiji14 commented Dec 19, 2023 •

edited

Loading