-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow ClayDataModule to load GeoTIFF files directly from s3 #92
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule
The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).
Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.
Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.
Not just testing one, but two different LightningDataModules now!
Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
4 tasks
weiji14
commented
Dec 19, 2023
@@ -80,13 +80,16 @@ def __getitem__(self, idx): | |||
cube = self.read_chip(chip_path) | |||
|
|||
# remove nans and convert to tensor | |||
cube["pixels"] = torch.nan_to_num(torch.as_tensor(data=cube["pixels"]), nan=0.0) | |||
cube["pixels"] = torch.as_tensor(data=cube["pixels"], dtype=torch.float16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up @srmsoumya that I've remove the NaN to 0 clipping here, since the new batch of GeoTIFF files shouldn't have NaNs anymore per #68.
brunosan
pushed a commit
that referenced
this pull request
Dec 27, 2023
* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Similar to work done in #85 on the GeoTIFFDataPipeModule, this PR implements similar functionality in ClayDataModule to load GeoTIFF files from an s3 bucket. Plus a few more minor tweaks to align both LightningDataModules.
Implementation uses
torchdata
's S3FileLister to get the files, but instead of returning an iterator, a list is returned.TODO:
datacube["path"]
todatacube["source_url"]
to match Rename embeddings file to include MGRS code and store GeoTIFF source_url #86Continuing on from #91, this PR is part 2/3 of working towards generating new embeddings from the model developed at #47.