Fixes issue with model reconstruction of the upper half of the image & saves model checkpoint in s3 #193

srmsoumya · 2024-03-26T05:24:24Z

This PR resolves an issue with the model reconstructing just the bottom 50% of the image during validation and stores model checkpoints in the s3 store.

Adds a shuffle argument to ClayModule that is set to False by default
Logs model checkpoints to aws s3 bucket

Fixes #156 #138

- Lr -> 1e-5 to 1e-5 - Data -> Size: 256 x 256, patchsize: 16 - Log checkpoints to s3 - Save model params along with optimizer & epoch state

…while training & validation.

for more information, see https://pre-commit.ci

yellowcap

Looks good except for some strange test errors 🐈

half of the image & saves model checkpoint in s3 (#193) - Fix issue with not shuffling during validation run. Use shuffle=True while training & validation. - Log to devseed-gaia account of wandb & save checkpoints on s3. - Update params for v0.2 model run - Lr -> 1e-4 to 1e-5 - Data -> Size: 256 x 256, patchsize: 16 - Log checkpoints to s3 - Save model params along with optimizer & epoch state

srmsoumya · 2024-04-19T11:51:24Z

@weiji14 I am getting an error with create a conda-lock.yml file with new dependency.

Encountered problems while solving:
  - package pytorch-2.1.0-cuda120py38h1932296_301 requires cuda-version >=12.0,<13, but none of the providers can be installed

For now, I have merged this branch with main, as we need to develop v1 on top of v0.2. We can fix the issues with conda-lock & do a v0.2 release next week.

Remove the `--platform linux-64` flag since unified lockfile is for linux-64, osx-64 and osx-arm64 as of #164. Also re-locking the conda-lock.yml file after 2a9ef9d/#193.

weiji14 · 2024-04-21T21:14:48Z

@weiji14 I am getting an error with create a conda-lock.yml file with new dependency.
Encountered problems while solving:
  - package pytorch-2.1.0-cuda120py38h1932296_301 requires cuda-version >=12.0,<13, but none of the providers can be installed

Hmm, did you run conda-lock lock --mamba --file environment.yml --with-cuda=12.0? I get the same error you got without the --with-cuda=12.0 flag. For reference, my conda-lock/mamba versions are:

$ conda-lock --version
conda-lock, version 2.5.6
$ mamba --version
mamba 1.5.8
conda 24.3.0

I'll patch this up at #225, and also update the docs slightly under the Note section in https://clay-foundation.github.io/model/installation.html#advanced about re-locking the conda-lock.yml file.

Remove the `--platform linux-64` flag since unified lockfile is for linux-64, osx-64 and osx-arm64 as of #164. Also re-locking the conda-lock.yml file after 2a9ef9d/#193.

SRM added 4 commits March 11, 2024 13:03

Update params for v0.2 model run

e25d46f

- Lr -> 1e-5 to 1e-5 - Data -> Size: 256 x 256, patchsize: 16 - Log checkpoints to s3 - Save model params along with optimizer & epoch state

Log to devseed-gaia account of wandb & save checkpoints on s3.

aa3321b

Fix issue with not shuffling during validation run. Use shuffle=True …

beee502

…while training & validation.

Merge branch 'main' into clay-v0.2-run

2d5d22a

srmsoumya requested a review from yellowcap March 26, 2024 05:24

[pre-commit.ci] auto fixes from pre-commit.com hooks

fcbc871

for more information, see https://pre-commit.ci

This was referenced Mar 26, 2024

Upper image does not train? #156

Closed

Unable to continue training from checkpoint. #138

Closed

yellowcap approved these changes Mar 26, 2024

View reviewed changes

SRM added 2 commits April 19, 2024 15:53

Add s3fs to environment.yml

bbfe00e

Merge branch 'main' into clay-v0.2-run

270dbc7

srmsoumya closed this Apr 19, 2024

weiji14 mentioned this pull request Apr 21, 2024

Update instructions to re-lock conda-lock.yml file #225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes issue with model reconstruction of the upper half of the image & saves model checkpoint in s3 #193

Fixes issue with model reconstruction of the upper half of the image & saves model checkpoint in s3 #193

srmsoumya commented Mar 26, 2024 •

edited

Loading

yellowcap left a comment •

edited

Loading

srmsoumya commented Apr 19, 2024

weiji14 commented Apr 21, 2024

Fixes issue with model reconstruction of the upper half of the image & saves model checkpoint in s3 #193

Fixes issue with model reconstruction of the upper half of the image & saves model checkpoint in s3 #193

Conversation

srmsoumya commented Mar 26, 2024 • edited Loading

yellowcap left a comment • edited Loading

Choose a reason for hiding this comment

srmsoumya commented Apr 19, 2024

weiji14 commented Apr 21, 2024

srmsoumya commented Mar 26, 2024 •

edited

Loading

yellowcap left a comment •

edited

Loading