Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade UK to data sampler #276

Merged
merged 19 commits into from
Dec 17, 2024
Merged

Upgrade UK to data sampler #276

merged 19 commits into from
Dec 17, 2024

Conversation

dfulu
Copy link
Member

@dfulu dfulu commented Nov 13, 2024

This PR upgrades the UK part datamodule to ocf-data-sampler. This works towards #273

Notes:

  • The datamodules for wind and pvsite have not been updated
  • This breaks PVNet for wind and pvsites due to an update in the target variable shape, and in the time units coming from data-sampler
  • I have also stripped out the exponential weighted losses. We haven't used them in a very long time if ever
  • Also stripped out parts of model training test routine, we only ever use train and validate
  • There are also a couple of minor bug fixes and clean ups that struck me along the way
  • Update tests

@dfulu dfulu requested a review from Sukh-P November 13, 2024 13:25
@Sukh-P
Copy link
Member

Sukh-P commented Nov 15, 2024

More of a general question, looking through am I right in thinking that we are (for the UK regional stuff at least) moving from saving batches (batch_size number of samples saved together) to saving each individual sample in separate .pt files, does the extra writes (a write to .pt for each sample now) and reads impact batch/sample creation time / training times significantly or is it negligible?

Copy link
Member

@Sukh-P Sukh-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work, thanks for adding this in!

@dfulu
Copy link
Member Author

dfulu commented Nov 15, 2024

More of a general question, looking through am I right in thinking that we are (for the UK regional stuff at least) moving from saving batches (batch_size number of samples saved together) to saving each individual sample in separate .pt files, does the extra writes (a write to .pt for each sample now) and reads impact batch creation time / training times significantly or is it negligible?

Yes I did change it so we would save samples. I honestly don't know how it impacts speed, I haven't measured it. I can only say I haven't noticed much difference between how long it took before and what it takes now. But by saving samples we cut down a lot of RAM usage. Previously, we were loading a bunch of batches, splitting the batches into samples, shuffling the samples, and recombining back into batches on the fly. So a lot more samples had to be stored in memory at a given time. This was to make sure the same samples weren't always in a batch together. I do think that saving them as individual samples makes more sense. I think it is hard to evaluate performance-wise since it will depend on the hardware and since we plan to change the data structure we save to disk, it would soon become outdated. So honestly I don't know about speed, but it does cut down RAM usage, which might allow us to use more workers for a given machine size

@Sukh-P
Copy link
Member

Sukh-P commented Nov 20, 2024

Starting to use this code to test out the Site Torch Dataset pipeline, whilst doing that I thought it would be good to update the config example parameters in the datamodule folder which mention "batch" to "samples" to align with the new save_samples.py script

In fact more generally that example config folder needs updating to match the Configuration data model now in ocf-data-sampler

@peterdudfield
Copy link
Contributor

is it good to merge this? Make sure @Sukh-P is working from that branch then?

@dfulu
Copy link
Member Author

dfulu commented Nov 25, 2024

Well there is still the example configs to update, but yeh then we could merge into the dev-data-sampler branch

@Sukh-P Sukh-P merged commit 82b6009 into dev-data-sampler Dec 17, 2024
3 checks passed
@Sukh-P Sukh-P deleted the data_sampler branch December 18, 2024 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants