-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade UK to data sampler #276
Conversation
for more information, see https://pre-commit.ci
…into data_sampler
for more information, see https://pre-commit.ci
…into data_sampler
for more information, see https://pre-commit.ci
Co-authored-by: Sukhil Patel <[email protected]>
for more information, see https://pre-commit.ci
More of a general question, looking through am I right in thinking that we are (for the UK regional stuff at least) moving from saving batches (batch_size number of samples saved together) to saving each individual sample in separate .pt files, does the extra writes (a write to .pt for each sample now) and reads impact batch/sample creation time / training times significantly or is it negligible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work, thanks for adding this in!
Yes I did change it so we would save samples. I honestly don't know how it impacts speed, I haven't measured it. I can only say I haven't noticed much difference between how long it took before and what it takes now. But by saving samples we cut down a lot of RAM usage. Previously, we were loading a bunch of batches, splitting the batches into samples, shuffling the samples, and recombining back into batches on the fly. So a lot more samples had to be stored in memory at a given time. This was to make sure the same samples weren't always in a batch together. I do think that saving them as individual samples makes more sense. I think it is hard to evaluate performance-wise since it will depend on the hardware and since we plan to change the data structure we save to disk, it would soon become outdated. So honestly I don't know about speed, but it does cut down RAM usage, which might allow us to use more workers for a given machine size |
Starting to use this code to test out the Site Torch Dataset pipeline, whilst doing that I thought it would be good to update the config example parameters in the datamodule folder which mention "batch" to "samples" to align with the new In fact more generally that example config folder needs updating to match the Configuration data model now in |
is it good to merge this? Make sure @Sukh-P is working from that branch then? |
Well there is still the example configs to update, but yeh then we could merge into the |
This PR upgrades the UK part datamodule to ocf-data-sampler. This works towards #273
Notes: