Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process Met Office extended NWP data #4

Open
8 tasks
zakwatts opened this issue Jan 2, 2025 · 5 comments
Open
8 tasks

Process Met Office extended NWP data #4

zakwatts opened this issue Jan 2, 2025 · 5 comments
Assignees

Comments

@zakwatts
Copy link
Collaborator

zakwatts commented Jan 2, 2025

Previously the Met Office extended NWP data was processed for the purpose of running the 2023 backtest, so this year was prioritised and processed before previous years had downloaded.

The 2017-2024 data needs to be re-processed and uploaded to a GCP disk so it can be used for training a DA PVNet model. (Additional init times may have be downloaded so likely worth re processing again).

Steps:

  1. Identify location and that desired data exists and has been downloaded (on Leonardo), can view data available via Dagster or specific location specified in the data availability google sheet. Location of the data can be found here. Worth checking with @devsjc that this is the correct location still.
  2. Convert raw Met Office NWP data into unzipped zarr format, can use script unzip_mo.py
  3. Merge the data into yearly zarrs, can use script combine_proc_zarrs_mo.py
  4. Validate the data via visualisation and testing (size, variables, NaNs...). There are some scripts and notebooks in this repo to help with this.
  5. Upload yearly zarrs to Google Storage bucket, here gs://solar-pv-nowcasting-data/NWP/UK_Met_Office with the name UKV_extended_v2.
  6. Create a new disk on GCP, which is a duplicate of of uk-all-inputs-v2, the duplicate can be called uk-all-inputs-v3. Given we are adding possible 8TB of new data, the disk size will need to be increased to avoid running out of storage space when transferring data.
  7. Mount new disk to a VM in read/write mode
  8. Move the new data from GS to GCP disk, the new data should be in the folder named UKV_v9 to maintain naming convention and the respective README should be updated accordingly.

Steps 1-3 are done on Leonardo, where the raw data is located.

Possible Errors and Issues along the way

  • Leonardo goes down, stopping data processing tasks.
  • Workers/threads not optimal or others using Leonardo taking up resources
  • Slow data transfer speeds out from Leonardo
  • see, Issues and Important Considerations for NWP Processing in the README of this repo for more.
  • Duplication of gcp disk leads to needing to rename disk to differentiate.
  • Need to increase the size of the duplicate disk, to add the new data and avoid
  • Leonardo runs out of storage in disk that is being worked on

It's worth noting that due to the file sizes, this process can take a long time.

@zakwatts
Copy link
Collaborator Author

zakwatts commented Jan 3, 2025

Screenshot 2025-01-03 at 14 53 08

On Leonardo Storage B is full (at 100%), Storage C is close to being full (99%).

@zakwatts
Copy link
Collaborator Author

zakwatts commented Jan 3, 2025

Screenshot 2025-01-03 at 15 03 20

Appears we are still missing lots of Met Office UK Extended NWP data. (Using init times from /mnt/storage_b/nwp/ceda/uk)

@devsjc, for when you are back, is there another location where this data might be kept ? I was under the impression that it was downloaded but maybe we still need to pull it.

@zakwatts
Copy link
Collaborator Author

zakwatts commented Jan 3, 2025

A Dagster partition fill was ran on 17th of December to download 2024 but failed, possibly due to the storage_b disk becoming full

@peterdudfield peterdudfield assigned devsjc and unassigned felix-e-h-p Jan 3, 2025
@felix-e-h-p
Copy link

Screenshot 2025-01-03 at 15 03 20 Appears we are still missing lots of Met Office UK Extended NWP data. (Using init times from `/mnt/storage_b/nwp/ceda/uk`)

@devsjc, for when you are back, is there another location where this data might be kept ? I was under the impression that it was downloaded but maybe we still need to pull it.

Just to highlight this further, a few errors are notable when executing save_samples.py.

Brief structural overview of both 2022 and 2023 data (worth noting that specific keys are not present in certain dates also):

Structure for year 2022:
Available keys: ['UKV', 'init_time', 'step', 'variable', 'x', 'y']

Attributes: {}

Array shapes:
UKV: (12, 232, 51, 704, 548)
init_time: (232,)
step: (51,)
variable: (12,)
x: (548,)
y: (704,)

Step index analysis:
Total steps: 51
Unique steps: 51
Min step: 0.0
Max step: 54.0
Has gaps: False
Step differences: [1.0, 3.0]

Structure for year 2023:
Available keys: ['UKV', 'init_time', 'step', 'variable', 'x', 'y']

Attributes: {}

Array shapes:
UKV: (12, 2759, 51, 704, 548)
init_time: (2759,)
step: (51,)
variable: (12,)
x: (548,)
y: (704,)

Step index analysis:
Total steps: 51
Unique steps: 51
Min step: 0.0
Max step: 54.0
Has gaps: False
Step differences: [1.0, 3.0]

@felix-e-h-p
Copy link

Regarding the specific errors:

ValueError: num_samples=0 (2023 validation error)
First 5 days of 2023 have no data, i.e. earliest data point in 2023 starts at January 6th, 12:00:00

KeyError: "not all values found in index 'step'" (Cross year error)
Structural changes in the data between years, pre-2023: approx 230-250 init_times per year at irregular intervals
2023: 2759 init_times at more regular 3-hour intervals
2024: 1500 init_times at consistent 3-hour intervals

Step indexing seems to fail when trying to handle different temporal structures across years. I think that fundamentally this structural break is causing the dataloader to fail when trying to work across effectively different data formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants