Process Met Office extended NWP data #4

zakwatts · 2025-01-02T15:49:12Z

Previously the Met Office extended NWP data was processed for the purpose of running the 2023 backtest, so this year was prioritised and processed before previous years had downloaded.

The 2017-2024 data needs to be re-processed and uploaded to a GCP disk so it can be used for training a DA PVNet model. (Additional init times may have be downloaded so likely worth re processing again).

Steps:

Identify location and that desired data exists and has been downloaded (on Leonardo), can view data available via Dagster or specific location specified in the data availability google sheet. Location of the data can be found here. Worth checking with @devsjc that this is the correct location still.
Convert raw Met Office NWP data into unzipped zarr format, can use script unzip_mo.py
Merge the data into yearly zarrs, can use script combine_proc_zarrs_mo.py
Validate the data via visualisation and testing (size, variables, NaNs...). There are some scripts and notebooks in this repo to help with this.
Upload yearly zarrs to Google Storage bucket, here gs://solar-pv-nowcasting-data/NWP/UK_Met_Office with the name UKV_extended_v2.
Create a new disk on GCP, which is a duplicate of of uk-all-inputs-v2, the duplicate can be called uk-all-inputs-v3. Given we are adding possible 8TB of new data, the disk size will need to be increased to avoid running out of storage space when transferring data.
Mount new disk to a VM in read/write mode
Move the new data from GS to GCP disk, the new data should be in the folder named UKV_v9 to maintain naming convention and the respective README should be updated accordingly.

Steps 1-3 are done on Leonardo, where the raw data is located.

Possible Errors and Issues along the way

Leonardo goes down, stopping data processing tasks.
Workers/threads not optimal or others using Leonardo taking up resources
Slow data transfer speeds out from Leonardo
see, Issues and Important Considerations for NWP Processing in the README of this repo for more.
Duplication of gcp disk leads to needing to rename disk to differentiate.
Need to increase the size of the duplicate disk, to add the new data and avoid
Leonardo runs out of storage in disk that is being worked on

It's worth noting that due to the file sizes, this process can take a long time.

The text was updated successfully, but these errors were encountered:

zakwatts · 2025-01-03T14:54:23Z

On Leonardo Storage B is full (at 100%), Storage C is close to being full (99%).

zakwatts · 2025-01-03T15:03:55Z

Appears we are still missing lots of Met Office UK Extended NWP data. (Using init times from /mnt/storage_b/nwp/ceda/uk)

@devsjc, for when you are back, is there another location where this data might be kept ? I was under the impression that it was downloaded but maybe we still need to pull it.

zakwatts · 2025-01-03T15:06:02Z

A Dagster partition fill was ran on 17th of December to download 2024 but failed, possibly due to the storage_b disk becoming full

felix-e-h-p · 2025-01-06T10:09:55Z

Appears we are still missing lots of Met Office UK Extended NWP data. (Using init times from `/mnt/storage_b/nwp/ceda/uk`)
@devsjc, for when you are back, is there another location where this data might be kept ? I was under the impression that it was downloaded but maybe we still need to pull it.

Just to highlight this further, a few errors are notable when executing save_samples.py.

Brief structural overview of both 2022 and 2023 data (worth noting that specific keys are not present in certain dates also):

Structure for year 2022:
Available keys: ['UKV', 'init_time', 'step', 'variable', 'x', 'y']

Attributes: {}

Array shapes:
UKV: (12, 232, 51, 704, 548)
init_time: (232,)
step: (51,)
variable: (12,)
x: (548,)
y: (704,)

Step index analysis:
Total steps: 51
Unique steps: 51
Min step: 0.0
Max step: 54.0
Has gaps: False
Step differences: [1.0, 3.0]

Structure for year 2023:
Available keys: ['UKV', 'init_time', 'step', 'variable', 'x', 'y']

Attributes: {}

Array shapes:
UKV: (12, 2759, 51, 704, 548)
init_time: (2759,)
step: (51,)
variable: (12,)
x: (548,)
y: (704,)

Step index analysis:
Total steps: 51
Unique steps: 51
Min step: 0.0
Max step: 54.0
Has gaps: False
Step differences: [1.0, 3.0]

felix-e-h-p · 2025-01-06T10:13:05Z

Regarding the specific errors:

ValueError: num_samples=0 (2023 validation error)
First 5 days of 2023 have no data, i.e. earliest data point in 2023 starts at January 6th, 12:00:00

KeyError: "not all values found in index 'step'" (Cross year error)
Structural changes in the data between years, pre-2023: approx 230-250 init_times per year at irregular intervals
2023: 2759 init_times at more regular 3-hour intervals
2024: 1500 init_times at consistent 3-hour intervals

Step indexing seems to fail when trying to handle different temporal structures across years. I think that fundamentally this structural break is causing the dataloader to fail when trying to work across effectively different data formats.

zakwatts assigned felix-e-h-p Jan 2, 2025

peterdudfield assigned devsjc and unassigned felix-e-h-p Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process Met Office extended NWP data #4

Process Met Office extended NWP data #4

zakwatts commented Jan 2, 2025 •

edited

Loading

zakwatts commented Jan 3, 2025

zakwatts commented Jan 3, 2025 •

edited

Loading

zakwatts commented Jan 3, 2025

felix-e-h-p commented Jan 6, 2025

felix-e-h-p commented Jan 6, 2025

Process Met Office extended NWP data #4

Process Met Office extended NWP data #4

Comments

zakwatts commented Jan 2, 2025 • edited Loading

Steps:

Possible Errors and Issues along the way

zakwatts commented Jan 3, 2025

zakwatts commented Jan 3, 2025 • edited Loading

zakwatts commented Jan 3, 2025

felix-e-h-p commented Jan 6, 2025

felix-e-h-p commented Jan 6, 2025

zakwatts commented Jan 2, 2025 •

edited

Loading

zakwatts commented Jan 3, 2025 •

edited

Loading