Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML pipeline to make samples #62

Open
1 of 4 tasks
peterdudfield opened this issue Feb 21, 2025 · 8 comments
Open
1 of 4 tasks

ML pipeline to make samples #62

peterdudfield opened this issue Feb 21, 2025 · 8 comments
Assignees
Labels
contributions-welcome Good issue for open-source contribution

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Feb 21, 2025

Detailed Description

Following on from #1 I wanted to write this issue.

We currently have lots of NWP data and PVLive data. Its too much to go into memory, so we have to cut it down ready for ML experiments. The way we've done this in the past is to create samples of data. This are smaller chinks of the data, that contain specific data for a certain time (and space). These then get batched up in a dataloader and the ML model can then train from them.

So we want to build a pipeline for making these samples (most of the work is done in ocf-data-sampler)

Context

  • NWP = numerical weather predictions data
  • PVLive data, national solar generation data
  • We have GFS on S3 and we have been collecting Metoffice Global data
  • @jcamier @siddharth7113 and others have been working on this already
  • ocf-data-sampler is python library used to create samples from large datasets.

Possible Implementation

There are lots of ways to do this, but theres a suggestion to use ocf-data-sampler

You start with a data configuration (see below) that tells ocf-data-sampler what to load and other specific bits.

It would be really great to

  1. Create a configuration for this project. Perhaps starting with GFS and PVLive, and then adding Metoffice later
  2. Using this configuration, run ocf-data-sampler and make some samples
  3. Make a script for 2. so that others can use it.
  4. Same samples, maybe in s3, so others can use them

The ocf-data-sampler class we recommend using is PVNetUKRegionalDataset, but there are a few things that might need adding like

@peterdudfield peterdudfield added enhancement New feature or request and removed enhancement New feature or request labels Feb 21, 2025
@peterdudfield peterdudfield added the contributions-welcome Good issue for open-source contribution label Feb 21, 2025
@peterdudfield
Copy link
Contributor Author

peterdudfield commented Feb 21, 2025

Example of config, not tested

general:
  description: Example config for producing PVNet samples
  name: example_config

input_data:
  gsp:
    # Path to GSP data in zarr format
    # e.g. gs://solar-pv-nowcasting-data/PV/GSP/v7/pv_gsp.zarr
    zarr_path: PLACEHOLDER.zarr #TODO update, pull from s3
    interval_start_minutes: -60
    # Specified for intraday currently
    interval_end_minutes: 480
    time_resolution_minutes: 30
    # Random value from the list below will be chosen as the delay when dropout is used
    # If set to null no dropout is applied. Only values before t0 are dropped out for GSP.
    # Values after t0 are assumed as targets and cannot be dropped.
    dropout_timedeltas_minutes: null
    dropout_fraction: 0 # Fraction of samples with dropout

  nwp:
    gfs:
      provider: gfs
      # Path to UKV NWP data in zarr format
      # e.g. gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_intermediate_version_7.zarr
      # n.b. It is not necessary to use multiple or any NWP data. These entries can be removed
      zarr_path: PLACEHOLDER.zarr #TODO update, pull from s3
      interval_start_minutes: -60
      # Specified for intraday currently
      interval_end_minutes: 480
      time_resolution_minutes: 60
      channels: #TODO this need updating
        - t # 2-metre temperature
        - dswrf # downwards short-wave radiation flux
        - dlwrf # downwards long-wave radiation flux
        - hcc # high cloud cover
        - mcc # medium cloud cover
        - lcc # low cloud cover
        - sde # snow depth water equivalent
        - r # relative humidty
        - vis # visibility
        - si10 # 10-metre wind speed
        - wdir10 # 10-metre wind direction
        - prate # precipitation rate
        # These variables exist in CEDA training data but not in the live MetOffice live service
        - hcct # height of convective cloud top, meters above surface. NaN if no clouds
        - cdcb # height of lowest cloud base > 3 oktas
        - dpt # dew point temperature
        - prmsl # mean sea level pressure
        - h # geometrical? (maybe geopotential?) height
      image_size_pixels_height: 24
      image_size_pixels_width: 24
      dropout_timedeltas_minutes: [-360]
      dropout_fraction: 1.0 # Fraction of samples with dropout
      max_staleness_minutes: null

@siddharth7113
Copy link
Contributor

Thank @peterdudfield , this clears up a lot of things !, I have now an idea of how things would look like and lots of code I wrote for previous issue could be reused here, I would start working on it.

@peterdudfield
Copy link
Contributor Author

Thank @peterdudfield , this clears up a lot of things !, I have now an idea of how things would look like and lots of code I wrote for previous issue could be reused here, I would start working on it.

Thats great, please do reach out if there is something else confusing. Happy to help clarify things

@leoheim
Copy link

leoheim commented Mar 4, 2025

Hi @peterdudfield ,

I’m really interested in this issue and would love to contribute. Is there any part of the task still open that I could help with?

@Jiya873
Copy link

Jiya873 commented Mar 4, 2025

Hi @peterdudfield,

I'm interested in contributing to this project and would love to get involved. Could you guide me on where to start? I'm particularly keen on understanding how the data pipeline is set up and how I can help in making the data samples more manageable for ML training.

Looking forward to your guidance.

@peterdudfield
Copy link
Contributor Author

Might have to ask @siddharth7113 for an update, and if its working or not?

@siddharth7113
Copy link
Contributor

Hi @peterdudfield,

I'm interested in contributing to this project and would love to get involved. Could you guide me on where to start? I'm particularly keen on understanding how the data pipeline is set up and how I can help in making the data samples more manageable for ML training.

Looking forward to your guidance.

Hi @peterdudfield ,

I’m really interested in this issue and would love to contribute. Is there any part of the task still open that I could help with?

Hi @Jiya873 & @leoheim ,

Thank you for your interest in the issue, but right now , a PR (openclimatefix/ocf-data-sampler#199) is already opened regarding the GFS functionality implementation , once that is implemented I think it would be easier to deal with this issue here
Other people including @alirashidAR and @jcamier are working on met-office data and PVLive implementation. This is sort of a meta issue where people are already working but if anything new comes up, I will ping you here.

@leoheim
Copy link

leoheim commented Mar 4, 2025

Hi @siddharth7113,

Thank you for the update! Feel free to ping me if anything comes up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions-welcome Good issue for open-source contribution
Projects
Status: In Progress
Development

No branches or pull requests

4 participants