Skip to content

ML pipeline to make samples #62

Open
@peterdudfield

Description

@peterdudfield

Detailed Description

Following on from #1 I wanted to write this issue.

We currently have lots of NWP data and PVLive data. Its too much to go into memory, so we have to cut it down ready for ML experiments. The way we've done this in the past is to create samples of data. This are smaller chinks of the data, that contain specific data for a certain time (and space). These then get batched up in a dataloader and the ML model can then train from them.

So we want to build a pipeline for making these samples (most of the work is done in ocf-data-sampler)

Context

  • NWP = numerical weather predictions data
  • PVLive data, national solar generation data
  • We have GFS on S3 and we have been collecting Metoffice Global data
  • @jcamier @siddharth7113 and others have been working on this already
  • ocf-data-sampler is python library used to create samples from large datasets.

Possible Implementation

There are lots of ways to do this, but theres a suggestion to use ocf-data-sampler

You start with a data configuration (see below) that tells ocf-data-sampler what to load and other specific bits.

It would be really great to

  1. Create a configuration for this project. Perhaps starting with GFS and PVLive, and then adding Metoffice later
  2. Using this configuration, run ocf-data-sampler and make some samples
  3. Make a script for 2. so that others can use it.
  4. Same samples, maybe in s3, so others can use them

The ocf-data-sampler class we recommend using is PVNetUKRegionalDataset, but there are a few things that might need adding like

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions