-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write L2 to file and leave all additional variable derivation for the L2toL3 step #255
Write L2 to file and leave all additional variable derivation for the L2toL3 step #255
Conversation
This reverts commit af09818.
@@ -87,20 +87,20 @@ def join_l3(): | |||
exit() | |||
|
|||
# Get hourly, daily and monthly datasets | |||
print('Resampling L3 data to hourly, daily and monthly resolutions...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we should have a specific join L2 script that resamples and generates three different intermediate L2(A) products.
getL3 is also supposed to be a merge function with resampling.
NSE_L3 = resample(
hard_merge([
NSE_hist_L2,
NSE_tx_L2,
NSE_raw_L2,
])
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It was first separated to respect the distinction, in the previous version, between getl3
and joinl3
.
It also has the advantage of making a clear distinction between the data product levels:
get_l2
produces L2_tx
and L2_raw
join_l2
produces L2_merged
get_l3
produces L3
join_l3
produces L3_merged
I find it confusing that a script called getL3
produces as an intermediate product the L2_merged
product
Also, I have two comments about the example you give above:
- that's another clear case where we need to do it setp-wise:
NSE_tx_L2
andNSE_raw_L2
can be combined withcombine_first
because they are from the same physical station. The result is theNSE_L2_merged
. Only after that, can we merge it with another distinct physical station (historical or v3) using time blocks. But... - The historical stations come out as level 3 products (they already have derived variables such as turbulent fluxes, estimated coordinates). So we need to first turn
NSE_L2_merged
intoNSE_L3
and then it can be merged withNSE_hist_L3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your comment:
I don't see why we should have a specific join L2 script that resamples and generates three different intermediate L2(A) products.
The now updated reply is:
get_l2 produces L2_tx and L2_raw
join_l2 produces L2_merged
get_l2tol3 produces L3 from L2_merged
join_l3 produces L3_merged
We cannot further merge those functionalities because they take different inputs:
get_l2
takes a config file as input (different from tx and raw)
join_l2
takes two files with same station_id
as input
get_l2tol3
takes a single file (or alternatively a single station_id
) as input
join_l3
takes either multiple files or a site_id
as input
This structure respects the different levels and makes the different function rather independent of each other.
get_l3
is not used at the moment and is also unpractical because it takes as input a single config file (either from tx
or raw
) and doesn't include the merging of L2 datasets before running L2toL3
Does that answer your comment?
I just pushed changes to the processing performed in the
Also, I updated:
I will tackle the CLI scripts now, specifically looking at |
New commit that tackles I've made changes now that means you can specify the level of processing (and the outputted file) with the input variable $ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2 Where $ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l3 -l 3 Where I have also renamed Baptiste's |
Now pypromice/src/pypromice/process/l2_to_l3.py Lines 55 to 69 in cc6bfb2
Effectively the Level 2 .nc file is loaded directly as the Level 2 property of the aws object (i.e. The only thing I am not sure about is the efficiency of re-loading the netcdf file. I'm curious to see what @ladsmund thinks about this. Please feel free to revert the change if you have something else in mind. |
Hi @PennyHow , Thanks for the suggestions. Although your
I try to summarize these constrains and make a draft of a I hope that clarifies as well the level definition for @ladsmund. As a side note, since we'll need to update |
I also wondered about the purpose of the get_l3 script. As I see it, it is a script that processes all data from AWS through the pipeline. So far, it has only been from a single data source such as tx or raw. It is not clear to me, if we are on the same page with respect to the processing pipeline and how to invoke which steps. As I see it, the current pipeline is something like Version 1
After this pull request, I could be something like Version 2
Or if we still need the L3 tx files for backwards compatibility Version 3
And maybe also with the historical data
|
I am not sure about how we should interpet L3 vs Level3 both now and in the future. I think Baptiste has a good point about using site vs station names as references for datasets. I could imagine an approach where we have multiple types of l3 products: Data source level
Station level
Site level
|
I added flexibility into the get_l3 function to also allow for L0 to L2 processing. This can be defined with the $ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2 So yes, it is still called "get_l3" currently, but can also be used for the functionality you described in get_l2. I'm happy to rename it. I just want to keep the option of L0-to-L3 processing for development or other uses that we may need in the future. The reason I am hesitant about your configuration is because a lot of the post-processing functionality (i.e. rounding, re-formating, resampling) is re-written from the pypromice.process.aws module into the CLI script. By having two instances of this functionality, it means we have to update and maintain it in both places which is a lot more work for us.
So currently this is what I think we are aiming for. With the current pypromice modules/CLI scripts, it should look like this:
I think we are close! |
There are a lot of new commits here now, but most of them are associated with de-bugging of a new Action for testing the Main changes
To-do
And also to see what you guys think about these changes. Please feel free to modify. |
…les out in L2toL3, trying not to re-write sorted tx file if already sorted
…rging function, attribute management and use site_id and list_station_id
I have now made an update of aws-operational-processing that uses the functions from this PR. The code has been running on glacio01 and posting the level_2 and level_3 on GitHub (if cloning The comparison between the I'll be rebasing downstream PRs to this one and we can take the discussion to the next one, which is #252 |
Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand? |
I need to adapt the scripts first now that I have a clearer idea about the structure. |
@@ -0,0 +1,193 @@ | |||
#!/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend avoiding the use of imperative forms in module names, as it can be confusing. For example:
import write
write.write(args)
Instead, I suggest using a gerund form like writing
or a descriptive noun phrase such as dataset_writer
or dataset_export
. This makes the purpose of the module clearer and more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that was me. Feel free to change.
The idea of this new version is that:
L2 data files are written into
level_2/raw
andlevel_2/tx
folders by get_l2 (just like it was done for the L3 data previously). One consequence is that this low-latency level 2tx
data can be posted very fast on THREDDS for showcase and fieldwork, and processed into BUFR files.L2
tx
andraw
files are merged using join_l2 (just like it was done for the L3 data previously). Resampling to hourly, daily, monthly values are done here, but could be left for a later stage.get_l3 is now a script that loads the merged L2 file, run
process.L2toL3.toL3
. This will allow more variables to be derived intoL3
and historical data to be appended once the L3 data is processed.