Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kvalobs importer #39

Merged
merged 67 commits into from
Dec 13, 2024
Merged

Kvalobs importer #39

merged 67 commits into from
Dec 13, 2024

Conversation

Lun4m
Copy link
Collaborator

@Lun4m Lun4m commented Nov 28, 2024

  • Dump Kvalobs tables
  • Import Kvalobs dumps
    • Figure out what to do with sensor and level.
      If they are missing in the Obsinn message, it seems that Kvalobs inserts default values ('0' and 0 respectively). In contrast, in Lard we insert NULLs, so we might have a mismatch for the same timeseries.
    • Figure out where to get FromTime when we need to add a new timeseries in Lard
    • Confirm that non-scalar parameters in Kvalobs and Stinfosys are the same
  • Still need to clean up some commits that got carried over from the KDVH importer branch

This was linked to issues Nov 28, 2024
@Lun4m
Copy link
Collaborator Author

Lun4m commented Dec 2, 2024

Ok, everything seems to work, but I'm still unsure about what I've left unchecked in the original comment.
Here's what I've done:

  • sensor and level: if a kvalobs/kdvh timeseries has sensor and level equal to (0,0), we first check if the label exists in LARD as is, otherwise we check if the values are both NULL. Kvalobs uses default values if these are not provided (except for text data).
    However, this could be problematic if we first migrate the data and, afterward, ingest a message from Obsinn with an explicit (0,0) sensor/level.
  • fromtime/totime: kvalobs has a station_metadata table where you can find these values for a subset of timeseries. For the others. we create a "bare" (id only) timeseries in LARD.
  • scalar params: in the implementation, I assumed that what is saved in text_data is marked as non-scalar in stinfosys, which might not be the case. So I probably should write some code that checks this?

Also, there were some comments in the original script mentioning that observations from the same timeseries could be found in both data and text_data tables, but I haven't yet investigated this (maybe it was about cloud types?).

And I still don't understand the purpose of histkvalobs, can someone illuminate me?

@Lun4m Lun4m marked this pull request as ready for review December 2, 2024 13:22
@Lun4m Lun4m requested review from ketilt and intarga December 2, 2024 13:23
@Lun4m Lun4m force-pushed the kvalobs_importer branch 2 times, most recently from 7af2fa1 to 1393ec3 Compare December 2, 2024 13:28
@intarga
Copy link
Member

intarga commented Dec 2, 2024

I believe histkvalobs is another instance of kvalobs specifically for running checks on old data, like a month or a year after the data is ingested.

@intarga
Copy link
Member

intarga commented Dec 2, 2024

sensor and level

That's really annoying... I guess we'll have to reconcile these timeseries later with a script. If you make an issue for that I'm happy to leave it there, I don't see that we can do any better right now

fromtime/totime

I think that's fine. I guess it will be up to the content managers to correct any of this if the information wasn't available.

scalar params

I think you answered your own question there 😛. It would be a good sanity-check.

@ketilt
Copy link
Collaborator

ketilt commented Dec 2, 2024

histkvalobs is indeed a separate deployment of kvalobs, sometimes used to run QC on a set of historical data. The actual database histkvalobsdb, however, contains about 99% of the data we want to import from "the kvalobses". That's because kvalobsdb itself only stores a couple months of recent data before INSERTing them into histkalobsdb and DELETEing them from itself.

fromtime/totime: we've probably discussed it before, but this is crucial to know for most end users. It's painstakingly slow to get de facto times out of kvalobs. KDVH runs a routine job to cache recent estimates, and so does ODA. Maybe someone made that kvalobs table to workaround the issue when doing analysis. Stinfosys metadata (a set of obsinn tables) also covers only a subset of timeseries and describes "what we think/want/expect", even when this is de facto incorrect due to any type of errors. Frost will require fromtime for every timeseries, so we need to cover that need somewhere

@Lun4m
Copy link
Collaborator Author

Lun4m commented Dec 2, 2024

I believe histkvalobs is another instance of kvalobs specifically for running checks on old data, like a month or a year after the data is ingested.

So it can potentially flag/correct data a month/year after? And where are the results of these checks saved? I think I left a comment somewhere that I could only see data starting from June 2023 in histkvalobs (maybe allt the data before that timestamp was "good"?).

@Lun4m
Copy link
Collaborator Author

Lun4m commented Dec 2, 2024

histkvalobs is indeed a separate deployment of kvalobs, sometimes used to run QC on a set of historical data. The actual database histkvalobsdb, however, contains about 99% of the data we want to import from "the kvalobses". That's because kvalobsdb itself only stores a couple months of recent data before INSERTing them into histkalobsdb and DELETEing them from itself.

That's what I remembered from a previous conversation (don't know with whom), but when I log into the main kvalobs database I can see data from 2006, so maybe I'm looking in the wrong place (or maybe I have hist and normal mixed up)? 🤔 🤔

Copy link
Member

@intarga intarga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we have a conversation about the dump format?

CSV is an odd choice in some ways because it's a text format, so we have to take a performance hit on converting floats to strings and back, and risk loss in the conversion.

Using something like gob would spare us this, and we'd be able to get rid of the code for converting CSV rows to go objects. Only downside as far as I can see would be that gob isn't a widely supported format, but that would only affect us if we want to rewrite the importer in a different language while keeping the dumps, which seems unlikely

migrations/kdvh/dump/dump.go Outdated Show resolved Hide resolved
@Lun4m
Copy link
Collaborator Author

Lun4m commented Dec 2, 2024

Did we have a conversation about the dump format?

CSV is an odd choice in some ways because it's a text format, so we have to take a performance hit on converting floats to strings and back, and risk loss in the conversion.

Using something like gob would spare us this, and we'd be able to get rid of the code for converting CSV rows to go objects. Only downside as far as I can see would be that gob isn't a widely supported format, but that would only affect us if we want to rewrite the importer in a different language while keeping the dumps, which seems unlikely

Not really, since we were using CSV for KDVH, I thought it made sense to use CSV for Kvalobs too.
I'm not sure if the CSV parsing is a performance problem, but I can look into gob or other packages.
By the way, there's already some potential data loss (?) since obsvalues are stored as float64 in Kvalobs and as decimals with different precisions in KDVH. while in LARD we only have float32.

@intarga
Copy link
Member

intarga commented Dec 2, 2024

If you know the bottleneck is elsewhere, then since you've already written the code to use CSV, I don't think we should change it. But I figured I should bring it up for posterity.

About f32, the reasoning behind that is that we don't expect any of our instruments to have precision beyond the 6sf f32 offers and it's a huge space saving. We can change this if you feel strongly. With CSV my concern was more that I've seen a weird drift with CSV roundtripping, where the mantissa gets much bigger, but now I think about it that shouldn't be an issue here given you're using the same ser/de implementation on both ends.

@intarga
Copy link
Member

intarga commented Dec 2, 2024

Should we merge flags.kvdata (why is this called kvdata btw? weird mix of kvkafka and kldata?) and flags.old_databases? In theory they're the same thing, the Kafka messages we're reading come from kvalobs too.

@Lun4m
Copy link
Collaborator Author

Lun4m commented Dec 3, 2024

Should we merge flags.kvdata (why is this called kvdata btw? weird mix of kvkafka and kldata?) and flags.old_databases? In theory they're the same thing, the Kafka messages we're reading come from kvalobs too.

That table was created specifically for the kvkafka "checked" queue. My original idea was to insert directly there, but then you said we should have the observations in public.data. Should I keep inserting the corrected value into public.data while also inserting the full observation+flags into kvdata? They are basically the same table indeed 😅

@intarga
Copy link
Member

intarga commented Dec 3, 2024

I think we should insert into both, yes.

The purpose of ingesting kvkafka-checked and migrating the flags here is that Vegar wants us to have the kvalobs flags so he can turn off KDVH before Confident is prod-ready. That table will be dropped once kvalobs is deprecated. I have said at length that I think this is a terrible idea and a waste of resources, but I don't make the decision.

the corrected value into public.data

This is wrong, we should be inserting the original, not corrected. (I know KDVH doesn't have the original, so we have to take corrected for that, but that's not our fault)

@Lun4m
Copy link
Collaborator Author

Lun4m commented Dec 3, 2024

scalar params

I think you answered your own question there 😛. It would be a good sanity-check.

Just done with implementing this check and of course, there's only partial overlap 😞
But it's not too bad! Except for param IDs 2751, 2752, 2753, and 2754, some sort of cloud type which for some reason are treated as scalars, while the other cloud types are strings, I can simply use the stinfosys definition to decide how to parse the observations

EDIT: I was wrong, I was only checking against the text labels, but in reality Kvalobs does not store a lot of the params that stifosys marks as non-scalar. And except the ones I mentioned above, the others match.

@Lun4m Lun4m merged commit 93038b6 into trunk Dec 13, 2024
1 check passed
@Lun4m Lun4m deleted the kvalobs_importer branch December 13, 2024 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cfailed is a tag/string Data migration script from kvalobs and KDVH
3 participants