-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kvalobs importer #39
Kvalobs importer #39
Conversation
Ok, everything seems to work, but I'm still unsure about what I've left unchecked in the original comment.
Also, there were some comments in the original script mentioning that observations from the same timeseries could be found in both And I still don't understand the purpose of histkvalobs, can someone illuminate me? |
7af2fa1
to
1393ec3
Compare
I believe histkvalobs is another instance of kvalobs specifically for running checks on old data, like a month or a year after the data is ingested. |
That's really annoying... I guess we'll have to reconcile these timeseries later with a script. If you make an issue for that I'm happy to leave it there, I don't see that we can do any better right now
I think that's fine. I guess it will be up to the content managers to correct any of this if the information wasn't available.
I think you answered your own question there 😛. It would be a good sanity-check. |
histkvalobs is indeed a separate deployment of kvalobs, sometimes used to run QC on a set of historical data. The actual database histkvalobsdb, however, contains about 99% of the data we want to import from "the kvalobses". That's because kvalobsdb itself only stores a couple months of recent data before INSERTing them into histkalobsdb and DELETEing them from itself. fromtime/totime: we've probably discussed it before, but this is crucial to know for most end users. It's painstakingly slow to get de facto times out of kvalobs. KDVH runs a routine job to cache recent estimates, and so does ODA. Maybe someone made that kvalobs table to workaround the issue when doing analysis. Stinfosys metadata (a set of obsinn tables) also covers only a subset of timeseries and describes "what we think/want/expect", even when this is de facto incorrect due to any type of errors. Frost will require fromtime for every timeseries, so we need to cover that need somewhere |
So it can potentially flag/correct data a month/year after? And where are the results of these checks saved? I think I left a comment somewhere that I could only see data starting from June 2023 in histkvalobs (maybe allt the data before that timestamp was "good"?). |
That's what I remembered from a previous conversation (don't know with whom), but when I log into the main kvalobs database I can see data from 2006, so maybe I'm looking in the wrong place (or maybe I have hist and normal mixed up)? 🤔 🤔 |
f8daf6f
to
0324b19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we have a conversation about the dump format?
CSV is an odd choice in some ways because it's a text format, so we have to take a performance hit on converting floats to strings and back, and risk loss in the conversion.
Using something like gob would spare us this, and we'd be able to get rid of the code for converting CSV rows to go objects. Only downside as far as I can see would be that gob isn't a widely supported format, but that would only affect us if we want to rewrite the importer in a different language while keeping the dumps, which seems unlikely
Not really, since we were using CSV for KDVH, I thought it made sense to use CSV for Kvalobs too. |
If you know the bottleneck is elsewhere, then since you've already written the code to use CSV, I don't think we should change it. But I figured I should bring it up for posterity. About f32, the reasoning behind that is that we don't expect any of our instruments to have precision beyond the 6sf f32 offers and it's a huge space saving. We can change this if you feel strongly. With CSV my concern was more that I've seen a weird drift with CSV roundtripping, where the mantissa gets much bigger, but now I think about it that shouldn't be an issue here given you're using the same ser/de implementation on both ends. |
Should we merge |
That table was created specifically for the kvkafka "checked" queue. My original idea was to insert directly there, but then you said we should have the observations in |
I think we should insert into both, yes. The purpose of ingesting kvkafka-checked and migrating the flags here is that Vegar wants us to have the kvalobs flags so he can turn off KDVH before Confident is prod-ready. That table will be dropped once kvalobs is deprecated. I have said at length that I think this is a terrible idea and a waste of resources, but I don't make the decision.
This is wrong, we should be inserting the original, not corrected. (I know KDVH doesn't have the original, so we have to take corrected for that, but that's not our fault) |
Just done with implementing this check and of course, there's only partial overlap 😞 EDIT: I was wrong, I was only checking against the text labels, but in reality Kvalobs does not store a lot of the params that stifosys marks as non-scalar. And except the ones I mentioned above, the others match. |
829866a
to
71ce7a3
Compare
ba8111b
to
19a35b2
Compare
If they are missing in the Obsinn message, it seems that Kvalobs inserts default values (
'0'
and0
respectively). In contrast, in Lard we insert NULLs, so we might have a mismatch for the same timeseries.FromTime
when we need to add a new timeseries in Lard