Kvalobs importer #39

Lun4m · 2024-11-28T12:37:24Z

Dump Kvalobs tables
Import Kvalobs dumps
- Figure out what to do with sensor and level.
  If they are missing in the Obsinn message, it seems that Kvalobs inserts default values ('0' and 0 respectively). In contrast, in Lard we insert NULLs, so we might have a mismatch for the same timeseries.
- Figure out where to get FromTime when we need to add a new timeseries in Lard
- Confirm that non-scalar parameters in Kvalobs and Stinfosys are the same
Still need to clean up some commits that got carried over from the KDVH importer branch

Lun4m · 2024-12-02T13:22:40Z

Ok, everything seems to work, but I'm still unsure about what I've left unchecked in the original comment.
Here's what I've done:

sensor and level: if a kvalobs/kdvh timeseries has sensor and level equal to (0,0), we first check if the label exists in LARD as is, otherwise we check if the values are both NULL. Kvalobs uses default values if these are not provided (except for text data).
However, this could be problematic if we first migrate the data and, afterward, ingest a message from Obsinn with an explicit (0,0) sensor/level.
fromtime/totime: kvalobs has a station_metadata table where you can find these values for a subset of timeseries. For the others. we create a "bare" (id only) timeseries in LARD.
scalar params: in the implementation, I assumed that what is saved in text_data is marked as non-scalar in stinfosys, which might not be the case. So I probably should write some code that checks this?

Also, there were some comments in the original script mentioning that observations from the same timeseries could be found in both data and text_data tables, but I haven't yet investigated this (maybe it was about cloud types?).

And I still don't understand the purpose of histkvalobs, can someone illuminate me?

intarga · 2024-12-02T13:49:59Z

I believe histkvalobs is another instance of kvalobs specifically for running checks on old data, like a month or a year after the data is ingested.

intarga · 2024-12-02T13:58:12Z

sensor and level

That's really annoying... I guess we'll have to reconcile these timeseries later with a script. If you make an issue for that I'm happy to leave it there, I don't see that we can do any better right now

fromtime/totime

I think that's fine. I guess it will be up to the content managers to correct any of this if the information wasn't available.

scalar params

I think you answered your own question there 😛. It would be a good sanity-check.

ketilt · 2024-12-02T14:05:41Z

histkvalobs is indeed a separate deployment of kvalobs, sometimes used to run QC on a set of historical data. The actual database histkvalobsdb, however, contains about 99% of the data we want to import from "the kvalobses". That's because kvalobsdb itself only stores a couple months of recent data before INSERTing them into histkalobsdb and DELETEing them from itself.

fromtime/totime: we've probably discussed it before, but this is crucial to know for most end users. It's painstakingly slow to get de facto times out of kvalobs. KDVH runs a routine job to cache recent estimates, and so does ODA. Maybe someone made that kvalobs table to workaround the issue when doing analysis. Stinfosys metadata (a set of obsinn tables) also covers only a subset of timeseries and describes "what we think/want/expect", even when this is de facto incorrect due to any type of errors. Frost will require fromtime for every timeseries, so we need to cover that need somewhere

Lun4m · 2024-12-02T14:06:31Z

I believe histkvalobs is another instance of kvalobs specifically for running checks on old data, like a month or a year after the data is ingested.

So it can potentially flag/correct data a month/year after? And where are the results of these checks saved? I think I left a comment somewhere that I could only see data starting from June 2023 in histkvalobs (maybe allt the data before that timestamp was "good"?).

Lun4m · 2024-12-02T14:11:23Z

histkvalobs is indeed a separate deployment of kvalobs, sometimes used to run QC on a set of historical data. The actual database histkvalobsdb, however, contains about 99% of the data we want to import from "the kvalobses". That's because kvalobsdb itself only stores a couple months of recent data before INSERTing them into histkalobsdb and DELETEing them from itself.

That's what I remembered from a previous conversation (don't know with whom), but when I log into the main kvalobs database I can see data from 2006, so maybe I'm looking in the wrong place (or maybe I have hist and normal mixed up)? 🤔 🤔

intarga

Did we have a conversation about the dump format?

CSV is an odd choice in some ways because it's a text format, so we have to take a performance hit on converting floats to strings and back, and risk loss in the conversion.

Using something like gob would spare us this, and we'd be able to get rid of the code for converting CSV rows to go objects. Only downside as far as I can see would be that gob isn't a widely supported format, but that would only affect us if we want to rewrite the importer in a different language while keeping the dumps, which seems unlikely

migrations/kdvh/dump/dump.go

Lun4m · 2024-12-02T20:26:00Z

Did we have a conversation about the dump format?

CSV is an odd choice in some ways because it's a text format, so we have to take a performance hit on converting floats to strings and back, and risk loss in the conversion.

Using something like gob would spare us this, and we'd be able to get rid of the code for converting CSV rows to go objects. Only downside as far as I can see would be that gob isn't a widely supported format, but that would only affect us if we want to rewrite the importer in a different language while keeping the dumps, which seems unlikely

Not really, since we were using CSV for KDVH, I thought it made sense to use CSV for Kvalobs too.
I'm not sure if the CSV parsing is a performance problem, but I can look into gob or other packages.
By the way, there's already some potential data loss (?) since obsvalues are stored as float64 in Kvalobs and as decimals with different precisions in KDVH. while in LARD we only have float32.

intarga · 2024-12-02T20:39:34Z

If you know the bottleneck is elsewhere, then since you've already written the code to use CSV, I don't think we should change it. But I figured I should bring it up for posterity.

About f32, the reasoning behind that is that we don't expect any of our instruments to have precision beyond the 6sf f32 offers and it's a huge space saving. We can change this if you feel strongly. With CSV my concern was more that I've seen a weird drift with CSV roundtripping, where the mantissa gets much bigger, but now I think about it that shouldn't be an issue here given you're using the same ser/de implementation on both ends.

intarga · 2024-12-02T20:53:15Z

Should we merge flags.kvdata (why is this called kvdata btw? weird mix of kvkafka and kldata?) and flags.old_databases? In theory they're the same thing, the Kafka messages we're reading come from kvalobs too.

Lun4m · 2024-12-03T08:52:54Z

Should we merge flags.kvdata (why is this called kvdata btw? weird mix of kvkafka and kldata?) and flags.old_databases? In theory they're the same thing, the Kafka messages we're reading come from kvalobs too.

That table was created specifically for the kvkafka "checked" queue. My original idea was to insert directly there, but then you said we should have the observations in public.data. Should I keep inserting the corrected value into public.data while also inserting the full observation+flags into kvdata? They are basically the same table indeed 😅

intarga · 2024-12-03T09:40:55Z

I think we should insert into both, yes.

The purpose of ingesting kvkafka-checked and migrating the flags here is that Vegar wants us to have the kvalobs flags so he can turn off KDVH before Confident is prod-ready. That table will be dropped once kvalobs is deprecated. I have said at length that I think this is a terrible idea and a waste of resources, but I don't make the decision.

the corrected value into public.data

This is wrong, we should be inserting the original, not corrected. (I know KDVH doesn't have the original, so we have to take corrected for that, but that's not our fault)

Lun4m · 2024-12-03T13:45:53Z

scalar params

I think you answered your own question there 😛. It would be a good sanity-check.

Just done with implementing this check and of course, there's only partial overlap 😞
But it's not too bad! Except for param IDs 2751, 2752, 2753, and 2754, some sort of cloud type which for some reason are treated as scalars, while the other cloud types are strings, I can simply use the stinfosys definition to decide how to parse the observations

EDIT: I was wrong, I was only checking against the text labels, but in reality Kvalobs does not store a lot of the params that stifosys marks as non-scalar. And except the ones I mentioned above, the others match.

This was linked to issues Nov 28, 2024

Data migration script from kvalobs and KDVH #14

Closed

Cfailed is a tag/string #38

Closed

Lun4m marked this pull request as ready for review December 2, 2024 13:22

Lun4m requested review from ketilt and intarga December 2, 2024 13:23

Lun4m force-pushed the kvalobs_importer branch 2 times, most recently from 7af2fa1 to 1393ec3 Compare December 2, 2024 13:28

Lun4m force-pushed the kvalobs_importer branch from f8daf6f to 0324b19 Compare December 2, 2024 14:54

intarga reviewed Dec 2, 2024

View reviewed changes

migrations/kdvh/dump/dump.go Outdated Show resolved Hide resolved

Lun4m force-pushed the kvalobs_importer branch from 829866a to 71ce7a3 Compare December 4, 2024 16:15

Lun4m added 8 commits December 11, 2024 12:37

Init kvalobs importer

5b490fd

Update directory structure

761b9e0

Use pointer in GetTimeseriesID

dab35ba

Use go-arg instead of go-flags

dea347a

Remove Skip argument and don't insert flags for text data

e64d1dd

Add helpful error for timestamp parsing

308140d

Change BaseDir to Path

8079a61

Simplify help messages

c47718b

Lun4m added 22 commits December 11, 2024 12:37

Need to call bar.Add before wg.Done

55c5b4c

Insert timeseries even if fromtime is null

733aa10

Move constructor close to struct

cd49eb5

Update docs

b6d3050

Drop indices when importing from Kvalobs

7a0d986

Remove unnecessary calls to log.SetOutput

3f5f2e5

Simplify getStationLabelMap

9500603

Add timestamp to log filename

5ec43da

Fix bugs with label dump

ddc193b

Add possibility to only dump labels

c413791

Add 'check' package

af535d3

Drop flags.old_databases in favor of flags.kvdata

7f66e17

Rework and simplify overall structure and add edge cases

b1cc685

Run cleanup even when tests fail

47302de

Add more automatic way of dropping indexes

75e445a

Remove directory that should not be there

bb35e86

Unclutter kdvh package

6e01c28

Rework kvalobs structure

3b28f96

Add psql command

a734882

Improve label dump (wip)

0e1fa4e

Improve label dumps (final)

1445cbf

Remove unnecessary comments

19a35b2

Lun4m force-pushed the kvalobs_importer branch from ba8111b to 19a35b2 Compare December 11, 2024 11:37

intarga approved these changes Dec 11, 2024

View reviewed changes

Lun4m added 4 commits December 11, 2024 13:32

Update check command

0ac0890

Update comments

1acb368

Insert partitions for tests

3509faf

Allow specifying time range for import

046151b

Lun4m merged commit 93038b6 into trunk Dec 13, 2024
1 check passed

Lun4m deleted the kvalobs_importer branch December 13, 2024 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kvalobs importer #39

Kvalobs importer #39

Lun4m commented Nov 28, 2024 •

edited

Loading

Lun4m commented Dec 2, 2024

intarga commented Dec 2, 2024

intarga commented Dec 2, 2024

ketilt commented Dec 2, 2024

Lun4m commented Dec 2, 2024

Lun4m commented Dec 2, 2024

intarga left a comment

Lun4m commented Dec 2, 2024 •

edited

Loading

intarga commented Dec 2, 2024

intarga commented Dec 2, 2024

Lun4m commented Dec 3, 2024 •

edited

Loading

intarga commented Dec 3, 2024

Lun4m commented Dec 3, 2024 •

edited

Loading

Kvalobs importer #39

Kvalobs importer #39

Conversation

Lun4m commented Nov 28, 2024 • edited Loading

Lun4m commented Dec 2, 2024

intarga commented Dec 2, 2024

intarga commented Dec 2, 2024

ketilt commented Dec 2, 2024

Lun4m commented Dec 2, 2024

Lun4m commented Dec 2, 2024

intarga left a comment

Choose a reason for hiding this comment

Lun4m commented Dec 2, 2024 • edited Loading

intarga commented Dec 2, 2024

intarga commented Dec 2, 2024

Lun4m commented Dec 3, 2024 • edited Loading

intarga commented Dec 3, 2024

Lun4m commented Dec 3, 2024 • edited Loading

Lun4m commented Nov 28, 2024 •

edited

Loading

Lun4m commented Dec 2, 2024 •

edited

Loading

Lun4m commented Dec 3, 2024 •

edited

Loading

Lun4m commented Dec 3, 2024 •

edited

Loading