feat: add tasks for writing parquet tables. #1076

bpblanken · 2025-04-10T16:04:47Z

adds methods to de-enumerate our enums back to strings, but preserve the annotations table structure.
adds methods to support camelCasing of structs.
adds exports of new_variants, new_entries, new_clinvar_variants and new_transcripts parquets.

…es into benb/deenumerate_for_export

…seqr-loading-pipelines into benb/deenumerate_for_export

…es into benb/deenumerate_for_export

* support grch37 * enable grch37

* Support alphabetization of nested field * better sorting * Update misc_test.py * ruff

hanars · 2025-05-08T15:24:15Z

v03_pipeline/lib/tasks/exports/misc.py

+    formatting_annotation_names = {
+        fa.__name__ for fa in dataset_type.formatting_annotation_fns(reference_genome)
+    }
+    if 'sorted_motif_feature_consequences' in formatting_annotation_names:


Is there a reason why the generic part of the unmapping function you wrote for unmap_reference_dataset_annotation_enums wouldn't work here?

yeah, the logic behind these being manual was we enumerated them manually, both in the annotations and in the globals. Building this function was effectively just reversing an existing function. I thought at the time that it was easier to just be explicit.

hanars · 2025-05-08T15:26:48Z

v03_pipeline/lib/tasks/exports/write_new_clinvar_variants_parquet.py

+
+
+@luigi.util.inherits(BaseLoadingRunParams)
+class WriteNewClinvarVariantsParquetTask(BaseWriteParquetTask):


just confirming that, per our discussion, this is not the approach we are going to be using for MVP. Okay to leave in for now and take out later when the real approach is implemented

hanars · 2025-05-08T15:28:37Z

v03_pipeline/lib/tasks/exports/write_new_entries_parquet.py

+            strict=True,
+        ):
+            mt = hl.read_matrix_table(remapped_and_subsetted_callset_task.path)
+            ht = compute_callset_family_entries_ht(


isn't this already called in the main loading task we run? Why not make this task dependent on that task and reuse this table?

I made the decision to use the subsetted tables as the source rather than the project tables because we didn't have a clean existing way to select families out of the project table (the existing method removes families, and our migration plan will use the entire project tables with no need to subset). I thought it made more sense to use this for now as I don't actually think the compute difference will be noticeable, and it will more easily allow us to deprecate the project table.

My only concern is that in addition to supporting this task for new projects/families, we will also need to be able to programmatically migrate all data from the existing project tables. If we write a task here that uses that project table it would be very easy to reuse it to do the migration. However, if you think that using the subsetting is easier for the loading and would rather write and support a totally independent task for the migration thats fine too

yeah, the plan was to write a totally independent task for the migration, as the two would be functionally different.

v03_pipeline/lib/tasks/exports/write_new_entries_parquet.py

bpblanken added 30 commits April 9, 2025 16:30

Formatting function enums

b482487

improve function name

f1a5a6c

handle lookups that propagate missing

18a38c9

format

e006df0

ruff

27d2464

rename func

345a619

second func

f4c611c

start test

6931585

finish tests

e5880c9

Merge branch 'main' of github.com:broadinstitute/seqr-loading-pipelin…

2937808

…es into benb/deenumerate_for_export

ruff

2c60e10

Merge branch 'main' of github.com:broadinstitute/seqr-loading-pipelin…

e08ce89

…es into benb/deenumerate_for_export

Add camelcase

21e2d95

add camelcase

2a05241

ruff

236d5bb

missing init py

135d0b3

move functions to export

085e886

mostly functioning entries task

f2165ce

import

4880195

improve private reference datasets logic

f28ec6a

first pass

359d773

progress

2d4420f

tests passing

3964c8f

Merge branch 'benb/add_key_to_pipeline' into benb/deenumerate_for_export

6f40918

key

8b67aa7

annotations table

752b123

test transcripts

fcaba23

closer

dd0c024

missed some tests

23ee29a

Merge branch 'benb/add_key_to_pipeline' of github.com:broadinstitute/…

b9b87a0

…seqr-loading-pipelines into benb/deenumerate_for_export

bpblanken and others added 24 commits April 24, 2025 15:05

sort it

125ea95

ruff

9528d3c

bugfixes

bb5aab3

update test to new format

9d9a138

ruff

1469298

v03

ac967b6

merge

7b594ac

merge

bd1a037

Merge branch 'main' of github.com:broadinstitute/seqr-loading-pipelin…

0de3398

…es into benb/deenumerate_for_export

no longer used

58b3355

Merge branch 'main' of github.com:broadinstitute/seqr-loading-pipelin…

944b102

…es into benb/deenumerate_for_export

lint

ae807c2

formatting

026c1ba

special case the export

6b172ca

Merge branch 'main' of github.com:broadinstitute/seqr-loading-pipelin…

6277649

…es into benb/deenumerate_for_export

remove gene/map

6f4b736

ruff

5af48f5

add new annotations

585724f

print

459e510

ruff

c19c16b

Merge branch 'main' of github.com:broadinstitute/seqr-loading-pipelin…

e972898

…es into benb/deenumerate_for_export

canonical is not a float

ca2daf5

add new clinvar variants parquet export (#1092)

1b4b265

feat: grch37 SNV_INDEL export (#1095)

ccbc80d

* support grch37 * enable grch37

bpblanken changed the title ~~feat: deenumerate for export~~ feat: add tasks for writing parquet tables. May 7, 2025

bpblanken marked this pull request as ready for review May 7, 2025 14:09

bpblanken requested a review from a team as a code owner May 7, 2025 14:09

feat: alphabetize nested field (#1096)

c25dbc6

* Support alphabetization of nested field * better sorting * Update misc_test.py * ruff

hanars reviewed May 8, 2025

View reviewed changes

hanars approved these changes May 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add tasks for writing parquet tables. #1076

feat: add tasks for writing parquet tables. #1076

bpblanken commented Apr 10, 2025 •

edited

Loading

hanars May 8, 2025

bpblanken May 9, 2025

hanars May 8, 2025

hanars May 8, 2025

bpblanken May 9, 2025

hanars May 9, 2025

bpblanken May 9, 2025



		@luigi.util.inherits(BaseLoadingRunParams)
		class WriteNewClinvarVariantsParquetTask(BaseWriteParquetTask):

feat: add tasks for writing parquet tables. #1076

Are you sure you want to change the base?

feat: add tasks for writing parquet tables. #1076

Conversation

bpblanken commented Apr 10, 2025 • edited Loading

hanars May 8, 2025

Choose a reason for hiding this comment

bpblanken May 9, 2025

Choose a reason for hiding this comment

hanars May 8, 2025

Choose a reason for hiding this comment

hanars May 8, 2025

Choose a reason for hiding this comment

bpblanken May 9, 2025

Choose a reason for hiding this comment

hanars May 9, 2025

Choose a reason for hiding this comment

bpblanken May 9, 2025

Choose a reason for hiding this comment

bpblanken commented Apr 10, 2025 •

edited

Loading