Skip to content

feat: add tasks for writing parquet tables. #1076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

bpblanken
Copy link
Collaborator

@bpblanken bpblanken commented Apr 10, 2025

  1. adds methods to de-enumerate our enums back to strings, but preserve the annotations table structure.
  2. adds methods to support camelCasing of structs.
  3. adds exports of new_variants, new_entries, new_clinvar_variants and new_transcripts parquets.

@bpblanken bpblanken changed the title feat: deenumerate for export feat: add tasks for writing parquet tables. May 7, 2025
@bpblanken bpblanken marked this pull request as ready for review May 7, 2025 14:09
@bpblanken bpblanken requested a review from a team as a code owner May 7, 2025 14:09
* Support alphabetization of nested field

* better sorting

* Update misc_test.py

* ruff
formatting_annotation_names = {
fa.__name__ for fa in dataset_type.formatting_annotation_fns(reference_genome)
}
if 'sorted_motif_feature_consequences' in formatting_annotation_names:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the generic part of the unmapping function you wrote for unmap_reference_dataset_annotation_enums wouldn't work here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the logic behind these being manual was we enumerated them manually, both in the annotations and in the globals. Building this function was effectively just reversing an existing function. I thought at the time that it was easier to just be explicit.



@luigi.util.inherits(BaseLoadingRunParams)
class WriteNewClinvarVariantsParquetTask(BaseWriteParquetTask):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming that, per our discussion, this is not the approach we are going to be using for MVP. Okay to leave in for now and take out later when the real approach is implemented

strict=True,
):
mt = hl.read_matrix_table(remapped_and_subsetted_callset_task.path)
ht = compute_callset_family_entries_ht(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this already called in the main loading task we run? Why not make this task dependent on that task and reuse this table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the decision to use the subsetted tables as the source rather than the project tables because we didn't have a clean existing way to select families out of the project table (the existing method removes families, and our migration plan will use the entire project tables with no need to subset). I thought it made more sense to use this for now as I don't actually think the compute difference will be noticeable, and it will more easily allow us to deprecate the project table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that in addition to supporting this task for new projects/families, we will also need to be able to programmatically migrate all data from the existing project tables. If we write a task here that uses that project table it would be very easy to reuse it to do the migration. However, if you think that using the subsetting is easier for the loading and would rather write and support a totally independent task for the migration thats fine too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the plan was to write a totally independent task for the migration, as the two would be functionally different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants