-
Notifications
You must be signed in to change notification settings - Fork 20
feat: add tasks for writing parquet tables. #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…es into benb/deenumerate_for_export
…es into benb/deenumerate_for_export
…seqr-loading-pipelines into benb/deenumerate_for_export
…es into benb/deenumerate_for_export
…es into benb/deenumerate_for_export
…es into benb/deenumerate_for_export
…es into benb/deenumerate_for_export
* support grch37 * enable grch37
* Support alphabetization of nested field * better sorting * Update misc_test.py * ruff
formatting_annotation_names = { | ||
fa.__name__ for fa in dataset_type.formatting_annotation_fns(reference_genome) | ||
} | ||
if 'sorted_motif_feature_consequences' in formatting_annotation_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the generic part of the unmapping function you wrote for unmap_reference_dataset_annotation_enums
wouldn't work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the logic behind these being manual was we enumerated them manually, both in the annotations and in the globals. Building this function was effectively just reversing an existing function. I thought at the time that it was easier to just be explicit.
|
||
|
||
@luigi.util.inherits(BaseLoadingRunParams) | ||
class WriteNewClinvarVariantsParquetTask(BaseWriteParquetTask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirming that, per our discussion, this is not the approach we are going to be using for MVP. Okay to leave in for now and take out later when the real approach is implemented
strict=True, | ||
): | ||
mt = hl.read_matrix_table(remapped_and_subsetted_callset_task.path) | ||
ht = compute_callset_family_entries_ht( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this already called in the main loading task we run? Why not make this task dependent on that task and reuse this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the decision to use the subsetted tables as the source rather than the project tables because we didn't have a clean existing way to select families out of the project table (the existing method removes families, and our migration plan will use the entire project tables with no need to subset). I thought it made more sense to use this for now as I don't actually think the compute difference will be noticeable, and it will more easily allow us to deprecate the project table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is that in addition to supporting this task for new projects/families, we will also need to be able to programmatically migrate all data from the existing project tables. If we write a task here that uses that project table it would be very easy to reuse it to do the migration. However, if you think that using the subsetting is easier for the loading and would rather write and support a totally independent task for the migration thats fine too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the plan was to write a totally independent task for the migration, as the two would be functionally different.
new_variants
,new_entries
,new_clinvar_variants
andnew_transcripts
parquets.