Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty sample/genotype-data in single samples of a pedigree cause filters to crash #2201

Closed
Nicolai-vKuegelgen opened this issue Jan 21, 2025 · 3 comments · Fixed by #2235
Closed
Labels
bug Something isn't working

Comments

@Nicolai-vKuegelgen
Copy link

Nicolai-vKuegelgen commented Jan 21, 2025

Describe the bug
In a case/pedigree with multiple samples some variants may not have usable data from all samples (either due to missing coverage or also on the Y chromosome). In vcf files this can either be encoded as a single "." for that sample in the sample/genotype block or with individual missing values for each defined Format field. In the tsv file used for data-import to Varfish the sample specific data (genotype column) can - in principle - also be empty for single samples (i.e. """sample_2""": {} ). However, in cases like the the variant filtration for this sample will fail for the whole set of variants (SNVs or SVs).

To Reproduce
Steps to reproduce the behavior:

  1. Generate a tsv file with empty genotype data for a single variant in a single sample (i.e. using mehari on a vcf with a "." in the sample block).
  2. Import this case to Varfish
  3. Attempt variant filtration
  4. See error

Expected behavior
Given that some samples may not have any usable information for some variants, ideally the variant filtration should be able to deal with missing data.
Alternatively, import of variants with missing data for even a single sample should be reject, so that filtration will not fail due to this.

Additional context
This could be fixed by never writing empty sample/genotype-data into the tsv files used for varfish import, see mehari issue 672

@stolpeo
Copy link
Contributor

stolpeo commented Jan 23, 2025

For the Y chromosome, there is no standardized output for the genotype:

  • Dragen outputs GT as ./. and other data as .
  • GATK outputs GT mostly as ./. (except it is reported noise), and other data as 0
  • Varfish annotator converts the . of other data to 0
  • mehari converts the . of other data to -1

@stolpeo
Copy link
Contributor

stolpeo commented Feb 11, 2025

RCA

When starting a SV query, the variants from the database are written out to a temporary vcf file to be then submitted to the varfish-server-worker. However with some variants the genotype dictionary for an individual can be empty. If that happens to an CNV entry, the conversion fails. Per default, expected fields that are not in the genotype default to ., however, the cn entry is cast with an int() which fails the conversion.

coerce_db_to_vcf.get(key, lambda x: x)(
record.genotype[sample].get(key, ".")
)

coerce_db_to_vcf = {
"cn": lambda x: int(x),
}

@stolpeo
Copy link
Contributor

stolpeo commented Feb 11, 2025

Solution

Allow . as cn value.

        coerce_db_to_vcf = {
            "cn": lambda x: x if x == "." else int(x),
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants