You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a normal SampleData file we guarantee that if the genotypes contain -1 for missing data, the last item in the alleles array is None, (so that site.alleles[missing_genotype] is None). This is done when adding the alleles during add_site, here:
Since we don't use add_sites for an SGkit file, this no longer holds. In other words, SgkitSampleData files can have missing data but the last item of the allele list may not be None. This could easily become a source of subtle bugs. I don't know how we should correct it? Perhaps override the site.alleles accessor in the SgkitSampleData class?
importtskitds=sg.simulate_genotype_call_dataset(n_variant=7, n_sample=3, missing_pct=.5, phased=True)
ds.update({'variant_ancestral_allele': ds['variant_allele'][:,0]})
sg.save_dataset(ds, "output.zarr", mode="w")
sd=tsinfer.SgkitSampleData(path="output.zarr")
forvinsd.variants():
iftskit.MISSING_DATAinv.genotypes:
print("This allele list should have a `None` as the last item:", v.alleles)
Giving
This allele list should have a `None` as the last item: [b'A', b'G']
The text was updated successfully, but these errors were encountered:
This is a little complicated, because sg_sd.sites_alleles[:] is taken straight from the sgkit file, and will include empty strings if e.g. some sites have 4 alleles and others have 2.
I suggest that in a SgkitSampleData file, we:
do not change the sd.sites_alleles zarr data
when we iterate over sites, we report sites.alleles without the trailing "" entries in the zarr array. Moreover, we ignore the presence/absence of missing data in the genotypes for that site, and either report them without a None appended (which could cause user errors) or simply append a None all the time.
When iterating over variants, we know whether or not there is missing data in the genotype, so in the variant.alleles list (which can already differ from site.alleles, e.g. if recode_ancestral=True), we retain the behaviour seen in a "normal" SampleData file, which is to append None only if there is missing data at that site (this could also include if there is a missing ancestral allele, I guess)?
Just a note here that the SgkitSample data is a temporary way to let use use data in sgkit format internally, without breaking too much code. We're not going to propose it as an interface for users to work with, and we will ultimately remove the SampleData interface entirely.
In a normal SampleData file we guarantee that if the genotypes contain
-1
for missing data, the last item in the alleles array isNone
, (so thatsite.alleles[missing_genotype] is None
). This is done when adding the alleles duringadd_site
, here:tsinfer/tsinfer/formats.py
Line 1892 in 7db1d38
Since we don't use
add_sites
for an SGkit file, this no longer holds. In other words, SgkitSampleData files can have missing data but the last item of the allele list may not be None. This could easily become a source of subtle bugs. I don't know how we should correct it? Perhaps override the site.alleles accessor in the SgkitSampleData class?Giving
The text was updated successfully, but these errors were encountered: