@@ -32,7 +32,7 @@ document. However, for the moment we'll just use a pre-generated dataset:
32
32
33
33
``` {code-cell} ipython3
34
34
import zarr
35
- ds = zarr.load ("_static/example_data.vcz")
35
+ ds = zarr.open ("_static/example_data.vcz")
36
36
```
37
37
38
38
This is what the genotypes stored in that datafile look like:
@@ -44,7 +44,7 @@ assert all(len(np.unique(a)) == len(a) for a in ds['variant_allele'])
44
44
assert any([np.sum(g) == 1 for g in ds['call_genotype']]) # at least one singleton
45
45
assert any([np.sum(g) == 0 for g in ds['call_genotype']]) # at least one non-variable
46
46
47
- alleles = ds['variant_allele'].astype(str)
47
+ alleles = ds['variant_allele'][:] .astype(str)
48
48
sites = np.arange(ds['call_genotype'].shape[0])
49
49
print(" " * 22, "Site:", " ".join(str(x) for x in range(8)), "\n")
50
50
for sample in range(ds['call_genotype'].shape[1]):
@@ -76,11 +76,11 @@ and not used for inference (with a warning given).
76
76
import tsinfer
77
77
78
78
# For this example take the REF allele (index 0) as ancestral
79
- ancestral_alleles = ds['variant_allele'][:,0].astype(str)
79
+ ancestral_allele = ds['variant_allele'][:,0].astype(str)
80
80
# This is just a numpy array, set the last site to an unknown value, for demo purposes
81
- ancestral_alleles [-1] = "."
81
+ ancestral_allele [-1] = "."
82
82
83
- vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_alleles )
83
+ vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_allele )
84
84
```
85
85
86
86
The ` VariantData ` object is a lightweight wrapper around the .vcz file.
@@ -98,14 +98,40 @@ Additionally, during the inference step, additional sites can be flagged as not
98
98
inference, for example if they are deemed unreliable (this is done
99
99
via the ` exclude_positions ` parameter).
100
100
101
+ Sites which are not used for inference will
102
+ still be included in the final tree sequence, with mutations at those sites being placed
103
+ onto branches by {meth}` parsimony<tskit.Tree.map_mutations> ` .
104
+
101
105
### Masks
102
106
103
- Sites which are not used for inference will still be included in the final tree sequence, with
104
- mutations at those sites being placed onto branches by parsimony. However, it is also possible
105
- to completely exclude sites and samples from the final tree sequence, by specifing a ` site_mask `
106
- and/or a ` sample_mask ` when creating the ` VariantData ` object. Such sites or samples will be
107
- completely omitted from both inference and the final tree sequence. This can be useful, for
108
- example, to reduce the amount of computation required for an inference.
107
+ It is also possible to * completely* exclude sites and samples, by specifing a boolean
108
+ ` site_mask ` and/or a ` sample_mask ` when creating the ` VariantData ` object. Sites or samples with
109
+ a mask value of ` True ` will be completely omitted both from inference and the final tree sequence.
110
+ This can be useful, for example, if your VCF file contains multiple chromosomes (in which case
111
+ ` tsinfer ` will need to be run separately on each chromosome) or if you wish to select only a subset
112
+ of the chromosome for inference (e.g. to reduce computational load). If a ` site_mask ` is provided,
113
+ note that the ancestral alleles array only specifies alleles for the unmasked sites.
114
+
115
+ Below, for instance, is an example of including only sites up to position six in the contig
116
+ labelled "chr1" in the ` example_data.vcz ` file:
117
+
118
+ ``` {code-cell}
119
+ import numpy as np
120
+
121
+ # mask out any sites not associated with the contig named "chr1"
122
+ # (for demonstration: all sites in this .vcz file are from "chr1" anyway)
123
+ chr1_index = np.where(ds.contig_id[:] == "chr1")[0]
124
+ site_mask = ds.variant_contig[:] != chr1_index
125
+ # also mask out any sites with a position >= 6
126
+ site_mask[ds.variant_position[:] >= 6] = True
127
+
128
+ smaller_vdata = tsinfer.VariantData(
129
+ "_static/example_data.vcz",
130
+ ancestral_allele=ancestral_allele[site_mask == False],
131
+ site_mask=site_mask,
132
+ )
133
+ print(f"The `smaller_vdata` object returns data for only {smaller_vdata.num_sites} sites")
134
+ ```
109
135
110
136
### Topology inference
111
137
0 commit comments