Skip to content

Commit a98a019

Browse files
hyanwongmergify[bot]
authored andcommitted
Add example of excluding chromosomes using site masks
And switch to the preferred `zarr.open` instead of `zarr.load`.
1 parent 1c355ae commit a98a019

File tree

3 files changed

+39
-13
lines changed

3 files changed

+39
-13
lines changed

docs/_static/example_data.vcz/contig_id/.zarray

+2-2
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@
99
"id": "blosc",
1010
"shuffle": 1
1111
},
12-
"dtype": "<U1",
13-
"fill_value": null,
12+
"dtype": "<U4",
13+
"fill_value": "",
1414
"filters": null,
1515
"order": "C",
1616
"shape": [
12 Bytes
Binary file not shown.

docs/usage.md

+37-11
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ document. However, for the moment we'll just use a pre-generated dataset:
3232

3333
```{code-cell} ipython3
3434
import zarr
35-
ds = zarr.load("_static/example_data.vcz")
35+
ds = zarr.open("_static/example_data.vcz")
3636
```
3737

3838
This is what the genotypes stored in that datafile look like:
@@ -44,7 +44,7 @@ assert all(len(np.unique(a)) == len(a) for a in ds['variant_allele'])
4444
assert any([np.sum(g) == 1 for g in ds['call_genotype']]) # at least one singleton
4545
assert any([np.sum(g) == 0 for g in ds['call_genotype']]) # at least one non-variable
4646
47-
alleles = ds['variant_allele'].astype(str)
47+
alleles = ds['variant_allele'][:].astype(str)
4848
sites = np.arange(ds['call_genotype'].shape[0])
4949
print(" " * 22, "Site:", " ".join(str(x) for x in range(8)), "\n")
5050
for sample in range(ds['call_genotype'].shape[1]):
@@ -76,11 +76,11 @@ and not used for inference (with a warning given).
7676
import tsinfer
7777
7878
# For this example take the REF allele (index 0) as ancestral
79-
ancestral_alleles = ds['variant_allele'][:,0].astype(str)
79+
ancestral_allele = ds['variant_allele'][:,0].astype(str)
8080
# This is just a numpy array, set the last site to an unknown value, for demo purposes
81-
ancestral_alleles[-1] = "."
81+
ancestral_allele[-1] = "."
8282
83-
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_alleles)
83+
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_allele)
8484
```
8585

8686
The `VariantData` object is a lightweight wrapper around the .vcz file.
@@ -98,14 +98,40 @@ Additionally, during the inference step, additional sites can be flagged as not
9898
inference, for example if they are deemed unreliable (this is done
9999
via the `exclude_positions` parameter).
100100

101+
Sites which are not used for inference will
102+
still be included in the final tree sequence, with mutations at those sites being placed
103+
onto branches by {meth}`parsimony<tskit.Tree.map_mutations>`.
104+
101105
### Masks
102106

103-
Sites which are not used for inference will still be included in the final tree sequence, with
104-
mutations at those sites being placed onto branches by parsimony. However, it is also possible
105-
to completely exclude sites and samples from the final tree sequence, by specifing a `site_mask`
106-
and/or a `sample_mask` when creating the `VariantData` object. Such sites or samples will be
107-
completely omitted from both inference and the final tree sequence. This can be useful, for
108-
example, to reduce the amount of computation required for an inference.
107+
It is also possible to *completely* exclude sites and samples, by specifing a boolean
108+
`site_mask` and/or a `sample_mask` when creating the `VariantData` object. Sites or samples with
109+
a mask value of `True` will be completely omitted both from inference and the final tree sequence.
110+
This can be useful, for example, if your VCF file contains multiple chromosomes (in which case
111+
`tsinfer` will need to be run separately on each chromosome) or if you wish to select only a subset
112+
of the chromosome for inference (e.g. to reduce computational load). If a `site_mask` is provided,
113+
note that the ancestral alleles array only specifies alleles for the unmasked sites.
114+
115+
Below, for instance, is an example of including only sites up to position six in the contig
116+
labelled "chr1" in the `example_data.vcz` file:
117+
118+
```{code-cell}
119+
import numpy as np
120+
121+
# mask out any sites not associated with the contig named "chr1"
122+
# (for demonstration: all sites in this .vcz file are from "chr1" anyway)
123+
chr1_index = np.where(ds.contig_id[:] == "chr1")[0]
124+
site_mask = ds.variant_contig[:] != chr1_index
125+
# also mask out any sites with a position >= 6
126+
site_mask[ds.variant_position[:] >= 6] = True
127+
128+
smaller_vdata = tsinfer.VariantData(
129+
"_static/example_data.vcz",
130+
ancestral_allele=ancestral_allele[site_mask == False],
131+
site_mask=site_mask,
132+
)
133+
print(f"The `smaller_vdata` object returns data for only {smaller_vdata.num_sites} sites")
134+
```
109135

110136
### Topology inference
111137

0 commit comments

Comments
 (0)