Skip to content

Commit 37645ef

Browse files
committed
Large scale inference docs
1 parent e3b2155 commit 37645ef

File tree

7 files changed

+405
-10
lines changed

7 files changed

+405
-10
lines changed

Diff for: docs/_static/ancestor_grouping.png

31.1 KB
Loading

Diff for: docs/_toc.yml

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ parts:
1414
- caption: Inference
1515
chapters:
1616
- file: inference
17+
- file: large_scale
1718
- caption: Interfaces
1819
chapters:
1920
- file: api

Diff for: docs/api.rst

+26
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,11 @@ File formats
1616
Sample data
1717
+++++++++++
1818

19+
.. autoclass:: tsinfer.VariantData
20+
:members:
21+
:inherited-members:
22+
23+
1924
.. autoclass:: tsinfer.SampleData
2025
:members:
2126
:inherited-members:
@@ -60,6 +65,27 @@ Running inference
6065

6166
.. autofunction:: tsinfer.post_process
6267

68+
*****************
69+
Batched inference
70+
*****************
71+
72+
.. autofunction:: tsinfer.match_ancestors_batch_init
73+
74+
.. autofunction:: tsinfer.match_ancestors_batch_groups
75+
76+
.. autofunction:: tsinfer.match_ancestors_batch_group_partition
77+
78+
.. autofunction:: tsinfer.match_ancestors_batch_group_finalise
79+
80+
.. autofunction:: tsinfer.match_ancestors_batch_finalise
81+
82+
.. autofunction:: tsinfer.match_samples_batch_init
83+
84+
.. autofunction:: tsinfer.match_samples_batch_partition
85+
86+
.. autofunction:: tsinfer.match_samples_batch_finalise
87+
88+
6389
*****************
6490
Container classes
6591
*****************

Diff for: docs/inference.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -300,4 +300,4 @@ The final phase of a `tsinfer` inference consists of a number steps:
300300
section
301301
2. Describe the structure of the output tree sequences; how the
302302
nodes are mapped, what the time values mean, etc.
303-
:::
303+
:::

Diff for: docs/large_scale.md

+136
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.12
7+
jupytext_version: 1.9.1
8+
kernelspec:
9+
display_name: Python 3
10+
language: python
11+
name: python3
12+
---
13+
14+
:::{currentmodule} tsinfer
15+
:::
16+
17+
(sec_large_scale)=
18+
19+
# Large Scale Inference
20+
21+
tsinfer scales well and has been successfully used with datasets up to half a
22+
million samples. Here we detail considerations and tips for each step of the
23+
inference process to help you scale up your analysis. A snakemake pipeline
24+
which implements this parallelisation scheme is available at https://github.com/benjeffery/tsinfer-snakemake.
25+
26+
(sec_large_scale_ancestor_generation)=
27+
28+
## Data preparation
29+
30+
For large scale inference the data must be in [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec)
31+
format, read by the {class}`VariantData` class. [bio2zarr](https://github.com/sgkit-dev/bio2zarr)
32+
is recommended for conversion from VCF. [sgkit](https://github.com/sgkit-dev/sgkit) can then
33+
be used to perform initial filtering.
34+
35+
36+
## Ancestor generation
37+
38+
Ancestor generation is generally the fastest step in inference and is not yet
39+
parallelised out-of-core in tsinfer. However it scales well on machines with
40+
many cores and hyperthreading via the `num_threads` argument to
41+
{meth}`generate_ancestors`. The limiting factor is often that the
42+
entire genotype array for the contig being inferred needs to fit in RAM.
43+
This is the high-water mark for memory usage in tsinfer.
44+
Note the `genotype_encoding` argument, setting this to
45+
{class}`tsinfer.GenotypeEncoding.ONE_BIT` reduces the memory footprint of
46+
the genotype array by a factor of 8, for a surprisingly small increase in
47+
runtime. With this encoding, the RAM needed is roughly
48+
`num_sites * num_samples * ploidy / 8 bytes.`
49+
50+
## Ancestor matching
51+
52+
Ancestor matching is one of the more time consuming steps of inference. It
53+
proceeds in groups, progressively growing the tree sequence with younger
54+
ancestors. At each stage the parallelism is limited to the number of ancestors
55+
whose possible inheritors are already matched, as all possible inheritors
56+
of a sample must be matched in an earlier group. For a typical human data set
57+
the number of samples per group varies from single digits up to approximately
58+
the number of samples.
59+
The plot below shows the number of ancestors matched in each group for a typical
60+
human data set:
61+
62+
```{figure} _static/ancestor_grouping.png
63+
:width: 80%
64+
```
65+
66+
There are five tsinfer API methods that can be used to parallelise ancestor
67+
matching.
68+
69+
Initially {meth}`match_ancestors_batch_init` should be called to
70+
set up the batch matching and to determine the groupings of ancestors.
71+
This method writes a file `metadata.json` to the `work_dir` that contains
72+
a JSON encoded dictionary with configuration for later steps, and a key
73+
`ancestor_grouping` which is a list of dictionaries, each containing the
74+
list of ancestors in that group (key:`ancestors`) and a proposed partioning of
75+
those ancestors into sets that can be matched in parallel (key:`partitions`).
76+
The dictionary is also returned by the method.
77+
The partitioning is controlled by the `min_work_per_job` and `max_num_partitions`
78+
arguments. Ancestors are placed in a partition until the sum of their lengths exceeds
79+
`min_work_per_job`, when a new partition is started. However, the number of partitions
80+
is not allowed to exceed `max_num_partitions`. It is suggested to set `max_num_partitions`
81+
to around 3-4x the number of worker nodes available, and `min_work_per_job` to around
82+
2,000,000 for a typical human data set.
83+
84+
Each group is then matched in turn, either by calling {meth}`match_ancestors_batch_groups`
85+
to match without partitioning, or by calling {meth}`match_ancestors_batch_group_partition`
86+
many times in parallel followed by a single call to {meth}`match_ancestors_batch_group_finalise`.
87+
Each call to {meth}`match_ancestors_batch_groups` or {meth}`match_ancestors_batch_group_finalise`
88+
outputs the tree sequence to `work_dir`, which is then used by the next group. The length of
89+
the `ancestor_grouping` in the metadata dictionary determines the group numbers that these methods
90+
will need to be called for, and the length of the `partitions` list in each group determines
91+
the number of calls to {meth}`match_ancestors_batch_group_partition` that are needed (if any).
92+
93+
{meth}`match_ancestors_batch_groups` matches groups, without partitioning, from
94+
`group_index_start` (inclusively) to `group_index_end` (exclusively). Combining
95+
many groups into one call reduces the overhead from job submission and start
96+
up times, but note on job failure the process can only be resumed from the
97+
last `group_index_end`.
98+
99+
To match a single group in parallel, call {meth}`match_ancestors_batch_group_partition`
100+
once for each partition listed in the `ancestor_grouping[group_index]['partitions']` list,
101+
incrementing `partition_index`. This will match the ancestors, placing the match data in
102+
the `working_dir`. Once all are complete a single call to
103+
{meth}`match_ancestors_batch_group_finalise` will then insert the matches and
104+
output the tree sequence to `work_dir`.
105+
106+
At anypoint the process can be resumed from the last successfully completed call to
107+
{meth}`match_ancestors_batch_groups`. As the tree sequences in `work_dir` checkpoint the
108+
progress.
109+
110+
Finally after the final group, call {meth}`match_ancestors_batch_finalise` to
111+
combine the groups into a single tree sequence.
112+
113+
The partitioning in `metadata.json` does not have to be used for every group. As early groups are
114+
not matching to a large tree sequence it is often faster to not partition the first half of the
115+
groups, depending on job set up and queueing time on your cluster.
116+
117+
Calls to {meth}`match_ancestors_batch_group_partition` will only use a single core, but
118+
{meth}`match_ancestors_batch_groups` will use as many cores as `num_threads` is set to
119+
Therefore this value and cluster resources requested should be scaled with the number of ancestors,
120+
which can be read from the metadata dictionary.
121+
122+
123+
124+
## Sample matching
125+
126+
Sample matching is far simpler than ancestor matching as it is essentially the same as a single group
127+
of ancestors. There are three API methods that work together to enable distributed sample matching.
128+
{meth}`match_samples_batch_init` should be called to set up the batch matching and to determine the
129+
groupings of samples. Similar to {meth}`match_ancestors_batch_init` is has a `min_work_per_job` and
130+
`max_num_partitions` arguments to control the level of parallelism. The method writes a file
131+
`metadata.json` to the directory `work_dir` that contains a JSON encoded dictionary with
132+
configuration for later steps. This is also returned by the call. The `num_partitions` key in
133+
this dictionary is the number of times {meth}`match_samples_batch_partition` will need
134+
to be called, with each partition index as the `partition_index` argument. These calls can happen
135+
in parallel and write match data to the `work_dir` which is then used by
136+
{meth}`match_samples_batch_finalise` to output the tree sequence.

Diff for: tsinfer/formats.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2308,7 +2308,7 @@ class VariantData(SampleData):
23082308
the inference process will have ``inferred_ts.num_samples`` equal to double
23092309
the number returned by ``VariantData.num_samples``.
23102310
2311-
:param Union(str, zarr.hierarchy.Group) path_or_zarr: The input dataset in
2311+
:param Union(str, zarr.Group) path_or_zarr: The input dataset in
23122312
`VCF Zarr <https://github.com/sgkit-dev/vcf-zarr-spec>`_ format.
23132313
This can either a path to the Zarr dataset saved on disk, or the
23142314
Zarr object itself.

0 commit comments

Comments
 (0)