@@ -16,14 +16,14 @@ kernelspec:
16
16
17
17
(sec_large_scale)=
18
18
19
- # Large Scale Inference
19
+ # Large scale inference
20
20
21
21
Generally, for up to a few thousand samples a single multi-core machine
22
- can infer a tree seqeunce in a few days. However, tsinfer has been
23
- successfully used with datasets up to half a million samples, where
24
- ancestor and sample matching can take several CPU-years.
22
+ can infer a tree sequence in a few days, hours, or even minutes.
23
+ However, _ tsinfer _ has been successfully used with datasets up to half a million
24
+ samples, where ancestor and sample matching can take several CPU-years.
25
25
At this scale inference must be scaled across many machines.
26
- tsinfer provides specific APIs to enable this.
26
+ _ Tsinfer _ provides specific APIs to enable this.
27
27
Here we detail considerations and tips for each step of the
28
28
inference process to help you scale up your analysis. A snakemake pipeline
29
29
which implements this parallelisation scheme is available as
@@ -34,8 +34,8 @@ which implements this parallelisation scheme is available as
34
34
## Data preparation
35
35
36
36
For large scale inference the data must be in [ VCF Zarr] ( https://github.com/sgkit-dev/vcf-zarr-spec )
37
- format, read by the {class}` VariantData ` class. [ bio2zarr ] ( https://github.com/sgkit-dev/bio2zarr )
38
- is recommended for conversion from VCF. [ sgkit] ( https://github.com/sgkit-dev/sgkit ) can then
37
+ format, read by the {class}` VariantData ` class. [ Bio2zarr ] ( https://github.com/sgkit-dev/bio2zarr )
38
+ is recommended for conversion from VCF, and [ sgkit] ( https://github.com/sgkit-dev/sgkit ) can then
39
39
be used to perform initial filtering.
40
40
41
41
:::{todo}
@@ -45,19 +45,20 @@ An upcoming tutorial will detail conversion from VCF to a VCF Zarr suitable for
45
45
46
46
## Ancestor generation
47
47
48
- Ancestor generation is generally the fastest step in inference and is not yet
48
+ Ancestor generation is generally the fastest step in inference. It is not yet
49
49
parallelised out-of-core in tsinfer and must be performed on a single machine.
50
50
However it scales well on machines with
51
51
many cores and hyperthreading via the ` num_threads ` argument to
52
52
{meth}` generate_ancestors ` . The limiting factor is often that the
53
53
entire genotype array for the contig being inferred needs to fit in RAM.
54
54
This is the high-water mark for memory usage in tsinfer.
55
- Note the ` genotype_encoding ` argument, setting this to
56
- {class}` tsinfer.GenotypeEncoding.ONE_BIT ` reduces the memory footprint of
57
- the genotype array by a factor of 8, for a surprisingly small increase in
58
- runtime. With this encoding, the RAM needed is roughly
59
- ` num_sites * num_samples * ploidy / 8 bytes. ` However this encoding
60
- only supports biallelic sites, with no missingness.
55
+
56
+ If your data consists of only biallelic sites, with no missingness,
57
+ the ` genotype_encoding ` argument can be set to
58
+ {class}` tsinfer.GenotypeEncoding.ONE_BIT ` which reduces the memory footprint of
59
+ the genotype array by a factor of 8, such that the RAM needed is roughly
60
+ ` num_sites * num_samples * ploidy / 8 bytes ` . This memory optimisation
61
+ results in a surprisingly small increase in runtime.
61
62
62
63
## Ancestor matching
63
64
0 commit comments