Skip to content

Commit 8ef3511

Browse files
hyanwongmergify[bot]
authored andcommitted
Tidy some wording in the large-scale docs
1 parent eb60ee5 commit 8ef3511

File tree

1 file changed

+15
-14
lines changed

1 file changed

+15
-14
lines changed

docs/large_scale.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@ kernelspec:
1616

1717
(sec_large_scale)=
1818

19-
# Large Scale Inference
19+
# Large scale inference
2020

2121
Generally, for up to a few thousand samples a single multi-core machine
22-
can infer a tree seqeunce in a few days. However, tsinfer has been
23-
successfully used with datasets up to half a million samples, where
24-
ancestor and sample matching can take several CPU-years.
22+
can infer a tree sequence in a few days, hours, or even minutes.
23+
However, _tsinfer_ has been successfully used with datasets up to half a million
24+
samples, where ancestor and sample matching can take several CPU-years.
2525
At this scale inference must be scaled across many machines.
26-
tsinfer provides specific APIs to enable this.
26+
_Tsinfer_ provides specific APIs to enable this.
2727
Here we detail considerations and tips for each step of the
2828
inference process to help you scale up your analysis. A snakemake pipeline
2929
which implements this parallelisation scheme is available as
@@ -34,8 +34,8 @@ which implements this parallelisation scheme is available as
3434
## Data preparation
3535

3636
For large scale inference the data must be in [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec)
37-
format, read by the {class}`VariantData` class. [bio2zarr](https://github.com/sgkit-dev/bio2zarr)
38-
is recommended for conversion from VCF. [sgkit](https://github.com/sgkit-dev/sgkit) can then
37+
format, read by the {class}`VariantData` class. [Bio2zarr](https://github.com/sgkit-dev/bio2zarr)
38+
is recommended for conversion from VCF, and [sgkit](https://github.com/sgkit-dev/sgkit) can then
3939
be used to perform initial filtering.
4040

4141
:::{todo}
@@ -45,19 +45,20 @@ An upcoming tutorial will detail conversion from VCF to a VCF Zarr suitable for
4545

4646
## Ancestor generation
4747

48-
Ancestor generation is generally the fastest step in inference and is not yet
48+
Ancestor generation is generally the fastest step in inference. It is not yet
4949
parallelised out-of-core in tsinfer and must be performed on a single machine.
5050
However it scales well on machines with
5151
many cores and hyperthreading via the `num_threads` argument to
5252
{meth}`generate_ancestors`. The limiting factor is often that the
5353
entire genotype array for the contig being inferred needs to fit in RAM.
5454
This is the high-water mark for memory usage in tsinfer.
55-
Note the `genotype_encoding` argument, setting this to
56-
{class}`tsinfer.GenotypeEncoding.ONE_BIT` reduces the memory footprint of
57-
the genotype array by a factor of 8, for a surprisingly small increase in
58-
runtime. With this encoding, the RAM needed is roughly
59-
`num_sites * num_samples * ploidy / 8 bytes.` However this encoding
60-
only supports biallelic sites, with no missingness.
55+
56+
If your data consists of only biallelic sites, with no missingness,
57+
the `genotype_encoding` argument can be set to
58+
{class}`tsinfer.GenotypeEncoding.ONE_BIT` which reduces the memory footprint of
59+
the genotype array by a factor of 8, such that the RAM needed is roughly
60+
`num_sites * num_samples * ploidy / 8 bytes`. This memory optimisation
61+
results in a surprisingly small increase in runtime.
6162

6263
## Ancestor matching
6364

0 commit comments

Comments
 (0)