Skip to content

Commit

Permalink
Nits
Browse files Browse the repository at this point in the history
  • Loading branch information
benjeffery committed Feb 6, 2025
1 parent d039461 commit 6c02df1
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 7 deletions.
11 changes: 6 additions & 5 deletions docs/large_scale.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ Note the `genotype_encoding` argument, setting this to
{class}`tsinfer.GenotypeEncoding.ONE_BIT` reduces the memory footprint of
the genotype array by a factor of 8, for a surprisingly small increase in
runtime. With this encoding, the RAM needed is roughly
`num_sites * num_samples * ploidy / 8 bytes.`
`num_sites * num_samples * ploidy / 8 bytes.` However this encoding
only supports biallelic sites, with no missingness.

## Ancestor matching

Expand All @@ -57,7 +58,7 @@ of a sample must be matched in an earlier group. For a typical human data set
the number of samples per group varies from single digits up to approximately
the number of samples.
The plot below shows the number of ancestors matched in each group for a typical
human data set:
human data set, earlier groups are older ancestors:

```{figure} _static/ancestor_grouping.png
:width: 80%
Expand Down Expand Up @@ -103,9 +104,9 @@ the `working_dir`. Once all are complete a single call to
{meth}`match_ancestors_batch_group_finalise` will then insert the matches and
output the tree sequence to `work_dir`.

At anypoint the process can be resumed from the last successfully completed call to
{meth}`match_ancestors_batch_groups`. As the tree sequences in `work_dir` checkpoint the
progress.
Each call to {meth}`match_ancestors_batch_groups` and {meth}`match_ancestors_batch_group_finalise` results in a tree sequence being written to `work_dir`.
These tree sequences are essentially checkpoints from with the batch matching workflow
can be resumed on job failure.

Finally after the final group, call {meth}`match_ancestors_batch_finalise` to
combine the groups into a single tree sequence.
Expand Down
4 changes: 2 additions & 2 deletions tsinfer/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -647,7 +647,7 @@ def match_ancestors_batch_init(
:param int min_work_per_job: The minimum amount of work (as a count of genotypes) to
allocate to a single parallel job. If the amount of work in a group of ancestors
exceeds this level it will be broken up into parallel partitions, subject to
the constriant of `max_num_partitions`.
the constraint of `max_num_partitions`.
:param int max_num_partitions: The maximum number of partitions to split a group of
ancestors into. Useful for limiting the number of jobs in a workflow to
avoid job overhead. Defaults to 1000.
Expand Down Expand Up @@ -1189,7 +1189,7 @@ def match_samples_batch_init(
:param int min_work_per_job: The minimum amount of work (as a count of
genotypes) to allocate to a single parallel job. If the amount of work in
a group of samples exceeds this level it will be broken up into parallel
partitions, subject to the constriant of `max_num_partitions`.
partitions, subject to the constraint of `max_num_partitions`.
:param int max_num_partitions: The maximum number of partitions to split a
group of samples into. Useful for limiting the number of jobs in a
workflow to avoid job overhead. Defaults to 1000.
Expand Down

0 comments on commit 6c02df1

Please sign in to comment.