From 6c02df19f502be6ff5b9833ceb6f17a8b2818789 Mon Sep 17 00:00:00 2001 From: Ben Jeffery Date: Thu, 6 Feb 2025 13:53:48 +0000 Subject: [PATCH] Nits --- docs/large_scale.md | 11 ++++++----- tsinfer/inference.py | 4 ++-- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/docs/large_scale.md b/docs/large_scale.md index 16c76d52..eb4dca59 100644 --- a/docs/large_scale.md +++ b/docs/large_scale.md @@ -45,7 +45,8 @@ Note the `genotype_encoding` argument, setting this to {class}`tsinfer.GenotypeEncoding.ONE_BIT` reduces the memory footprint of the genotype array by a factor of 8, for a surprisingly small increase in runtime. With this encoding, the RAM needed is roughly -`num_sites * num_samples * ploidy / 8 bytes.` +`num_sites * num_samples * ploidy / 8 bytes.` However this encoding +only supports biallelic sites, with no missingness. ## Ancestor matching @@ -57,7 +58,7 @@ of a sample must be matched in an earlier group. For a typical human data set the number of samples per group varies from single digits up to approximately the number of samples. The plot below shows the number of ancestors matched in each group for a typical -human data set: +human data set, earlier groups are older ancestors: ```{figure} _static/ancestor_grouping.png :width: 80% @@ -103,9 +104,9 @@ the `working_dir`. Once all are complete a single call to {meth}`match_ancestors_batch_group_finalise` will then insert the matches and output the tree sequence to `work_dir`. -At anypoint the process can be resumed from the last successfully completed call to -{meth}`match_ancestors_batch_groups`. As the tree sequences in `work_dir` checkpoint the -progress. +Each call to {meth}`match_ancestors_batch_groups` and {meth}`match_ancestors_batch_group_finalise` results in a tree sequence being written to `work_dir`. +These tree sequences are essentially checkpoints from with the batch matching workflow +can be resumed on job failure. Finally after the final group, call {meth}`match_ancestors_batch_finalise` to combine the groups into a single tree sequence. diff --git a/tsinfer/inference.py b/tsinfer/inference.py index 4d5e5864..858a91fd 100644 --- a/tsinfer/inference.py +++ b/tsinfer/inference.py @@ -647,7 +647,7 @@ def match_ancestors_batch_init( :param int min_work_per_job: The minimum amount of work (as a count of genotypes) to allocate to a single parallel job. If the amount of work in a group of ancestors exceeds this level it will be broken up into parallel partitions, subject to - the constriant of `max_num_partitions`. + the constraint of `max_num_partitions`. :param int max_num_partitions: The maximum number of partitions to split a group of ancestors into. Useful for limiting the number of jobs in a workflow to avoid job overhead. Defaults to 1000. @@ -1189,7 +1189,7 @@ def match_samples_batch_init( :param int min_work_per_job: The minimum amount of work (as a count of genotypes) to allocate to a single parallel job. If the amount of work in a group of samples exceeds this level it will be broken up into parallel - partitions, subject to the constriant of `max_num_partitions`. + partitions, subject to the constraint of `max_num_partitions`. :param int max_num_partitions: The maximum number of partitions to split a group of samples into. Useful for limiting the number of jobs in a workflow to avoid job overhead. Defaults to 1000.