|
| 1 | +--- |
| 2 | +jupytext: |
| 3 | + text_representation: |
| 4 | + extension: .md |
| 5 | + format_name: myst |
| 6 | + format_version: 0.12 |
| 7 | + jupytext_version: 1.9.1 |
| 8 | +kernelspec: |
| 9 | + display_name: Python 3 |
| 10 | + language: python |
| 11 | + name: python3 |
| 12 | +--- |
| 13 | + |
| 14 | +:::{currentmodule} tsinfer |
| 15 | +::: |
| 16 | + |
| 17 | +(sec_large_scale)= |
| 18 | + |
| 19 | +# Large Scale Inference |
| 20 | + |
| 21 | +tsinfer scales well and has been successfully used with datasets up to half a |
| 22 | +million samples. Here we detail considerations and tips for each step of the |
| 23 | +inference process to help you scale up your analysis. A snakemake pipeline |
| 24 | +which implements this parallelisation scheme is available at https://github.com/benjeffery/tsinfer-snakemake. |
| 25 | + |
| 26 | +(sec_large_scale_ancestor_generation)= |
| 27 | + |
| 28 | +## Data preparation |
| 29 | + |
| 30 | +For large scale inference the data must be in [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec) |
| 31 | +format, read by the {class}`VariantData` class. [bio2zarr](https://github.com/sgkit-dev/bio2zarr) |
| 32 | +is recommended for conversion from VCF. [sgkit](https://github.com/sgkit-dev/sgkit) can then |
| 33 | +be used to perform initial filtering. |
| 34 | + |
| 35 | + |
| 36 | +## Ancestor generation |
| 37 | + |
| 38 | +Ancestor generation is generally the fastest step in inference and is not yet |
| 39 | +parallelised out-of-core in tsinfer. However it scales well on machines with |
| 40 | +many cores and hyperthreading via the `num_threads` argument to |
| 41 | +{meth}`generate_ancestors`. The limiting factor is often that the |
| 42 | +entire genotype array for the contig being inferred needs to fit in RAM. |
| 43 | +This is the high-water mark for memory usage in tsinfer. |
| 44 | +Note the `genotype_encoding` argument, setting this to |
| 45 | +{class}`tsinfer.GenotypeEncoding.ONE_BIT` reduces the memory footprint of |
| 46 | +the genotype array by a factor of 8, for a surprisingly small increase in |
| 47 | +runtime. With this encoding, the RAM needed is roughly |
| 48 | +`num_sites * num_samples * ploidy / 8 bytes.` |
| 49 | + |
| 50 | +## Ancestor matching |
| 51 | + |
| 52 | +Ancestor matching is one of the more time consuming steps of inference. It |
| 53 | +proceeds in groups, progressively growing the tree sequence with younger |
| 54 | +ancestors. At each stage the parallelism is limited to the number of ancestors |
| 55 | +whose possible inheritors are already matched, as all possible inheritors |
| 56 | +of a sample must be matched in an earlier group. For a typical human data set |
| 57 | +the number of samples per group varies from single digits up to approximately |
| 58 | +the number of samples. |
| 59 | +The plot below shows the number of ancestors matched in each group for a typical |
| 60 | +human data set: |
| 61 | + |
| 62 | +```{figure} _static/ancestor_grouping.png |
| 63 | +:width: 80% |
| 64 | +``` |
| 65 | + |
| 66 | +There are five tsinfer API methods that can be used to parallelise ancestor |
| 67 | +matching. |
| 68 | + |
| 69 | +Initially {meth}`match_ancestors_batch_init` should be called to |
| 70 | +set up the batch matching and to determine the groupings of ancestors. |
| 71 | +This method writes a file `metadata.json` to the `work_dir` that contains |
| 72 | +a JSON encoded dictionary with configuration for later steps, and a key |
| 73 | +`ancestor_grouping` which is a list of dictionaries, each containing the |
| 74 | +list of ancestors in that group (key:`ancestors`) and a proposed partioning of |
| 75 | +those ancestors into sets that can be matched in parallel (key:`partitions`). |
| 76 | +The dictionary is also returned by the method. |
| 77 | +The partitioning is controlled by the `min_work_per_job` and `max_num_partitions` |
| 78 | +arguments. Ancestors are placed in a partition until the sum of their lengths exceeds |
| 79 | +`min_work_per_job`, when a new partition is started. However, the number of partitions |
| 80 | +is not allowed to exceed `max_num_partitions`. It is suggested to set `max_num_partitions` |
| 81 | +to around 3-4x the number of worker nodes available, and `min_work_per_job` to around |
| 82 | +2,000,000 for a typical human data set. |
| 83 | + |
| 84 | +Each group is then matched in turn, either by calling {meth}`match_ancestors_batch_groups` |
| 85 | +to match without partitioning, or by calling {meth}`match_ancestors_batch_group_partition` |
| 86 | +many times in parallel followed by a single call to {meth}`match_ancestors_batch_group_finalise`. |
| 87 | +Each call to {meth}`match_ancestors_batch_groups` or {meth}`match_ancestors_batch_group_finalise` |
| 88 | +outputs the tree sequence to `work_dir`, which is then used by the next group. The length of |
| 89 | +the `ancestor_grouping` in the metadata dictionary determines the group numbers that these methods |
| 90 | +will need to be called for, and the length of the `partitions` list in each group determines |
| 91 | +the number of calls to {meth}`match_ancestors_batch_group_partition` that are needed (if any). |
| 92 | + |
| 93 | +{meth}`match_ancestors_batch_groups` matches groups, without partitioning, from |
| 94 | +`group_index_start` (inclusively) to `group_index_end` (exclusively). Combining |
| 95 | +many groups into one call reduces the overhead from job submission and start |
| 96 | +up times, but note on job failure the process can only be resumed from the |
| 97 | +last `group_index_end`. |
| 98 | + |
| 99 | +To match a single group in parallel, call {meth}`match_ancestors_batch_group_partition` |
| 100 | +once for each partition listed in the `ancestor_grouping[group_index]['partitions']` list, |
| 101 | +incrementing `partition_index`. This will match the ancestors, placing the match data in |
| 102 | +the `working_dir`. Once all are complete a single call to |
| 103 | +{meth}`match_ancestors_batch_group_finalise` will then insert the matches and |
| 104 | +output the tree sequence to `work_dir`. |
| 105 | + |
| 106 | +At anypoint the process can be resumed from the last successfully completed call to |
| 107 | +{meth}`match_ancestors_batch_groups`. As the tree sequences in `work_dir` checkpoint the |
| 108 | +progress. |
| 109 | + |
| 110 | +Finally after the final group, call {meth}`match_ancestors_batch_finalise` to |
| 111 | +combine the groups into a single tree sequence. |
| 112 | + |
| 113 | +The partitioning in `metadata.json` does not have to be used for every group. As early groups are |
| 114 | +not matching to a large tree sequence it is often faster to not partition the first half of the |
| 115 | +groups, depending on job set up and queueing time on your cluster. |
| 116 | + |
| 117 | +Calls to {meth}`match_ancestors_batch_group_partition` will only use a single core, but |
| 118 | +{meth}`match_ancestors_batch_groups` will use as many cores as `num_threads` is set to |
| 119 | +Therefore this value and cluster resources requested should be scaled with the number of ancestors, |
| 120 | +which can be read from the metadata dictionary. |
| 121 | + |
| 122 | + |
| 123 | + |
| 124 | +## Sample matching |
| 125 | + |
| 126 | +Sample matching is far simpler than ancestor matching as it is essentially the same as a single group |
| 127 | +of ancestors. There are three API methods that work together to enable distributed sample matching. |
| 128 | +{meth}`match_samples_batch_init` should be called to set up the batch matching and to determine the |
| 129 | +groupings of samples. Similar to {meth}`match_ancestors_batch_init` is has a `min_work_per_job` and |
| 130 | +`max_num_partitions` arguments to control the level of parallelism. The method writes a file |
| 131 | +`metadata.json` to the directory `work_dir` that contains a JSON encoded dictionary with |
| 132 | +configuration for later steps. This is also returned by the call. The `num_partitions` key in |
| 133 | +this dictionary is the number of times {meth}`match_samples_batch_partition` will need |
| 134 | +to be called, with each partition index as the `partition_index` argument. These calls can happen |
| 135 | +in parallel and write match data to the `work_dir` which is then used by |
| 136 | +{meth}`match_samples_batch_finalise` to output the tree sequence. |
0 commit comments