diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index b7c47ca..a763d15 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -181,7 +181,7 @@ For example, limit the output to 100 sequences: --output-metadata subsampled_metadata.tsv Random sampling is easy to define but can expose sampling bias in some datasets. -Consider uniform sampling to reduce sampling bias. +Consider another sampling method to reduce sampling bias. Uniform sampling ---------------- @@ -242,6 +242,39 @@ is noted in the output: This is automatically enabled. To force the command to exit with an error in these situations, use ``--no-probabilistic-sampling``. +Weighted sampling +----------------- + +``--group-by-weights`` can be specified in addition to ``--group-by`` to allow +different target sizes per group. For example, target twice the amount of +sequences from Asia compared to other regions. First, create a file +``weights.tsv``: + +.. code-block:: + + region weight + Asia 2 + default 1 + ... + +The format specifications are described in ``augur filter`` docs for +``--group-by-weights``. + +Add the option by using ``--group-by-weights weights.tsv`` in the command: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --group-by region year month \ + --group-by-weights weights.tsv \ + --subsample-max-sequences 100 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + Caveats -------