From 096d3619344266db3a893bfd85d8b6b69bdf520c Mon Sep 17 00:00:00 2001
From: Victor Lin <13424970+victorlin@users.noreply.github.com>
Date: Mon, 19 Aug 2024 13:45:01 -0700
Subject: [PATCH 1/5] Fix typo

---
 src/guides/bioinformatics/filtering-and-subsampling.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
index f6fa8f77..dd06037e 100644
--- a/src/guides/bioinformatics/filtering-and-subsampling.rst
+++ b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -51,7 +51,7 @@ To drop such strains, you can pass the filename to ``--exclude``:
 Subsampling within ``augur filter``
 ===================================
 
-Another common filtering operation is subsetting of data to a achieve a more
+Another common filtering operation is subsetting of data to achieve a more
 even spatio-temporal distribution or to cut-down data set size to more
 manageable numbers. The filter command allows you to select a specific number of
 sequences from specific groups, for example one sequence per month from each

From 7475ecb8b7d4c7f3158011b5b5cfa92573cfdf13 Mon Sep 17 00:00:00 2001
From: Victor Lin <13424970+victorlin@users.noreply.github.com>
Date: Fri, 16 Aug 2024 14:23:29 -0700
Subject: [PATCH 2/5] Clarify filtering docs

Reword some text and add an example for --query.
---
 .../filtering-and-subsampling.rst             | 56 ++++++++++++-------
 1 file changed, 35 insertions(+), 21 deletions(-)

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
index dd06037e..7fc54993 100644
--- a/src/guides/bioinformatics/filtering-and-subsampling.rst
+++ b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -11,8 +11,9 @@ sample data.
 Filtering
 =========
 
-The filter command allows you to select various subsets of your input data for
-different types of analysis. A simple example use of this command would be
+``augur filter`` provides the flexibility to choose different subsets of input
+data for various types of analysis. A simple example would be to select all
+sequences with a collection date in 2012 or later using ``--min-date 2012``:
 
 .. code-block:: bash
 
@@ -23,30 +24,43 @@ different types of analysis. A simple example use of this command would be
      --output-sequences filtered_sequences.fasta \
      --output-metadata filtered_metadata.tsv
 
-This command will select all sequences with collection date in 2012 or later.
-The filter command has a large number of options that allow flexible filtering
-for many common situations. One such use-case is the exclusion of sequences that
-are known to be outliers (e.g. because of sequencing errors, cell-culture
-adaptation, ...). These can be specified in a separate text file (e.g.
-``exclude.txt``):
+There are several options that allow flexible filtering for many common
+situations. Below are additional examples.
 
-.. code-block::
+- Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation)
+  using ``--exclude``. First, create a text file ``exclude.txt`` with one line
+  per sequence ID:
 
-   BRA/2016/FC_DQ75D1
-   COL/FLR_00034/2015
-   ...
+  .. code-block::
 
-To drop such strains, you can pass the filename to ``--exclude``:
+      BRA/2016/FC_DQ75D1
+      COL/FLR_00034/2015
+      ...
 
-.. code-block:: bash
+  Add the option by using ``--exclude exclude.txt`` in the command:
 
-   augur filter \
-     --sequences data/sequences.fasta \
-     --metadata data/metadata.tsv \
-     --min-date 2012 \
-     --exclude exclude.txt \
-     --output-sequences filtered_sequences.fasta \
-     --output-metadata filtered_metadata.tsv
+  .. code-block:: bash
+
+      augur filter \
+        --sequences data/sequences.fasta \
+        --metadata data/metadata.tsv \
+        --min-date 2012 \
+        --exclude exclude.txt \
+        --output-sequences filtered_sequences.fasta \
+        --output-metadata filtered_metadata.tsv
+
+- Include sequences from a specific region using ``--query``:
+
+  .. code-block:: bash
+
+      augur filter \
+        --sequences data/sequences.fasta \
+        --metadata data/metadata.tsv \
+        --min-date 2012 \
+        --exclude exclude.txt \
+        --query 'region="Asia"' \
+        --output-sequences filtered_sequences.fasta \
+        --output-metadata filtered_metadata.tsv
 
 Subsampling within ``augur filter``
 ===================================

From 3adb7659783fd473c1c83b2b664076e9e4e801a4 Mon Sep 17 00:00:00 2001
From: Victor Lin <13424970+victorlin@users.noreply.github.com>
Date: Fri, 16 Aug 2024 12:40:10 -0700
Subject: [PATCH 3/5] Describe the order of operations for filtering options

Note that I'm introducing new terminology here: "preliminary" vs.
"subsampling" vs. "force-inclusive" filtering options. These are clearly
distinct in the order of operations, making these labels helpful for
explaining that process.

For "preliminary", I had considered a term such as "exclusive" to better
contrast with "force-inclusive". However, the expression syntax used for
options in this category can be both exclusive (--exclude-where
region!=Asia) and inclusive (--min-date 2012). This is also why
"inclusive" is not a sufficient name for the "force-inclusive" category.

Co-authored-by: James Hadfield <hadfield.james@gmail.com>
---
 .../filtering-and-subsampling.rst             | 123 +++++++++++++++---
 1 file changed, 107 insertions(+), 16 deletions(-)

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
index 7fc54993..a4e7dc29 100644
--- a/src/guides/bioinformatics/filtering-and-subsampling.rst
+++ b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -8,24 +8,79 @@ sample data.
 .. contents:: Table of Contents
    :local:
 
-Filtering
-=========
+Overview
+========
 
 ``augur filter`` provides the flexibility to choose different subsets of input
-data for various types of analysis. A simple example would be to select all
-sequences with a collection date in 2012 or later using ``--min-date 2012``:
+data for various types of analysis. There are several options which can be
+categorized based on the information source and selection method.
+
+Information source:
+
+- **Metadata-based** options work with information available from
+  ``--metadata``.
+- **Sequence-based** options work with information available from
+  ``--sequences`` or ``--sequence-index``.
+
+Selection method:
+
+- **Preliminary** options work by selecting or dropping sequences that match
+  certain criteria.
+- **Subsampling** options work by selecting sequences using rules based on final
+  output size. These are applied after all preliminary options and before any
+  force-inclusive options.
+- **Force-inclusive** options work by ensuring sequences that match certain
+  criteria are always included in the output, ignoring all other filter options.
+
+.. list-table:: Categories for filter options
+   :header-rows: 1
+   :stub-columns: 1
+
+   * -
+     - Metadata-based
+     - Sequence-based
+   * - Preliminary
+     - * ``--min-date``
+       * ``--max-date``
+       * ``--exclude-ambiguous-dates-by``
+       * ``--exclude``
+       * ``--exclude-where``
+       * ``--query``
+     - * ``--min-length``
+       * ``--max-length``
+       * ``--non-nucleotide``
+
+   * - Subsampling
+     - * ``--subsample-max-sequences``
+       * ``--group-by``
+       * ``--sequences-per-group``
+       * ``--probabilistic-sampling``
+       * ``--no-probabilistic-sampling``
+       * ``--priority``
+     - *None*
+
+   * - Force-inclusive
+     - * ``--include``
+       * ``--include-where``
+     - *None*
+
+Preliminary & force-inclusive selection
+=======================================
+
+A common filtering operation is to select sequences according to rules on
+individual sequence attributes. Examples:
+
+- Select all sequences with a collection date in 2012 or later using
+  ``--min-date 2012``:
 
-.. code-block:: bash
-
-   augur filter \
-     --sequences data/sequences.fasta \
-     --metadata data/metadata.tsv \
-     --min-date 2012 \
-     --output-sequences filtered_sequences.fasta \
-     --output-metadata filtered_metadata.tsv
+  .. code-block:: bash
 
-There are several options that allow flexible filtering for many common
-situations. Below are additional examples.
+     augur filter \
+       --sequences data/sequences.fasta \
+       --metadata data/metadata.tsv \
+       --min-date 2012 \
+       --output-sequences filtered_sequences.fasta \
+       --output-metadata filtered_metadata.tsv
 
 - Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation)
   using ``--exclude``. First, create a text file ``exclude.txt`` with one line
@@ -62,8 +117,44 @@ situations. Below are additional examples.
         --output-sequences filtered_sequences.fasta \
         --output-metadata filtered_metadata.tsv
 
-Subsampling within ``augur filter``
-===================================
+  .. tip::
+
+      ``--query 'region="Asia"'`` is functionally equivalent to ``--exclude-where
+      region!=Asia``. However, ``--query`` allows for more complex expressions such
+      as ``--query '(region in {"Asia", "Europe"}) & (coverage >= 0.95)'``.
+
+      ``--query 'region="Asia"'`` is **not** equivalent to ``--include-where
+      region=Asia`` since force-inclusive options ignore other filter options
+      (i.e. ``--min-date`` and ``--exclude`` in the example above).
+
+Force-inclusive options work similarly, and override all other filtering
+options. Example:
+
+- Include specific sequences (e.g. root sequence) using ``--include``. First,
+  create a text file ``include.txt`` with one line per sequence ID:
+
+  .. code-block::
+
+      Wuhan/Hu-1/2019
+      ...
+
+  Add the option by using ``--include include.txt`` in the command:
+
+  .. code-block:: bash
+
+      augur filter \
+        --sequences data/sequences.fasta \
+        --metadata data/metadata.tsv \
+        --min-date 2020 \
+        --include include.txt \
+        --output-sequences filtered_sequences.fasta \
+        --output-metadata filtered_metadata.tsv
+
+  ``Wuhan/Hu-1/2019`` will still be included even if it does not pass the filter
+  ``--min-date 2020``.
+
+Subsampling
+===========
 
 Another common filtering operation is subsetting of data to achieve a more
 even spatio-temporal distribution or to cut-down data set size to more

From 7e17f40784293a4eb83c7801780f7197c19e5dba Mon Sep 17 00:00:00 2001
From: Victor Lin <13424970+victorlin@users.noreply.github.com>
Date: Thu, 15 Aug 2024 14:55:08 -0700
Subject: [PATCH 4/5] Clarify --sequences-per-group example

---
 src/guides/bioinformatics/filtering-and-subsampling.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
index a4e7dc29..b8e0c702 100644
--- a/src/guides/bioinformatics/filtering-and-subsampling.rst
+++ b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -158,9 +158,9 @@ Subsampling
 
 Another common filtering operation is subsetting of data to achieve a more
 even spatio-temporal distribution or to cut-down data set size to more
-manageable numbers. The filter command allows you to select a specific number of
-sequences from specific groups, for example one sequence per month from each
-country:
+manageable numbers. The filter command allows you to partition the data into
+groups based on column values and sample uniformly. For example, target one
+sequence per month from each country:
 
 .. code-block:: bash
 

From 5779a70fc3b8b5cc4c6bcdbf4b01a7360e6dd236 Mon Sep 17 00:00:00 2001
From: Victor Lin <13424970+victorlin@users.noreply.github.com>
Date: Thu, 15 Aug 2024 14:57:14 -0700
Subject: [PATCH 5/5] Add sections on sampling methods

Reword the subsampling introduction with *what* it is, followed by
examples on *why* paired with *how*.

This also allows future sampling methods such as weighted sampling to be
added by simply including a new section.
---
 .../filtering-and-subsampling.rst             | 89 +++++++++++++++++--
 1 file changed, 84 insertions(+), 5 deletions(-)

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
index b8e0c702..e6d642d8 100644
--- a/src/guides/bioinformatics/filtering-and-subsampling.rst
+++ b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -156,11 +156,56 @@ options. Example:
 Subsampling
 ===========
 
-Another common filtering operation is subsetting of data to achieve a more
-even spatio-temporal distribution or to cut-down data set size to more
-manageable numbers. The filter command allows you to partition the data into
-groups based on column values and sample uniformly. For example, target one
-sequence per month from each country:
+Another common filtering operation is **subsampling**: selection of data using
+rules based on output size rather than individual sequence attributes. These are
+the sampling methods supported by ``augur filter`` and a final section for caveats:
+
+.. contents::
+   :local:
+
+Random sampling
+---------------
+
+The simplest scenario is a reduction of dataset size to more manageable numbers.
+For example, limit the output to 100 sequences:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --subsample-max-sequences 100 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
+Random sampling is easy to define but can expose sampling bias in some datasets.
+Consider uniform sampling to reduce sampling bias.
+
+Uniform sampling
+----------------
+
+``--group-by`` allows you to partition the data into groups based on column
+values and sample uniformly. For example, sample evenly across countries over
+time:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --group-by country year month \
+     --subsample-max-sequences 100 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
+An alternative to ``--subsample-max-sequences`` is ``--sequences-per-group``.
+This is useful if you care less about total sample size and more about having
+a fixed number of sequences from each group. For example, target one sequence
+per month from each country:
 
 .. code-block:: bash
 
@@ -174,6 +219,40 @@ sequence per month from each country:
      --output-sequences subsampled_sequences.fasta \
      --output-metadata subsampled_metadata.tsv
 
+Probabilistic sampling
+----------------------
+
+It is possible to encounter situations in uniform sampling where the number of
+groups exceeds the target sample size. For example, consider a command with
+groups defined by ``--group-by country year month`` and target sample size
+defined by ``--subsample-max-sequences 100``. If the input contains data from 5
+countries over a span of 24 months, that could result in 120 groups.
+
+The only way to target 100 sequences from 120 groups is to apply **probabilistic
+sampling** which randomly determines a whole number of sequences per group. This
+is noted in the output:
+
+.. code-block:: text
+
+   WARNING: Asked to provide at most 100 sequences, but there are 120 groups.
+   Sampling probabilistically at 0.83 sequences per group, meaning it is
+   possible to have more than the requested maximum of 100 sequences after
+   filtering.
+
+This is automatically enabled. To force the command to exit with an error in
+these situations, use ``--no-probabilistic-sampling``.
+
+Caveats
+-------
+
+For these sampling methods, the number of targeted sequences per group does not
+take into account the actual number of sequences available in the input data.
+For example, consider a dataset with 200 sequences available from 2023 and 100
+sequences available from 2024. ``--group-by year --subsample-max-sequences 300``
+is equivalent to ``--group-by year --sequences-per-group 150``. This will take
+150 sequences from 2023 and all 100 sequences from 2024 for a total of 250
+sequences, which is less than the target of 300.
+
 Subsampling using multiple ``augur filter`` commands
 ====================================================