From 096d3619344266db3a893bfd85d8b6b69bdf520c Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Mon, 19 Aug 2024 13:45:01 -0700 Subject: [PATCH 1/5] Fix typo --- src/guides/bioinformatics/filtering-and-subsampling.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index f6fa8f77..dd06037e 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -51,7 +51,7 @@ To drop such strains, you can pass the filename to ``--exclude``: Subsampling within ``augur filter`` =================================== -Another common filtering operation is subsetting of data to a achieve a more +Another common filtering operation is subsetting of data to achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each From 7475ecb8b7d4c7f3158011b5b5cfa92573cfdf13 Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Fri, 16 Aug 2024 14:23:29 -0700 Subject: [PATCH 2/5] Clarify filtering docs Reword some text and add an example for --query. --- .../filtering-and-subsampling.rst | 56 ++++++++++++------- 1 file changed, 35 insertions(+), 21 deletions(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index dd06037e..7fc54993 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -11,8 +11,9 @@ sample data. Filtering ========= -The filter command allows you to select various subsets of your input data for -different types of analysis. A simple example use of this command would be +``augur filter`` provides the flexibility to choose different subsets of input +data for various types of analysis. A simple example would be to select all +sequences with a collection date in 2012 or later using ``--min-date 2012``: .. code-block:: bash @@ -23,30 +24,43 @@ different types of analysis. A simple example use of this command would be --output-sequences filtered_sequences.fasta \ --output-metadata filtered_metadata.tsv -This command will select all sequences with collection date in 2012 or later. -The filter command has a large number of options that allow flexible filtering -for many common situations. One such use-case is the exclusion of sequences that -are known to be outliers (e.g. because of sequencing errors, cell-culture -adaptation, ...). These can be specified in a separate text file (e.g. -``exclude.txt``): +There are several options that allow flexible filtering for many common +situations. Below are additional examples. -.. code-block:: +- Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation) + using ``--exclude``. First, create a text file ``exclude.txt`` with one line + per sequence ID: - BRA/2016/FC_DQ75D1 - COL/FLR_00034/2015 - ... + .. code-block:: -To drop such strains, you can pass the filename to ``--exclude``: + BRA/2016/FC_DQ75D1 + COL/FLR_00034/2015 + ... -.. code-block:: bash + Add the option by using ``--exclude exclude.txt`` in the command: - augur filter \ - --sequences data/sequences.fasta \ - --metadata data/metadata.tsv \ - --min-date 2012 \ - --exclude exclude.txt \ - --output-sequences filtered_sequences.fasta \ - --output-metadata filtered_metadata.tsv + .. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv + +- Include sequences from a specific region using ``--query``: + + .. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --query 'region="Asia"' \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv Subsampling within ``augur filter`` =================================== From 3adb7659783fd473c1c83b2b664076e9e4e801a4 Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Fri, 16 Aug 2024 12:40:10 -0700 Subject: [PATCH 3/5] Describe the order of operations for filtering options Note that I'm introducing new terminology here: "preliminary" vs. "subsampling" vs. "force-inclusive" filtering options. These are clearly distinct in the order of operations, making these labels helpful for explaining that process. For "preliminary", I had considered a term such as "exclusive" to better contrast with "force-inclusive". However, the expression syntax used for options in this category can be both exclusive (--exclude-where region!=Asia) and inclusive (--min-date 2012). This is also why "inclusive" is not a sufficient name for the "force-inclusive" category. Co-authored-by: James Hadfield --- .../filtering-and-subsampling.rst | 123 +++++++++++++++--- 1 file changed, 107 insertions(+), 16 deletions(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index 7fc54993..a4e7dc29 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -8,24 +8,79 @@ sample data. .. contents:: Table of Contents :local: -Filtering -========= +Overview +======== ``augur filter`` provides the flexibility to choose different subsets of input -data for various types of analysis. A simple example would be to select all -sequences with a collection date in 2012 or later using ``--min-date 2012``: +data for various types of analysis. There are several options which can be +categorized based on the information source and selection method. + +Information source: + +- **Metadata-based** options work with information available from + ``--metadata``. +- **Sequence-based** options work with information available from + ``--sequences`` or ``--sequence-index``. + +Selection method: + +- **Preliminary** options work by selecting or dropping sequences that match + certain criteria. +- **Subsampling** options work by selecting sequences using rules based on final + output size. These are applied after all preliminary options and before any + force-inclusive options. +- **Force-inclusive** options work by ensuring sequences that match certain + criteria are always included in the output, ignoring all other filter options. + +.. list-table:: Categories for filter options + :header-rows: 1 + :stub-columns: 1 + + * - + - Metadata-based + - Sequence-based + * - Preliminary + - * ``--min-date`` + * ``--max-date`` + * ``--exclude-ambiguous-dates-by`` + * ``--exclude`` + * ``--exclude-where`` + * ``--query`` + - * ``--min-length`` + * ``--max-length`` + * ``--non-nucleotide`` + + * - Subsampling + - * ``--subsample-max-sequences`` + * ``--group-by`` + * ``--sequences-per-group`` + * ``--probabilistic-sampling`` + * ``--no-probabilistic-sampling`` + * ``--priority`` + - *None* + + * - Force-inclusive + - * ``--include`` + * ``--include-where`` + - *None* + +Preliminary & force-inclusive selection +======================================= + +A common filtering operation is to select sequences according to rules on +individual sequence attributes. Examples: + +- Select all sequences with a collection date in 2012 or later using + ``--min-date 2012``: -.. code-block:: bash - - augur filter \ - --sequences data/sequences.fasta \ - --metadata data/metadata.tsv \ - --min-date 2012 \ - --output-sequences filtered_sequences.fasta \ - --output-metadata filtered_metadata.tsv + .. code-block:: bash -There are several options that allow flexible filtering for many common -situations. Below are additional examples. + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv - Exclude outliers (e.g. because of sequencing errors, cell-culture adaptation) using ``--exclude``. First, create a text file ``exclude.txt`` with one line @@ -62,8 +117,44 @@ situations. Below are additional examples. --output-sequences filtered_sequences.fasta \ --output-metadata filtered_metadata.tsv -Subsampling within ``augur filter`` -=================================== + .. tip:: + + ``--query 'region="Asia"'`` is functionally equivalent to ``--exclude-where + region!=Asia``. However, ``--query`` allows for more complex expressions such + as ``--query '(region in {"Asia", "Europe"}) & (coverage >= 0.95)'``. + + ``--query 'region="Asia"'`` is **not** equivalent to ``--include-where + region=Asia`` since force-inclusive options ignore other filter options + (i.e. ``--min-date`` and ``--exclude`` in the example above). + +Force-inclusive options work similarly, and override all other filtering +options. Example: + +- Include specific sequences (e.g. root sequence) using ``--include``. First, + create a text file ``include.txt`` with one line per sequence ID: + + .. code-block:: + + Wuhan/Hu-1/2019 + ... + + Add the option by using ``--include include.txt`` in the command: + + .. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2020 \ + --include include.txt \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv + + ``Wuhan/Hu-1/2019`` will still be included even if it does not pass the filter + ``--min-date 2020``. + +Subsampling +=========== Another common filtering operation is subsetting of data to achieve a more even spatio-temporal distribution or to cut-down data set size to more From 7e17f40784293a4eb83c7801780f7197c19e5dba Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Thu, 15 Aug 2024 14:55:08 -0700 Subject: [PATCH 4/5] Clarify --sequences-per-group example --- src/guides/bioinformatics/filtering-and-subsampling.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index a4e7dc29..b8e0c702 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -158,9 +158,9 @@ Subsampling Another common filtering operation is subsetting of data to achieve a more even spatio-temporal distribution or to cut-down data set size to more -manageable numbers. The filter command allows you to select a specific number of -sequences from specific groups, for example one sequence per month from each -country: +manageable numbers. The filter command allows you to partition the data into +groups based on column values and sample uniformly. For example, target one +sequence per month from each country: .. code-block:: bash From 5779a70fc3b8b5cc4c6bcdbf4b01a7360e6dd236 Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Thu, 15 Aug 2024 14:57:14 -0700 Subject: [PATCH 5/5] Add sections on sampling methods Reword the subsampling introduction with *what* it is, followed by examples on *why* paired with *how*. This also allows future sampling methods such as weighted sampling to be added by simply including a new section. --- .../filtering-and-subsampling.rst | 89 +++++++++++++++++-- 1 file changed, 84 insertions(+), 5 deletions(-) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index b8e0c702..e6d642d8 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -156,11 +156,56 @@ options. Example: Subsampling =========== -Another common filtering operation is subsetting of data to achieve a more -even spatio-temporal distribution or to cut-down data set size to more -manageable numbers. The filter command allows you to partition the data into -groups based on column values and sample uniformly. For example, target one -sequence per month from each country: +Another common filtering operation is **subsampling**: selection of data using +rules based on output size rather than individual sequence attributes. These are +the sampling methods supported by ``augur filter`` and a final section for caveats: + +.. contents:: + :local: + +Random sampling +--------------- + +The simplest scenario is a reduction of dataset size to more manageable numbers. +For example, limit the output to 100 sequences: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --subsample-max-sequences 100 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + +Random sampling is easy to define but can expose sampling bias in some datasets. +Consider uniform sampling to reduce sampling bias. + +Uniform sampling +---------------- + +``--group-by`` allows you to partition the data into groups based on column +values and sample uniformly. For example, sample evenly across countries over +time: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --group-by country year month \ + --subsample-max-sequences 100 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + +An alternative to ``--subsample-max-sequences`` is ``--sequences-per-group``. +This is useful if you care less about total sample size and more about having +a fixed number of sequences from each group. For example, target one sequence +per month from each country: .. code-block:: bash @@ -174,6 +219,40 @@ sequence per month from each country: --output-sequences subsampled_sequences.fasta \ --output-metadata subsampled_metadata.tsv +Probabilistic sampling +---------------------- + +It is possible to encounter situations in uniform sampling where the number of +groups exceeds the target sample size. For example, consider a command with +groups defined by ``--group-by country year month`` and target sample size +defined by ``--subsample-max-sequences 100``. If the input contains data from 5 +countries over a span of 24 months, that could result in 120 groups. + +The only way to target 100 sequences from 120 groups is to apply **probabilistic +sampling** which randomly determines a whole number of sequences per group. This +is noted in the output: + +.. code-block:: text + + WARNING: Asked to provide at most 100 sequences, but there are 120 groups. + Sampling probabilistically at 0.83 sequences per group, meaning it is + possible to have more than the requested maximum of 100 sequences after + filtering. + +This is automatically enabled. To force the command to exit with an error in +these situations, use ``--no-probabilistic-sampling``. + +Caveats +------- + +For these sampling methods, the number of targeted sequences per group does not +take into account the actual number of sequences available in the input data. +For example, consider a dataset with 200 sequences available from 2023 and 100 +sequences available from 2024. ``--group-by year --subsample-max-sequences 300`` +is equivalent to ``--group-by year --sequences-per-group 150``. This will take +150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 +sequences, which is less than the target of 300. + Subsampling using multiple ``augur filter`` commands ====================================================