Skip to content

Commit

Permalink
Merge pull request #1372 from nextstrain/docs-v3
Browse files Browse the repository at this point in the history
  • Loading branch information
ivan-aksamentov authored Jan 9, 2024
2 parents 3367917 + c88bff4 commit 9908d2d
Show file tree
Hide file tree
Showing 48 changed files with 538 additions and 441 deletions.
151 changes: 86 additions & 65 deletions CHANGELOG.md

Large diffs are not rendered by default.

13 changes: 3 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,9 @@ by Nextstrain team
</a>
</p>

<p align="center">
<a href="https://raw.githubusercontent.com/nextstrain/nextclade/master/docs/assets/ui.gif" target="_blank" rel="noopener noreferrer" alt="Link to animated screenshot of the application, showcasing the user interface on main page">
<img
width="100%"
height="auto"
src="https://raw.githubusercontent.com/nextstrain/nextclade/master/docs/assets/ui.gif"
alt="Animated screenshot of the application, showcasing the user interface on main page"
/>
</a>
</p>
| <video controls autoplay loop muted src="https://github.com/nextstrain/nextclade/assets/9403403/9bf0bab5-b7ee-4161-96a6-23e76ddb56b4" width="680"></video> |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Brief demonstration of Nextclade Web. Large version is <a target="_blank" href="https://github.com/nextstrain/nextclade/assets/9403403/9bf0bab5-b7ee-4161-96a6-23e76ddb56b4">here</a>. |

<p align="center">
<a target="_blank" rel="noopener noreferrer" href="LICENSE">
Expand Down
17 changes: 17 additions & 0 deletions docs/dev/docs-meta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
### Categories of Nextclade users

| | Nextclade Web users | Nextclade CLI users | Dataset authors | Software developers |
|-----------------------------------------------------------|------------------------------|-------------------------------------------------|---------------------------------------|----------------------|
| Skills | Basic web users | Advanced users, CLI | Experienced users, CLI, phylo | Software development |
| Relation to datasets | They use datasets implicitly | They use datasets explicitly (download and run) | They create and maintain datasets | Mixed |
| Need nextstrain/nextclade repo | | | ||
| Need nextstrain/nextclade_data repo | | |||
| Tools needed | | | Tools to create and maintain datasets | |
| Need Web docs (on RTD) |||||
| Need CLI docs (on RTD) |||||
| Need algo docs (on RTD) |||||
| Need input files docs (on RTD) |||||
| Need output files docs (on RTD) |||||
| Need dataset user docs (on RTD) |||||
| Need dataset curation docs (in nextstrain/nextclade_data) |||||
| Need software dev docs (in nextstrain/nextclade) |||||
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Nextclade is a part of `Nextstrain <https://nextstrain.org>`_, an open-source pr

user/migration-v3

user/nextclade-web
user/nextclade-web/index
user/nextclade-cli/index

user/datasets
Expand Down
10 changes: 5 additions & 5 deletions docs/user/algorithm/02-translation.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# 2. Translation

In order to detect changes in viral proteins, aminoacid sequences (peptides) need to be computed from the nucleotide sequence regions corresponding to [coding sequences (CDS)](https://en.wikipedia.org/wiki/Coding_region). This process is called [translation](<https://en.wikipedia.org/wiki/Translation_(biology)>). Protein sequences then need to be aligned, in order to make them comparable, similarly to how it's done with nucleotide sequences.
In order to detect changes in viral proteins, amino acid sequences (peptides) need to be computed from the nucleotide sequence regions corresponding to [coding sequences (CDS)](https://en.wikipedia.org/wiki/Coding_region). This process is called [translation](<https://en.wikipedia.org/wiki/Translation_(biology)>). Peptide sequences then need to be aligned, in order to make them comparable, similarly to how it's [done](./01-sequence-alignment) with nucleotide sequences.

Nextclade performs translation separately for every CDS. CDS are specified in a genome annotation file, previously called [Gene map](../terminology.html#gene-map), and can consist of multiple segments that correspond to ranges in the genome that are combined into a contiguous CDS. The list of CDS to be considered for translation is configurable in [Nextclade CLI](../nextclade-cli) and if it's not specified, all CDS found in the annotation are translated.
Nextclade performs translation separately for every CDS. CDS are specified in a [genome annotation file](../input-files/03-genome-annotation.md), previously called [Gene map](../terminology.html#gene-map), and can consist of multiple segments that correspond to ranges in the genome that are combined into a contiguous CDS. The list of CDS to be considered for translation is configurable in [Nextclade CLI](../nextclade-cli) and if it's not specified, all CDS found in the annotation are translated.

For each coding sequence in the annotation, Nextclade extracts the corresponding sequence from the nucleotide alignment, and then generates peptides by taking every triplet of nucleotides (codon) and translating it into a corresponding aminoacid. It then aligns the resulting peptides against the corresponding reference peptides (translated from reference sequence), using the same alignment algorithm as for nucleotide sequences.
For each coding sequence in the annotation, Nextclade extracts the corresponding sequence from the nucleotide alignment, and then generates peptides by taking every triplet of nucleotides (codon) and translating it into a corresponding amino acid. It then aligns the resulting peptides against the corresponding reference peptides (translated from reference sequence), using the same alignment algorithm as for nucleotide sequences.

This step only runs if an annotation is provided.

### Results

The translation step results in aligned [Peptide](../terminology.html#peptide) sequences, which are being produced in the form of fasta files, one per CDS.
The translation step results in aligned [peptide](../terminology.html#peptide) sequences, which are being [produced](../output-files/03-translations) in the form of fasta files, one file per CDS.

These files are written by [Nextclade CLI](../nextclade-cli) and can be downloaded in the "Download" dialog of [Nextclade Web](../nextclade-web).
These files are written by [Nextclade CLI](../nextclade-cli) and can be downloaded in the "Export" dialog of [Nextclade Web](../nextclade-web).
6 changes: 3 additions & 3 deletions docs/user/algorithm/03-mutation-calling.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ In order to detect nucleotide mutations, aligned nucleotide sequences are compar

- Nucleotide deletions ("gaps"): nucleotide was present in the reference sequence, but is not present in the query sequence. These are indicated by the "`-`" character in the alignment sequence. They are shown in sequence views in [Nextclade Web](../nextclade-web) as dark-grey markers. In the output files deletions are represented as numeric ranges, signifying the start and end of the deleted fragment (for example: `21765-21770`)

- Nucleotide insertions: additional nucleotides in the query sequence that were not present in the reference sequence. They are stripped from the alignment and reported in a separate output file, showing the position in the reference after which the insertion occurred and the fragment that was inserted. `22030:ACT` would indicate that the query sequence has the three bases `ACT` inserted between position `22030` and `22031` in the reference sequence (the indices are 1-based).
- Nucleotide insertions: additional nucleotides in the query sequence that were not present in the reference sequence. They are stripped from the alignment and reported separately, showing the position in the reference after which the insertion occurred and the fragment that was inserted. `22030:ACT` would indicate that the query sequence has the three bases `ACT` inserted between position `22030` and `22031` in the reference sequence (the indices are 1-based).

Nextclade also gathers and reports other useful statistics, such as the number of contiguous ranges of `N` (missing) and non-ACGTN (ambiguous) nucleotides, as well as the total counts of substituted, deleted, missing and ambiguous nucleotides. You can find this information in the results table of [Nextclade Web](../nextclade-web) and in the output files of [Nextclade CLI](../nextclade-cli).

Similarly, aminoacid mutations and statistics are gathered from the aligned peptides obtained after translation. This step only runs if a genome annotation is provided.
Similarly, aminoacid mutations and statistics are gathered from the aligned peptides obtained after [translation](./02-translation). This step only runs if a [genome annotation](../input-files/03-genome-annotation) is provided.

### Results

The nucleotide mutations can be viewed in "Sequence view" column of the results table in [Nextclade Web](../nextclade-web). Switching "Sequence view" to a particular gene will show mutations in the corresponding peptide.

The mutation calling step results in a set of mutations and various practical metrics for each sequence. They are produced as a part of the analysis results JSON, CSV and TSV files in [Nextclade CLI](../nextclade-cli) and in the "Download" dialog of [Nextclade Web](../nextclade-web).
The mutation calling step results in a set of mutations and various practical metrics for each sequence. They are produced as a part of the analysis results [JSON](../output-files/05-results-json), [CSV and TSV files](../output-files/04-results-tsv) in [Nextclade CLI](../nextclade-cli) and in the "Download" dialog of [Nextclade Web](../nextclade-web).
32 changes: 17 additions & 15 deletions docs/user/algorithm/05-phylogenetic-placement.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,39 +23,41 @@ D = M_{ref} + M_{query} - 2 M_{agree} - M_{disagree} - M_{unknown}

where

- ``$` D `$`` is the resulting distance metric
- ``$ D $`` is the resulting distance metric

- ``$` M_{ref} `$`` is the total number of mutations in the reference node
- ``$ M_{ref} $`` is the total number of mutations in the reference node

- ``$` M_{query} `$`` is the total number of mutations in the query sequence
- ``$ M_{query} $`` is the total number of mutations in the query sequence

- ``$` M_{agree} `$`` is the number of exact mutations is shared between the reference node and the query sequence
- ``$ M_{agree} $`` is the number of exact mutations is shared between the reference node and the query sequence

- ``$` M_{disagree} `$`` is the number of mutations at the same position in the reference node and the query sequence, but where the states are different. This is where the reference node and the query sequence disagree
- ``$ M_{disagree} $`` is the number of mutations at the same position in the reference node and the query sequence, but where the states are different. This is where the reference node and the query sequence disagree

- ``$` M_{unknown} `$`` is number of undetermined (sites) - sites that are mutated in the reference node but are missing in the query sequence. For these we can't tell whether the reference node agrees with the query sequence
- ``$ M_{unknown} $`` is number of undetermined (sites) - sites that are mutated in the reference node but are missing in the query sequence. For these we can't tell whether the reference node agrees with the query sequence

The nearest reference node is then chosen as the one having the lowest distance metric ``$` D `$``.
The nearest reference node is then chosen as the one having the lowest distance metric ``$ D $``.
If multiple candidate attachment nodes with the same distance exist, Nextclade can use a "placement prior" to pick the most likely node based on its prevalence in the overall sequence data.
Note that this option exists only when such placement information is coded into the reference tree of the dataset.

This operation is repeated for each query sequence, until all of them are placed onto the tree.

Other query sequences are never considered as targets for the initial placement such that information derived from the placement on the reference tree (see for example [clade assignment](06-clade-assignment.md)) does not depend on other query sequences.
Note, however, that Nextclade now supports a greedy type of tree-building performed at the final step of the analysis that will consider relation-ships between query sequences (see [tree building](#tree-building)).
Other query sequences are never considered as targets for the initial placement such that information derived from the placement on the reference tree (see for example [clade assignment](06-clade-assignment)) does not depend on other query sequences. Note, however, that Nextclade now supports a greedy type of tree-building performed at the final step of the analysis that will consider relation-ships between query sequences (see [tree building](#tree-building)).

Mutations that separate the query sequence and the nearest node in the reference tree are designated "private mutations". Mutations that are the same is the query sequence and in the nearest node we call "shared mutations".

Sequencing errors and sequence assembly problems are expected to give rise to more private mutations than usual. Thus, an excess of such mutations is a useful [quality control (QC) metric](07-quality-control.md). In addition to the overall number of such private mutations, Nextclade also assesses whether they cluster in specific regions of the genome, as such clusters give more fine grained indications of potential quality issues.
Sequencing errors and sequence assembly problems are expected to give rise to more private mutations than usual. Thus, an excess of such mutations is a useful [quality control (QC) metric](07-quality-control.md). In addition to the overall number of such private mutations, Nextclade also assesses whether they cluster in specific regions of the genome, as such clusters give more fine-grained indications of potential quality issues.

### Tree building
After all query sequences have been assigned their initial placement on the tree, Nextclade will resolve local phylogenetic relationships between the query sequences and refine the tree (optional in CLI, data set specific in the web app).
Nextclade sorts query sequences by the number of mutations to their closest reference node and will start refining their attachment positions starting with the queries closest to the reference tree.
For each query sequence, it will check whether there are some mutations shared with branches in the immediate neighborhood.
If such mutations exist, the corresponding branches will be split to optimally position the query, or, if all mutations on a branch are shared with another branch, the query will be moved along this branch to a new position.

After all query sequences have been assigned their initial placement on the tree, Nextclade will resolve local phylogenetic relationships between the query sequences and refine the tree (optional in CLI, dataset-specific in the web app).

Nextclade sorts query sequences by the number of mutations to their closest reference node and will start refining their attachment positions starting with the queries closest to the reference tree. For each query sequence, it will check whether there are some mutations shared with branches in the immediate neighborhood. If such mutations exist, the corresponding branches will be split to optimally position the query, or, if all mutations on a branch are shared with another branch, the query will be moved along this branch to a new position.

This procedure is repeated until no further local improvement is possible and a new node corresponding to the query (along with necessary internal nodes) is added to the tree.

The position of the next sequence will now be refined on the tree with the previous sequences already attached at their refined positions, gradually building up the phylogenetic structure among the query sequences.
Such a greedy tree-building approach works the diversity of the population is well represented by the reference tree and remaining diversity among the query sequences is small.

This greedy tree-building approach works the diversity of the population is well represented by the reference tree and remaining diversity among the query sequences is small.

### Known limitations

Expand Down
Loading

1 comment on commit 9908d2d

@vercel
Copy link

@vercel vercel bot commented on 9908d2d Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

nextclade – ./

nextclade-git-master-nextstrain.vercel.app
nextclade.vercel.app
nextclade-nextstrain.vercel.app

Please sign in to comment.