feat: mutations relative to arbitrary node (extended) #1492

ivan-aksamentov · 2024-06-18T15:11:12Z

Extension of #1454 (based on top of its branch)

This PR includes the following changes:

make specification of ref nodes of interest more flexible
- allow many-to-many relationships between query samples and searched nodes of interest (meaning each query can have different ref node as a target)
- allow for multiple tree search algorithms
allow to define default node of interest - to be selected in the dropdown in web app by default
separate 2 concerns:
- tree search: finding nodes of interest on the tree
- mutation search: calling mutations relative to the nodes of interests found previously
CSV/TSV now additionally emit "relativeMutations['{name}'].nodeName" column, to identify matched node name for each row (they now can vary across samples)
Similarly, mutation tooltip in the Web app now displays that node's name

Here I mostly describe changes to the node search. The mutation calling code is largely the same as in #1454 and it is happening for each node that has been found.

Inputs

The extensions JSON snippet is to be placed under .meta.extensions.nextclade.ref_nodes as before.

Properties:

default: string, optional. Set default search to display in the web app dropdown. Should correspond to one of the search[].name fields or one of the special values __root__ for reference sequence (default), __parent__ for nearest node (private mutations).
search: array of objects, optional. Each object describes one search. Each search corresponds to an entry in the "Relative to" dropdown in the web app and a set of CSV/TSV columns relativeMutations['searchName']. Note that these names no longer need to correspond to node names.
- search[].name: required unique identifier of the search entry
- search[].displayName, search.description: optional friendly name and description to be displayed in the UI (dropdown)
- search[].criteria: array of objects, optional. One or multiple search criteria. Criteria should be described such that during search run only one criterion matches a pair of query and node. If there are multiple matches, then one (unspecified) match is taken and a warning is emitted.
  - search[].criteria[].qry: object, optional, describing properties of query samples to select for this search
    - search[].criteria[].qry.clade: array of strings, optional. Query names to consider for this search. At least one match is necessary for sample to match.
    - search[].criteria[].qry.cladeNodeAttrs: optional mapping from name of the clade-like attr to a list of searched values for this attr. At least one match is necessary for sample to match.
  - search[].criteria[].node: object, optional, describing properties of ref node to search, as well as search algorithm. All of the properties should match.
    - search[].criteria[].node.name: array of strings, optional. Searched node names. At least one match.
    - search[].criteria[].node.clade: array of strings, optional. Searched node clades. At least one match is necessary for node to match.
    - search[].criteria[].node.cladeNodeAttrs: optional mapping from name of the clade-like attr to a list of searched values for this attr. At least one match is necessary for node to match.
    - search[].criteria[].node.searchAlgo: string, optional. Search algorithm to use
      - full (default): simple loop over all nodes until first match is found
      - ancestor-earliest: start with the current sample and traverse the graph against edge directions, looking for matching nodes, until it reaches root node. The result is the last encountered matching node.
      - ancestor-latest: start with the current sample and traverse the graph against edge directions, looking for matching nodes. The first match is the result.

Examples

The branch with the same name in data: nextstrain/nextclade_data#212. This allows to use this data when adding URL param shortcut ?dataset-server=gh to this PR's deployment of Nextclade Web.

Script to embed configs into trees: scripts/migrate_006_add_rel_muts.py. Feel free to add/remove/modify entries on the branch. The changes will be reflected in the linked examples

(looks like the script might be breaking clade-like attrs; to be investigated)

Example config snippets

Reproduce existing functionality of feat: mutations relative to arbitrary node #1454: search for a specific node by name, among all nodes (not necessarily ancestral) for all query samples

Click to expand

{
  "ref_nodes": {
    "default": "__root__",
    "search": [
      {
        "name": "A/Massachusetts/18/2022",
        "displayName": "A/Massachusetts/18/2022",
        "description": "Isolate first used in vaccine for SH season 2024",
        "criteria": [
          {
            "node": [
              {
                "name": ["A/Massachusetts/18/2022"],
                "searchAlgo": "full"
              }
            ]
          }
        ]
      }
    ]
  }
}

Reproduce existing functionality of feat: mutations relative to arbitrary node #1454: search for a specific node by name, among all nodes (not necessarily ancestral) for query samples matching certain criteria (in this case same clade)

This requires some work during preparing of the dataset: need to find exact node name for each criterion, and if node names change, then adjust the search descriptions accordingly.

Click to expand

{
  "ref_nodes": {
    "default": "__root__",
    "search": [
      {
        "name": "BA.2.86",
        "displayName": "BA.2.86 (23I)",
        "description": "Full search by name: NODE_0000659 (only for samples having clade 22I)",
        "criteria": [
          {
            "qry": [{"clade": ["23I"]}],
            "node": [{"name": ["NODE_0000659"], "searchAlgo": "full"}]
          }
        ]
      },
      {
        "name": "XBB.1.5",
        "displayName": "XBB.1.5 (23A)",
        "description": "Full search by name: XBB.1.5 (only for samples having clade 23A)",
        "criteria": [
          {
            "qry": [{"clade": ["23A"]}],
            "node": [{"name": ["XBB.1.5"], "searchAlgo": "full"}]
          }
        ]
      },
      {
        "name": "BA.5",
        "displayName": "BA.5 (22B)",
        "description": "Full search by name: NODE_0000862 (only for samples having clade 22B)",
        "criteria": [
          {
            "qry": [{"clade": ["22B"]}],
            "node": [{"name": ["NODE_0000862"], "searchAlgo": "full"}]
          }
        ]
      }
    ]
  }
}

New functionality: filter query samples by clade and find the earliest ancestor node with the same clade (note the "searchAlgo": "ancestor-earliest").

This is similar to the previous example, but no need to search node names in advance. Ancestral search is happening in Nextclade. Desired clades still need to be listed though.

For this particular use-case - finding clade founder for each clade - we may consider embedding it into Nextclade without need for search descriptions.

Click to expand

{
 "name": "clade-founder-by-clade",
 "displayName": "Clade founder (by node clade)",
 "description": "Mutations relative to founder of clade (earliest ancestor by clade)",
 "criteria": [
   {
     "qry": [{"clade": ["20A"]}],
     "node": [{"clade": ["20A"], "searchAlgo": "ancestor-earliest"}]
   },
   {
     "qry": [{"clade": ["23A"]}],
     "node": [{"clade": ["23A"], "searchAlgo": "ancestor-earliest"}]
   },
   {
     "qry": [{"clade": ["23B"]}],
     "node": [{"clade": ["23B"], "searchAlgo": "ancestor-earliest"}]
   }
 ]
}

You might come up with more examples and use-cases. The query-to-node mapping is now many-to-many, which allows for great flexibility. And search algos allow to traverse the tree differently.

Possible improvements

Hardcode clade founder search, such that it is always performed, without additional specification, like nearest node search (and private mutations).
Use regex match for strings, instead of strict equality.
Add more tree search algos
Use found nodes not only for mutation calling, but possibly for other things. In which case, we might need to also specify purpose (or multiple purposes) for each search object.
Match nodes by arbitrary node attributes or branch attributes (such as labels). I need to know more about how exactly these properties look like in JSON.

Resolves #1477

When new nodes are placed onto a ref node, sometimes we create an intermediate new node as a copy of the existing ref node. They are all named the same - `{node_key}_internal`. This is insufficient to provide uniqueness when multiple new nodes are placed onto the same ref node. To address that, here I change names to also include query name and index.

…om permissions-policy Having both `feature-policy` and `permissions-policy` causes a "yellow error" error/warning (warnor? errning?) because it's [deprecated](https://scotthelme.co.uk/goodbye-feature-policy-and-hello-permissions-policy/) (like if noone is using old browsers which don't know about it) Same for `interest-cohort` entry in `permissions-policy`which requires a feature flag enabled in chrome to be functional: > Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'. Let's remove the outdated `feature-policy` header and remove `interest-cohort` entry from `permissions-policy`. Still an "A" score on both - https://securityheaders.com/?q=clades.nextstrain.org&followRedirects=on - https://observatory.mozilla.org/analyze/clades.nextstrain.org even though we allow `unsafe-eval` in order to be able to run wasm. Security headers being a bit of a confusing mess nowadays (to put it lightly), which is always good for additional feeling of security of course! I am not even sure why I am doing this anymore. To get an "A" score I guess.

…ames

The bug is the inverted `.filter()` which filtered out all non-empty entries

Resolves #1487

Bumps [auspice](https://github.com/nextstrain/auspice) from 2.54.3 to 2.55.0. - [Release notes](https://github.com/nextstrain/auspice/releases) - [Changelog](https://github.com/nextstrain/auspice/blob/master/CHANGELOG.md) - [Commits](nextstrain/auspice@v2.54.3...v2.55.0) --- updated-dependencies: - dependency-name: auspice dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>

…ages/nextclade-web/auspice-2.55.0

vercel · 2024-06-18T15:11:16Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
nextclade	✅ Ready (Inspect)	Visit Preview	Jun 21, 2024 9:07am

ivan-aksamentov · 2024-06-20T17:28:36Z

After discussion with Richard, I added the "hardcoded" (built-in) search for clade founder nodes and calling mutations relative to these nodes.

start with the query node
search for earliest (closest to root) ancestor node with the same clade as the query
let's call that node "clade founder" node
call mutations relative to it
display these mutations when dropdown points to "Clade founder" entry
do the same thing for custom clade-like node attributes in the dataset (e.g. we have "Pango lineage" for SC2, "Subclade" for flu, "G-clade" for RSV etc.)

This allows to avoid explicit repetitive ref node configuration in the ref tree. This search is the same as manually adding a criterion like this for every clade:

{"qry": {"clade": [clade_x]}, "node": {"clade": [clade_x], "searchAlgo":"ancestor-earliest"}}

The idea is similar to nearest node/private mutations, and can be considered continuation of it: we find nearest node in order to do tree placement and to call private mutations, and here we continue searching from the nearest node and we ascend on the tree towards the root, until we find a node which is closest to the root and which is having the same clade or clade-like attribute as query.

In TSV this information is emitted into columns:

founderMuts['{name}'].nodeName
founderMuts['{name}'].substitutions
founderMuts['{name}'].deletions
founderMuts['{name}'].aaSubstitutions
founderMuts['{name}'].aaDeletions

where {name} is clade for built-in clades or a name of the custom clade-like attribute (e.g. Nextclade_pango etc.).

P.S. This is a prototype, the names are placeholders. The names and everything else is up for discussion of course.

ivan-aksamentov · 2024-06-20T17:30:24Z

packages/nextclade/src/graph/search.rs

+/// Starting from a given node, traverse the graph backwards (against direction of edges) until reaching the root,
+/// and return the last node which fulfills a given predicate condition, if any
+pub fn graph_find_backwards_last<N, E, D, F, R>(
+  graph: &Graph<N, E, D>,
+  start: GraphNodeKey,
+  mut predicate: F,
+) -> Result<Option<R>, Report>
+where
+  N: GraphNode,
+  E: GraphEdge,
+  F: FnMut(&Node<N>) -> Option<R>,
+{
+  let mut current = graph
+    .get_node(start)
+    .wrap_err("In graph_search_backwards(): When retrieving starting node")?;
+
+  let mut found = None;
+
+  loop {
+    let edge_keys = current.inbound();
+    if edge_keys.is_empty() {
+      break;
+    }
+
+    let edge = take_exactly_one(edge_keys)
+      .wrap_err("In graph_search_backwards(): multiple parent nodes are not currently supported")?;
+
+    let parent_key = graph.get_edge(*edge)?.source();
+    let parent = graph
+      .get_node(parent_key)
+      .wrap_err("In graph_search_backwards(): When retrieving parent node")?;
+
+    let result = predicate(parent);
+    if let Some(result) = result {
+      found = Some(result);
+    }
+
+    current = parent;
+  }
+
+  Ok(found)
+}


This is the code which performs "ancestor-earliest" kind of search in the graph

Is `skipAsReference` is set on a clade-like node attribute description (an entry in the Auspice JSON `.meta.extensions.nextclade.clade_node_attrs[]` array), then this attribute will not participate in clade founder node search as well as mutation calling relative to these nodes.

ivan-aksamentov · 2024-06-21T08:59:58Z

Clade-like attributes can now be conditionally excluded from founder node search: 7c60909 (#1492)

And the datasets on the sibling branch in data repo are configured like so:
https://github.com/nextstrain/nextclade_data/blob/598bd5c46eb8f3b038ba2843b1d5d0bbae4e3340/scripts/migrate_006_add_rel_muts.py#L102-L141

huddlej · 2024-06-24T19:15:37Z

@ivan-aksamentov This has gotten really fancy quickly, so I haven't been able to test out the various combinations of input definitions available. I did check the subclade-specific amino acid substitutions for recent H3N2 HA sequences using this branch's deployment of the web UI. I compared these subclade-specific annotations to the "derived haplotypes" we currently produce for seasonal flu trees and everything matched the way I expected. This is exactly the main use I have for this kind of relative mutations functionality, so I'd be super happy to have this merged.

Also, thank you for getting this alternate "hardcoded" functionality implemented so quickly!

Adds a prototype script that produces derived haplotype strings per record from a given Nextclade annotations file with columns for clade and mutations relative to each clade. The derived haplotypes produced with this script could eventually replace the haplotypes we build from the mutation-annotated trees and allow us to calculate haplotype frequencies from all available data instead of a subset of data used to build a tree. Related to #130 Depends on nextstrain/nextclade#1492

ivan-aksamentov · 2024-06-24T19:21:26Z

@huddlej Thanks for testing!

I did check the subclade-specific amino acid substitutions for recent H3N2 HA sequences using this branch's deployment of the web UI.

~~Which dropdown entry or which TSV columns have you tested? The "Subclade founder" entry?~~

I now see here that it's founderMuts['subclade']!

Let me know if you have any further suggestions. Richard decided to test this branch a little bit more, so there's still time.

huddlej · 2024-06-24T19:23:39Z

That's right! The founderMuts['subclade'].aaSubstitutions is exactly what I needed, although I can see potential value in the other related columns...

Follow up for #1492

ivan-aksamentov and others added 25 commits June 12, 2024 15:30

feat(web): add toggles for insertions and frame shifts markers

1fda656

Resolves #1477

Merge remote-tracking branch 'origin/master' into feat/web-seq-markers

f206800

Merge pull request #1486 from nextstrain/fix/prevent-duplicate-node-n…

b5a311b

…ames

fix(web): correctly display 'updated at' time of datasets

57d53df

The bug is the inverted `.filter()` which filtered out all non-empty entries

fix(web): styling of details & summary in markdown content

42f3aee

Resolves #1487

Merge pull request #1489 from nextstrain/fix/web-updated-at

9d7c283

feat: add ancestral search

f94b2e3

feat: add multiple criteria per search

b09abd1

feat: allow multiple criteria per node and query

46566c8

feat: actually run the search

29170fd

fix: infinite loop

2330352

fix: field name mismatch

60e2a70

Merge pull request #1491 from nextstrain/dependabot/npm_and_yarn/pack…

27d1c32

…ages/nextclade-web/auspice-2.55.0

feat: adjust private nuc mutation search for the new format of ref nodes

60732f5

fix: apply query criteria correctly

3f9e11f

refactor: split modules for relative mutations - for nuc and aa

5f18529

feat: adjust private aa mutation search for the new format of ref nodes

02ffd3b

feat: add node name to search result

2654b2a

feat: remove old node search format, adjust code for the new format

348e244

feat: adjust web ui for the new ref node desc format

6c6cf9f

feat: add ref node name to csv output

455efe2

feat(web): add ref node name to mut tooltip

dda188f

ivan-aksamentov mentioned this pull request Jun 18, 2024

feat: mutations relative to arbitrary node (extended) nextstrain/nextclade_data#212

Closed

vercel bot deployed to Preview June 18, 2024 15:18 View deployment

fix(web): infinite loading due to uninitialized atom

aacb3e9

vercel bot deployed to Preview June 18, 2024 15:52 View deployment

fix: correctly display aa mutations for clade-like attr founders

966fcd3

vercel bot deployed to Preview June 20, 2024 16:32 View deployment

ivan-aksamentov added 3 commits June 20, 2024 19:17

fix: emit clade founder mutations into csv

39a4785

feat(web): add founder muts checkbox into csv column config

b9610f1

feat: reorder csv columns

5e168a8

vercel bot deployed to Preview June 20, 2024 17:25 View deployment

ivan-aksamentov commented Jun 20, 2024

View reviewed changes

vercel bot deployed to Preview June 20, 2024 17:40 View deployment

ivan-aksamentov added 2 commits June 21, 2024 09:52

fix(web): description and behavior of clade column checkboxes

8f5da60

fix: buffer overflow when inserting csv headers

e62022c

vercel bot deployed to Preview June 21, 2024 08:02 View deployment

ivan-aksamentov added 2 commits June 21, 2024 10:50

refactor: lint

f5fcd80

vercel bot deployed to Preview June 21, 2024 09:07 View deployment

huddlej mentioned this pull request Jun 24, 2024

Summarize haplotype coverage by titer references using frequencies per haplotype from all available data nextstrain/seasonal-flu#173

Merged

5 tasks

rneher self-requested a review June 28, 2024 14:05

rneher approved these changes Jun 28, 2024

View reviewed changes

ivan-aksamentov merged commit ef5ca4a into feat/mutations-relative-to-node Jun 28, 2024
19 of 20 checks passed

ivan-aksamentov deleted the feat/mutations-relative-to-node-ext branch June 28, 2024 14:16

ivan-aksamentov added a commit that referenced this pull request Jun 30, 2024

feat(web): add help tips for gene and ref node dropdowns

97366e9

Follow up for #1492

This was referenced Jun 30, 2024

feat(web): add help tips for gene and ref node dropdowns #1496

Merged

docs: add docs for clade founders and relative mutations #1497

Merged

add ref nodes to nextstrain/sars-cov-2/wuhan-hu-1/proteins dataset nextstrain/nextclade_data#198

Merged

ivan-aksamentov mentioned this pull request Jul 10, 2024

fix: crash when generating CSV/TSV output #1507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mutations relative to arbitrary node (extended) #1492

feat: mutations relative to arbitrary node (extended) #1492

ivan-aksamentov commented Jun 18, 2024 •

edited

Loading

vercel bot commented Jun 18, 2024 •

edited

Loading

ivan-aksamentov commented Jun 20, 2024 •

edited

Loading

ivan-aksamentov Jun 20, 2024

ivan-aksamentov commented Jun 21, 2024 •

edited

Loading

huddlej commented Jun 24, 2024

ivan-aksamentov commented Jun 24, 2024 •

edited

Loading

huddlej commented Jun 24, 2024

feat: mutations relative to arbitrary node (extended) #1492

feat: mutations relative to arbitrary node (extended) #1492

Conversation

ivan-aksamentov commented Jun 18, 2024 • edited Loading

Inputs

Examples

Example config snippets

Possible improvements

vercel bot commented Jun 18, 2024 • edited Loading

ivan-aksamentov commented Jun 20, 2024 • edited Loading

ivan-aksamentov Jun 20, 2024

Choose a reason for hiding this comment

ivan-aksamentov commented Jun 21, 2024 • edited Loading

huddlej commented Jun 24, 2024

ivan-aksamentov commented Jun 24, 2024 • edited Loading

huddlej commented Jun 24, 2024

ivan-aksamentov commented Jun 18, 2024 •

edited

Loading

vercel bot commented Jun 18, 2024 •

edited

Loading

ivan-aksamentov commented Jun 20, 2024 •

edited

Loading

ivan-aksamentov commented Jun 21, 2024 •

edited

Loading

ivan-aksamentov commented Jun 24, 2024 •

edited

Loading