Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mutations relative to arbitrary node (extended) #1492

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented Jun 18, 2024

Extension of #1454 (based on top of its branch)

Resolves #991 #1237 #1142

This PR includes the following changes:

  • make specification of ref nodes of interest more flexible
    • allow many-to-many relationships between query samples and searched nodes of interest (meaning each query can have different ref node as a target)
    • allow for multiple tree search algorithms
  • allow to define default node of interest - to be selected in the dropdown in web app by default
  • separate 2 concerns:
    • tree search: finding nodes of interest on the tree
    • mutation search: calling mutations relative to the nodes of interests found previously
  • CSV/TSV now additionally emit "relativeMutations['{name}'].nodeName" column, to identify matched node name for each row (they now can vary across samples)
  • Similarly, mutation tooltip in the Web app now displays that node's name

Here I mostly describe changes to the node search. The mutation calling code is largely the same as in #1454 and it is happening for each node that has been found.

Inputs

The extensions JSON snippet is to be placed under .meta.extensions.nextclade.ref_nodes as before.

Properties:

  • default: string, optional. Set default search to display in the web app dropdown. Should correspond to one of the search[].name fields or one of the special values __root__ for reference sequence (default), __parent__ for nearest node (private mutations).

  • search: array of objects, optional. Each object describes one search. Each search corresponds to an entry in the "Relative to" dropdown in the web app and a set of CSV/TSV columns relativeMutations['searchName']. Note that these names no longer need to correspond to node names.

    • search[].name: required unique identifier of the search entry
    • search[].displayName, search.description: optional friendly name and description to be displayed in the UI (dropdown)
    • search[].criteria: array of objects, optional. One or multiple search criteria. Criteria should be described such that during search run only one criterion matches a pair of query and node. If there are multiple matches, then one (unspecified) match is taken and a warning is emitted.
      • search[].criteria[].qry: object, optional, describing properties of query samples to select for this search
        • search[].criteria[].qry.clade: array of strings, optional. Query names to consider for this search. At least one match is necessary for sample to match.
        • search[].criteria[].qry.cladeNodeAttrs: optional mapping from name of the clade-like attr to a list of searched values for this attr. At least one match is necessary for sample to match.
      • search[].criteria[].node: object, optional, describing properties of ref node to search, as well as search algorithm. All of the properties should match.
        • search[].criteria[].node.name: array of strings, optional. Searched node names. At least one match.
        • search[].criteria[].node.clade: array of strings, optional. Searched node clades. At least one match is necessary for node to match.
        • search[].criteria[].node.cladeNodeAttrs: optional mapping from name of the clade-like attr to a list of searched values for this attr. At least one match is necessary for node to match.
        • search[].criteria[].node.searchAlgo: string, optional. Search algorithm to use
          • full (default): simple loop over all nodes until first match is found
          • ancestor-earliest: start with the current sample and traverse the graph against edge directions, looking for matching nodes, until it reaches root node. The result is the last encountered matching node.
          • ancestor-latest: start with the current sample and traverse the graph against edge directions, looking for matching nodes. The first match is the result.

Examples

The branch with the same name in data: nextstrain/nextclade_data#212. This allows to use this data when adding URL param shortcut ?dataset-server=gh to this PR's deployment of Nextclade Web.

Script to embed configs into trees: scripts/migrate_006_add_rel_muts.py. Feel free to add/remove/modify entries on the branch. The changes will be reflected in the linked examples

(looks like the script might be breaking clade-like attrs; to be investigated)

Example config snippets

  • Reproduce existing functionality of feat: mutations relative to arbitrary node #1454: search for a specific node by name, among all nodes (not necessarily ancestral) for all query samples

    Click to expand
    {
      "ref_nodes": {
        "default": "__root__",
        "search": [
          {
            "name": "A/Massachusetts/18/2022",
            "displayName": "A/Massachusetts/18/2022",
            "description": "Isolate first used in vaccine for SH season 2024",
            "criteria": [
              {
                "node": [
                  {
                    "name": ["A/Massachusetts/18/2022"],
                    "searchAlgo": "full"
                  }
                ]
              }
            ]
          }
        ]
      }
    }
  • Reproduce existing functionality of feat: mutations relative to arbitrary node #1454: search for a specific node by name, among all nodes (not necessarily ancestral) for query samples matching certain criteria (in this case same clade)

    This requires some work during preparing of the dataset: need to find exact node name for each criterion, and if node names change, then adjust the search descriptions accordingly.

    Click to expand
    {
      "ref_nodes": {
        "default": "__root__",
        "search": [
          {
            "name": "BA.2.86",
            "displayName": "BA.2.86 (23I)",
            "description": "Full search by name: NODE_0000659 (only for samples having clade 22I)",
            "criteria": [
              {
                "qry": [{"clade": ["23I"]}],
                "node": [{"name": ["NODE_0000659"], "searchAlgo": "full"}]
              }
            ]
          },
          {
            "name": "XBB.1.5",
            "displayName": "XBB.1.5 (23A)",
            "description": "Full search by name: XBB.1.5 (only for samples having clade 23A)",
            "criteria": [
              {
                "qry": [{"clade": ["23A"]}],
                "node": [{"name": ["XBB.1.5"], "searchAlgo": "full"}]
              }
            ]
          },
          {
            "name": "BA.5",
            "displayName": "BA.5 (22B)",
            "description": "Full search by name: NODE_0000862 (only for samples having clade 22B)",
            "criteria": [
              {
                "qry": [{"clade": ["22B"]}],
                "node": [{"name": ["NODE_0000862"], "searchAlgo": "full"}]
              }
            ]
          }
        ]
      }
    }
  • New functionality: filter query samples by clade and find the earliest ancestor node with the same clade (note the "searchAlgo": "ancestor-earliest").

    This is similar to the previous example, but no need to search node names in advance. Ancestral search is happening in Nextclade. Desired clades still need to be listed though.

    For this particular use-case - finding clade founder for each clade - we may consider embedding it into Nextclade without need for search descriptions.

    Click to expand
    {
     "name": "clade-founder-by-clade",
     "displayName": "Clade founder (by node clade)",
     "description": "Mutations relative to founder of clade (earliest ancestor by clade)",
     "criteria": [
       {
         "qry": [{"clade": ["20A"]}],
         "node": [{"clade": ["20A"], "searchAlgo": "ancestor-earliest"}]
       },
       {
         "qry": [{"clade": ["23A"]}],
         "node": [{"clade": ["23A"], "searchAlgo": "ancestor-earliest"}]
       },
       {
         "qry": [{"clade": ["23B"]}],
         "node": [{"clade": ["23B"], "searchAlgo": "ancestor-earliest"}]
       }
     ]
    }
  • You might come up with more examples and use-cases. The query-to-node mapping is now many-to-many, which allows for great flexibility. And search algos allow to traverse the tree differently.

Possible improvements

  • Hardcode clade founder search, such that it is always performed, without additional specification, like nearest node search (and private mutations).
  • Use regex match for strings, instead of strict equality.
  • Add more tree search algos
  • Use found nodes not only for mutation calling, but possibly for other things. In which case, we might need to also specify purpose (or multiple purposes) for each search object.
  • Match nodes by arbitrary node attributes or branch attributes (such as labels). I need to know more about how exactly these properties look like in JSON.

ivan-aksamentov and others added 25 commits June 12, 2024 15:30
When new nodes are placed onto a ref node, sometimes we create an intermediate new node as a copy of the existing ref node. They are all named the same - `{node_key}_internal`. This is insufficient to provide uniqueness when multiple new nodes are placed onto the same ref node.

To address that, here I change names to also include query name and index.
…om permissions-policy

Having both `feature-policy` and `permissions-policy` causes a "yellow error"  error/warning (warnor? errning?) because it's [deprecated](https://scotthelme.co.uk/goodbye-feature-policy-and-hello-permissions-policy/) (like if noone is using old browsers which don't know about it)

Same for `interest-cohort` entry in `permissions-policy`which requires a feature flag enabled in chrome to be functional:

> Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'.

Let's remove the outdated `feature-policy` header and remove `interest-cohort` entry from `permissions-policy`.

Still an "A" score on both
- https://securityheaders.com/?q=clades.nextstrain.org&followRedirects=on
- https://observatory.mozilla.org/analyze/clades.nextstrain.org
even though we allow `unsafe-eval` in order to be able to run wasm.

Security headers being a bit of a confusing mess nowadays (to put it lightly), which is always good for additional feeling of security of course! I am not even sure why I am doing this anymore. To get an "A" score I guess.
The bug is the inverted `.filter()` which filtered out all non-empty entries
Bumps [auspice](https://github.com/nextstrain/auspice) from 2.54.3 to 2.55.0.
- [Release notes](https://github.com/nextstrain/auspice/releases)
- [Changelog](https://github.com/nextstrain/auspice/blob/master/CHANGELOG.md)
- [Commits](nextstrain/auspice@v2.54.3...v2.55.0)

---
updated-dependencies:
- dependency-name: auspice
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Copy link

vercel bot commented Jun 18, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ✅ Ready (Inspect) Visit Preview Jun 21, 2024 9:07am

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Jun 20, 2024

After discussion with Richard, I added the "hardcoded" (built-in) search for clade founder nodes and calling mutations relative to these nodes.

  • start with the query node
  • search for earliest (closest to root) ancestor node with the same clade as the query
  • let's call that node "clade founder" node
  • call mutations relative to it
  • display these mutations when dropdown points to "Clade founder" entry
  • do the same thing for custom clade-like node attributes in the dataset (e.g. we have "Pango lineage" for SC2, "Subclade" for flu, "G-clade" for RSV etc.)

This allows to avoid explicit repetitive ref node configuration in the ref tree. This search is the same as manually adding a criterion like this for every clade:

{"qry": {"clade": [clade_x]}, "node": {"clade": [clade_x], "searchAlgo":"ancestor-earliest"}}

The idea is similar to nearest node/private mutations, and can be considered continuation of it: we find nearest node in order to do tree placement and to call private mutations, and here we continue searching from the nearest node and we ascend on the tree towards the root, until we find a node which is closest to the root and which is having the same clade or clade-like attribute as query.

In TSV this information is emitted into columns:

founderMuts['{name}'].nodeName
founderMuts['{name}'].substitutions
founderMuts['{name}'].deletions
founderMuts['{name}'].aaSubstitutions
founderMuts['{name}'].aaDeletions

where {name} is clade for built-in clades or a name of the custom clade-like attribute (e.g. Nextclade_pango etc.).

P.S. This is a prototype, the names are placeholders. The names and everything else is up for discussion of course.

Comment on lines +46 to +87
/// Starting from a given node, traverse the graph backwards (against direction of edges) until reaching the root,
/// and return the last node which fulfills a given predicate condition, if any
pub fn graph_find_backwards_last<N, E, D, F, R>(
graph: &Graph<N, E, D>,
start: GraphNodeKey,
mut predicate: F,
) -> Result<Option<R>, Report>
where
N: GraphNode,
E: GraphEdge,
F: FnMut(&Node<N>) -> Option<R>,
{
let mut current = graph
.get_node(start)
.wrap_err("In graph_search_backwards(): When retrieving starting node")?;

let mut found = None;

loop {
let edge_keys = current.inbound();
if edge_keys.is_empty() {
break;
}

let edge = take_exactly_one(edge_keys)
.wrap_err("In graph_search_backwards(): multiple parent nodes are not currently supported")?;

let parent_key = graph.get_edge(*edge)?.source();
let parent = graph
.get_node(parent_key)
.wrap_err("In graph_search_backwards(): When retrieving parent node")?;

let result = predicate(parent);
if let Some(result) = result {
found = Some(result);
}

current = parent;
}

Ok(found)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code which performs "ancestor-earliest" kind of search in the graph

Is `skipAsReference` is set on a clade-like node attribute description (an entry in the Auspice JSON `.meta.extensions.nextclade.clade_node_attrs[]` array), then this attribute will not participate in clade founder node search as well as mutation calling relative to these nodes.
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Jun 21, 2024

Clade-like attributes can now be conditionally excluded from founder node search: 7c60909 (#1492)

And the datasets on the sibling branch in data repo are configured like so:
https://github.com/nextstrain/nextclade_data/blob/598bd5c46eb8f3b038ba2843b1d5d0bbae4e3340/scripts/migrate_006_add_rel_muts.py#L102-L141

@huddlej
Copy link
Contributor

huddlej commented Jun 24, 2024

@ivan-aksamentov This has gotten really fancy quickly, so I haven't been able to test out the various combinations of input definitions available. I did check the subclade-specific amino acid substitutions for recent H3N2 HA sequences using this branch's deployment of the web UI. I compared these subclade-specific annotations to the "derived haplotypes" we currently produce for seasonal flu trees and everything matched the way I expected. This is exactly the main use I have for this kind of relative mutations functionality, so I'd be super happy to have this merged.

Also, thank you for getting this alternate "hardcoded" functionality implemented so quickly!

huddlej added a commit to nextstrain/seasonal-flu that referenced this pull request Jun 24, 2024
Adds a prototype script that produces derived haplotype strings per
record from a given Nextclade annotations file with columns for clade
and mutations relative to each clade. The derived haplotypes produced
with this script could eventually replace the haplotypes we build from
the mutation-annotated trees and allow us to calculate haplotype
frequencies from all available data instead of a subset of data used to
build a tree.

Related to #130
Depends on nextstrain/nextclade#1492
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Jun 24, 2024

@huddlej Thanks for testing!

I did check the subclade-specific amino acid substitutions for recent H3N2 HA sequences using this branch's deployment of the web UI.

Which dropdown entry or which TSV columns have you tested? The "Subclade founder" entry?

I now see here that it's founderMuts['subclade']!

Let me know if you have any further suggestions. Richard decided to test this branch a little bit more, so there's still time.

@huddlej
Copy link
Contributor

huddlej commented Jun 24, 2024

That's right! The founderMuts['subclade'].aaSubstitutions is exactly what I needed, although I can see potential value in the other related columns...

@rneher rneher self-requested a review June 28, 2024 14:05
@ivan-aksamentov ivan-aksamentov merged commit ef5ca4a into feat/mutations-relative-to-node Jun 28, 2024
19 of 20 checks passed
@ivan-aksamentov ivan-aksamentov deleted the feat/mutations-relative-to-node-ext branch June 28, 2024 14:16
ivan-aksamentov added a commit that referenced this pull request Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants