Streamtrees #1902

jameshadfield · 2024-11-14T07:45:48Z

This is the first prototype of a long-running idea of mine (and others) as a way of displaying big trees where our typical approach is either too slow and/or we run out of pixels. This sketch was from many years ago:

This prototype implements streams by allowing branch labels to cut the tree into monophylies / paraphylies and visualising each as streamgraphs. While the motivation was for this to display huge trees which Auspice can't currently display, it works and is useful (and fun) for smaller trees as well. Long-term I think this would be a good starting view into very large or diverse datasets.

How to use

nextstrain.org review app

Any dataset with a branch label (apart from amino acid) can have streamtrees. The dropdown in the sidebar can change the label used to partition the tree, and there's a toggle to go between streamtrees and normal tree view. My general view is that the appropriate partitioning of sample-sets will be best done in Augur, either algorithmically or manually. (There's also scope for dynamic partitioning in Auspice via genotype or color-by, but that's not implemented here.)

Suggested testing datasets

EBOV
Influenza
Large 23k-tip ncov dataset. In particular scrubbing time / filtering is much much faster here than when viewing the normal tree.

Known caveats / bugs

The tree must be temporal. You can change between temporal & divergence metrics, so long as temporal information exists.
Only rectangular trees.
Streams with only internal nodes cause a crash.
Performance hasn't been optimised at all. Specifically, toggling between streamtrees & normal trees is very slow.
Nested streams which aren't very ladder-like don't zoom well most of the time, and changing back to the normal view often gets the viewport completely wrong. This is fixable, but requires a rethinking & rewrite of how I calculate the vertical position of nodes.
Only categorical color-bys work. Genotype colorings don't work.
The vertical space within a single stream is consistent (i.e. half height = half the samples in that window) but it's not consistent between streams. The window size is also not consistent between streams.
The curves drawn around streams are an off-the-shelf curve generator and aren't quite right.
Dashed lines are no good.

Future directions

Release the current prototype after feedback, UI improvements & bug-fixes.
Auto-toggling between streamtrees & normal trees. Currently you have to manually do this and it's very slow, but the idea is that as you zoom the streams are replaced by either more fine-grained streams or the normal tree view.
Allow JSONs to encode streams in a more compressed format (to allow massive trees) and then the switching of streams to normal trees involves a fetch of the associated dataset-JSON for that particular stream.
Dynamic partitioning which isn't simple cuts in the tree. E.g. see certain mutations as individual streams with the lines between the streams representing the flows between states. (Maybe.)
On a recent call we discussed collapsing identical sequences (tips) in order to better view large trees. One option here is making the tip size scale with the number of samples represented by the tip (expressing multiple colours / sampling dates is harder). Another option is to use streams (but be aware that n streams will always be a lot slower than n tips).

Screenshots

Screen.Recording.2024-11-14.at.8.43.25.PM.mov

Screen.Recording.2024-11-14.at.8.40.30.PM.mov

trvrb · 2024-11-14T22:02:21Z

This is so cool @jameshadfield! I can give more detailed feedback soon, but I wanted to highlight just a few things on initial perusal:

https://nextstrain-s-nextstrain-ginawj.herokuapp.com/ncov/gisaid/global/all-time is throwing an error, while https://nextstrain-s-nextstrain-ginawj.herokuapp.com/ncov/gisaid/europe/all-time is working. https://nextstrain-s-nextstrain-ginawj.herokuapp.com/seasonal-flu/h3n2/ha also not working.
it would often be quite helpful to be able to label streams with their branch label. This would help with many clade views especially when coloring not by clade. I'd think this would work in a very similar fashion as branch labels on the tree.
I see why the layout is a non-trivial problem, see https://nextstrain-s-nextstrain-ginawj.herokuapp.com/ncov/gisaid/global/6m and compare 24C in yellow to 24F in orange. But I'd think you could re-compute the y-values and get a properly hierarchical layout that doesn't involve crossing dashed lines.
The binning strategy should get looked at
How do we support radial, unrooted and scatter trees?
We’ll really need some sort of augur clades --automatic option for this to be truly useful for large phylogenies.

trvrb · 2024-11-15T00:36:16Z

(After conversation with James...)

We want to be sure this viz approach is useful for being able to read evolutionary / epidemiological stories from the genetic data. If we construct streams from the clade branch label then it's clear that we can describe evolutionary stories. For example from https://nextstrain-s-nextstrain-ginawj.herokuapp.com/ncov/gisaid/europe/all-time we can see the standard pandemic story of initial variants, then VOCs and the sweep of Delta and then the emergence of Omicron on a long branch, and so forth.

And we can do things like color by S1 mutations like we often want to understand what's driving clade success.

But in general I'd be trying to think about how a streamtree view would enable proper reading of stories like:

dispersal of Zika into Brazil and then into the Americas: https://nextstrain.org/zika
dispersal of mpox linage B.1 through the world: https://nextstrain.org/mpox/lineage-B.1

My guess is that streamtrees would work quite well for the epidemiological cases, but only really when "clades" are created that correspond well to geographic transitions (note that this tracking geography was where Pango lineages got their start). This could be literally creating branch labels to mark geographic transitions from augur traits and giving each transition a unique name. I think this will probably be better than trying to label branches as "USA", etc... and have convergence to the same streams. (This doesn't have to be a primary use case of the streamtrees, but I think worth considering while scoping out initial remit)

trvrb · 2024-11-15T00:39:51Z

Related to the above, I think the biggest design decision here is to enforce creating streams from existing branch labels. This effectively pushes the problem of what streams to allow for into augur and machinery like augur clades. This is as opposed to dynamically sizing streams in Auspice based on tree size (perhaps with UI for how granular to make them). But that said, even if we at some point really wanted dynamic streams in Auspice, it's still going to be best to start with the simple branch label strategy as this is the necessary prerequisite anyway.

trvrb · 2024-11-20T00:37:41Z

@jameshadfield --- A follow up thought here while on the topic of remit and structure for this feature. It would be amazing to be able to have https://nextstrain.org/nextclade/sars-cov-2/ but rather than each Pango lineage having a single circle, they would each have a stream.

I think this could effectively be hacked into your current setup by creating a branch label on the branch immediately leading to each tip with the Pango lineage label and then throwing in a set of 1 to 100 representative strains from this Pango lineage which would each possess necessary metadata of collection date, S1 mutations, region, etc... These 1-100 representative strains would not have any tree structure and would just be a comb / polytomy replacing the single existing tip. The number of strains per Pango lineage would be input so that more frequent lineages get more strains and consequently wider streams.

My main reason to bring this up: is a standard Auspice JSON with specific branch labels and polytomies of discrete strains the best way to encode this? I think so? There would be more efficient ways to encode this than a bag of strains if we had a single coloring to worry about, but if we want to allow different colorings, then treating this as a set of discrete strains with metadata is probably the way to go.

I would imagine this scenario of collapsed tip distributions / streams to be a common one. We have the analogous issue with unique seasonal influenza HA haplotypes.

I don't know if augur clades is the right place to stuff this sort of operation? Actually... one strategy would be to take a Nextclade reference tree (like the SARS-CoV-2 lineage tree above) and decorate it with additional input sequences also using Nextclade, and this way not perform full phylogenetic / TreeTime inference.

trvrb · 2024-11-20T00:42:40Z

On a recent call we discussed collapsing identical sequences (tips) in order to better view large trees. One option here is making the tip size scale with the number of samples represented by the tip (expressing multiple colours / sampling dates is harder). Another option is to use streams (but be aware that n streams will always be a lot slower than n tips).

As you say, if you have the logic to collapse to streams you could explore the ability to collapse to circles whose area is proportional to n and whose color is the simple merged color logic we use for phylogeographic uncertainty. This would allow further scaling. But I think fair to start with just streams as getting this working is definitely the more complex avenue.

jameshadfield · 2024-12-18T00:59:26Z

Thanks for the feedback @trvrb - much appreciated.

I'm currently reworking this PR to both improve the code and address the shortcomings identified by Trevor and myself. If anyone has further feedback please provide it over the coming fortnight.

genehack · 2024-12-18T01:22:47Z

If anyone has further feedback please provide it over the coming fortnight.

That's the next two weeks for those of you who only speak Merrikun! 🤣

joverlee521 · 2025-02-06T19:58:18Z

I haven't looked at the code, just jotting down my experience clicking around the stream-tree review app.

I really appreciate the warning notification when trying to switch stream branches within a subtree. It's also helpful that the stream tree toggle is disabled when it's not compatible with other options. I was expecting the branch labels control to control the stream-tree display, but realized this wasn't true after reading the stream-tree info panel and seeing the separate stream-tree branch label dropdown.

Suggestions:

Could the hover panel on a stream to show the counts within the category, similar to how the frequencies panel shows time point + percentage?
I don't have specific feedback on how to improve this, but wanted to note that I didn't realize the stream-tree was displayed when filtering down to small number of samples (e.g. dengue filtered to American Samoa)

Noticed bugs:

Changing the color-by while in the stream-tree view hides the stream name display within the stream
If the filter leads to displaying only a single stream, the branch leading to the stream is no longer clickable to zoom in on the stream (e.g. dengue filtered to denv1).
Clicking on "Zoom to selected" when viewing a single stream leads to a blank graph.

trvrb · 2025-02-06T21:42:08Z

So cool! Some viz / UI feedback from kicking the tires a bit more:

I know previous "experimental" features lacked URL support, but do think it's worth building it in here. One of the primary use cases is being able to work with large trees and so having the ability to directly link to a large tree with streamtrees toggled on seems highly valulable.
I've the "clade" view to be quite informative of clade relationships (as perhaps expected). It's easier for me to see that the most recent LP.8.1 viruses descend from JN.1 viruses when viewed as a streamtree rather than fully branching tree. Compare these two views:

That said, the crossing lines here do detract from the ability to just read the plot. I know this is algorithmically complicated and so I don't know if it's truly feasible, but I feel that fewer lines crossing would be preferred to keeping things in approximately the same place when toggling. Right now, there's a lot of jumping when toggling (which is okay).
The bandwidth to the KDE smoothing seems perhaps a bit too wide? Here, there's 338 tips in the 20B clade. However, in June and July 2021 there are just three observations. However, the stream for 20B is about 25% the width in July 2021 as it is in the peak of July and August 2020 when there are 63 observations. I'd expect something closer to 5% than 25%. I realize however that making the KDEs too narrow will cause weirdly bumpy streams and so some smoothing is good. My impression is that there's just a bit too much at the moment.

The interaction between a branch and its stream is challenging. Here's a dengue example:

The discontinuity between branch and stream for DENV4 is particularly striking.

I'd wonder about taking the branch all the way to the middle of the stream. If the stream is plotted on top of the branch (on the z axis) then I could imagine opacity taking care of the issue. This would effectively give streams a minimum width (according the branch width which depends on sample count, but is scaled non-linearly). I think this is probably okay. It would give visual continuity and would give a better "handle" to click on streams.

The ability to select "NONE" under "stream-tree branch label" feels over-loaded. The user should just toggle off via the UI toggle if they want to remove the streams.

Starting without a stream often feels quite weird. Here's ncov

where it seems strange that basal clade 19A isn't a stream when every other clade is.

Mpox lineage B.1 is particularly striking here. Having a stream for this populous basal lineage is particularly relevant.

A couple bugs:

Changing coloring while streamtrees are toggled on results in stream labels landing behind streams themselves.

The x-axis changes in a surprising way when toggling on streams. Here's for ncov, where with streams on the x expands to almost 2026.

trvrb · 2025-02-06T21:56:39Z

Related to above comments about epidemiological / phylogeographic inference, at this point I'd imagine a strategy of building in branch label annotations into augur traits. If every state transition gets a label, this could be used to readily surface these phylogeography stories for things like Ebola and Zika without having to do anything more than we're already doing. Currently showing streams of "clades" emphasizes clade structure (as desired), I believe this this small trick would appropriately emphasize geographic structure (when desired).

Before doing this work in augur traits I would suggest to just hack together this labeling for Ebola and/or Zika to see how it pans out.

This work isn't blocking merging this PR

huddlej

Some stream-of-consciousness thoughts and testing today:

Viewing the H3N2 HA 2y tree, I was surprised that the default stream annotation was “clade” when the default coloring and branch labeling was “subclade”. Will we have a separate option to control the default stream field or will it default to the same default as the branch label field?

In the same tree view above, I didn’t know what the thick vertical bars on the left represented at first. I’m also not sure why there are gaps in the trunk of the tree leading to the most recent samples.

I like that the tooltip for stream trees tells me why I can’t enable them (e.g., when I’ve enabled exploded trees).

What if we could explode the stream trees? Would that simplify some of the issues with the branch crossings in the current view?

“Explode tree by” has a “None” option to disable that view. The “second tree” has “REMOVE” to disable that view. “Branch labels” has “none” as an option. “Stream trees” has “NONE” and a toggle button. I like the toggle button best, as a way to turn something on/off, but maybe it's better to have a consistent interface across the different tree viewing options?

The option to color by proposed subclade and stream by subclade is really cool for flu. I like how this view communicates the total number of samples in each subclade and overall at a specific timepoint. Unlike the frequencies panel, this view shows how new clades have emerged against the background of existing clades. I think you mentioned this before, @jameshadfield, but is the height of the streams at any given timepoint proportional to the number of samples across all streams? For example, if I’m comparing J.2.2 to J.2 in the screenshot below, is it fair to say that J.2.2 at its peak had <50% of samples that J.2 had around the same time?

Toggling between time and divergence views or filtering by anything is super smooth!

I notice that branch labels don’t appear at the base of clades. This lack of a label makes it trickier to pick specific subtrees for linking in reports.

I like how coloring streams by genotype shows the recurrence of specific substitutions through time against different phylogenetic backgrounds.

It’s also cool to see a stream tree by clade colored by titer measurements against a specific reference virus. The tree view is much more informative about the number of samples and antigenic distance through time than the frequency panel. For this reason, I’d second Trevor’s request to have a URL toggle for the stream trees, so we could include these views in narratives.

jameshadfield · 2025-02-07T00:43:53Z

[@huddlej] is the height of the streams at any given timepoint proportional to the number of samples across all streams? For example, if I’m comparing J.2.2 to J.2 in the screenshot below, is it fair to say that J.2.2 at its peak had <50% of samples that J.2 had around the same time?

Kind-of. Starting with a normal rectangular tree let's define the vertical separation between tips as 1 unit. For stream trees each sample gets a KDE centered around its date/div with some bandwidth and the height of this KDE at the center is 1 unit. So the answer to your question is generally yes, but since both the bandwidth and pivot spacing differs between streams when you look at the peak of a stream you're seeing a superposition of samples some of which were centered at that peak and some of which aren't, and the contribution of the latter to the peak is hard to reason about. Would be great to chat more about this 1:1, including how to choose pivots and bandwidth.

huddlej · 2025-02-07T18:19:06Z

Thank you, @jameshadfield! I had forgotten these details from the last time we talked.

...since both the bandwidth and pivot spacing differs between streams when you look at the peak of a stream you're seeing a superposition of samples some of which were centered at that peak and some of which aren't...

I remember now that I thought it would be easier to interpret the streams if all streams used the same pivot spacing and bandwidth. But, I can imagine how that might be a challenge to implement. I'm down to talk about this whenever you like.

trvrb · 2025-02-16T22:49:51Z

Without really diving in, I would expect a single pivot spacing and KDE bandwidth to work across streams. Bumpiness to streams should be an attribute of the absolute amount of data (number of tips) available. I'm not sure it makes sense to have the stream with 100 tips have a different bandwidth than the stream with 1000 tips for this reason. The 100 tip stream should be more bumpy.

jameshadfield added experiment PRs which may never be merged preview on nextstrain.org labels Nov 14, 2024

nextstrain-bot temporarily deployed to auspice-james-stream-tr-37i7jd November 14, 2024 07:46 Inactive

nextstrain-bot mentioned this pull request Nov 14, 2024

[bot] [DO NOT MERGE] Test Auspice PR 1902 nextstrain/nextstrain.org#1075

Draft

jameshadfield mentioned this pull request Dec 18, 2024

Prototype stream graph summaries of big trees #1866

Closed

jameshadfield force-pushed the james/stream-trees branch from acbc70c to e617e12 Compare February 3, 2025 03:57

Streamtrees

b1b2f77

jameshadfield force-pushed the james/stream-trees branch from e617e12 to b1b2f77 Compare February 6, 2025 01:20

huddlej reviewed Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamtrees #1902

Streamtrees #1902

jameshadfield commented Nov 14, 2024 •

edited by joverlee521

Loading

trvrb commented Nov 14, 2024

trvrb commented Nov 15, 2024 •

edited

Loading

trvrb commented Nov 15, 2024

trvrb commented Nov 20, 2024

trvrb commented Nov 20, 2024

jameshadfield commented Dec 18, 2024 •

edited

Loading

genehack commented Dec 18, 2024

joverlee521 commented Feb 6, 2025

trvrb commented Feb 6, 2025 •

edited

Loading

trvrb commented Feb 6, 2025 •

edited

Loading

huddlej left a comment

jameshadfield commented Feb 7, 2025

huddlej commented Feb 7, 2025

trvrb commented Feb 16, 2025

Streamtrees #1902

Are you sure you want to change the base?

Streamtrees #1902

Conversation

jameshadfield commented Nov 14, 2024 • edited by joverlee521 Loading

How to use

Suggested testing datasets

Known caveats / bugs

Future directions

Screenshots

trvrb commented Nov 14, 2024

trvrb commented Nov 15, 2024 • edited Loading

trvrb commented Nov 15, 2024

trvrb commented Nov 20, 2024

trvrb commented Nov 20, 2024

jameshadfield commented Dec 18, 2024 • edited Loading

genehack commented Dec 18, 2024

joverlee521 commented Feb 6, 2025

trvrb commented Feb 6, 2025 • edited Loading

trvrb commented Feb 6, 2025 • edited Loading

huddlej left a comment

Choose a reason for hiding this comment

jameshadfield commented Feb 7, 2025

huddlej commented Feb 7, 2025

trvrb commented Feb 16, 2025

jameshadfield commented Nov 14, 2024 •

edited by joverlee521

Loading

trvrb commented Nov 15, 2024 •

edited

Loading

jameshadfield commented Dec 18, 2024 •

edited

Loading

trvrb commented Feb 6, 2025 •

edited

Loading

trvrb commented Feb 6, 2025 •

edited

Loading