-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamtrees #1902
base: master
Are you sure you want to change the base?
Streamtrees #1902
Conversation
This is so cool @jameshadfield! I can give more detailed feedback soon, but I wanted to highlight just a few things on initial perusal:
|
(After conversation with James...) We want to be sure this viz approach is useful for being able to read evolutionary / epidemiological stories from the genetic data. If we construct streams from the ![]() And we can do things like color by S1 mutations like we often want to understand what's driving clade success. ![]() But in general I'd be trying to think about how a streamtree view would enable proper reading of stories like:
My guess is that streamtrees would work quite well for the epidemiological cases, but only really when "clades" are created that correspond well to geographic transitions (note that this tracking geography was where Pango lineages got their start). This could be literally creating branch labels to mark geographic transitions from |
Related to the above, I think the biggest design decision here is to enforce creating streams from existing branch labels. This effectively pushes the problem of what streams to allow for into augur and machinery like |
@jameshadfield --- A follow up thought here while on the topic of remit and structure for this feature. It would be amazing to be able to have https://nextstrain.org/nextclade/sars-cov-2/ but rather than each Pango lineage having a single circle, they would each have a stream. ![]() I think this could effectively be hacked into your current setup by creating a branch label on the branch immediately leading to each tip with the Pango lineage label and then throwing in a set of 1 to 100 representative strains from this Pango lineage which would each possess necessary metadata of collection date, S1 mutations, region, etc... These 1-100 representative strains would not have any tree structure and would just be a comb / polytomy replacing the single existing tip. The number of strains per Pango lineage would be input so that more frequent lineages get more strains and consequently wider streams. My main reason to bring this up: is a standard Auspice JSON with specific branch labels and polytomies of discrete strains the best way to encode this? I think so? There would be more efficient ways to encode this than a bag of strains if we had a single coloring to worry about, but if we want to allow different colorings, then treating this as a set of discrete strains with metadata is probably the way to go. I would imagine this scenario of collapsed tip distributions / streams to be a common one. We have the analogous issue with unique seasonal influenza HA haplotypes. I don't know if |
As you say, if you have the logic to collapse to streams you could explore the ability to collapse to circles whose area is proportional to n and whose color is the simple merged color logic we use for phylogeographic uncertainty. This would allow further scaling. But I think fair to start with just streams as getting this working is definitely the more complex avenue. |
Thanks for the feedback @trvrb - much appreciated. I'm currently reworking this PR to both improve the code and address the shortcomings identified by Trevor and myself. If anyone has further feedback please provide it over the coming fortnight. |
That's the next two weeks for those of you who only speak Merrikun! 🤣 |
acbc70c
to
e617e12
Compare
e617e12
to
b1b2f77
Compare
I haven't looked at the code, just jotting down my experience clicking around the stream-tree review app. I really appreciate the warning notification when trying to switch stream branches within a subtree. It's also helpful that the stream tree toggle is disabled when it's not compatible with other options. I was expecting the branch labels control to control the stream-tree display, but realized this wasn't true after reading the stream-tree info panel and seeing the separate stream-tree branch label dropdown. Suggestions:
Noticed bugs:
|
So cool! Some viz / UI feedback from kicking the tires a bit more:
![]() ![]()
![]() ![]()
![]() The discontinuity between branch and stream for DENV4 is particularly striking. I'd wonder about taking the branch all the way to the middle of the stream. If the stream is plotted on top of the branch (on the z axis) then I could imagine opacity taking care of the issue. This would effectively give streams a minimum width (according the branch width which depends on sample count, but is scaled non-linearly). I think this is probably okay. It would give visual continuity and would give a better "handle" to click on streams.
![]()
![]() where it seems strange that basal clade 19A isn't a stream when every other clade is. Mpox lineage B.1 is particularly striking here. Having a stream for this populous basal lineage is particularly relevant. ![]() A couple bugs:
![]()
![]() |
Related to above comments about epidemiological / phylogeographic inference, at this point I'd imagine a strategy of building in branch label annotations into Before doing this work in This work isn't blocking merging this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some stream-of-consciousness thoughts and testing today:
Viewing the H3N2 HA 2y tree, I was surprised that the default stream annotation was “clade” when the default coloring and branch labeling was “subclade”. Will we have a separate option to control the default stream field or will it default to the same default as the branch label field?

In the same tree view above, I didn’t know what the thick vertical bars on the left represented at first. I’m also not sure why there are gaps in the trunk of the tree leading to the most recent samples.
I like that the tooltip for stream trees tells me why I can’t enable them (e.g., when I’ve enabled exploded trees).
What if we could explode the stream trees? Would that simplify some of the issues with the branch crossings in the current view?
“Explode tree by” has a “None” option to disable that view. The “second tree” has “REMOVE” to disable that view. “Branch labels” has “none” as an option. “Stream trees” has “NONE” and a toggle button. I like the toggle button best, as a way to turn something on/off, but maybe it's better to have a consistent interface across the different tree viewing options?
The option to color by proposed subclade and stream by subclade is really cool for flu. I like how this view communicates the total number of samples in each subclade and overall at a specific timepoint. Unlike the frequencies panel, this view shows how new clades have emerged against the background of existing clades. I think you mentioned this before, @jameshadfield, but is the height of the streams at any given timepoint proportional to the number of samples across all streams? For example, if I’m comparing J.2.2 to J.2 in the screenshot below, is it fair to say that J.2.2 at its peak had <50% of samples that J.2 had around the same time?

Toggling between time and divergence views or filtering by anything is super smooth!
I notice that branch labels don’t appear at the base of clades. This lack of a label makes it trickier to pick specific subtrees for linking in reports.
I like how coloring streams by genotype shows the recurrence of specific substitutions through time against different phylogenetic backgrounds.

It’s also cool to see a stream tree by clade colored by titer measurements against a specific reference virus. The tree view is much more informative about the number of samples and antigenic distance through time than the frequency panel. For this reason, I’d second Trevor’s request to have a URL toggle for the stream trees, so we could include these views in narratives.

Kind-of. Starting with a normal rectangular tree let's define the vertical separation between tips as 1 unit. For stream trees each sample gets a KDE centered around its date/div with some bandwidth and the height of this KDE at the center is 1 unit. So the answer to your question is generally yes, but since both the bandwidth and pivot spacing differs between streams when you look at the peak of a stream you're seeing a superposition of samples some of which were centered at that peak and some of which aren't, and the contribution of the latter to the peak is hard to reason about. Would be great to chat more about this 1:1, including how to choose pivots and bandwidth. |
Thank you, @jameshadfield! I had forgotten these details from the last time we talked.
I remember now that I thought it would be easier to interpret the streams if all streams used the same pivot spacing and bandwidth. But, I can imagine how that might be a challenge to implement. I'm down to talk about this whenever you like. |
Without really diving in, I would expect a single pivot spacing and KDE bandwidth to work across streams. Bumpiness to streams should be an attribute of the absolute amount of data (number of tips) available. I'm not sure it makes sense to have the stream with 100 tips have a different bandwidth than the stream with 1000 tips for this reason. The 100 tip stream should be more bumpy. |
This is the first prototype of a long-running idea of mine (and others) as a way of displaying big trees where our typical approach is either too slow and/or we run out of pixels. This sketch was from many years ago:

This prototype implements streams by allowing branch labels to cut the tree into monophylies / paraphylies and visualising each as streamgraphs. While the motivation was for this to display huge trees which Auspice can't currently display, it works and is useful (and fun) for smaller trees as well. Long-term I think this would be a good starting view into very large or diverse datasets.
How to use
nextstrain.org review app
Any dataset with a branch label (apart from amino acid) can have streamtrees. The dropdown in the sidebar can change the label used to partition the tree, and there's a toggle to go between streamtrees and normal tree view. My general view is that the appropriate partitioning of sample-sets will be best done in Augur, either algorithmically or manually. (There's also scope for dynamic partitioning in Auspice via genotype or color-by, but that's not implemented here.)
Suggested testing datasets
Known caveats / bugs
Future directions
Screenshots
Screen.Recording.2024-11-14.at.8.43.25.PM.mov
Screen.Recording.2024-11-14.at.8.40.30.PM.mov