-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate why sample nodes attach to deep roots / ultimate ancestor #903
Comments
I think it would also be interesting to count the number of parent=ultimate ancestor edges (and total span) for samples. I think it's clear that we don't expect many samples to connect to the root (a max of 1, for a binary tree), so I don't see much value in comparing back to the source trees here. |
After some discussion, we think it would be useful (a) to see if these deep-rooted samples (especially the ultimate-ancestor ones) are concentrated at the flanks, and (b) whether the switch to the root is caused by running out of ancestral haplotype. |
It's interesting this doesn't happen when using the true site times, if I'm reading the second panel correctly |
Here's the plot we get if we change to It reduces the number of root-attached samples a lot when we first use freq order, but actually increases them when we use the true time order?? And although the sample->grandMRCA edges are removed (3rd row), we still get a lot of sample->root ancestors (i.e. after simplification, 4th row). So I'm not sure that building longer ancestors is going to fix the problem. Also not much effect once we iterate (bottom panel) |
I adjusted the plots above to add the ancestor lengths too. It's clear that the sample->grandMRCA nodes when true times are used (2nd row, which is as good as we are going to get by repeated iteration) are not caused by running out of ancestral material, as all of the ancestors are of full length. I wonder if this effect is caused by the fact that using the true time of each site breaks the "nestedness" assumption of the ancestor builder logic? E.g. if an adjacent site is older than the focal site, but at lower freq, it's still treated as potentially contributing an ancestral state to the haplotype. |
Hmm, that is interesting. We need to really investigate what's happening here when we order sites by time. Maybe a slightly different hueristic is called for. |
So we can partially solve the problem by setting derived states when building the ancestral haplotype only if both the time of the node associated with the mutation is older than the focal node and the frequency of that mutation is greater than the frequency of the mutation at the focal node. We can alter the Python generator to do this by using Line 53 in fa6c364
(and calling We can then say Line 222 in fa6c364
This will only change the ancestor builder at sites to the left or right of the focal site which are old-but-low-frequency. In this case, (a) the generated ancestor will take the ancestral state, even if a majority of focal samples have the derived state (b) the sample set which serves as a list of samples to track will get whittled down to a slightly different subset of samples. This doesn't actually improve the redating, though (was spearman's r = 0.951 on relating, now 0.950 - no real difference). I'll do some testing on how it affects the iterative procedure in #877 (comment) and also what happens when we combine this with building full-length ancestors. |
Brilliant - if we can characterise these bad haplotypes we can fix them! I'd rather not use mismatch in ancestor matching unless we have to (various reasons) |
OK, so I think I know what's going on now. Imagine you are trying to match a sample. You have basically the correct ancestral haplotype against which to match, and are happily trundling e.g. rightwards along the sample, copying from the correct ancestral haplotype. Now assume that the haplotype is wrongly reconstructed in one position, and has a "bad site": in particular, the sample has a Since we are not using the mismatch parameter, we are forced to switch to copying from another haplotype. But there aren't any similar haplotypes with a You might hope that after that root jump, the HMM would switch back to the original haplotype from which you were copying, in which case the mismatch parameter should allow us to soak up the ancestor-building-error quite nicely. However, the wrongly-built ancestor will be affected by the bad site, and will be kicking out some of the samples from the set it's using for reconstruction, so the haplotype reconstruction to the right of the bad site will also be a bit wacky, and you might not resume copying correctly from the same haplotype. You might hope that the "buffer" that we use would deal with this reasonably well: it would require 2 sites in a row to show this pattern before we kick out the wrong samples from the sample set. |
How about mispolarisation and/or sequencing error @petrelharp ? |
That's the next thing I'm working on... |
Well, assuming I'm mispolarising correctly, that's not doing it either:
|
In trees from real data, we often find sample nodes attached via long branches to deep in the trees, often to the root. We should investigate why this is happening.
I wondered at first if these deep-rooted samples would be partially removed if we get the sites in the correct order, which seems to produce much better dating results. But some experimentation seems to suggest this doesn't help much (and unsurprisingly, neither does a round of dating + reinference). The plot below shows that regardless of the sample size, very few sample nodes attach to the root (solid lines). But as soon as any inference is involved, even with the true site dates (dot-dashed lines) we get roughly the same distribution of samples attached to their local root as other inference attempts (dotted, dashed lines)
The text was updated successfully, but these errors were encountered: