-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split GISAID profile to "six-month" and "all-time" builds #910
Conversation
5ddf7cc
to
0bf1383
Compare
The above commit swaps from |
The above commit replicates the For example you can easily see the recent prevalence of BA.4 and BA.5 in South Africa and the distribution across the country with BA.4 in Gauteng and BA.5 in KwaZulu-Natal (https://nextstrain.org/staging/multiple-timespans/ncov/gisaid/africa/1m?c=emerging_lineage&f_country=South%20Africa&r=division): Here, I'd plan to default to |
This is looking great Trevor - I really like this! I actually think the 1-month views on Nextstrain could be pretty interesting. But of course, adding more builds mean more time & more money, so need to be balanced. Still, it might make quick investigation of growing/changing trends a little easier. |
Trial run has completed with datasets available as: Things seem to look generally good. Though a quick note on compute time / resources. Our standard GISAID builds with 6 regional endpoints run on 36 CPUs with 70Gb of memory and the This trial run used the same resources to run 21 regional endpoints. It took ~7hr. So almost but not quite 3X as expected. When merging this PR, we should increase resources for these rebuild jobs to use 96 CPUs and specify 8 CPUs for In using 96 CPUs, we'll still have wasted resources as the |
Okay. After the level of Slack discussion surrounding I've also adjusted the |
The above two commits fix some oversights that I had in copying the GISAID Trial runs are up at: GISAIDOpenI believe this PR is now ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really happy with these changes - it's great to have a builds which are (temporally) uniform sampled as well as ones focused on the recent (6m) situation.
We need to update the description markdown to reflect this however (it currently states "Our primary global analysis subsamples to ~600 genomes per continental region with ~400 from the previous 4 months and ~200 from before this.") We can make 2 markdowns (6m vs all) and set this in per-build via config["builds"][<build>]["description"]
).
Thanks for the catch @jameshadfield! I went through and revised both the GISAID and the open I list out the available views like so:
It should look better to table them out as:
However, this currently, renders as: I'm going to leave this in list form for this PR, but perhaps we can update Auspice's I've kicked off new trial runs that should fix descriptions. But know believe this is ready to merge. |
@trvrb I just realized that when we merge this PR, we'll need to ping GISAID to let them know the new canonical paths for the "nextregions" files. I assume we want to direct folks to "all-time" builds for now, so we don't break workflows for people who have used these data as background for full-pandemic builds. But, it could make sense to switch to a shorter time period for "nextregions" data eventually. |
Good catch @huddlej! It's a little unfortunate to have to pick either that GISAID could offer "Global / 6m", "Global / all time", etc... options and not just swap for the 7 existing. I'll propose this, but if it's just the 7 them updating paths to correspond to "all time" outputs seems best. |
@tsibley: I'm running into a final small issue with this PR that I don't know immediately how to implement a simple fix for. If you look at https://nextstrain.org/staging/ncov/open/trial/multiple-timespans/africa/6m you'll see: The first link is then to https://data.nextstrain.org/files/ncov/open/africa_6m/metadata.tsv.xz rather than the desired https://data.nextstrain.org/files/ncov/open/africa/6m/metadata.tsv.xz. Here,
in rule I know this should be a simple change to regex Edit: I'm realizing we also need to update |
After thinking about the above issue a bit further, my proposal is to just use
This is directly names what the file is in the same way as |
The question I’m asking myself reading this is whether we even want the dataset JSON(s) to be in the directory or not? My reading of |
For those playing at home, @trvrb, @jameshadfield, and I had a lengthy discussion about URLs/names in a Slack thread. @trvrb I can update One thing I want to call out clearly is that this PR does represent effectively a breaking change for the https://data.nextstrain.org/files/ncov/open/ data, as the old names will no longer be updated. We should figure out how we want to handle that (even if it's to explicitly decide the breakage is ok). |
Thanks @tsibley! Given level of uncertainty regarding file names (and how this interacts with planned dataset API for nextstrain.org), I'd think to make the least breaking change with this PR. I'd see this as:
I'm choosing This will allow us to more seamlessly transition to updated file names when we're ready where we can continue to provide the previous file URLs and then include on top of these additional split |
@trvrb Sounds good. Testing changes for that locally and ran into a small complication because of the |
Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)
Pushed. |
After deploying these builds for the first time to nextstrain.org, we'll want to setup redirects from the unqualified names to the 6m builds (e.g. ncov/open/global → ncov/open/global/6m). |
Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)
Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)
Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)
This commit splits the existing regional builds "global", "africa", etc... in the "nextstrain-gisaid" profile into "six-month" builds that focus subsampling on the previous six months and "all-time" builds that subsample evenly across time. This uses the new relative dates functionality in "augur filter" and "augur frequencies" to make these subsampling strategies easier to implement and more obvious. The general subsampling logic is cleaned up in a few ways: 1. North America and Oceania are subsampled and traits reconstructed at the "division" level, while Africa, Asia, Europe and South America are subsampled and traits reconstructed at the "country" level. Previously this behavior had been inconsistent between subsampling, traits, etc... 2. For global builds, all regions are now sampled at equal frequency except for Oceania which is 33%. Previous overemphasis on Europe and North America is no longer justified. 3. There is a consistent 4:1 emphasis on recent vs early samples for the "six-month" builds and a consistent 4:1 emphasis on focal vs context for the regional builds. Frequencies timespans are set to match subsampling ranges. The description.md footer text is updated to describe this split and to provide a table links of region x time period combinations.
Follow the same logic from Nextstrain GISAID and split Nextstrain open to produce "6m" targets that focus subsampling on the previous 6 months as well as "all-time" targets that subsample evenly since pandemic start. Remove subsampling_ranges.smk as it's no longer referenced.
To compensate for doubling build targets from 7 regional builds to 14 regional builds, this commit doubles computational resources from 36 CPUs to 72 CPUs. These specific CPU numbers are keyed to AWS EC2 instance sizes. A c5.9xlarge is 36 CPUs, a c5.12xlarge is 48 CPUs and a c5.18xlarge is 72 CPUs. We should be picking one of these and not a number in between. Finally, this reduces `--set-threads tree` from 16 to 8. There are often close to 7 trees that wanted to simultaneously be run. With 36 CPUs, we'd get situations where 2 trees were taking up 32 CPUs leaving 4 open. With this commit, we'll have 72 CPUs and want to simultaneously run 14 trees. If trees are each 8 CPUs this should better fit into resources.
Tells Snakemake what the `prefix` wildcard's literal value is, preventing Snakemake from interpreting part of the build name as the prefix. When Snakemake misinterprets the build name, this causes key errors downstream that are difficult to debug.
…constraint Avoids accidentally treating the fixed string as a regex which could lead to very weird Snakemake DAG issues when matching the "prefix" wildcard.
Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)
4723ab1
to
7bbc46e
Compare
I've rebased this PR onto |
Description of proposed changes
This PR splits the existing regional builds
global
,africa
, etc... in thenextstrain-gisaid
profile intosix-month
builds that focus subsampling on the previous six months andall-time
builds that subsample evenly across time. This uses the new relative dates functionality inaugur filter
to make these subsampling strategies easier to implement and more obvious.Frequencies timespans are set to match subsampling ranges.
The general subsampling logic is cleaned up in a few ways:
six-month
builds and a consistent 4:1 emphasis on focal vs context for the regional builds.Trial runs can be seen at:
I would plan to mirror changes to open profile once we're happy with things here.
Blocking issues
There are some blocking issues however before this PR can be merged. These are:
--min-date
and--max-date
augur#889We need
augur filter --min-date 6M
,augur frequencies --min-date 6M
, etc... for this PR to function. This needs to be merged and a new version of Augur released. Currently, thefilter
work is merged tomaster
and forfrequencies
, I've just specifiedso that this will function. Once
augur frequencies
has been updated this can get swapped tomin_date: "6M"
. This issue also needs to be fully closed by a new Augur release.I named builds as
global_six-months
,africa_six-months
, etc... becauseglobal_6m
doesn't work. If we can generally fix things to allow numbers in build names, then the PR can be updated accordingly.Todo
Update
build_description
rule and_get_upload_inputs()
to handle the newbuild_name
s during templating / in remote file names (per @trvrb's comment below and subsequent discussion).Ping GISAID about updated URLs (per @huddlej's comment).No longer necessary.Decide how to handle the breaking change of new URLs/filenames under https://data.nextstrain.org/files/ncov/open/. We're preserving existing URLs for now.
Testing
I've tested locally.
Release checklist
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.