-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider ingesting Corona Data Scraper or COVID Tracking testing data #112
Comments
Jingjing confirms that the number of unique counties is in the 3k range, so we'll go ahead and make this a new indicator |
Here is the list of counties that we currently forecast... |
Coverage over all states at state level, but not all states are represented at the county level. Publish both, but note in the Limitations section that not all states provide county-level information. |
Proposed signals, after talking to Roni:
[1] in a perfect world this would be the same as case counts, but at the moment we have two sources of cases data that give slightly different values. we'd rather get this value then from the same source as we're getting the denominator from. |
Add a pipeline here. Haven't compared Number of positive tests[1] with JHU or USA-FACTS yet. |
Correlations are next! |
Comparison to JHU is available here; a variety of different kinds and magnitudes of discrepancies but it's faithful to what CDS is publishing so we'll go forward with it as an indicator. |
|
Unassigned cases were added to the raw dataset. The pipeline needs to be updated. Corona Data Scraper migrated from old cds to li(a new name?) this week. As mentioned in the engineering team meeting, they changed the format of the data and the location metadata json file. It seems there are more problems happen to their new release. One is stated above, the other important one is mentioned by another group here. We might want to hold this source off temporarily until the data becomes more stable. |
Source: Signal Names:
Start from 2020-03-03
Start from 2020-03-07 (restriction: tested ≥ 50)
|
CDS closed a bunch of issues 24 hours ago, so we can return to work on this signal. New bugs, but that's to be expected. |
Bug fixed. Set export start date for confirmed_ signals to be 2020-02-20 For pct_positive: #tested incidence num>= 50 For tested signals until the most recent days: |
@jingjtang @krivard I'm so happy we're pursuing this! I was about to create a new issue exactly on this, but wisely I decided to search first. Can I get an update on where we are in terms of finalizing these signals? |
Re-run correlations just to double-check that the changes in data format after the last correlations run didn't trash the signal. If it's still good, we can publish in the next release. |
They updated their location metadata again. I updated the supporting mapping files in the pipeline, but noticed a new problem.
According to the comparison, I did not change the pipeline: generate the state level report based on county level data for confirmed cases. The only special case is for state level The correlation analysis results are shown below: Correlating jhu-csse: confirmed_incidence_prop averaged from 2020-08-17 to 2020-08-20 at county level agains cds: pct_positive from 2020-08-10 to 2020-08-13 |
It looks like the correlations are still good, so we should go ahead and schedule this for inclusion in the 1.9 release. The instability of the location metadata is annoying, but we've accounted for the need to document small tweaks like that by publishing a changelog for the API (https://cmu-delphi.github.io/delphi-epidata/api/covidcast_changelog.html). I think we can handle ongoing changes through that mechanism, we just need to be prepared to check the signal regularly and implement fixes. Next deadlines:
|
@krivard @jingjtang I just reviewed the data in CDS "latest" csv file. There were #tests reports from ~11 states, which I compared with the JHU reports for these states: I assume you studied the discrepancies so this may not be new to you, I just wanted to point them out just in case. |
@RoniRos Yes. According to their code, they take the cases/deaths data from JHU-CSSE/NYTimes/xx.gov and cross check them. Most of them are the same as JHU's which is expected. However, Corona Data Scraper made their own decision after cross checking which might be the reason why there are discrepancies for some states. Here is an example. |
@jingjtang @krivard Yes, I see that, thank you. But the discrepancies in CO and ND as so huge that I think they are worth investigating specifically. E.g. in ND it's ~220,000 vs. ~540,000. |
Thanks @jingjtang , that's good progress. At today's team lead meeting we decided we want to move forward with the '#tested' and 'pct_positive' signals, but only after making sure we know exactly what they measure. Can you please do the same thing you did for CO, also for:
Do you have time to do this today? If so, we could make a go/no-go decision by tomorrow, while Kathryn is still here. If not, that's fine, we can make the decision next week. |
@ryantibs @krivard : @jingjtang and I conferred further this evening. Here is my understanding of the current state of (global) confusion about what types of testing are being reported:
What we are going to do:
If it turns out CDS consistently reports people, we can move ahead with (appropriately named) signals. Otherwise, we need to decide what we want to do: The modeling group can still decide to use the 'hybrid' %positivity estimates. As Jingjing showed, they still have high correlation with incidence rates. I just don't think we should publish them without being able to explain exactly what they are. Your thoughts welcome. |
There's a dichotomy here between faithfully reproducing data from some source (like CTP or CDS) and publishing definitionally pure signals. If we want to publish a Corona Data Scraper mirror, we should mirror CDS, warts and all. If we want to publish a #tests or #tested signal, we probably will not be able to do that by mirroring a single source, and may have to consider a fusion signal. We already do this with combining the more-reliable USAFacts cases/deaths with the less-reliable-but-includes-Puerto-Rico JHU cases/deaths. I want to emphasize that in both cases, we can explain exactly what the signal is -- in the former, it's an exact mirror of CDS. We should 100% call out that CDS reports #tests in some regions and #tested in others, and that this prevents meaningful comparison between regions. We should probably not include CDS in the map. Beyond that, we should direct people to the CDS documentation. The next question to resolve then is: Do we want to publish a CDS mirror at all? Do we have internal or external users who want it? cc @ryantibs |
@RoniRos @krivard @ryantibs After investigating the dashboards of states and counties:
|
I like the idea of adding value by figuring out what's what and then publishing synthesized signals, e.g. one for each type of counting methodology, and one hybrid that covers a superset of the locations and maybe does some harmonizing. Obviously this will require more investigation and thinking, so it should be discussed and put somewhere on our priority stack. As for mirroring CSD, let's wait for @ryantibs's answer and CAN's answers. |
Sorry for the late reply here. I think Roni and I already talked about it this morning and discussed the value of mirroring, but, let me know if you still need me to weigh in on anything else. |
We decided we would like to go ahead with mirroring CDS for now. |
If we're mirroring, then we should not construct or publish a |
Based on the team leads discussion this morning, we should put CDS mirroring on indefinite hold. Should this issue be suspended? |
Yes -- I'll close it for now, and we can reopen at a later date if we find it is needed. |
The Corona Data Scraper project produces the testing data that's used by Covid Act Now. They seem to have testing data for specific counties they scrape. They're also connected to the COVID Atlas.
We should investigate their testing data and see if they report enough counties to make it worthwhile to ingest.
Meanwhile, COVID Tracking has an API for state-level testing data, which forecasting is already using even though it's not in our API.
The text was updated successfully, but these errors were encountered: