Consider ingesting Corona Data Scraper or COVID Tracking testing data #112

capnrefsmmat · 2020-06-26T16:29:26Z

The Corona Data Scraper project produces the testing data that's used by Covid Act Now. They seem to have testing data for specific counties they scrape. They're also connected to the COVID Atlas.

We should investigate their testing data and see if they report enough counties to make it worthwhile to ingest.

Meanwhile, COVID Tracking has an API for state-level testing data, which forecasting is already using even though it's not in our API.

jingjtang · 2020-07-21T15:58:20Z

Here is an EDA for the number of the locations available in terms of the testing data.

3189 unique counties in total. (including PR)
731 unique counties available for testing data.

krivard · 2020-07-21T18:30:22Z

Jingjing confirms that the number of unique counties is in the 3k range, so we'll go ahead and make this a new indicator

jsharpna · 2020-07-21T18:58:52Z

Here is the list of counties that we currently forecast...
fips_pred.xlsx

krivard · 2020-07-23T18:33:19Z

Coverage over all states at state level, but not all states are represented at the county level. Publish both, but note in the Limitations section that not all states provide county-level information.

krivard · 2020-07-23T20:33:55Z

Proposed signals, after talking to Roni:

Number of positive tests[1]
Number of tests
Test positivity rate

[1] in a perfect world this would be the same as case counts, but at the moment we have two sources of cases data that give slightly different values. we'd rather get this value then from the same source as we're getting the denominator from.

jingjtang · 2020-07-24T12:51:45Z

Add a pipeline here. Haven't compared Number of positive tests[1] with JHU or USA-FACTS yet.

krivard · 2020-07-24T18:19:56Z

Correlations are next!

krivard · 2020-07-27T18:23:11Z

Comparison to JHU is available here; a variety of different kinds and magnitudes of discrepancies but it's faithful to what CDS is publishing so we'll go forward with it as an indicator.

jingjtang · 2020-07-28T14:45:07Z

Correlating CDS pct_positive from 2020-07-10 to 2020-07-13 against jhu-csse: confirmed_incidence_prop averaged from 2020-07-17 to 2020-07-20.

State Level
County Level

- **MSA Level**

krivard · 2020-07-28T18:21:50Z

API documentation - see DETAILS.md
Upload historical data going back to February for confirmed cases, back to March for test positivity rate with wip names
get final signal names approved by Roni
Missingness (need to research)

jingjtang · 2020-07-30T16:36:27Z

Unassigned cases were added to the raw dataset. The pipeline needs to be updated.
Wired dates exist in the raw dataset. For example, today is 07-30, but they provide data for California (only) for 07-31. I have already posted an issue to their github repo.

Corona Data Scraper migrated from old cds to li(a new name?) this week. As mentioned in the engineering team meeting, they changed the format of the data and the location metadata json file. It seems there are more problems happen to their new release. One is stated above, the other important one is mentioned by another group here.

We might want to hold this source off temporarily until the data becomes more stable.

jingjtang · 2020-07-30T17:59:45Z

Source: cds-test

Signal Names:
Start from 2020-02-20

wip_confirmed_incidence_num
wip_confirmed_incidence_prop
wip_confirmed_cumulative_num
wip_confirmed_cumulative_prop
wip_confirmed_7dav_incid_num
wip_confirmed_7dav_incid_prop
wip_confirmed_7dav_cumul_num
wip_confirmed_7dav_cumul_prop

Start from 2020-03-03

wip_tested_incidence_num
wip_tested_incidence_prop
wip_tested_cumulative_num
wip_tested_cumulative_prop
wip_tested_7dav_incid_num
wip_tested_7dav_incid_prop
wip_tested_7dav_cumul_num
wip_tested_7dav_cumul_prop

Start from 2020-03-07 (restriction: tested ≥ 50)

wip_raw_pct_positive
wip_smoothed_pct_positive

krivard · 2020-08-10T18:20:08Z

CDS closed a bunch of issues 24 hours ago, so we can return to work on this signal.

New bugs, but that's to be expected.

jingjtang · 2020-08-10T22:41:46Z

CDS closed a bunch of issues 24 hours ago, so we can return to work on this signal.

New bugs, but that's to be expected.

Bug fixed.

Set export start date for confirmed_ signals to be 2020-02-20
Set export start date for tested and pct_positive to be 2020-03-15

For pct_positive: #tested incidence num>= 50

For tested signals until the most recent days:
~126 MSAs available. 20200809_msa_list.xlsx
~643 Counties available. 20200809_county_list.xlsx

ryantibs · 2020-09-06T16:27:29Z

@jingjtang @krivard I'm so happy we're pursuing this! I was about to create a new issue exactly on this, but wisely I decided to search first.

Can I get an update on where we are in terms of finalizing these signals?

krivard · 2020-09-09T19:32:34Z

Re-run correlations just to double-check that the changes in data format after the last correlations run didn't trash the signal. If it's still good, we can publish in the next release.

jingjtang · 2020-09-10T21:02:34Z

They updated their location metadata again. I updated the supporting mapping files in the pipeline, but noticed a new problem.

The number of cases at state level does not match the aggregation we get from county level, especially in CT, DE, ...(filed an issue to the cds group to ask for the exact meaning of their unassigned cases)
- Use state level directly + number of unassigned cases for each state: UseStateLevel_plus_unassigned.pdf
- Use state level directly only: UseStateLevel.pdf
- Use county level aggregation + number of unassigned cases for each state: UseCountyLevel_plus_uassigned.pdf
- Use county level and aggregate the values to state level, no unassigned cases considered: UseCountyLevel.pdf
The unassigned data seems not matched well with the states. unassigned.xlsx

According to the comparison, I did not change the pipeline: generate the state level report based on county level data for confirmed cases. The only special case is for state level tested where we use the state level data directly.

The correlation analysis results are shown below:
Correlating jhu-csse: confirmed_incidence_prop averaged from 2020-08-17 to 2020-08-20 at county level agains cds: confirmed_incidence_prop from 2020-08-10 to 2020-08-13

County Level
MSA Level
State Level

Correlating jhu-csse: confirmed_incidence_prop averaged from 2020-08-17 to 2020-08-20 at county level agains cds: pct_positive from 2020-08-10 to 2020-08-13

County Level
MSA Level
State Level

jingjtang · 2020-09-10T22:30:01Z

They have tested data for Puerto Rico at county level but not at state level.

krivard · 2020-09-14T14:20:42Z

It looks like the correlations are still good, so we should go ahead and schedule this for inclusion in the 1.9 release. The instability of the location metadata is annoying, but we've accounted for the need to document small tweaks like that by publishing a changelog for the API (https://cmu-delphi.github.io/delphi-epidata/api/covidcast_changelog.html). I think we can handle ongoing changes through that mechanism, we just need to be prepared to check the signal regularly and implement fixes.

Next deadlines:

Final source and signal name approval - let's try to get this done by EOD Friday so I can be here for it. I have the candidate list of signal names here, though I'll be expanding them since we can now use up to 64 characters. What are the candidate source names, cds and corona-data-scraper? Any others?
1.9 demo on Tuesday 22 September - WIP signals in API, custom map svelte. @jsharpna to work with @adamperer on final details to go out with the demo.
API documentation due Monday 28 September - add commits to this pull request

RoniRos · 2020-09-17T01:45:28Z

@krivard @jingjtang I just reviewed the data in CDS "latest" csv file. There were #tests reports from ~11 states, which I compared with the JHU reports for these states:
OR, NY, FL were identical or almost so, which I think means that CDS took it from JHU, yes?
IL, MA, NH, TN were maybe 15-20% off, some up and some down.
CO, ND were way off, with JHU reporting 50% more tested for CO and 150% more for ND.

I assume you studied the discrepancies so this may not be new to you, I just wanted to point them out just in case.
I tried to capture them in color in the following spreadsheet: cds-latest.xlsx

jingjtang · 2020-09-17T04:27:05Z

@RoniRos Yes. According to their code, they take the cases/deaths data from JHU-CSSE/NYTimes/xx.gov and cross check them. Most of them are the same as JHU's which is expected. However, Corona Data Scraper made their own decision after cross checking which might be the reason why there are discrepancies for some states. Here is an example.

RoniRos · 2020-09-17T15:01:19Z

@jingjtang @krivard Yes, I see that, thank you. But the discrepancies in CO and ND as so huge that I think they are worth investigating specifically. E.g. in ND it's ~220,000 vs. ~540,000.
I just searched ND dept. of Health and found this dashboard. From it, it is clear that that the 540k measures tests, whereas the 220k measures individuals. Can you investigate similarly CO? And the other states that have a 15%-20% discrepancy?

jingjtang · 2020-09-17T15:30:18Z

@RoniRos For CO , 797,493 measures people, 1,153,853 measures tests, which means CDS also reports test at individual level.

RoniRos · 2020-09-17T19:21:42Z

Thanks @jingjtang , that's good progress. At today's team lead meeting we decided we want to move forward with the '#tested' and 'pct_positive' signals, but only after making sure we know exactly what they measure. Can you please do the same thing you did for CO, also for:

All the other states listed in CoronaDataScraper? Additionally, for those states where CDS numbers are identical or almost-identical to JHU's numbers: are these counting tests or people being tested?
1-2 example counties from each of these states.

Do you have time to do this today? If so, we could make a go/no-go decision by tomorrow, while Kathryn is still here. If not, that's fine, we can make the decision next week.

RoniRos · 2020-09-18T01:41:26Z

@ryantibs @krivard : @jingjtang and I conferred further this evening. Here is my understanding of the current state of (global) confusion about what types of testing are being reported:

CovidTrackingProject (CTP) did the best job of sorting them out. According to them, there are currently at least three different ways of counting testing: Specimen (aka 'tests'), people, and encounters (which means unique people in a given day). The differences can be quite large, e.g. in ND it's a factor of 2.5 between #specimen and #people. As you can imagine, different states report different subsets of those, and for some it's not even clear what they are reporting. Unfortunately, CTP does not provide all the measures reported by each state. Rather, it picks only one, preferring tests/encounters to people. They tell you which they chose, but they don't provide everything. And they only report by state, not by county.
CovedDataScraper (CDS) reports on ~10 states, and on all the counties of most of these states. As Jingjing investigated, they get their data from JHU and other places. Regretfully, they do not indicate which type of counting is being reported. For some of the states they report identical numbers to JHU (which also doesn't say what they are reporting), and for other states they report much smaller numbers, where it's clear that CDS reports people, and JHU reports tests. Jingjing sent them a few questions about other discrepancies, but has not heard back.
CovidActNow (CAN) told us that they get their data from CDS. They may have looked into this issue and have better knowledge of who reports what. We can ask them.

What we are going to do:

Jingjing is investigating the dashboards of more states and counties, to determine what type of measure CDS reports. I
Jinjing will also send me the questions she couldn't get an answer on from CDS, and I will forward them to CAN.

If it turns out CDS consistently reports people, we can move ahead with (appropriately named) signals.

Otherwise, we need to decide what we want to do:
My view is that we should not contribute to the confusion, but rather work to resolve it. I think we should support multiple signals, one for each type of measurement for which there are a meaningful number of reporting locations. We should not provide a single %positivity signal that is based in some locations on #people and in others on #tests. As we saw, there is a huge discrepancy between them, so any comparison will be highly misleading. We could explore ways to harmonize estimates from different types of counts, but that sounds like a research project.

The modeling group can still decide to use the 'hybrid' %positivity estimates. As Jingjing showed, they still have high correlation with incidence rates. I just don't think we should publish them without being able to explain exactly what they are.

Your thoughts welcome.

krivard · 2020-09-18T14:45:04Z

There's a dichotomy here between faithfully reproducing data from some source (like CTP or CDS) and publishing definitionally pure signals.

If we want to publish a Corona Data Scraper mirror, we should mirror CDS, warts and all.

If we want to publish a #tests or #tested signal, we probably will not be able to do that by mirroring a single source, and may have to consider a fusion signal. We already do this with combining the more-reliable USAFacts cases/deaths with the less-reliable-but-includes-Puerto-Rico JHU cases/deaths.

I want to emphasize that in both cases, we can explain exactly what the signal is -- in the former, it's an exact mirror of CDS. We should 100% call out that CDS reports #tests in some regions and #tested in others, and that this prevents meaningful comparison between regions. We should probably not include CDS in the map. Beyond that, we should direct people to the CDS documentation.

The next question to resolve then is: Do we want to publish a CDS mirror at all? Do we have internal or external users who want it? cc @ryantibs

jingjtang · 2020-09-18T15:12:23Z

@RoniRos @krivard @ryantibs
[I want to post this as early as I can so that Katie can get some sense today. Information in the sheets that I made could have mistakes. Some government websites provide unclear descriptions to their data. ]

After investigating the dashboards of states and counties:

States level results here
For most of the states, CDS measures tests. However, for around 1/4 of the states, they measure people.
County level results here
For most of the states with county level tested data available, the counties within them have consistent measures. There are some exceptions:
- Wisconsin: I found at least one county measures people and at least one county measures tests respectively.
- California and Massachusetts: Failed to find clear explained data for multiple counties.
- Missouri: There are tests per 100k people provided in an interactive map in their website. I manually recalculated the number of tests for some counties, and the numbers do not match those in CDS's reports.

RoniRos · 2020-09-20T23:05:52Z

I like the idea of adding value by figuring out what's what and then publishing synthesized signals, e.g. one for each type of counting methodology, and one hybrid that covers a superset of the locations and maybe does some harmonizing. Obviously this will require more investigation and thinking, so it should be discussed and put somewhere on our priority stack.

As for mirroring CSD, let's wait for @ryantibs's answer and CAN's answers.

ryantibs · 2020-09-21T21:55:22Z

Sorry for the late reply here. I think Roni and I already talked about it this morning and discussed the value of mirroring, but, let me know if you still need me to weigh in on anything else.

RoniRos · 2020-09-27T02:40:31Z

We decided we would like to go ahead with mirroring CDS for now.
In the longer term it is important to have more coherent signals (e.g. separately for #individuals-tested, #tests-performed, etc.). C.A.N. is actively working on this now with Velorum, so we should probably not duplicate their work.

krivard · 2020-09-30T19:36:32Z

If we're mirroring, then we should not construct or publish a pct_positive signal.

RoniRos · 2020-10-01T02:40:57Z

@krivard : That's correct.

@ryantibs : What is your thinking following our discussion with Igor? Do you think we should go ahead with simple mirroring for now? And/or do the one week project with CAN?

RoniRos · 2020-10-01T17:15:42Z

Based on the team leads discussion this morning, we should put CDS mirroring on indefinite hold. Should this issue be suspended?

krivard · 2020-10-01T18:48:59Z

Yes -- I'll close it for now, and we can reopen at a later date if we find it is needed.

capnrefsmmat changed the title ~~Consider ingesting Corona Data Scraper testing data~~ Consider ingesting Corona Data Scraper or COVID Tracking testing data Jul 2, 2020

krivard added modeling Must coordinate with Modeling team Triage Nominate for inclusion in the next release API addition New signals labels Jul 8, 2020

krivard assigned jingjtang Jul 21, 2020

krivard closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider ingesting Corona Data Scraper or COVID Tracking testing data #112

Consider ingesting Corona Data Scraper or COVID Tracking testing data #112

capnrefsmmat commented Jun 26, 2020 •

edited

Loading

jingjtang commented Jul 21, 2020 •

edited

Loading

krivard commented Jul 21, 2020

jsharpna commented Jul 21, 2020

krivard commented Jul 23, 2020

krivard commented Jul 23, 2020

jingjtang commented Jul 24, 2020 •

edited

Loading

krivard commented Jul 24, 2020

krivard commented Jul 27, 2020 •

edited

Loading

jingjtang commented Jul 28, 2020

krivard commented Jul 28, 2020

jingjtang commented Jul 30, 2020 •

edited

Loading

jingjtang commented Jul 30, 2020 •

edited by krivard

Loading

krivard commented Aug 10, 2020

jingjtang commented Aug 10, 2020 •

edited

Loading

ryantibs commented Sep 6, 2020

krivard commented Sep 9, 2020

jingjtang commented Sep 10, 2020 •

edited

Loading

jingjtang commented Sep 10, 2020

krivard commented Sep 14, 2020

RoniRos commented Sep 17, 2020

jingjtang commented Sep 17, 2020

RoniRos commented Sep 17, 2020

jingjtang commented Sep 17, 2020

RoniRos commented Sep 17, 2020

RoniRos commented Sep 18, 2020 •

edited by krivard

Loading

krivard commented Sep 18, 2020

jingjtang commented Sep 18, 2020 •

edited

Loading

RoniRos commented Sep 20, 2020

ryantibs commented Sep 21, 2020

RoniRos commented Sep 27, 2020

krivard commented Sep 30, 2020

RoniRos commented Oct 1, 2020

RoniRos commented Oct 1, 2020

krivard commented Oct 1, 2020

Consider ingesting Corona Data Scraper or COVID Tracking testing data #112

Consider ingesting Corona Data Scraper or COVID Tracking testing data #112

Comments

capnrefsmmat commented Jun 26, 2020 • edited Loading

jingjtang commented Jul 21, 2020 • edited Loading

krivard commented Jul 21, 2020

jsharpna commented Jul 21, 2020

krivard commented Jul 23, 2020

krivard commented Jul 23, 2020

jingjtang commented Jul 24, 2020 • edited Loading

krivard commented Jul 24, 2020

krivard commented Jul 27, 2020 • edited Loading

jingjtang commented Jul 28, 2020

krivard commented Jul 28, 2020

jingjtang commented Jul 30, 2020 • edited Loading

jingjtang commented Jul 30, 2020 • edited by krivard Loading

krivard commented Aug 10, 2020

jingjtang commented Aug 10, 2020 • edited Loading

ryantibs commented Sep 6, 2020

krivard commented Sep 9, 2020

jingjtang commented Sep 10, 2020 • edited Loading

jingjtang commented Sep 10, 2020

krivard commented Sep 14, 2020

RoniRos commented Sep 17, 2020

jingjtang commented Sep 17, 2020

RoniRos commented Sep 17, 2020

jingjtang commented Sep 17, 2020

RoniRos commented Sep 17, 2020

RoniRos commented Sep 18, 2020 • edited by krivard Loading

krivard commented Sep 18, 2020

jingjtang commented Sep 18, 2020 • edited Loading

RoniRos commented Sep 20, 2020

ryantibs commented Sep 21, 2020

RoniRos commented Sep 27, 2020

krivard commented Sep 30, 2020

RoniRos commented Oct 1, 2020

RoniRos commented Oct 1, 2020

krivard commented Oct 1, 2020

capnrefsmmat commented Jun 26, 2020 •

edited

Loading

jingjtang commented Jul 21, 2020 •

edited

Loading

jingjtang commented Jul 24, 2020 •

edited

Loading

krivard commented Jul 27, 2020 •

edited

Loading

jingjtang commented Jul 30, 2020 •

edited

Loading

jingjtang commented Jul 30, 2020 •

edited by krivard

Loading

jingjtang commented Aug 10, 2020 •

edited

Loading

jingjtang commented Sep 10, 2020 •

edited

Loading

RoniRos commented Sep 18, 2020 •

edited by krivard

Loading

jingjtang commented Sep 18, 2020 •

edited

Loading