Skip to content
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

Missing per-county tested data #356

Open
TomGoBravo opened this issue Jul 28, 2020 · 13 comments
Open

Missing per-county tested data #356

TomGoBravo opened this issue Jul 28, 2020 · 13 comments
Assignees
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.

Comments

@TomGoBravo
Copy link

Additional context

CovidActNow has been regularly fetching this file for months and making a copy at https://github.com/covid-projections/covid-data-public/commits/master/data/cases-cds/timeseries.csv

With the change to Project Li I noticed that many counties that used to have values in the tested column have no data now. The problem seems to be particularly bad in Pennsylvania.

@TomGoBravo
Copy link
Author

Actually, looking at

git show `git rev-list -n 1 --first-parent --before="2020-07-24" master`:data/cases-cds/timeseries.csv | git lfs smudge | csvgrep -c state -m Pennsylvania | csvgrep -c tested -r . | csvcut -c name,level,date | csvsort -c date | csvlook |less

I see that for Pennsylvania counties the newest date with data is from 2020-06-07

@TomGoBravo
Copy link
Author

But there are many counties where Corona Data Scraper had cases data for July but project Li has none. I'm looking at rows of data we fetched last week

(this creates a file with the names of every county in the US with cases in 2020-07 in data merged to https://github.com/covid-projections/covid-data-public last Friday)

git show `git rev-list -n 1 --first-parent --before="2020-07-24" master`:data/cases-cds/timeseries.csv | git lfs smudge | csvgrep -c level -m county | csvgrep -c country -m "United States" | csvgrep -c date -r '2020-07-..' | csvgrep -c cases -r . | csvcut -c name | perl -pe 's/United States/US/' | sort | uniq > data-20200724/cases-cds/timeseries-counties-cases-uniq

and comparing them to data a similar file created from data fetched from https://coronadatascraper.com/timeseries.csv.zip today:

cat cds/timeseries.csv | csvgrep -c level -m county | csvgrep -c country -m "United States" | csvgrep -c date -r '2020-07-..' | csvgrep -c cases -r . | csvcut -c name | sort | uniq > cds/timeseries-counties-cases-uniq

It looks like there are 851 counties that lost cases and 21 that got it.

@TomGoBravo
Copy link
Author

Here are 4 examples:

diff data-20200724/cases-cds/timeseries-counties-cases-uniq cds/timeseries-counties-cases-uniq  | grep Brown
< "Brown County, Kansas, US"
< "Brown County, Minnesota, US"
< "Brown County, South Dakota, US"
< "Brown County, Texas, US"

Looking at csvgrep -c name -m 'Brown County, Texas' data-20200724/cases-cds/timeseries.csv |csvcut -c name,date,cases it seems like the timeseries of cases was legit. It goes up to 303 on 2020-07-23 and agrees with https://ktxs.com/news/local/brown-county-12-new-cases-of-covid-19-2-deaths (99 cases on 2020-07-04).

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Hi @TomGoBravo , thx for the great notes. It looks like this is a result of a few things:

  • some of the scrapers have not been ported from CDS to Li. I tracked the migration in this spreadsheet . The list of remaining/blocked/needs updating:
BE/index.js
CA/NS/index.js
CH/index.js
FR/index.js
PA/index.js
US/AZ/index.js
US/CA/mercury-news.js
US/CA/san-francisco-county.js
US/CA/santa-clara-county.js
US/DC/index.js
US/KS/index.js   <<<
US/LA/index.js
US/MO/index.js
US/NV/washoe-county/index.js
US/TX/harris-county.js

Some of the ported sources are currently failing in live: ref https://api.covidatlas.com/status?format=html.

us-pa

I'll try fixing PA first, and see where that takes us.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Re "I see that for Pennsylvania counties the newest date with data is from 2020-06-07" - checking code comments and issues - we had an issue for that, https://github.com/covidatlas/coronadatascraper/issues/1055. PA changed their reporting to now use PDFs.

Code in src/shared/sources/us/pa/index.js has a comment:

    // TODO (scrapers) us-pa stopped working 2020-06-08
    // ref https://github.com/covidatlas/coronadatascraper/issues/1055
    // Now data is present in PDFs at links on
    // https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx

I'll switch from PA to KS first (one of the Brown County items you listed) to see if I can get that working.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Blarg, running into issues with getting KS to work. Similar to PA, KS switched to reporting stuff via PDFs, and for some reason the PDF code is not working -- have hacked around and can't grok it just yet. Will raise another issue for it.

@martynwong
Copy link

I've also found a similar issue - lots of county-level data in the central US appears to be missing.

@jzohrab
Copy link
Contributor

jzohrab commented Aug 2, 2020

Hi all, I believe I've found the reason for missing data, though I'm not sure what caused the cause.

Our reports are built up by location, stored in the Locations table. I checked the production table, and we don't have brown-county-texas-us (locationID iso1:us#iso2:us-tx#fips:48049), but we do have brown-county-illinois-us (iso1:us#iso2:us-il#fips:17009).

I'm not sure why that's the case -- the location data should be populated when data is scraped. We do have data for the brown-country-texas-us location:

"locationID (S)","dateSource (S)","cases (N)","country (S)","county (S)","date (S)","deaths (N)","priority (N)","source (S)","state (S)","updated (S)"
"iso1:us#iso2:us-tx#fips:48049","2020-07-01#jhu-usa","77","iso1:US","fips:48049","2020-07-01","10","-1","jhu-usa","iso2:US-TX","2020-08-02T10:08:53.548Z"

I'll look into a manual load of location data ... I don't know why we're loading locations during data scrape anyway, as we already have all of the location data.

ps - I haven't bothered looking into the other missing counties -- thanks for the list above -- but it seems highly likely this is the problem.

@jzohrab
Copy link
Contributor

jzohrab commented Aug 2, 2020

The "locations" lambda (which updates locations) appears to have been timing out. For most sources it's ok, but for something like jhu-usa, which updates thousands of locations, it fails. Local logging:

updating 153 of 3277: iso1:us#iso2:us-ak#fips:02050
updating 154 of 3277: iso1:us#iso2:us-ak#fips:02060

and it stops. I see errors in the lambda log, and am assuming it's that.

I bumped up the timeout for the lambda. Updating all locations takes about 1.5 mins for jhu-usa locally. Simplifying the code slightly now.

@jzohrab
Copy link
Contributor

jzohrab commented Aug 2, 2020

I believe this will be addressed by #367. I'll launch that to production soon (< 15 mins). We'll need to wait for a jhu-usa scrape to update all of the locations.

@jzohrab jzohrab self-assigned this Aug 2, 2020
@jzohrab jzohrab added the needs-verification Waiting for verification on an item that we believe to be fixed. label Aug 2, 2020
@jzohrab
Copy link
Contributor

jzohrab commented Aug 2, 2020

Launched to prod ... let's see how things shake out.

@jzohrab
Copy link
Contributor

jzohrab commented Aug 2, 2020

Also assigning @TomGoBravo and @martynwong , if you see the data has filled in before I do, please close the issue. Cheers! jz

@martynwong
Copy link

Hurrah! The data is working for me. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.
Projects
None yet
Development

No branches or pull requests

3 participants