Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up early google symptom data #1616

Open
aysim319 opened this issue Mar 3, 2025 · 5 comments
Open

clean up early google symptom data #1616

aysim319 opened this issue Mar 3, 2025 · 5 comments
Assignees

Comments

@aysim319
Copy link
Contributor

aysim319 commented Mar 3, 2025

Google symptoms signals all go back to 2017 now, but all the smoothed signals have a non-contiguous first entry on Aug 15 2017 (issue Aug 20 2017). There isn't enough data in the raw version to have calculated this. The Aug 20 issue is also before the earliest issue date seen in the raw data, although the raw and smoothed values for Aug 15 match.

Image

The reason behind is that the smoother in google symptom will skip the smoothing function if there isn't enough data, but still pass data instead of dropping them

https://github.com/cmu-delphi/covidcast-indicators/blob/454ac565d0a0f2b5cf557e4efb2278c278c528a9/_delphi_utils_python/delphi_utils/smooth.py#L194-L196

If we drop smoothed data from August 17-20 (inclusively) would solve this problem. In the epidata documentation it's already mentioned that the smoothed data have a earilest date of 08/21/2017.

@nmdefries
Copy link
Contributor

nmdefries commented Mar 3, 2025

If we drop smoothed data from August 17-20 (inclusively) would solve this problem

We only need to drop a single point (where time_value is Aug 15, 2017) from each of the smoothed signals. There aren't any data points between that and the first valid point (where time_value is Aug 21, 2017) so we don't need to remove a range of dates. This is only true for nation.

@melange396
Copy link
Collaborator

how did this one random date appear in the stream? are there other non-latest-issue data points that need to be removed too?

@melange396 melange396 self-assigned this Mar 3, 2025
@nmdefries
Copy link
Contributor

The way our internal moving average smoother is defined, it returns n missing values until the window size requirement is met. But there's a hard-coded edge case upstream that does no smoothing when only one data point is available. I don't know if this should be considered a bug, or if we want to let the user toggle it, or if we should change the edge case behavior depending on the smoother selected.

There are no other non-latest-issue data points that need to be removed; this shows up because of the way we keep lags when patching smoothed signals. @aysim319 can describe that better.

@aysim319
Copy link
Contributor Author

aysim319 commented Mar 3, 2025

how did this one random date appear in the stream?

08/15 is the first date where the data is available; In that case the smoother doesn't actually smooth things and returns as.

The reason behind is that the smoother in google symptom will skip the smoothing function if there isn't enough data (date range), but still pass data instead of dropping them

https://github.com/cmu-delphi/covidcast-indicators/blob/454ac565d0a0f2b5cf557e4efb2278c278c528a9/_delphi_utils_python/delphi_utils/smooth.py#L194-L196

However with the future runs it does in fact go through the imputation (passes the criteria from the previous link) even though there technically isn't enough data to properly impute, which (most if not all) values are null and later in the process (https://github.com/cmu-delphi/covidcast-indicators/blob/454ac565d0a0f2b5cf557e4efb2278c278c528a9/google_symptoms/delphi_google_symptoms/run.py#L97-L98) filters out the null values.

sometimes some values may go through and continue to create the csv for acquisition. but it's inconsistent per geo value.

are there other non-latest-issue data points that need to be removed too?

yes.

sample epidata call:

Epidata.covidcast(SOURCE, signal, time_type=time_type,
                                      geo_type="hhs", time_values={'from': '20170810', 'to': '20171031'},
                                      geo_value="*", issues={'from': '20170820', 'to': '20170827'},
                                      )

for some of the signal and geo resolution it does in fact produce data. but it's very inconsistent
I set the geo_value into "*" since for the single 08/16 date, it only returns geo value of 7 and nothing else

Image

@nmdefries
Copy link
Contributor

for some of the signal and geo resolution it does in fact produce data. but it's very inconsistent

Oh, good find

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants