-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean up early google symptom data #1616
Comments
|
how did this one random date appear in the stream? are there other non-latest-issue data points that need to be removed too? |
The way our internal moving average smoother is defined, it returns There are no other non-latest-issue data points that need to be removed; this shows up because of the way we keep lags when patching smoothed signals. @aysim319 can describe that better. |
08/15 is the first date where the data is available; In that case the smoother doesn't actually smooth things and returns as. The reason behind is that the smoother in google symptom will skip the smoothing function if there isn't enough data (date range), but still pass data instead of dropping them However with the future runs it does in fact go through the imputation (passes the criteria from the previous link) even though there technically isn't enough data to properly impute, which (most if not all) values are null and later in the process (https://github.com/cmu-delphi/covidcast-indicators/blob/454ac565d0a0f2b5cf557e4efb2278c278c528a9/google_symptoms/delphi_google_symptoms/run.py#L97-L98) filters out the null values. sometimes some values may go through and continue to create the csv for acquisition. but it's inconsistent per geo value.
yes. sample epidata call:
for some of the signal and geo resolution it does in fact produce data. but it's very inconsistent |
Oh, good find |
Google symptoms signals all go back to 2017 now, but all the smoothed signals have a non-contiguous first entry on Aug 15 2017 (issue Aug 20 2017). There isn't enough data in the raw version to have calculated this. The Aug 20 issue is also before the earliest issue date seen in the raw data, although the raw and smoothed values for Aug 15 match.
The reason behind is that the smoother in google symptom will skip the smoothing function if there isn't enough data, but still pass data instead of dropping them
https://github.com/cmu-delphi/covidcast-indicators/blob/454ac565d0a0f2b5cf557e4efb2278c278c528a9/_delphi_utils_python/delphi_utils/smooth.py#L194-L196
If we drop smoothed data from August 17-20 (inclusively) would solve this problem. In the epidata documentation it's already mentioned that the smoothed data have a earilest date of 08/21/2017.
The text was updated successfully, but these errors were encountered: