Skip to content

Fix 403 when downloading data for mnist tutorial #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 15, 2021

Conversation

rossbar
Copy link
Collaborator

@rossbar rossbar commented Mar 9, 2021

Uses header spoofing to circumvent problems with downloading the MNIST digit data. Also adds data caching to the circleCI builds so in principle, the MNIST data will be cached between CI runs.

Closes #63

@rossbar rossbar changed the title WIP: Fix 403 when downloading data for mnist tutorial Fix 403 when downloading data for mnist tutorial Mar 10, 2021
@rossbar
Copy link
Collaborator Author

rossbar commented Mar 12, 2021

The caching of the _data dir between CI runs seems to have worked: the mnist data was not re-downloaded during the build step for #67.

@mattip
Copy link
Member

mattip commented Mar 12, 2021

LGTM, and will improve CI performance. Just one nit about educating people to spoof headers responsibly.

@melissawm
Copy link
Member

Unfortunately this is not working for me - I get files that apparently are not in gzip format. Maybe this is why the CI is still failing too?

@rossbar
Copy link
Collaborator Author

rossbar commented Mar 12, 2021

Maybe this is why the CI is still failing too?

It looks like the CI failure was just related to execution timeout again (not related to downloading). The build artifact makes it seem like the cache is working properly (there would be print statements like "Downloading xyz" at the code cell if the data were being downloaded again).

The failure due to gzip is indeed strange - the cached files are likely compressed otherwise execution would be failing on the decompression step too, though the docs for iter_content do say that compressed files will be automatically be decompressed. I'm not sure that behavior is always consistent, otherwise I don't understand how the cache was originally built with the still-compressed files.

Some possible solutions that come to mind are:

  1. Use response.raw instead of response.iter_content, which shouldn't auto-decompress the content
  2. Wrap the local data loading in a try-except: for gzip.

I'm partial to the second option - it adds a minor amount of boilerplate but should work regardless whether the data was decompressed when it was downloaded.

@rossbar
Copy link
Collaborator Author

rossbar commented Mar 13, 2021

Ah - another potential problem is that the download is simply failing :). Testing locally now I'm getting the 503 errors (instead of the 403 forbidden from before). There currently isn't a check of the response status so that should be updated as well.

Ultimately I think this PR puts the necessary infrastructure in place to cache the data, but we may also need to change the data source as the original website does not seem very reliable.

@melissawm
Copy link
Member

Sounds reasonable - thanks @rossbar !

@melissawm melissawm merged commit 0e5b8d4 into numpy:main Mar 15, 2021
@8bitmp3
Copy link
Contributor

8bitmp3 commented Mar 15, 2021

Thank you @rossbar

Note / ToDo from the meeting: Adding a warning sign

cc @melissawm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MNIST dataset can't be downloaded automatically
4 participants