-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup process to the Internet Archive #184
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Conflicts: requirements.txt
It all looks sane to me. 👍 |
konklone
added a commit
that referenced
this pull request
Nov 24, 2014
Backup process to the Internet Archive
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a
backup
script to the root of the repository, to back up reports and bulk data to the Internet Archive.It's mildly intended to be generalizable (e.g. a backup to one's own S3 account), but not much -- it's hooked pretty tightly into
scripts/backup/ia.py
.More details about how the Internet Archive's uploading system works can be found in #63. #63 won't be complete until the collection is fully uploaded to IA, and syncing on a regular basis. This PR covers the scripts that will be used in these processes, and that have been used to update what's there in the collection right now.
Backing up individual reports
The primary use is like:
If you want to specify a specific
--report_id
, you have to specify an--ig
and a--year
too. By default, the backup script will mark archived reports with a littleia.done
file in that report's directory, and not upload reports that already have one. The--force
flag will override this behavior and always upload the specified reports. The--meta
tag will only upload a report's.json
data and not the report itself, which is really only useful for testing.So a cronjob that wanted to keep the report archive in sync with IA could just run
./backup
every X hours or days.Reports will be uploaded to the
usinspectorsgeneral
collection at https://archive.org/details/usinspectorsgeneral, that the Internet Archive created for the project after we uploaded ~90-100 reports using this script. The reports will be submitted to the Archive's "derivation queue", which should provide each PDF-based report with a pleasant little report reading interface, like this one.Backing up a giant bulk data file
An alternate use is:
If given a
--bulk
flag with a path to a zip file, this will upload that file directly to the Archive, athttps://archive.org/details/us-inspectors-general.bulk. This item is marked as part of the
usinspectorsgeneral
collection, and is the permanent link I'm giving people (and linking to from the collection's description) as a quick way to download all ~40GB of the reports we have so far.Using the bulk data backup means creating a
.zip
file of everything, excluding the.done
files:Then uploading that zip file to the Internet Archive with:
cd /path/to/inspectors-general ./backup --bulk=us-inspectors-general.bulk.zip
This would also be suitable to do via cron, but maybe monthly or weekly. It's an expensive operation for everyone.
Next steps
The next step is a full re-download of reports. @divergentdave has done yeoman's work in detecting and removing duplicate IDs, and integrating duplicate ID checking into the main report fetching process. @spulec did similarly rigorous work going back and adding report type detection to all the scrapers. These efforts won't apply to the full archive I have on my server, or that Sunlight has on theirs, unless we re-download everything. It'd be more efficient to migrate the collection in-place, but that's not really feasible.
I've initiated that re-download on my servers. Once that's done, I'll do a full backup to the Internet Archive, using the methods above.
Finally, once those are done, I'll instrument a couple of cronjobs that will keep things in sync automatically. Once that's done, #63 will be fixed, and we'll have a pretty nice system on our hands.