Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup process to the Internet Archive #184

Merged
merged 26 commits into from
Nov 24, 2014
Merged

Backup process to the Internet Archive #184

merged 26 commits into from
Nov 24, 2014

Conversation

konklone
Copy link
Member

This adds a backup script to the root of the repository, to back up reports and bulk data to the Internet Archive.

It's mildly intended to be generalizable (e.g. a backup to one's own S3 account), but not much -- it's hooked pretty tightly into scripts/backup/ia.py.

More details about how the Internet Archive's uploading system works can be found in #63. #63 won't be complete until the collection is fully uploaded to IA, and syncing on a regular basis. This PR covers the scripts that will be used in these processes, and that have been used to update what's there in the collection right now.

Backing up individual reports

The primary use is like:

./backup [--ig] [--year] [--report_id] [--force] [--meta]

If you want to specify a specific --report_id, you have to specify an --ig and a --year too. By default, the backup script will mark archived reports with a little ia.done file in that report's directory, and not upload reports that already have one. The --force flag will override this behavior and always upload the specified reports. The --meta tag will only upload a report's .json data and not the report itself, which is really only useful for testing.

So a cronjob that wanted to keep the report archive in sync with IA could just run ./backup every X hours or days.

Reports will be uploaded to the usinspectorsgeneral collection at https://archive.org/details/usinspectorsgeneral, that the Internet Archive created for the project after we uploaded ~90-100 reports using this script. The reports will be submitted to the Archive's "derivation queue", which should provide each PDF-based report with a pleasant little report reading interface, like this one.

Backing up a giant bulk data file

An alternate use is:

./backup --bulk=us-inspectors-general.bulk.zip

If given a --bulk flag with a path to a zip file, this will upload that file directly to the Archive, at
https://archive.org/details/us-inspectors-general.bulk. This item is marked as part of the usinspectorsgeneral collection, and is the permanent link I'm giving people (and linking to from the collection's description) as a quick way to download all ~40GB of the reports we have so far.

Using the bulk data backup means creating a .zip file of everything, excluding the .done files:

cd /path/to/inspectors-general/data
zip -r ../us-inspectors-general.bulk.zip * -x "*.done"

Then uploading that zip file to the Internet Archive with:

cd /path/to/inspectors-general
./backup --bulk=us-inspectors-general.bulk.zip

This would also be suitable to do via cron, but maybe monthly or weekly. It's an expensive operation for everyone.

Next steps

The next step is a full re-download of reports. @divergentdave has done yeoman's work in detecting and removing duplicate IDs, and integrating duplicate ID checking into the main report fetching process. @spulec did similarly rigorous work going back and adding report type detection to all the scrapers. These efforts won't apply to the full archive I have on my server, or that Sunlight has on theirs, unless we re-download everything. It'd be more efficient to migrate the collection in-place, but that's not really feasible.

I've initiated that re-download on my servers. Once that's done, I'll do a full backup to the Internet Archive, using the methods above.

Finally, once those are done, I'll instrument a couple of cronjobs that will keep things in sync automatically. Once that's done, #63 will be fixed, and we'll have a pretty nice system on our hands.

@parkr
Copy link
Member

parkr commented Nov 24, 2014

It all looks sane to me. 👍

konklone added a commit that referenced this pull request Nov 24, 2014
Backup process to the Internet Archive
@konklone konklone merged commit e5c966b into master Nov 24, 2014
@konklone konklone deleted the internet-archive branch November 24, 2014 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants