Backup process to the Internet Archive #184

konklone · 2014-11-24T02:17:17Z

This adds a backup script to the root of the repository, to back up reports and bulk data to the Internet Archive.

It's mildly intended to be generalizable (e.g. a backup to one's own S3 account), but not much -- it's hooked pretty tightly into scripts/backup/ia.py.

More details about how the Internet Archive's uploading system works can be found in #63. #63 won't be complete until the collection is fully uploaded to IA, and syncing on a regular basis. This PR covers the scripts that will be used in these processes, and that have been used to update what's there in the collection right now.

Backing up individual reports

The primary use is like:

./backup [--ig] [--year] [--report_id] [--force] [--meta]

If you want to specify a specific --report_id, you have to specify an --ig and a --year too. By default, the backup script will mark archived reports with a little ia.done file in that report's directory, and not upload reports that already have one. The --force flag will override this behavior and always upload the specified reports. The --meta tag will only upload a report's .json data and not the report itself, which is really only useful for testing.

So a cronjob that wanted to keep the report archive in sync with IA could just run ./backup every X hours or days.

Reports will be uploaded to the usinspectorsgeneral collection at https://archive.org/details/usinspectorsgeneral, that the Internet Archive created for the project after we uploaded ~90-100 reports using this script. The reports will be submitted to the Archive's "derivation queue", which should provide each PDF-based report with a pleasant little report reading interface, like this one.

Backing up a giant bulk data file

An alternate use is:

./backup --bulk=us-inspectors-general.bulk.zip

If given a --bulk flag with a path to a zip file, this will upload that file directly to the Archive, at
https://archive.org/details/us-inspectors-general.bulk. This item is marked as part of the usinspectorsgeneral collection, and is the permanent link I'm giving people (and linking to from the collection's description) as a quick way to download all ~40GB of the reports we have so far.

Using the bulk data backup means creating a .zip file of everything, excluding the .done files:

cd /path/to/inspectors-general/data
zip -r ../us-inspectors-general.bulk.zip * -x "*.done"

Then uploading that zip file to the Internet Archive with:

cd /path/to/inspectors-general
./backup --bulk=us-inspectors-general.bulk.zip

This would also be suitable to do via cron, but maybe monthly or weekly. It's an expensive operation for everyone.

Next steps

The next step is a full re-download of reports. @divergentdave has done yeoman's work in detecting and removing duplicate IDs, and integrating duplicate ID checking into the main report fetching process. @spulec did similarly rigorous work going back and adding report type detection to all the scrapers. These efforts won't apply to the full archive I have on my server, or that Sunlight has on theirs, unless we re-download everything. It'd be more efficient to migrate the collection in-place, but that's not really feasible.

I've initiated that re-download on my servers. Once that's done, I'll do a full backup to the Internet Archive, using the methods above.

Finally, once those are done, I'll instrument a couple of cronjobs that will keep things in sync automatically. Once that's done, #63 will be fixed, and we'll have a pretty nice system on our hands.

ref: jjjake/internetarchive#76

Conflicts: requirements.txt

parkr · 2014-11-24T06:43:29Z

It all looks sane to me. 👍

Backup process to the Internet Archive

konklone added 26 commits November 1, 2014 19:16

starting idea of logic path for IA backup

4555ef1

iteration and report aggregation works, now to actually sync

f739d76

making progress, still depends on an ia hack

6dd85bc

lock to my fork of internetarchive for now

62beb49

ref: jjjake/internetarchive#76

fix tests, geez

f5d4295

can't add it to a collection yet

6d9ee54

add more metadata

d042b8a

refine metadata

e03e9c9

add contents of report when they exist

7ceea22

https everywhere

192f92d

Allow queue derivation, be more careful about files

daf0ef2

Merge branch 'master' into internet-archive

7b7ae4a

Conflicts: requirements.txt

start documenting the backup

0f16361

more details

2d32acf

Merge branch 'master' into internet-archive

796b4c1

ignore zip files

e45ab7e

more details

77b3269

add ability to upload a bulk accompaniment

c635485

complete instructions

aaa9ac2

predictable sort order for globbed values, and print errors

94ef875

handle HTTP errors gracefully

bb9e43a

docs on archivin

003987f

Merge branch 'master' into internet-archive

624f297

document bulk upload better

8717e44

update collection names, don't use expensive queue for bulk upload

bfcdd7a

actually use queue_derive, and use correct collection ID

c7e35e4

konklone mentioned this pull request Nov 24, 2014

An Internet Archive backup system, inspired by this project harvard-lil/perma#751

Closed

konklone added a commit that referenced this pull request Nov 24, 2014

Merge pull request #184 from unitedstates/internet-archive

e5c966b

Backup process to the Internet Archive

konklone merged commit e5c966b into master Nov 24, 2014

konklone deleted the internet-archive branch November 24, 2014 06:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup process to the Internet Archive #184

Backup process to the Internet Archive #184

konklone commented Nov 24, 2014

parkr commented Nov 24, 2014

Backup process to the Internet Archive #184

Backup process to the Internet Archive #184

Conversation

konklone commented Nov 24, 2014

Backing up individual reports

Backing up a giant bulk data file

Next steps

parkr commented Nov 24, 2014