Adding FOIA'd IG reports via GovernmentAttic.org scraper #276

lukerosiak · 2016-02-26T23:40:27Z

Adding FOIA'd IG reports is good because they are usually more scandalous than the ones published online. Some FOIA experts have created GovernmentAttic.org, which houses 2,000+ government reports obtained through records requests. Usually they FOIA for an index of all reports by an IG, then submit individual ones that seem especially interesting. The site is regularly updated.

So I think it makes sense to piggyback off them (and I have gotten permission from them) since it is the only way to incorporate FOIA'd IG reports in an automated way.

This scraper first narrows GovAttic's reports down to only those from IGs, and then within that, to only IGs already tracked by oversight.garden. This leaves you with about 420 documents right now. Some of these are actually multiple related IG reports in one document--sometimes eight or more. The date is the date it was obtained under FOIA/uploaded to GovAttic, not the date it was written by the IG. The inspector slug is set to the IG's slug, meaning it saves the documents alongside PDFs produced by the actual IG site scrapers, rather than a folder called govattic.

Comments from Eric: "I definitely welcome that contribution too, though it will be a bit more complicated. In small part because it's an unofficial source, but in large part because the quality of the documents I've seen there tends to be really poor and will need a lot of OCRing. But it's also a huge trove of super relevant documents (including the names of a ton of unreleased IG reports), so it's definitely worth including here if you're going to write it."

Response:
a) GovernmentAttic OCRs all documents they receive using high-quality OCR software, so we can extract text with pdftotext. The pdftotext layer appears to be surprisingly accurate, even for those PDFs that are poorly scanned image files. But image quality can always be a problem when dealing with FOIA'd docs. Usually the gov scanned them that way and put them on the CD, it's not the requestor responsible for quality loss. It is becoming more common that agencies send natives PDF via CDs, but because of their redaction processes and other reasons, they usually don't.

b) To your point about official vs. unofficial, FOIA'd documents, which the README asks for, are always going to be unofficial. If there were a government resource for them, we wouldn't need FOIAs. Given that, GovAttic is as good as it gets because its judgement about what to ask for mirrors that of what the average person might find interesting; its requests aren't limited to some speciality niche or bias.

konklone · 2016-02-27T18:09:37Z

inspectors/osc.py

@@ -180,7 +180,7 @@ def report_from(result, year, year_range,OUTCOME_CODES):


 def make_report_id(url):
-  return url.replace('/PublicFiles/','').replace('/publicfiles/','').replace('/','-').replace('.pdf','')
+  return inspector.slugify(url.replace('/PublicFiles/','').replace('/publicfiles/','').replace('.pdf',''))


Just noting the extra OSC fix in this pull request, unrelated to the Government Attic work. 👍

konklone · 2016-02-27T18:11:43Z

@lukerosiak This is outstanding, and my thanks for writing this, and for working with Government Attic to get their 👍 on including the work this way.

I want to give this some real review, and incorporation of the results into a local copy of oversight.garden, before merging -- but offhand this looks very thorough.

I'm guessing governmentattic.org is updated by hand as static files, and doesn't have the ability to easily add an RSS feed?

lukerosiak · 2016-02-27T18:34:35Z

Thank you! You'll be surprised how good the .json files look because the PDFs have really good metadata of keywords, etc. embedded in them. You're right that there's no CMS behind the site, so we can't get RSS.

konklone · 2016-03-14T06:51:17Z

Just mentioning that I'm not dead, and this is still in my todo list -- I should be able to get to it by the end of next weekend. Others on the project are welcome to perform the review as well.

divergentdave · 2016-04-05T23:11:59Z

First off, thanks a ton for putting this together, @lukerosiak! This will be a great addition to the data set. I'm going to do my review changes as another PR to your branch, in part so I can test what I'm saying, and in part because the character set and date parsing bits are getting hairy.

In looking at remove_non_ascii(), it appears that utils.download() is not detecting the correct character set for some pages. The governmentattic.org server doesn't provide a character set in the Content-Type, but the HTML does include a meta tag that specifies utf-8. This gets lost in the composition of requests and BeautifulSoup, resulting in mojibake when requests guesses wrong. My plan is to add a special case to utils.download() telling requests to use utf-8. Between that and the inspector.sanitize() method, it looks like we'll only have center dots and trademark symbols left, which is fine for our purposes.

Following #273, I'd like to avoid having a default date for reports. I'm going to add a call to the new inspector.log_no_date() function, and add a couple more heuristic tweaks to parse more dates.

I have opened an issue over at konklone/oversight.garden#99 to index and search the PDF keywords.

Thanks again, and expect a meta-PR from me briefly.

Nits:

Going to mark the new script as executable
remove_non_ascii() is doubly indented
There's a branch that never gets taken after a = result.find('a'), because find() descends the DOM tree

governmentattic.org doesn't provide an encoding in HTTP headers, and requests can't parse <meta> tags inside the body of a document. Thus, we explicitly decode these pages as utf-8 with a special case for governmentattic.org.

lukerosiak · 2016-04-08T16:52:59Z

@divergentdave Looks great to me, thank you for making those improvements! I will keep them in mind for the future. Do you need me to do anything or are we good to go on GovAttic?

I noticed that OSC reports are not online even though my prior PR has been merged in ( #236 ). Do I need to do anything on that?

After OSC and GovAttic are online, I will move on to GAO per #269 .

Review changes

divergentdave · 2016-04-08T16:57:09Z

Hmm, #236 probably didn't get deployed. I'll look into that tonight. Thanks again!

lukerosiak · 2016-04-08T17:01:45Z

I believe this still needs to be deployed as well to add OSC to the list of
inspectors on the oversight.garden repo:

konklone/oversight.garden#93

On Fri, Apr 8, 2016 at 12:57 PM, David Cook [email protected]
wrote:

Hmm, #236 #236
probably didn't get deployed. I'll look into that tonight. Thanks again!

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#276 (comment)

divergentdave · 2016-04-09T13:31:28Z

Both are properly deployed now, indexing is still in progress.

https://oversight.garden/reports?query=governmentattic.org
https://oversight.garden/reports?inspector=osc

root added 2 commits February 26, 2016 18:25

GovernmentAttic

46d9d06

GovernmentAttic

bc55d3e

konklone reviewed Feb 27, 2016
View reviewed changes

divergentdave added 10 commits April 5, 2016 18:26

[governmentattic] chmod a+x

6fc58d3

[governmentattic] Fix mojibake

2da6d41

governmentattic.org doesn't provide an encoding in HTTP headers, and requests can't parse <meta> tags inside the body of a document. Thus, we explicitly decode these pages as utf-8 with a special case for governmentattic.org.

Merge branch 'master' into governmentattic

6b03978

[governmentattic] Log missing dates, no default

e1273a6

[governmentattic] Handle "Sept" in dates

5b2bf53

[governmentattic] Handle extra hyphen before date

d68ba3c

[governmentattic] Handle [11-Aug-2013] etc.

5b87108

[governmentattic] Don't break [16-September-2009]

97eb04d

[governmentattic] Remove dead code

9c707bf

[governmentattic] Add to safe.yml

9935190

divergentdave mentioned this pull request Apr 5, 2016

Review changes lukerosiak/inspectors-general#1

Merged

Merge pull request #1 from divergentdave/governmentattic

b8008d2

Review changes

divergentdave closed this Apr 8, 2016

divergentdave reopened this Apr 8, 2016

divergentdave merged commit 9b3b8cd into unitedstates:master Apr 8, 2016

divergentdave mentioned this pull request Apr 10, 2016

Search over keywords and titles from metadata konklone/oversight.garden#102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding FOIA'd IG reports via GovernmentAttic.org scraper #276

Adding FOIA'd IG reports via GovernmentAttic.org scraper #276

lukerosiak commented Feb 26, 2016

konklone Feb 27, 2016

konklone commented Feb 27, 2016

lukerosiak commented Feb 27, 2016

konklone commented Mar 14, 2016

divergentdave commented Apr 5, 2016

lukerosiak commented Apr 8, 2016

divergentdave commented Apr 8, 2016

lukerosiak commented Apr 8, 2016

divergentdave commented Apr 9, 2016

Adding FOIA'd IG reports via GovernmentAttic.org scraper #276

Adding FOIA'd IG reports via GovernmentAttic.org scraper #276

Conversation

lukerosiak commented Feb 26, 2016

konklone Feb 27, 2016

Choose a reason for hiding this comment

konklone commented Feb 27, 2016

lukerosiak commented Feb 27, 2016

konklone commented Mar 14, 2016

divergentdave commented Apr 5, 2016

lukerosiak commented Apr 8, 2016

divergentdave commented Apr 8, 2016

lukerosiak commented Apr 8, 2016

divergentdave commented Apr 9, 2016