Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding FOIA'd IG reports via GovernmentAttic.org scraper #276

Merged
merged 13 commits into from
Apr 8, 2016

Conversation

lukerosiak
Copy link
Contributor

Adding FOIA'd IG reports is good because they are usually more scandalous than the ones published online. Some FOIA experts have created GovernmentAttic.org, which houses 2,000+ government reports obtained through records requests. Usually they FOIA for an index of all reports by an IG, then submit individual ones that seem especially interesting. The site is regularly updated.

So I think it makes sense to piggyback off them (and I have gotten permission from them) since it is the only way to incorporate FOIA'd IG reports in an automated way.

This scraper first narrows GovAttic's reports down to only those from IGs, and then within that, to only IGs already tracked by oversight.garden. This leaves you with about 420 documents right now. Some of these are actually multiple related IG reports in one document--sometimes eight or more. The date is the date it was obtained under FOIA/uploaded to GovAttic, not the date it was written by the IG. The inspector slug is set to the IG's slug, meaning it saves the documents alongside PDFs produced by the actual IG site scrapers, rather than a folder called govattic.

Comments from Eric: "I definitely welcome that contribution too, though it will be a bit more complicated. In small part because it's an unofficial source, but in large part because the quality of the documents I've seen there tends to be really poor and will need a lot of OCRing. But it's also a huge trove of super relevant documents (including the names of a ton of unreleased IG reports), so it's definitely worth including here if you're going to write it."

Response:
a) GovernmentAttic OCRs all documents they receive using high-quality OCR software, so we can extract text with pdftotext. The pdftotext layer appears to be surprisingly accurate, even for those PDFs that are poorly scanned image files. But image quality can always be a problem when dealing with FOIA'd docs. Usually the gov scanned them that way and put them on the CD, it's not the requestor responsible for quality loss. It is becoming more common that agencies send natives PDF via CDs, but because of their redaction processes and other reasons, they usually don't.

b) To your point about official vs. unofficial, FOIA'd documents, which the README asks for, are always going to be unofficial. If there were a government resource for them, we wouldn't need FOIAs. Given that, GovAttic is as good as it gets because its judgement about what to ask for mirrors that of what the average person might find interesting; its requests aren't limited to some speciality niche or bias.

@@ -180,7 +180,7 @@ def report_from(result, year, year_range,OUTCOME_CODES):


def make_report_id(url):
return url.replace('/PublicFiles/','').replace('/publicfiles/','').replace('/','-').replace('.pdf','')
return inspector.slugify(url.replace('/PublicFiles/','').replace('/publicfiles/','').replace('.pdf',''))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting the extra OSC fix in this pull request, unrelated to the Government Attic work. 👍

@konklone
Copy link
Member

@lukerosiak This is outstanding, and my thanks for writing this, and for working with Government Attic to get their 👍 on including the work this way.

I want to give this some real review, and incorporation of the results into a local copy of oversight.garden, before merging -- but offhand this looks very thorough.

I'm guessing governmentattic.org is updated by hand as static files, and doesn't have the ability to easily add an RSS feed?

@lukerosiak
Copy link
Contributor Author

Thank you! You'll be surprised how good the .json files look because the PDFs have really good metadata of keywords, etc. embedded in them. You're right that there's no CMS behind the site, so we can't get RSS.

@konklone
Copy link
Member

Just mentioning that I'm not dead, and this is still in my todo list -- I should be able to get to it by the end of next weekend. Others on the project are welcome to perform the review as well.

@divergentdave
Copy link
Contributor

First off, thanks a ton for putting this together, @lukerosiak! This will be a great addition to the data set. I'm going to do my review changes as another PR to your branch, in part so I can test what I'm saying, and in part because the character set and date parsing bits are getting hairy.

In looking at remove_non_ascii(), it appears that utils.download() is not detecting the correct character set for some pages. The governmentattic.org server doesn't provide a character set in the Content-Type, but the HTML does include a meta tag that specifies utf-8. This gets lost in the composition of requests and BeautifulSoup, resulting in mojibake when requests guesses wrong. My plan is to add a special case to utils.download() telling requests to use utf-8. Between that and the inspector.sanitize() method, it looks like we'll only have center dots and trademark symbols left, which is fine for our purposes.

Following #273, I'd like to avoid having a default date for reports. I'm going to add a call to the new inspector.log_no_date() function, and add a couple more heuristic tweaks to parse more dates.

I have opened an issue over at konklone/oversight.garden#99 to index and search the PDF keywords.

Thanks again, and expect a meta-PR from me briefly.

Nits:

  • Going to mark the new script as executable
  • remove_non_ascii() is doubly indented
  • There's a branch that never gets taken after a = result.find('a'), because find() descends the DOM tree

@lukerosiak
Copy link
Contributor Author

@divergentdave Looks great to me, thank you for making those improvements! I will keep them in mind for the future. Do you need me to do anything or are we good to go on GovAttic?

I noticed that OSC reports are not online even though my prior PR has been merged in ( #236 ). Do I need to do anything on that?

After OSC and GovAttic are online, I will move on to GAO per #269 .

@divergentdave
Copy link
Contributor

Hmm, #236 probably didn't get deployed. I'll look into that tonight. Thanks again!

@divergentdave divergentdave reopened this Apr 8, 2016
@divergentdave divergentdave merged commit 9b3b8cd into unitedstates:master Apr 8, 2016
@lukerosiak
Copy link
Contributor Author

I believe this still needs to be deployed as well to add OSC to the list of
inspectors on the oversight.garden repo:

konklone/oversight.garden#93

On Fri, Apr 8, 2016 at 12:57 PM, David Cook [email protected]
wrote:

Hmm, #236 #236
probably didn't get deployed. I'll look into that tonight. Thanks again!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#276 (comment)

@divergentdave
Copy link
Contributor

Both are properly deployed now, indexing is still in progress.

https://oversight.garden/reports?query=governmentattic.org
https://oversight.garden/reports?inspector=osc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants