-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding FOIA'd IG reports via GovernmentAttic.org scraper #276
Conversation
@@ -180,7 +180,7 @@ def report_from(result, year, year_range,OUTCOME_CODES): | |||
|
|||
|
|||
def make_report_id(url): | |||
return url.replace('/PublicFiles/','').replace('/publicfiles/','').replace('/','-').replace('.pdf','') | |||
return inspector.slugify(url.replace('/PublicFiles/','').replace('/publicfiles/','').replace('.pdf','')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting the extra OSC fix in this pull request, unrelated to the Government Attic work. 👍
@lukerosiak This is outstanding, and my thanks for writing this, and for working with Government Attic to get their 👍 on including the work this way. I want to give this some real review, and incorporation of the results into a local copy of oversight.garden, before merging -- but offhand this looks very thorough. I'm guessing governmentattic.org is updated by hand as static files, and doesn't have the ability to easily add an RSS feed? |
Thank you! You'll be surprised how good the .json files look because the PDFs have really good metadata of keywords, etc. embedded in them. You're right that there's no CMS behind the site, so we can't get RSS. |
Just mentioning that I'm not dead, and this is still in my todo list -- I should be able to get to it by the end of next weekend. Others on the project are welcome to perform the review as well. |
First off, thanks a ton for putting this together, @lukerosiak! This will be a great addition to the data set. I'm going to do my review changes as another PR to your branch, in part so I can test what I'm saying, and in part because the character set and date parsing bits are getting hairy. In looking at Following #273, I'd like to avoid having a default date for reports. I'm going to add a call to the new I have opened an issue over at konklone/oversight.garden#99 to index and search the PDF keywords. Thanks again, and expect a meta-PR from me briefly. Nits:
|
governmentattic.org doesn't provide an encoding in HTTP headers, and requests can't parse <meta> tags inside the body of a document. Thus, we explicitly decode these pages as utf-8 with a special case for governmentattic.org.
@divergentdave Looks great to me, thank you for making those improvements! I will keep them in mind for the future. Do you need me to do anything or are we good to go on GovAttic? I noticed that OSC reports are not online even though my prior PR has been merged in ( #236 ). Do I need to do anything on that? After OSC and GovAttic are online, I will move on to GAO per #269 . |
Review changes
Hmm, #236 probably didn't get deployed. I'll look into that tonight. Thanks again! |
I believe this still needs to be deployed as well to add OSC to the list of On Fri, Apr 8, 2016 at 12:57 PM, David Cook [email protected]
|
Both are properly deployed now, indexing is still in progress. https://oversight.garden/reports?query=governmentattic.org |
Adding FOIA'd IG reports is good because they are usually more scandalous than the ones published online. Some FOIA experts have created GovernmentAttic.org, which houses 2,000+ government reports obtained through records requests. Usually they FOIA for an index of all reports by an IG, then submit individual ones that seem especially interesting. The site is regularly updated.
So I think it makes sense to piggyback off them (and I have gotten permission from them) since it is the only way to incorporate FOIA'd IG reports in an automated way.
This scraper first narrows GovAttic's reports down to only those from IGs, and then within that, to only IGs already tracked by oversight.garden. This leaves you with about 420 documents right now. Some of these are actually multiple related IG reports in one document--sometimes eight or more. The date is the date it was obtained under FOIA/uploaded to GovAttic, not the date it was written by the IG. The inspector slug is set to the IG's slug, meaning it saves the documents alongside PDFs produced by the actual IG site scrapers, rather than a folder called govattic.
Comments from Eric: "I definitely welcome that contribution too, though it will be a bit more complicated. In small part because it's an unofficial source, but in large part because the quality of the documents I've seen there tends to be really poor and will need a lot of OCRing. But it's also a huge trove of super relevant documents (including the names of a ton of unreleased IG reports), so it's definitely worth including here if you're going to write it."
Response:
a) GovernmentAttic OCRs all documents they receive using high-quality OCR software, so we can extract text with pdftotext. The pdftotext layer appears to be surprisingly accurate, even for those PDFs that are poorly scanned image files. But image quality can always be a problem when dealing with FOIA'd docs. Usually the gov scanned them that way and put them on the CD, it's not the requestor responsible for quality loss. It is becoming more common that agencies send natives PDF via CDs, but because of their redaction processes and other reasons, they usually don't.
b) To your point about official vs. unofficial, FOIA'd documents, which the README asks for, are always going to be unofficial. If there were a government resource for them, we wouldn't need FOIAs. Given that, GovAttic is as good as it gets because its judgement about what to ask for mirrors that of what the average person might find interesting; its requests aren't limited to some speciality niche or bias.