Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[crec_stager] first commit #4

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

will-horning
Copy link

@will-horning will-horning commented Feb 16, 2017

This adds a new entry point called crec_stager to the data pipeline. The existing pipeline operates entirely off of local disk (all three parts of the ETL pipeline).

The crec_stager downloads the previous day's CREC zip from gpo.gov to local disk, extracts all .htm files, then uploads each one to S3 using the key format <CUSTOM_PREFIX>/YYYY/MM/DD/<CREC_FILENAME>.zip.

Its designed to be run either locally or as an AWS lambda job.

To run locally:

python crec_stager.py

That will upload the data from yesterday to our good ol' test bukkit: use-this-bucket-to-test-your-bullshit. Check out the docstrings for more details.

I also deployed it to chartbeat's aws account, under the lambda job lambda_test and set up a scheduled event trigger for once per day.

Eventually we'll need to convert the parser and solr importer to work off of the staged files in S3 instead of local disk, but this is at least a starting point to make this ETL pipeline a little more robust.

@rmangi @ackramer @jeiranj

Copy link

@ackramer ackramer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how does this relate to the current scraper.py? Will parts of that still be required or is this a replacement?

LGTM

)
parser.add_argument(
'--loglevel',
help='Log level, one of INFO, ERROR, WARN, DEBUG or CRITICAL.',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

choices=[logging.INFO, logging.ERROR, ...] and then default=logging.INFO. then you can get rid of the LOGLEVELS dict. :)


Attributes:
DEFAULT_LOG_FORMAT (:obj:`str`): A template string for log lines.
LOGLEVELS (:obj:`dict`): A lookup of loglevel name to the loglevel code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 rst docstrings ftw

@will-horning
Copy link
Author

@ackramer It's a replacement for the scraper, but its not currently compatible with the parser and solr importer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants