-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[crec_stager] first commit #4
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, how does this relate to the current scraper.py? Will parts of that still be required or is this a replacement?
LGTM
) | ||
parser.add_argument( | ||
'--loglevel', | ||
help='Log level, one of INFO, ERROR, WARN, DEBUG or CRITICAL.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
choices=[logging.INFO, logging.ERROR, ...]
and then default=logging.INFO
. then you can get rid of the LOGLEVELS dict. :)
|
||
Attributes: | ||
DEFAULT_LOG_FORMAT (:obj:`str`): A template string for log lines. | ||
LOGLEVELS (:obj:`dict`): A lookup of loglevel name to the loglevel code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍 rst docstrings ftw
@ackramer It's a replacement for the scraper, but its not currently compatible with the parser and solr importer. |
This adds a new entry point called crec_stager to the data pipeline. The existing pipeline operates entirely off of local disk (all three parts of the ETL pipeline).
The crec_stager downloads the previous day's CREC zip from gpo.gov to local disk, extracts all
.htm
files, then uploads each one to S3 using the key format<CUSTOM_PREFIX>/YYYY/MM/DD/<CREC_FILENAME>.zip
.Its designed to be run either locally or as an AWS lambda job.
To run locally:
That will upload the data from yesterday to our good ol' test bukkit:
use-this-bucket-to-test-your-bullshit
. Check out the docstrings for more details.I also deployed it to chartbeat's aws account, under the lambda job
lambda_test
and set up a scheduled event trigger for once per day.Eventually we'll need to convert the parser and solr importer to work off of the staged files in S3 instead of local disk, but this is at least a starting point to make this ETL pipeline a little more robust.
@rmangi @ackramer @jeiranj