-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617
[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617
Conversation
Like the approach so far, thanks for sharing as a draft. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Like the overall approach and you did a great job of folding your prototype into the existing codebase.
I left a few comments that are likely better tackled in future sprints, but the one immediate question we might want to tackle is:
Do we want to use schemas or table prefixes to indicate that all of the tables being created are specific to GitHub data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping and recreating the tables works while we don't have any data in them, so perfect for right now! Long-term though, we should land on a more robust migration strategy.
That shouldn't block merging this in, but we should probably aim to tackle that in sprint 2.1 or 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# create tables | ||
|
||
CREATE TABLE deliverable ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use schemas?
Since we'll eventually have other data in the data warehouse besides GitHub data, it could be helpful to either prefix the table names with gh_
or to create and use a github
schema to organize these tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout. I like the lofi solution of using gh_ prefix on the table names. See 92fd950
for ghid in dataset.get_issue_ghids(): | ||
issue_df = dataset.get_issue(ghid) | ||
epic_id = id_map[entity.EPIC].get(issue_df['epic_ghid']) | ||
deliverable_id = id_map[entity.EPIC].get(issue_df['deliverable_ghid']) | ||
sprint_id = id_map[entity.SPRINT].get(issue_df['sprint_ghid']) | ||
quad_id = id_map[entity.QUAD].get(issue_df['quad_ghid']) | ||
row_id = random.randint(100, 999) # TODO: get actual row id via insert or select | ||
issue_map[ghid] = row_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not something we need to address in this PR, but in a future sprint, it might be worth looking into the value of inserting these (and other records) using a bulk statement (i.e. executemany()
) instead of inserting one row at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks solid! Nice job translating your prototype code to the existing repo.
One thought about where this code lives: I can see that it makes sense in integrations
because it involves actually reading/writing data to the DB. That being said, I was envisioning analytics.integrations
as being (relatively) metrics/dataset agnostic.
I was struggling with the same question with some of the post-extraction transformations for GitHub.
I'm wondering if it actually makes more sense to move some of this to a dedicated analytics.etl
package (and doing the same for the GitHub transformations that currently reside in analytics.integrations.github.main
) so that integrations can stay focused on functions and interfaces that can be reused across multiple datasets.
So for example we might expand the code in analytics.integrations.db
to have an upsert()
method that accepts a table name, a list of dicts and a match key, then handles the rest of the logic for actually inserting or updating each record passed. Or we could have an upsert_scd()
to handle the SCD logic.
Again not something we have to tackle now, but worth thinking about as we continue to build this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the less than perfect fit in the integrations
directory. Let's discuss further in a future planning session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to analytics.integrations.etldb
for the time being
analytics/src/analytics/cli.py
Outdated
@etl_app.command(name="initialize_database") | ||
def initialize_database() -> None: | ||
""" Initialize delivery metrics database """ | ||
print("initializing database") | ||
delivery_metrics_db.init_db() | ||
print("WARNING: database was NOT initialized because db integration is WIP") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably talk to @coilysiren about the best way to trigger a command that should only be run once.
Long-term I'm also wondering if we maybe want to abstract this as a migration
entry point to which we can pass a path to a SQL file that is version controlled or migration script that will allow us to continue to evolve the data warehouse schema without writing a new entry point for each migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "run once" point is now moot, as I've removed the DROP statements from the SQL and added IF NOT EXISTS clauses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command can now be run multiple times with no adverse consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Seems like a great interim solution
|
||
def connection(self) -> Connection: | ||
"""Get a connection object from the db engine.""" | ||
return self._db_engine.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DavidDudas-Intuitial when I try to run the poetry run transform_and_load
with the full export I get the following error:

It probably requires further digging, but I have a hunch that it might be stemming from this line, where you create a new connection, since this is invoked for each record in a loop and I'm pretty sure we're exceeding the maximum number of concurrent connections.
I think a potential fix would be to use the top-level self._db_engine
to either create a session or a connection that you pass in using dependency injection to be re-used throughout the loop, so that you're not spawning thousands of connections in the course of one run.
init-db: | ||
@echo "=> Initializing the database schema" | ||
@echo "=====================================================" | ||
$(POETRY) analytics etl initialize_database | ||
@echo "=====================================================" | ||
|
||
gh-transform-and-load: | ||
@echo "=> Transforming and loading GitHub data into the database" | ||
@echo "=====================================================" | ||
$(POETRY) analytics etl transform_and_load \ | ||
--deliverable-file $(DELIVERY_FILE) \ | ||
--effective-date $(EFFECTIVE_DATE) | ||
@echo "=====================================================" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added these because I wasn't able to trigger the command from the natively installed python application because I don't have the pyscopg_c
binding installed in my computer (which is needed by psycopg, but doesn't get distributed directly with the python library)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice add. I assumed this would be needed in the near future, but had not spent any time on it yet. Thanks for adding it!
@widal001 PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing the timeout issue. I left a few other questions/suggestions, namely:
- Considering adding a
t_created
to help us debug when records are created vs modified - Clarifying the behavior of
gh_issue_history
which seems to simply insert a new record for each effective date whether or not something has changed -- totally okay for now, but something we might want to reconsider in the long-run.
Neither of those things are blocking though, and I'd rather get something deployed today or tomorrow.
The same is true for testing -- we want to implement a minimum set of tests to pass the checks and prevent regressions if we change the load behavior, but we can break more robust testing into a follow-on ticket.
|
||
|
||
def sync_issues(db: EtlDb, dataset: EtlDataset, ghid_map: dict) -> dict: | ||
"""Insert or update (if necessary) a row for each issue and return a map of row ids.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, when I was testing the history table, I noticed something kind of strange -- after changing the point value of a given ticket, I saw the point value change in the gh_issue_history
table but there was still only one row for the issue whose point value I changed:

Am I misunderstanding how that table is supposed to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind see the following comment!
"insert into gh_issue_history (issue_id, status, is_closed, points, d_effective) " | ||
"values (:issue_id, :status, :is_closed, :points, :effective) " | ||
"on conflict (issue_id, d_effective) " | ||
"do update set (status, is_closed, points, t_modified) = " | ||
"(:status, :is_closed, :points, current_timestamp) " | ||
"returning id", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, so this table stores a new record for each effective date and multiple inserts in the same day simply overwrites the previous instance with the same effective date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the granularity of updates is presumed to be daily
d_effective DATE NOT NULL, | ||
t_modified TIMESTAMP, | ||
UNIQUE(issue_id, d_effective) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought on this and other tables -- it might be helpful to have a t_created
column as well for debugging purposes.
That can be scoped into a future ticket though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, and done: 8946b80
@mdragon @coilysiren @acouch Can I please get your review? This is ready for merge, if you approve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parts I had the time to read look good to me. Please try to avoid submitting PRs > 500 lines though. They are very hard to review. The tremendous diff is probably how a previous PR ended up printing the database password, which is a violation of security policy.
@@ -22,7 +22,6 @@ def get_db() -> Engine: | |||
A SQLAlchemy engine object representing the connection to the database. | |||
""" | |||
db = get_db_settings() | |||
print(f"postgresql+psycopg://{db.user}:{db.password}@{db.db_host}:{db.port}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... please don't print out the password...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coilysiren I agree with you. This is not my code; it was already there when I started.
@@ -143,6 +144,20 @@ lint: ## runs code quality checks | |||
# Data Commands # | |||
################# | |||
|
|||
init-db: | |||
@echo "=> Initializing the database schema" | |||
@echo "=====================================================" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this ends up looking very messy when you see it in the AWS Console
Summary
Adds new CLI capabilities to (1) initialize ETL database and (2) transform and load into the ETL database
Fixes #2482
Time to review: 10 mins
Changes proposed
etl_dataset
that can be hydrated from jsonpoetry run analytics etl
initialize_database
andtransform_and_load
integrations/etldb
to encapsulate transform and load logiccreate table
sql from sandbox repo, updated to be Postgres-friendlyTODO
initialize_database
transform_and_load
Context for reviewers
poetry run analytics etl initialize_database
poetry run analytics etl transform_and_load --deliverable-file ./data/test-etl-01.json --effective-date 2024-10-21
Additional information