Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617

Merged
merged 68 commits into from
Nov 5, 2024

Conversation

DavidDudas-Intuitial
Copy link
Collaborator

@DavidDudas-Intuitial DavidDudas-Intuitial commented Oct 29, 2024

Summary

Adds new CLI capabilities to (1) initialize ETL database and (2) transform and load into the ETL database

Fixes #2482

Time to review: 10 mins

Changes proposed

What was added, updated, or removed in this PR.

  • Creates new dataset etl_dataset that can be hydrated from json
  • Adds new entry point to CLI: poetry run analytics etl
  • Exposes new commands initialize_database and transform_and_load
  • Creates new subpackage integrations/etldb to encapsulate transform and load logic
  • Ported create table sql from sandbox repo, updated to be Postgres-friendly

TODO

  • DB integration - connect to Postgres
  • Finish initialize_database
  • Port insert/update/select sql from sandbox repo, update it to be Postgres-friendly
  • Finish transform_and_load
  • Fix linter issues
  • Write documentation
  • Write tests

Context for reviewers

Testing instructions, background context, more in-depth details of the implementation, and anything else you'd like to call out or ask reviewers. Explain how the changes were verified.

  1. To initialize ETL database: poetry run analytics etl initialize_database
  2. To transform and load into ETL database: poetry run analytics etl transform_and_load --deliverable-file ./data/test-etl-01.json --effective-date 2024-10-21

Additional information

Screenshots, GIF demos, code examples or output to help show the changes working as expected.

@DavidDudas-Intuitial DavidDudas-Intuitial self-assigned this Oct 29, 2024
@DavidDudas-Intuitial DavidDudas-Intuitial marked this pull request as draft October 29, 2024 03:10
@DavidDudas-Intuitial DavidDudas-Intuitial changed the title [Issue 2482] DRAFT: Migrate delivery metrics transform and load from simpler-grants-sandbox [Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox Oct 29, 2024
@acouch
Copy link
Collaborator

acouch commented Oct 29, 2024

Like the approach so far, thanks for sharing as a draft.

Copy link
Collaborator

@widal001 widal001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Like the overall approach and you did a great job of folding your prototype into the existing codebase.

I left a few comments that are likely better tackled in future sprints, but the one immediate question we might want to tackle is:

Do we want to use schemas or table prefixes to indicate that all of the tables being created are specific to GitHub data?

Copy link
Collaborator

@widal001 widal001 Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping and recreating the tables works while we don't have any data in them, so perfect for right now! Long-term though, we should land on a more robust migration strategy.

That shouldn't block merging this in, but we should probably aim to tackle that in sprint 2.1 or 2.2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I removed the DROP statements from the sql file. We don't need them in there. I can drop manually during dev. See 92fd950 and 0230ff8


# create tables

CREATE TABLE deliverable (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use schemas?

Since we'll eventually have other data in the data warehouse besides GitHub data, it could be helpful to either prefix the table names with gh_ or to create and use a github schema to organize these tables.

Copy link
Collaborator Author

@DavidDudas-Intuitial DavidDudas-Intuitial Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout. I like the lofi solution of using gh_ prefix on the table names. See 92fd950

Comment on lines 50 to 57
for ghid in dataset.get_issue_ghids():
issue_df = dataset.get_issue(ghid)
epic_id = id_map[entity.EPIC].get(issue_df['epic_ghid'])
deliverable_id = id_map[entity.EPIC].get(issue_df['deliverable_ghid'])
sprint_id = id_map[entity.SPRINT].get(issue_df['sprint_ghid'])
quad_id = id_map[entity.QUAD].get(issue_df['quad_ghid'])
row_id = random.randint(100, 999) # TODO: get actual row id via insert or select
issue_map[ghid] = row_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something we need to address in this PR, but in a future sprint, it might be worth looking into the value of inserting these (and other records) using a bulk statement (i.e. executemany()) instead of inserting one row at a time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid! Nice job translating your prototype code to the existing repo.

One thought about where this code lives: I can see that it makes sense in integrations because it involves actually reading/writing data to the DB. That being said, I was envisioning analytics.integrations as being (relatively) metrics/dataset agnostic.

I was struggling with the same question with some of the post-extraction transformations for GitHub.

I'm wondering if it actually makes more sense to move some of this to a dedicated analytics.etl package (and doing the same for the GitHub transformations that currently reside in analytics.integrations.github.main) so that integrations can stay focused on functions and interfaces that can be reused across multiple datasets.

So for example we might expand the code in analytics.integrations.db to have an upsert() method that accepts a table name, a list of dicts and a match key, then handles the rest of the logic for actually inserting or updating each record passed. Or we could have an upsert_scd() to handle the SCD logic.

Again not something we have to tackle now, but worth thinking about as we continue to build this out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the less than perfect fit in the integrations directory. Let's discuss further in a future planning session.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to analytics.integrations.etldb for the time being

Comment on lines 269 to 275
@etl_app.command(name="initialize_database")
def initialize_database() -> None:
""" Initialize delivery metrics database """
print("initializing database")
delivery_metrics_db.init_db()
print("WARNING: database was NOT initialized because db integration is WIP")
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably talk to @coilysiren about the best way to trigger a command that should only be run once.

Long-term I'm also wondering if we maybe want to abstract this as a migration entry point to which we can pass a path to a SQL file that is version controlled or migration script that will allow us to continue to evolve the data warehouse schema without writing a new entry point for each migration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "run once" point is now moot, as I've removed the DROP statements from the SQL and added IF NOT EXISTS clauses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command can now be run multiple times with no adverse consequences.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Seems like a great interim solution


def connection(self) -> Connection:
"""Get a connection object from the db engine."""
return self._db_engine.connect()
Copy link
Collaborator

@widal001 widal001 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidDudas-Intuitial when I try to run the poetry run transform_and_load with the full export I get the following error:

Screenshot 2024-11-01 at 10 54 36 AM

It probably requires further digging, but I have a hunch that it might be stemming from this line, where you create a new connection, since this is invoked for each record in a loop and I'm pretty sure we're exceeding the maximum number of concurrent connections.

I think a potential fix would be to use the top-level self._db_engine to either create a session or a connection that you pass in using dependency injection to be re-used throughout the loop, so that you're not spawning thousands of connections in the course of one run.

Comment on lines +148 to +161
init-db:
@echo "=> Initializing the database schema"
@echo "====================================================="
$(POETRY) analytics etl initialize_database
@echo "====================================================="

gh-transform-and-load:
@echo "=> Transforming and loading GitHub data into the database"
@echo "====================================================="
$(POETRY) analytics etl transform_and_load \
--deliverable-file $(DELIVERY_FILE) \
--effective-date $(EFFECTIVE_DATE)
@echo "====================================================="

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added these because I wasn't able to trigger the command from the natively installed python application because I don't have the pyscopg_c binding installed in my computer (which is needed by psycopg, but doesn't get distributed directly with the python library)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice add. I assumed this would be needed in the near future, but had not spent any time on it yet. Thanks for adding it!

@DavidDudas-Intuitial
Copy link
Collaborator Author

DavidDudas-Intuitial commented Nov 1, 2024

@widal001 PTAL

widal001
widal001 previously approved these changes Nov 4, 2024
Copy link
Collaborator

@widal001 widal001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing the timeout issue. I left a few other questions/suggestions, namely:

  • Considering adding a t_created to help us debug when records are created vs modified
  • Clarifying the behavior of gh_issue_history which seems to simply insert a new record for each effective date whether or not something has changed -- totally okay for now, but something we might want to reconsider in the long-run.

Neither of those things are blocking though, and I'd rather get something deployed today or tomorrow.

The same is true for testing -- we want to implement a minimum set of tests to pass the checks and prevent regressions if we change the load behavior, but we can break more robust testing into a follow-on ticket.



def sync_issues(db: EtlDb, dataset: EtlDataset, ghid_map: dict) -> dict:
"""Insert or update (if necessary) a row for each issue and return a map of row ids."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This most recent set of changes fixed the timeout beautifully!
Screenshot 2024-11-04 at 10 11 35 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, when I was testing the history table, I noticed something kind of strange -- after changing the point value of a given ticket, I saw the point value change in the gh_issue_history table but there was still only one row for the issue whose point value I changed:

Screenshot 2024-11-04 at 10 14 29 AM

Am I misunderstanding how that table is supposed to work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind see the following comment!

Comment on lines +95 to +101
"insert into gh_issue_history (issue_id, status, is_closed, points, d_effective) "
"values (:issue_id, :status, :is_closed, :points, :effective) "
"on conflict (issue_id, d_effective) "
"do update set (status, is_closed, points, t_modified) = "
"(:status, :is_closed, :points, current_timestamp) "
"returning id",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, so this table stores a new record for each effective date and multiple inserts in the same day simply overwrites the previous instance with the same effective date.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the granularity of updates is presumed to be daily

Comment on lines 58 to 60
d_effective DATE NOT NULL,
t_modified TIMESTAMP,
UNIQUE(issue_id, d_effective)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought on this and other tables -- it might be helpful to have a t_created column as well for debugging purposes.

That can be scoped into a future ticket though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, and done: 8946b80

@DavidDudas-Intuitial
Copy link
Collaborator Author

@mdragon @coilysiren @acouch Can I please get your review? This is ready for merge, if you approve.

Copy link
Collaborator

@coilysiren coilysiren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parts I had the time to read look good to me. Please try to avoid submitting PRs > 500 lines though. They are very hard to review. The tremendous diff is probably how a previous PR ended up printing the database password, which is a violation of security policy.

@@ -22,7 +22,6 @@ def get_db() -> Engine:
A SQLAlchemy engine object representing the connection to the database.
"""
db = get_db_settings()
print(f"postgresql+psycopg://{db.user}:{db.password}@{db.db_host}:{db.port}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... please don't print out the password...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coilysiren I agree with you. This is not my code; it was already there when I started.

@@ -143,6 +144,20 @@ lint: ## runs code quality checks
# Data Commands #
#################

init-db:
@echo "=> Initializing the database schema"
@echo "====================================================="
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this ends up looking very messy when you see it in the AWS Console

@DavidDudas-Intuitial DavidDudas-Intuitial merged commit b64f419 into main Nov 5, 2024
7 checks passed
@DavidDudas-Intuitial DavidDudas-Intuitial deleted the issue-2482-migrate-delivery-metrics branch November 5, 2024 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analytics ci/cd documentation Improvements or additions to documentation python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate the transform and load step into /analytics
5 participants