[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617

DavidDudas-Intuitial · 2024-10-29T03:10:04Z

Summary

Adds new CLI capabilities to (1) initialize ETL database and (2) transform and load into the ETL database

Fixes #2482

Time to review: 10 mins

Changes proposed

What was added, updated, or removed in this PR.

Creates new dataset etl_dataset that can be hydrated from json
Adds new entry point to CLI: poetry run analytics etl
Exposes new commands initialize_database and transform_and_load
Creates new subpackage integrations/etldb to encapsulate transform and load logic
Ported create table sql from sandbox repo, updated to be Postgres-friendly

TODO

DB integration - connect to Postgres
Finish initialize_database
Port insert/update/select sql from sandbox repo, update it to be Postgres-friendly
Finish transform_and_load
Fix linter issues
Write documentation
Write tests

Context for reviewers

Testing instructions, background context, more in-depth details of the implementation, and anything else you'd like to call out or ask reviewers. Explain how the changes were verified.

To initialize ETL database: poetry run analytics etl initialize_database
To transform and load into ETL database: poetry run analytics etl transform_and_load --deliverable-file ./data/test-etl-01.json --effective-date 2024-10-21

Additional information

Screenshots, GIF demos, code examples or output to help show the changes working as expected.

acouch · 2024-10-29T13:57:06Z

Like the approach so far, thanks for sharing as a draft.

widal001

Looks good! Like the overall approach and you did a great job of folding your prototype into the existing codebase.

I left a few comments that are likely better tackled in future sprints, but the one immediate question we might want to tackle is:

Do we want to use schemas or table prefixes to indicate that all of the tables being created are specific to GitHub data?

widal001 · 2024-10-29T17:52:36Z

analytics/src/analytics/integrations/delivery_metrics_db/create_delivery_metrics_db.sql

Dropping and recreating the tables works while we don't have any data in them, so perfect for right now! Long-term though, we should land on a more robust migration strategy.

That shouldn't block merging this in, but we should probably aim to tackle that in sprint 2.1 or 2.2.

Great point. I removed the DROP statements from the sql file. We don't need them in there. I can drop manually during dev. See 92fd950 and 0230ff8

widal001 · 2024-10-29T17:54:18Z

analytics/src/analytics/integrations/delivery_metrics_db/create_delivery_metrics_db.sql

+
+# create tables
+
+CREATE TABLE deliverable (


Do we want to use schemas?

Since we'll eventually have other data in the data warehouse besides GitHub data, it could be helpful to either prefix the table names with gh_ or to create and use a github schema to organize these tables.

Good callout. I like the lofi solution of using gh_ prefix on the table names. See 92fd950

widal001 · 2024-10-29T18:11:15Z

analytics/src/analytics/integrations/delivery_metrics_db/main.py

+    for ghid in dataset.get_issue_ghids():
+        issue_df = dataset.get_issue(ghid)
+        epic_id = id_map[entity.EPIC].get(issue_df['epic_ghid'])
+        deliverable_id = id_map[entity.EPIC].get(issue_df['deliverable_ghid'])
+        sprint_id = id_map[entity.SPRINT].get(issue_df['sprint_ghid'])
+        quad_id = id_map[entity.QUAD].get(issue_df['quad_ghid'])
+        row_id = random.randint(100, 999)   # TODO: get actual row id via insert or select
+        issue_map[ghid] = row_id


Not something we need to address in this PR, but in a future sprint, it might be worth looking into the value of inserting these (and other records) using a bulk statement (i.e. executemany()) instead of inserting one row at a time.

widal001 · 2024-10-29T18:20:17Z

analytics/src/analytics/integrations/delivery_metrics_db/main.py

This looks solid! Nice job translating your prototype code to the existing repo.

One thought about where this code lives: I can see that it makes sense in integrations because it involves actually reading/writing data to the DB. That being said, I was envisioning analytics.integrations as being (relatively) metrics/dataset agnostic.

I was struggling with the same question with some of the post-extraction transformations for GitHub.

I'm wondering if it actually makes more sense to move some of this to a dedicated analytics.etl package (and doing the same for the GitHub transformations that currently reside in analytics.integrations.github.main) so that integrations can stay focused on functions and interfaces that can be reused across multiple datasets.

So for example we might expand the code in analytics.integrations.db to have an upsert() method that accepts a table name, a list of dicts and a match key, then handles the rest of the logic for actually inserting or updating each record passed. Or we could have an upsert_scd() to handle the SCD logic.

Again not something we have to tackle now, but worth thinking about as we continue to build this out.

Agreed on the less than perfect fit in the integrations directory. Let's discuss further in a future planning session.

I changed it to analytics.integrations.etldb for the time being

widal001 · 2024-10-29T18:24:00Z

analytics/src/analytics/cli.py

+@etl_app.command(name="initialize_database")
+def initialize_database() -> None:
+    """ Initialize delivery metrics database """
+    print("initializing database")
+    delivery_metrics_db.init_db()
+    print("WARNING: database was NOT initialized because db integration is WIP")
+    return


We should probably talk to @coilysiren about the best way to trigger a command that should only be run once.

Long-term I'm also wondering if we maybe want to abstract this as a migration entry point to which we can pass a path to a SQL file that is version controlled or migration script that will allow us to continue to evolve the data warehouse schema without writing a new entry point for each migration.

I think the "run once" point is now moot, as I've removed the DROP statements from the SQL and added IF NOT EXISTS clauses.

The command can now be run multiple times with no adverse consequences.

Awesome! Seems like a great interim solution

widal001 · 2024-11-01T15:11:47Z

analytics/src/analytics/integrations/etldb/etldb.py

+
+    def connection(self) -> Connection:
+        """Get a connection object from the db engine."""
+        return self._db_engine.connect()


@DavidDudas-Intuitial when I try to run the poetry run transform_and_load with the full export I get the following error:

It probably requires further digging, but I have a hunch that it might be stemming from this line, where you create a new connection, since this is invoked for each record in a loop and I'm pretty sure we're exceeding the maximum number of concurrent connections.

I think a potential fix would be to use the top-level self._db_engine to either create a session or a connection that you pass in using dependency injection to be re-used throughout the loop, so that you're not spawning thousands of connections in the course of one run.

widal001 · 2024-11-01T15:14:33Z

analytics/Makefile

+init-db:
+	@echo "=> Initializing the database schema"
+	@echo "====================================================="
+	$(POETRY) analytics etl initialize_database
+	@echo "====================================================="
+
+gh-transform-and-load:
+	@echo "=> Transforming and loading GitHub data into the database"
+	@echo "====================================================="
+	$(POETRY) analytics etl transform_and_load \
+	--deliverable-file $(DELIVERY_FILE) \
+	--effective-date $(EFFECTIVE_DATE)
+	@echo "====================================================="
+


I also added these because I wasn't able to trigger the command from the natively installed python application because I don't have the pyscopg_c binding installed in my computer (which is needed by psycopg, but doesn't get distributed directly with the python library)

Nice add. I assumed this would be needed in the near future, but had not spent any time on it yet. Thanks for adding it!

DavidDudas-Intuitial · 2024-11-01T20:45:31Z

@widal001 PTAL

widal001

LGTM! Thanks for fixing the timeout issue. I left a few other questions/suggestions, namely:

Considering adding a t_created to help us debug when records are created vs modified
Clarifying the behavior of gh_issue_history which seems to simply insert a new record for each effective date whether or not something has changed -- totally okay for now, but something we might want to reconsider in the long-run.

Neither of those things are blocking though, and I'd rather get something deployed today or tomorrow.

The same is true for testing -- we want to implement a minimum set of tests to pass the checks and prevent regressions if we change the load behavior, but we can break more robust testing into a follow-on ticket.

widal001 · 2024-11-04T15:11:47Z

analytics/src/analytics/integrations/etldb/main.py

+
+
+def sync_issues(db: EtlDb, dataset: EtlDataset, ghid_map: dict) -> dict:
+    """Insert or update (if necessary) a row for each issue and return a map of row ids."""


This most recent set of changes fixed the timeout beautifully!

However, when I was testing the history table, I noticed something kind of strange -- after changing the point value of a given ticket, I saw the point value change in the gh_issue_history table but there was still only one row for the issue whose point value I changed:

Am I misunderstanding how that table is supposed to work?

Nevermind see the following comment!

widal001 · 2024-11-04T15:17:53Z

analytics/src/analytics/integrations/etldb/issue_model.py

+            "insert into gh_issue_history (issue_id, status, is_closed, points, d_effective) "
+            "values (:issue_id, :status, :is_closed, :points, :effective) "
+            "on conflict (issue_id, d_effective) "
+            "do update set (status, is_closed, points, t_modified) = "
+            "(:status, :is_closed, :points, current_timestamp) "
+            "returning id",
+        )


Oh I see, so this table stores a new record for each effective date and multiple inserts in the same day simply overwrites the previous instance with the same effective date.

Yes, the granularity of updates is presumed to be daily

widal001 · 2024-11-04T15:19:27Z

analytics/src/analytics/integrations/etldb/create_etl_db.sql

+	d_effective DATE NOT NULL,
+	t_modified TIMESTAMP,
+	UNIQUE(issue_id, d_effective)


One thought on this and other tables -- it might be helpful to have a t_created column as well for debugging purposes.

That can be scoped into a future ticket though!

Good suggestion, and done: 8946b80

DavidDudas-Intuitial · 2024-11-05T19:38:22Z

@mdragon @coilysiren @acouch Can I please get your review? This is ready for merge, if you approve.

coilysiren

The parts I had the time to read look good to me. Please try to avoid submitting PRs > 500 lines though. They are very hard to review. The tremendous diff is probably how a previous PR ended up printing the database password, which is a violation of security policy.

coilysiren · 2024-11-05T20:24:02Z

analytics/src/analytics/integrations/db.py

@@ -22,7 +22,6 @@ def get_db() -> Engine:
    A SQLAlchemy engine object representing the connection to the database.
    """
    db = get_db_settings()
-    print(f"postgresql+psycopg://{db.user}:{db.password}@{db.db_host}:{db.port}")


Yeah... please don't print out the password...

@coilysiren I agree with you. This is not my code; it was already there when I started.

coilysiren · 2024-11-05T20:25:33Z

analytics/Makefile

@@ -143,6 +144,20 @@ lint: ## runs code quality checks
 # Data Commands #
 #################

+init-db:
+	@echo "=> Initializing the database schema"
+	@echo "====================================================="


FYI this ends up looking very messy when you see it in the AWS Console

DavidDudas-Intuitial added 16 commits October 17, 2024 16:55

copy files from simpler-grants-sandbox

76aea7a

updated readme

ebdb816

update readme

320df01

cherry pick changes from agilesix/simpler-grants-sandbox#50

5a91616

stub out etl command

0f983f7

created dataset for delivery metrics

b837ddf

remove unnecessary files

10166a9

remove unnecessary files

d5a9d12

protect against null ghid

dbadc9f

get effective date from command line

eb3a3a9

updated comments

b13c7dc

created abstraction in integrations dir to encapsulate db logic

38aa7ca

corrected comment

505f1cc

added create table sql file

7d8300a

created cli entry point to init db

949882f

stubbed out command to init db

710e629

DavidDudas-Intuitial self-assigned this Oct 29, 2024

DavidDudas-Intuitial requested review from coilysiren, widal001 and acouch as code owners October 29, 2024 03:10

github-actions bot added python analytics labels Oct 29, 2024

DavidDudas-Intuitial marked this pull request as draft October 29, 2024 03:10

DavidDudas-Intuitial changed the title ~~[Issue 2482] DRAFT: Migrate delivery metrics transform and load from simpler-grants-sandbox~~ [Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox Oct 29, 2024

DavidDudas-Intuitial mentioned this pull request Oct 29, 2024

Migrate the transform and load step into /analytics #2482

Closed

5 tasks

widal001 reviewed Oct 29, 2024

View reviewed changes

DavidDudas-Intuitial added 3 commits October 29, 2024 14:22

add gh_ prefix to table names; remove DROP statements from sql

92fd950

add 'if not exists' clause to create statements'

0230ff8

renamed classes and paths

44433c3

widal001 dismissed mdragon’s stale review via 823aeb9 November 1, 2024 14:55

github-actions bot added the ci/cd label Nov 1, 2024

widal001 reviewed Nov 1, 2024

View reviewed changes

DavidDudas-Intuitial added 2 commits November 1, 2024 13:31

added db dependency injection and connection reuse

186941d

add type hint for dbh params

b068e80

widal001 previously approved these changes Nov 4, 2024

View reviewed changes

formatting

3726af9

DavidDudas-Intuitial dismissed widal001’s stale review via 3726af9 November 4, 2024 19:23

DavidDudas-Intuitial added 16 commits November 4, 2024 16:33

unit tests for EtlDataset

36c3f22

add cli tests

190d1cd

Merge branch 'main' into issue-2482-migrate-delivery-metrics

7ce883b

formatted tests

4b9f590

add missing import

a960540

formatting

9be9504

add t_created field to each table that does not already have it

8946b80

fixed path issue

69fcf63

fixed path issue

d7bd8fa

formatting

3b3e516

attempt to fix path problem

69164c9

move json file to tests directory so CI can find it

89e67ff

Merge branch 'main' into issue-2482-migrate-delivery-metrics

2ce2217

remove unused import

487aaca

formatting

e71cff1

restored load_json_data_as_df in utils

afced78

coilysiren approved these changes Nov 5, 2024

View reviewed changes

DavidDudas-Intuitial merged commit b64f419 into main Nov 5, 2024
7 checks passed

DavidDudas-Intuitial deleted the issue-2482-migrate-delivery-metrics branch November 5, 2024 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617

[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617

DavidDudas-Intuitial commented Oct 29, 2024 •

edited

Loading

acouch commented Oct 29, 2024

widal001 left a comment

widal001 Oct 29, 2024 •

edited

Loading

DavidDudas-Intuitial Oct 29, 2024

widal001 Oct 29, 2024

DavidDudas-Intuitial Oct 29, 2024 •

edited

Loading

widal001 Oct 29, 2024

DavidDudas-Intuitial Oct 29, 2024

widal001 Oct 29, 2024

DavidDudas-Intuitial Oct 29, 2024

DavidDudas-Intuitial Oct 30, 2024

widal001 Oct 29, 2024

DavidDudas-Intuitial Oct 29, 2024

DavidDudas-Intuitial Oct 29, 2024

widal001 Nov 1, 2024

widal001 Nov 1, 2024 •

edited

Loading

widal001 Nov 1, 2024

DavidDudas-Intuitial Nov 1, 2024

DavidDudas-Intuitial commented Nov 1, 2024 •

edited

Loading

widal001 left a comment •

edited

Loading

widal001 Nov 4, 2024

widal001 Nov 4, 2024

widal001 Nov 4, 2024

widal001 Nov 4, 2024

DavidDudas-Intuitial Nov 4, 2024

widal001 Nov 4, 2024

DavidDudas-Intuitial Nov 5, 2024

DavidDudas-Intuitial commented Nov 5, 2024

coilysiren left a comment

coilysiren Nov 5, 2024

DavidDudas-Intuitial Nov 5, 2024

coilysiren Nov 5, 2024



		def sync_issues(db: EtlDb, dataset: EtlDataset, ghid_map: dict) -> dict:
		"""Insert or update (if necessary) a row for each issue and return a map of row ids."""

[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617

[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617

Conversation

DavidDudas-Intuitial commented Oct 29, 2024 • edited Loading

Summary

Time to review: 10 mins

Changes proposed

Context for reviewers

Additional information

acouch commented Oct 29, 2024

widal001 left a comment

Choose a reason for hiding this comment

widal001 Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidDudas-Intuitial Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

widal001 Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidDudas-Intuitial commented Nov 1, 2024 • edited Loading

widal001 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidDudas-Intuitial commented Nov 5, 2024

coilysiren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidDudas-Intuitial commented Oct 29, 2024 •

edited

Loading

widal001 Oct 29, 2024 •

edited

Loading

DavidDudas-Intuitial Oct 29, 2024 •

edited

Loading

widal001 Nov 1, 2024 •

edited

Loading

DavidDudas-Intuitial commented Nov 1, 2024 •

edited

Loading

widal001 left a comment •

edited

Loading