Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617

Merged
merged 68 commits into from
Nov 5, 2024
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
76aea7a
copy files from simpler-grants-sandbox
DavidDudas-Intuitial Oct 17, 2024
ebdb816
updated readme
DavidDudas-Intuitial Oct 18, 2024
320df01
update readme
DavidDudas-Intuitial Oct 18, 2024
5a91616
cherry pick changes from https://github.com/agilesix/simpler-grants-s…
DavidDudas-Intuitial Oct 18, 2024
0f983f7
stub out etl command
DavidDudas-Intuitial Oct 28, 2024
b837ddf
created dataset for delivery metrics
DavidDudas-Intuitial Oct 28, 2024
10166a9
remove unnecessary files
DavidDudas-Intuitial Oct 28, 2024
d5a9d12
remove unnecessary files
DavidDudas-Intuitial Oct 28, 2024
dbadc9f
protect against null ghid
DavidDudas-Intuitial Oct 28, 2024
eb3a3a9
get effective date from command line
DavidDudas-Intuitial Oct 29, 2024
b13c7dc
updated comments
DavidDudas-Intuitial Oct 29, 2024
38aa7ca
created abstraction in integrations dir to encapsulate db logic
DavidDudas-Intuitial Oct 29, 2024
505f1cc
corrected comment
DavidDudas-Intuitial Oct 29, 2024
7d8300a
added create table sql file
DavidDudas-Intuitial Oct 29, 2024
949882f
created cli entry point to init db
DavidDudas-Intuitial Oct 29, 2024
710e629
stubbed out command to init db
DavidDudas-Intuitial Oct 29, 2024
92fd950
add gh_ prefix to table names; remove DROP statements from sql
DavidDudas-Intuitial Oct 29, 2024
0230ff8
add 'if not exists' clause to create statements'
DavidDudas-Intuitial Oct 29, 2024
44433c3
renamed classes and paths
DavidDudas-Intuitial Oct 30, 2024
e8c20ae
update sql to use postgres syntax
DavidDudas-Intuitial Oct 30, 2024
5a9ec75
added class to encapsulate db connection
DavidDudas-Intuitial Oct 30, 2024
82c51f5
finish implementaiton of init_db
DavidDudas-Intuitial Oct 30, 2024
cb2fc7e
update output message
DavidDudas-Intuitial Oct 30, 2024
a733e1a
port model classes from sandbox to represent etldb entities
DavidDudas-Intuitial Oct 30, 2024
f2bc5fd
port quad insert/update from sandbox
DavidDudas-Intuitial Oct 30, 2024
9d9ca31
finished implementing insert/select/update for each eltdb entity
DavidDudas-Intuitial Oct 31, 2024
12aaa95
fixed lint errors
DavidDudas-Intuitial Oct 31, 2024
6de57d1
fixed more linter errors
DavidDudas-Intuitial Oct 31, 2024
1257ff6
added docstrings
DavidDudas-Intuitial Oct 31, 2024
f5fae44
fixed more lint issues
DavidDudas-Intuitial Oct 31, 2024
d227c90
Merge branch 'main' into issue-2482-migrate-delivery-metrics
DavidDudas-Intuitial Oct 31, 2024
72b4a06
fixed verbose output
DavidDudas-Intuitial Oct 31, 2024
268a44b
minor code cleanup
DavidDudas-Intuitial Oct 31, 2024
15efd80
Update usage.md
DavidDudas-Intuitial Oct 31, 2024
195593a
remove blank lines
DavidDudas-Intuitial Oct 31, 2024
81355e8
Merge branch 'issue-2482-migrate-delivery-metrics' of github.com:HHS/…
DavidDudas-Intuitial Oct 31, 2024
8f9de75
change constant name
DavidDudas-Intuitial Oct 31, 2024
39bf886
ran black to format code
DavidDudas-Intuitial Oct 31, 2024
244c187
Merge branch 'main' into issue-2482-migrate-delivery-metrics
DavidDudas-Intuitial Oct 31, 2024
5d0097e
fixed formatting to pass CI checks
DavidDudas-Intuitial Oct 31, 2024
122ce66
fixed more formatting issues to pass ci checks
DavidDudas-Intuitial Oct 31, 2024
93edd79
fixed another formatting issues to pass ci checks
DavidDudas-Intuitial Oct 31, 2024
a6248c3
simplified string transformation
DavidDudas-Intuitial Oct 31, 2024
9928d06
fixed type hint issues
DavidDudas-Intuitial Nov 1, 2024
8b2bafb
more formatting
DavidDudas-Intuitial Nov 1, 2024
75f902a
created select method for each model class; updated format of execute…
DavidDudas-Intuitial Nov 1, 2024
62d45cd
formatting
DavidDudas-Intuitial Nov 1, 2024
1b5e74a
feat: Adds Makefile targets for init-db and gh-etl
widal001 Nov 1, 2024
823aeb9
refactor: Removes statement printing db connection string
widal001 Nov 1, 2024
186941d
added db dependency injection and connection reuse
DavidDudas-Intuitial Nov 1, 2024
b068e80
add type hint for dbh params
DavidDudas-Intuitial Nov 1, 2024
3726af9
formatting
DavidDudas-Intuitial Nov 4, 2024
36c3f22
unit tests for EtlDataset
DavidDudas-Intuitial Nov 5, 2024
190d1cd
add cli tests
DavidDudas-Intuitial Nov 5, 2024
7ce883b
Merge branch 'main' into issue-2482-migrate-delivery-metrics
DavidDudas-Intuitial Nov 5, 2024
4b9f590
formatted tests
DavidDudas-Intuitial Nov 5, 2024
a960540
add missing import
DavidDudas-Intuitial Nov 5, 2024
9be9504
formatting
DavidDudas-Intuitial Nov 5, 2024
8946b80
add t_created field to each table that does not already have it
DavidDudas-Intuitial Nov 5, 2024
69fcf63
fixed path issue
DavidDudas-Intuitial Nov 5, 2024
d7bd8fa
fixed path issue
DavidDudas-Intuitial Nov 5, 2024
3b3e516
formatting
DavidDudas-Intuitial Nov 5, 2024
69164c9
attempt to fix path problem
DavidDudas-Intuitial Nov 5, 2024
89e67ff
move json file to tests directory so CI can find it
DavidDudas-Intuitial Nov 5, 2024
2ce2217
Merge branch 'main' into issue-2482-migrate-delivery-metrics
DavidDudas-Intuitial Nov 5, 2024
487aaca
remove unused import
DavidDudas-Intuitial Nov 5, 2024
e71cff1
formatting
DavidDudas-Intuitial Nov 5, 2024
afced78
restored load_json_data_as_df in utils
DavidDudas-Intuitial Nov 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion analytics/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ ISSUE_FILE ?= $(OUTPUT_DIR)/issue-data.json
DELIVERY_FILE ?= $(OUTPUT_DIR)/delivery-data.json
SPRINT ?= @current
# Names of the points and sprint fields in the GitHub project
POINTS_FIELD ?= Points
POINTS_FIELD ?= Story Points
SPRINT_FIELD ?= Sprint
UNIT ?= points
ACTION ?= show-results
MIN_TEST_COVERAGE ?= 80
APP_NAME ?= grants-analytics
EFFECTIVE_DATE ?= $(shell date +"%Y-%m-%d")

# Required for CI to work properly
SHELL = /bin/bash -o pipefail
Expand Down Expand Up @@ -144,6 +145,20 @@ lint: ## runs code quality checks
# Data Commands #
#################

init-db:
@echo "=> Initializing the database schema"
@echo "====================================================="
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this ends up looking very messy when you see it in the AWS Console

$(POETRY) analytics etl initialize_database
@echo "====================================================="

gh-transform-and-load:
@echo "=> Transforming and loading GitHub data into the database"
@echo "====================================================="
$(POETRY) analytics etl transform_and_load \
--deliverable-file $(DELIVERY_FILE) \
--effective-date $(EFFECTIVE_DATE)
@echo "====================================================="

Comment on lines +147 to +160
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added these because I wasn't able to trigger the command from the natively installed python application because I don't have the pyscopg_c binding installed in my computer (which is needed by psycopg, but doesn't get distributed directly with the python library)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice add. I assumed this would be needed in the near future, but had not spent any time on it yet. Thanks for adding it!

sprint-data-export:
@echo "=> Exporting project data from the sprint board"
@echo "====================================================="
Expand Down Expand Up @@ -186,6 +201,8 @@ issue-data-export:

gh-data-export: sprint-data-export issue-data-export roadmap-data-export delivery-data-export

gh-etl: delivery-data-export gh-transform-and-load

sprint-burndown:
@echo "=> Running sprint burndown report"
@echo "====================================================="
Expand Down
51 changes: 50 additions & 1 deletion analytics/src/analytics/cli.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# pylint: disable=C0415
"""Expose a series of CLI entrypoints for the analytics package."""

import logging
import logging.config
from datetime import datetime
from pathlib import Path
from typing import Annotated, Optional

Expand All @@ -10,8 +12,9 @@
from sqlalchemy import text

from analytics.datasets.deliverable_tasks import DeliverableTasks
from analytics.datasets.etl_dataset import EtlDataset
from analytics.datasets.issues import GitHubIssues
from analytics.integrations import db, github, slack
from analytics.integrations import db, etldb, github, slack
from analytics.metrics.base import BaseMetric, Unit
from analytics.metrics.burndown import SprintBurndown
from analytics.metrics.burnup import SprintBurnup
Expand Down Expand Up @@ -39,6 +42,8 @@
STATUS_ARG = typer.Option(
help="Deliverable status to include in report, can be passed multiple times",
)
DELIVERABLE_FILE_ARG = typer.Option(help="Path to file with exported deliverable data")
EFFECTIVE_DATE_ARG = typer.Option(help="YYYY-MM-DD effective date to apply to each imported row")
# fmt: on

# instantiate the main CLI entrypoint
Expand All @@ -47,10 +52,12 @@
export_app = typer.Typer()
metrics_app = typer.Typer()
import_app = typer.Typer()
etl_app = typer.Typer()
# add sub-commands to main entrypoint
app.add_typer(export_app, name="export", help="Export data needed to calculate metrics")
app.add_typer(metrics_app, name="calculate", help="Calculate key project metrics")
app.add_typer(import_app, name="import", help="Import data into the database")
app.add_typer(etl_app, name="etl", help="Transform and load local file")


@app.callback()
Expand Down Expand Up @@ -292,3 +299,45 @@ def export_json_to_database(delivery_file: Annotated[str, ISSUE_FILE_ARG]) -> No
)
rows = len(issues.to_dict())
logger.info("Number of rows in table: %s", rows)


# ===========================================================
# Etl commands
# ===========================================================


@etl_app.command(name="initialize_database")
def initialize_database() -> None:
"""Initialize etl database."""
print("initializing database")
etldb.init_db()
print("done")


@etl_app.command(name="transform_and_load")
def transform_and_load(
deliverable_file: Annotated[str, DELIVERABLE_FILE_ARG],
effective_date: Annotated[str, EFFECTIVE_DATE_ARG],
) -> None:
"""Transform and load etl data."""
# validate effective date arg
try:
dateformat = "%Y-%m-%d"
datestamp = (
datetime.strptime(effective_date, dateformat)
.astimezone()
.strftime(dateformat)
)
print(f"running transform and load with effective date {datestamp}")
except ValueError:
print("FATAL ERROR: malformed effective date, expected YYYY-MM-DD format")
return

# hydrate a dataset instance from the input data
dataset = EtlDataset.load_from_json_file(file_path=deliverable_file)

# sync data to db
etldb.sync_db(dataset, datestamp)

# finish
print("transform and load is done")
145 changes: 145 additions & 0 deletions analytics/src/analytics/datasets/etl_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
"""
Implement the EtlDataset class.

This is a sub-class of BaseDataset that models
quad, deliverable, epic, issue, and sprint data.
"""

from enum import Enum
from typing import Any, Self

import pandas as pd
from numpy.typing import NDArray

from analytics.datasets.base import BaseDataset
from analytics.datasets.utils import load_json_data_as_df


class EtlEntityType(Enum):
"""Define entity types in the db schema."""

DELIVERABLE = "deliverable"
EPIC = "epic"
ISSUE = "issue"
SPRINT = "sprint"
QUAD = "quad"


class EtlDataset(BaseDataset):
"""Encapsulate data exported from github."""

COLUMN_MAP = {
"deliverable_url": "deliverable_ghid",
"deliverable_title": "deliverable_title",
"deliverable_pillar": "deliverable_pillar",
"epic_url": "epic_ghid",
"epic_title": "epic_title",
"issue_url": "issue_ghid",
"issue_title": "issue_title",
"issue_parent": "issue_parent",
"issue_type": "issue_type",
"issue_is_closed": "issue_is_closed",
"issue_opened_at": "issue_opened_at",
"issue_closed_at": "issue_closed_at",
"issue_points": "issue_points",
"issue_status": "issue_status",
"sprint_id": "sprint_ghid",
"sprint_name": "sprint_name",
"sprint_start": "sprint_start",
"sprint_length": "sprint_length",
"sprint_end": "sprint_end",
"quad_id": "quad_ghid",
"quad_name": "quad_name",
"quad_start": "quad_start",
"quad_length": "quad_length",
"quad_end": "quad_end",
}

@classmethod
def load_from_json_file(cls, file_path: str) -> Self:
"""
Load the input json file and instantiates an instance of EtlDataset.

Parameters
----------
file_path: str
Path to the local json file containing data exported from GitHub

Returns
-------
Self:
An instance of the EtlDataset dataset class
"""
# load input datasets
df = load_json_data_as_df(
file_path=file_path,
column_map=cls.COLUMN_MAP,
date_cols=None,
)

# transform entity id columns
prefix = "https://github.com/"
for col in ("deliverable_ghid", "epic_ghid", "issue_ghid", "issue_parent"):
df[col] = df[col].str.replace(prefix, "")

return cls(df)

# QUAD getters

def get_quad(self, quad_ghid: str) -> pd.Series:
"""Fetch data about a given quad."""
query_string = f"quad_ghid == '{quad_ghid}'"
return self.df.query(query_string).iloc[0]

def get_quad_ghids(self) -> NDArray[Any]:
"""Fetch an array of unique non-null quad ghids."""
df = self.df[self.df.quad_ghid.notna()]
return df.quad_ghid.unique()

# DELIVERABLE getters

def get_deliverable(self, deliverable_ghid: str) -> pd.Series:
"""Fetch data about a given deliverable."""
query_string = f"deliverable_ghid == '{deliverable_ghid}'"
return self.df.query(query_string).iloc[0]

def get_deliverable_ghids(self) -> NDArray[Any]:
"""Fetch an array of unique non-null deliverable ghids."""
df = self.df[self.df.deliverable_ghid.notna()]
return df.deliverable_ghid.unique()

# SPRINT getters

def get_sprint(self, sprint_ghid: str) -> pd.Series:
"""Fetch data about a given sprint."""
query_string = f"sprint_ghid == '{sprint_ghid}'"
return self.df.query(query_string).iloc[0]

def get_sprint_ghids(self) -> NDArray[Any]:
"""Fetch an array of unique non-null sprint ghids."""
df = self.df[self.df.sprint_ghid.notna()]
return df.sprint_ghid.unique()

# EPIC getters

def get_epic(self, epic_ghid: str) -> pd.Series:
"""Fetch data about a given epic."""
query_string = f"epic_ghid == '{epic_ghid}'"
return self.df.query(query_string).iloc[0]

def get_epic_ghids(self) -> NDArray[Any]:
"""Fetch an array of unique non-null epic ghids."""
df = self.df[self.df.epic_ghid.notna()]
return df.epic_ghid.unique()

# ISSUE getters

def get_issue(self, issue_ghid: str) -> pd.Series:
"""Fetch data about a given issue."""
query_string = f"issue_ghid == '{issue_ghid}'"
return self.df.query(query_string).iloc[0]

def get_issue_ghids(self) -> NDArray[Any]:
"""Fetch an array of unique non-null issue ghids."""
df = self.df[self.df.issue_ghid.notna()]
return df.issue_ghid.unique()
1 change: 0 additions & 1 deletion analytics/src/analytics/integrations/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ def get_db() -> Engine:
A SQLAlchemy engine object representing the connection to the database.
"""
db = get_db_settings()
print(f"postgresql+psycopg://{db.user}:{db.password}@{db.db_host}:{db.port}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... please don't print out the password...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coilysiren I agree with you. This is not my code; it was already there when I started.

return create_engine(
f"postgresql+psycopg://{db.user}:{db.password}@{db.db_host}:{db.port}",
pool_pre_ping=True,
Expand Down
11 changes: 11 additions & 0 deletions analytics/src/analytics/integrations/etldb/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""Read and write data from/to delivery metrics database."""

__all__ = [
"init_db",
"sync_db",
]

from analytics.integrations.etldb.main import (
init_db,
sync_db,
)
96 changes: 96 additions & 0 deletions analytics/src/analytics/integrations/etldb/create_etl_db.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
CREATE TABLE IF NOT EXISTS gh_deliverable (
id SERIAL PRIMARY KEY,
ghid TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
pillar TEXT,
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
t_modified TIMESTAMP
);

CREATE TABLE IF NOT EXISTS gh_deliverable_quad_map (
id SERIAL PRIMARY KEY,
deliverable_id INTEGER NOT NULL,
quad_id INTEGER,
d_effective DATE NOT NULL,
t_modified TIMESTAMP,
UNIQUE(deliverable_id, d_effective)
);
CREATE INDEX IF NOT EXISTS gh_dqm_i1 on gh_deliverable_quad_map(quad_id, d_effective);

CREATE TABLE IF NOT EXISTS gh_epic (
id SERIAL PRIMARY KEY,
ghid TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
t_modified TIMESTAMP
);

CREATE TABLE IF NOT EXISTS gh_epic_deliverable_map (
id SERIAL PRIMARY KEY,
epic_id INTEGER NOT NULL,
deliverable_id INTEGER,
d_effective DATE NOT NULL,
t_modified TIMESTAMP,
UNIQUE(epic_id, d_effective)
);
CREATE INDEX IF NOT EXISTS gh_edm_i1 on gh_epic_deliverable_map(deliverable_id, d_effective);

CREATE TABLE IF NOT EXISTS gh_issue (
id SERIAL PRIMARY KEY,
ghid TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
type TEXT NOT NULL,
opened_date DATE,
closed_date DATE,
parent_issue_ghid TEXT,
epic_id INTEGER,
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
t_modified TIMESTAMP
);
CREATE INDEX IF NOT EXISTS gh_issue_i1 on gh_issue(epic_id);

CREATE TABLE IF NOT EXISTS gh_issue_history (
id SERIAL PRIMARY KEY,
issue_id INTEGER NOT NULL,
status TEXT,
is_closed INTEGER NOT NULL,
points INTEGER NOT NULL DEFAULT 0,
d_effective DATE NOT NULL,
t_modified TIMESTAMP,
UNIQUE(issue_id, d_effective)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought on this and other tables -- it might be helpful to have a t_created column as well for debugging purposes.

That can be scoped into a future ticket though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, and done: 8946b80

);
CREATE INDEX IF NOT EXISTS gh_ih_i1 on gh_issue_history(issue_id, d_effective);

CREATE TABLE IF NOT EXISTS gh_issue_sprint_map (
id SERIAL PRIMARY KEY,
issue_id INTEGER NOT NULL,
sprint_id INTEGER,
d_effective DATE NOT NULL,
t_modified TIMESTAMP,
UNIQUE(issue_id, d_effective)
);

CREATE TABLE IF NOT EXISTS gh_sprint (
id SERIAL PRIMARY KEY,
ghid TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
start_date DATE,
end_date DATE,
duration INTEGER,
quad_id INTEGER,
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
t_modified TIMESTAMP
);

CREATE TABLE IF NOT EXISTS gh_quad (
id SERIAL PRIMARY KEY,
ghid TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
start_date DATE,
end_date DATE,
duration INTEGER,
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
t_modified TIMESTAMP
);
CREATE INDEX IF NOT EXISTS gh_quad_i1 on gh_quad(start_date);

Loading
Loading