-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue 2482] Migrate delivery metrics transform and load from simpler-grants-sandbox #2617
Changes from 49 commits
76aea7a
ebdb816
320df01
5a91616
0f983f7
b837ddf
10166a9
d5a9d12
dbadc9f
eb3a3a9
b13c7dc
38aa7ca
505f1cc
7d8300a
949882f
710e629
92fd950
0230ff8
44433c3
e8c20ae
5a9ec75
82c51f5
cb2fc7e
a733e1a
f2bc5fd
9d9ca31
12aaa95
6de57d1
1257ff6
f5fae44
d227c90
72b4a06
268a44b
15efd80
195593a
81355e8
8f9de75
39bf886
244c187
5d0097e
122ce66
93edd79
a6248c3
9928d06
8b2bafb
75f902a
62d45cd
1b5e74a
823aeb9
186941d
b068e80
3726af9
36c3f22
190d1cd
7ce883b
4b9f590
a960540
9be9504
8946b80
69fcf63
d7bd8fa
3b3e516
69164c9
89e67ff
2ce2217
487aaca
e71cff1
afced78
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,12 +13,13 @@ ISSUE_FILE ?= $(OUTPUT_DIR)/issue-data.json | |
DELIVERY_FILE ?= $(OUTPUT_DIR)/delivery-data.json | ||
SPRINT ?= @current | ||
# Names of the points and sprint fields in the GitHub project | ||
POINTS_FIELD ?= Points | ||
POINTS_FIELD ?= Story Points | ||
SPRINT_FIELD ?= Sprint | ||
UNIT ?= points | ||
ACTION ?= show-results | ||
MIN_TEST_COVERAGE ?= 80 | ||
APP_NAME ?= grants-analytics | ||
EFFECTIVE_DATE ?= $(shell date +"%Y-%m-%d") | ||
|
||
# Required for CI to work properly | ||
SHELL = /bin/bash -o pipefail | ||
|
@@ -144,6 +145,20 @@ lint: ## runs code quality checks | |
# Data Commands # | ||
################# | ||
|
||
init-db: | ||
@echo "=> Initializing the database schema" | ||
@echo "=====================================================" | ||
$(POETRY) analytics etl initialize_database | ||
@echo "=====================================================" | ||
|
||
gh-transform-and-load: | ||
@echo "=> Transforming and loading GitHub data into the database" | ||
@echo "=====================================================" | ||
$(POETRY) analytics etl transform_and_load \ | ||
--deliverable-file $(DELIVERY_FILE) \ | ||
--effective-date $(EFFECTIVE_DATE) | ||
@echo "=====================================================" | ||
|
||
Comment on lines
+147
to
+160
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also added these because I wasn't able to trigger the command from the natively installed python application because I don't have the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice add. I assumed this would be needed in the near future, but had not spent any time on it yet. Thanks for adding it! |
||
sprint-data-export: | ||
@echo "=> Exporting project data from the sprint board" | ||
@echo "=====================================================" | ||
|
@@ -186,6 +201,8 @@ issue-data-export: | |
|
||
gh-data-export: sprint-data-export issue-data-export roadmap-data-export delivery-data-export | ||
|
||
gh-etl: delivery-data-export gh-transform-and-load | ||
|
||
sprint-burndown: | ||
@echo "=> Running sprint burndown report" | ||
@echo "=====================================================" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
""" | ||
Implement the EtlDataset class. | ||
|
||
This is a sub-class of BaseDataset that models | ||
quad, deliverable, epic, issue, and sprint data. | ||
""" | ||
|
||
from enum import Enum | ||
from typing import Any, Self | ||
|
||
import pandas as pd | ||
from numpy.typing import NDArray | ||
|
||
from analytics.datasets.base import BaseDataset | ||
from analytics.datasets.utils import load_json_data_as_df | ||
|
||
|
||
class EtlEntityType(Enum): | ||
"""Define entity types in the db schema.""" | ||
|
||
DELIVERABLE = "deliverable" | ||
EPIC = "epic" | ||
ISSUE = "issue" | ||
SPRINT = "sprint" | ||
QUAD = "quad" | ||
|
||
|
||
class EtlDataset(BaseDataset): | ||
"""Encapsulate data exported from github.""" | ||
|
||
COLUMN_MAP = { | ||
"deliverable_url": "deliverable_ghid", | ||
"deliverable_title": "deliverable_title", | ||
"deliverable_pillar": "deliverable_pillar", | ||
"epic_url": "epic_ghid", | ||
"epic_title": "epic_title", | ||
"issue_url": "issue_ghid", | ||
"issue_title": "issue_title", | ||
"issue_parent": "issue_parent", | ||
"issue_type": "issue_type", | ||
"issue_is_closed": "issue_is_closed", | ||
"issue_opened_at": "issue_opened_at", | ||
"issue_closed_at": "issue_closed_at", | ||
"issue_points": "issue_points", | ||
"issue_status": "issue_status", | ||
"sprint_id": "sprint_ghid", | ||
"sprint_name": "sprint_name", | ||
"sprint_start": "sprint_start", | ||
"sprint_length": "sprint_length", | ||
"sprint_end": "sprint_end", | ||
"quad_id": "quad_ghid", | ||
"quad_name": "quad_name", | ||
"quad_start": "quad_start", | ||
"quad_length": "quad_length", | ||
"quad_end": "quad_end", | ||
} | ||
|
||
@classmethod | ||
def load_from_json_file(cls, file_path: str) -> Self: | ||
""" | ||
Load the input json file and instantiates an instance of EtlDataset. | ||
|
||
Parameters | ||
---------- | ||
file_path: str | ||
Path to the local json file containing data exported from GitHub | ||
|
||
Returns | ||
------- | ||
Self: | ||
An instance of the EtlDataset dataset class | ||
""" | ||
# load input datasets | ||
df = load_json_data_as_df( | ||
file_path=file_path, | ||
column_map=cls.COLUMN_MAP, | ||
date_cols=None, | ||
) | ||
|
||
# transform entity id columns | ||
prefix = "https://github.com/" | ||
for col in ("deliverable_ghid", "epic_ghid", "issue_ghid", "issue_parent"): | ||
df[col] = df[col].str.replace(prefix, "") | ||
|
||
return cls(df) | ||
|
||
# QUAD getters | ||
|
||
def get_quad(self, quad_ghid: str) -> pd.Series: | ||
"""Fetch data about a given quad.""" | ||
query_string = f"quad_ghid == '{quad_ghid}'" | ||
return self.df.query(query_string).iloc[0] | ||
|
||
def get_quad_ghids(self) -> NDArray[Any]: | ||
"""Fetch an array of unique non-null quad ghids.""" | ||
df = self.df[self.df.quad_ghid.notna()] | ||
return df.quad_ghid.unique() | ||
|
||
# DELIVERABLE getters | ||
|
||
def get_deliverable(self, deliverable_ghid: str) -> pd.Series: | ||
"""Fetch data about a given deliverable.""" | ||
query_string = f"deliverable_ghid == '{deliverable_ghid}'" | ||
return self.df.query(query_string).iloc[0] | ||
|
||
def get_deliverable_ghids(self) -> NDArray[Any]: | ||
"""Fetch an array of unique non-null deliverable ghids.""" | ||
df = self.df[self.df.deliverable_ghid.notna()] | ||
return df.deliverable_ghid.unique() | ||
|
||
# SPRINT getters | ||
|
||
def get_sprint(self, sprint_ghid: str) -> pd.Series: | ||
"""Fetch data about a given sprint.""" | ||
query_string = f"sprint_ghid == '{sprint_ghid}'" | ||
return self.df.query(query_string).iloc[0] | ||
|
||
def get_sprint_ghids(self) -> NDArray[Any]: | ||
"""Fetch an array of unique non-null sprint ghids.""" | ||
df = self.df[self.df.sprint_ghid.notna()] | ||
return df.sprint_ghid.unique() | ||
|
||
# EPIC getters | ||
|
||
def get_epic(self, epic_ghid: str) -> pd.Series: | ||
"""Fetch data about a given epic.""" | ||
query_string = f"epic_ghid == '{epic_ghid}'" | ||
return self.df.query(query_string).iloc[0] | ||
|
||
def get_epic_ghids(self) -> NDArray[Any]: | ||
"""Fetch an array of unique non-null epic ghids.""" | ||
df = self.df[self.df.epic_ghid.notna()] | ||
return df.epic_ghid.unique() | ||
|
||
# ISSUE getters | ||
|
||
def get_issue(self, issue_ghid: str) -> pd.Series: | ||
"""Fetch data about a given issue.""" | ||
query_string = f"issue_ghid == '{issue_ghid}'" | ||
return self.df.query(query_string).iloc[0] | ||
|
||
def get_issue_ghids(self) -> NDArray[Any]: | ||
"""Fetch an array of unique non-null issue ghids.""" | ||
df = self.df[self.df.issue_ghid.notna()] | ||
return df.issue_ghid.unique() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,7 +22,6 @@ def get_db() -> Engine: | |
A SQLAlchemy engine object representing the connection to the database. | ||
""" | ||
db = get_db_settings() | ||
print(f"postgresql+psycopg://{db.user}:{db.password}@{db.db_host}:{db.port}") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah... please don't print out the password... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @coilysiren I agree with you. This is not my code; it was already there when I started. |
||
return create_engine( | ||
f"postgresql+psycopg://{db.user}:{db.password}@{db.db_host}:{db.port}", | ||
pool_pre_ping=True, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
"""Read and write data from/to delivery metrics database.""" | ||
|
||
__all__ = [ | ||
"init_db", | ||
"sync_db", | ||
] | ||
|
||
from analytics.integrations.etldb.main import ( | ||
init_db, | ||
sync_db, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
CREATE TABLE IF NOT EXISTS gh_deliverable ( | ||
id SERIAL PRIMARY KEY, | ||
ghid TEXT UNIQUE NOT NULL, | ||
title TEXT NOT NULL, | ||
pillar TEXT, | ||
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
t_modified TIMESTAMP | ||
); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_deliverable_quad_map ( | ||
id SERIAL PRIMARY KEY, | ||
deliverable_id INTEGER NOT NULL, | ||
quad_id INTEGER, | ||
d_effective DATE NOT NULL, | ||
t_modified TIMESTAMP, | ||
UNIQUE(deliverable_id, d_effective) | ||
); | ||
CREATE INDEX IF NOT EXISTS gh_dqm_i1 on gh_deliverable_quad_map(quad_id, d_effective); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_epic ( | ||
id SERIAL PRIMARY KEY, | ||
ghid TEXT UNIQUE NOT NULL, | ||
title TEXT NOT NULL, | ||
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
t_modified TIMESTAMP | ||
); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_epic_deliverable_map ( | ||
id SERIAL PRIMARY KEY, | ||
epic_id INTEGER NOT NULL, | ||
deliverable_id INTEGER, | ||
d_effective DATE NOT NULL, | ||
t_modified TIMESTAMP, | ||
UNIQUE(epic_id, d_effective) | ||
); | ||
CREATE INDEX IF NOT EXISTS gh_edm_i1 on gh_epic_deliverable_map(deliverable_id, d_effective); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_issue ( | ||
id SERIAL PRIMARY KEY, | ||
ghid TEXT UNIQUE NOT NULL, | ||
title TEXT NOT NULL, | ||
type TEXT NOT NULL, | ||
opened_date DATE, | ||
closed_date DATE, | ||
parent_issue_ghid TEXT, | ||
epic_id INTEGER, | ||
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
t_modified TIMESTAMP | ||
); | ||
CREATE INDEX IF NOT EXISTS gh_issue_i1 on gh_issue(epic_id); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_issue_history ( | ||
id SERIAL PRIMARY KEY, | ||
issue_id INTEGER NOT NULL, | ||
status TEXT, | ||
is_closed INTEGER NOT NULL, | ||
points INTEGER NOT NULL DEFAULT 0, | ||
d_effective DATE NOT NULL, | ||
t_modified TIMESTAMP, | ||
UNIQUE(issue_id, d_effective) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thought on this and other tables -- it might be helpful to have a That can be scoped into a future ticket though! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good suggestion, and done: 8946b80 |
||
); | ||
CREATE INDEX IF NOT EXISTS gh_ih_i1 on gh_issue_history(issue_id, d_effective); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_issue_sprint_map ( | ||
id SERIAL PRIMARY KEY, | ||
issue_id INTEGER NOT NULL, | ||
sprint_id INTEGER, | ||
d_effective DATE NOT NULL, | ||
t_modified TIMESTAMP, | ||
UNIQUE(issue_id, d_effective) | ||
); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_sprint ( | ||
id SERIAL PRIMARY KEY, | ||
ghid TEXT UNIQUE NOT NULL, | ||
name TEXT NOT NULL, | ||
start_date DATE, | ||
end_date DATE, | ||
duration INTEGER, | ||
quad_id INTEGER, | ||
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
t_modified TIMESTAMP | ||
); | ||
|
||
CREATE TABLE IF NOT EXISTS gh_quad ( | ||
id SERIAL PRIMARY KEY, | ||
ghid TEXT UNIQUE NOT NULL, | ||
name TEXT NOT NULL, | ||
start_date DATE, | ||
end_date DATE, | ||
duration INTEGER, | ||
t_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
t_modified TIMESTAMP | ||
); | ||
CREATE INDEX IF NOT EXISTS gh_quad_i1 on gh_quad(start_date); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this ends up looking very messy when you see it in the AWS Console