Skip to content

Commit 2e59857

Browse files
authored
Update repo_mlp.ipynb to train an org wide model (kubeflow#128)
* Related to kubeflow#110 * Uses the newly computed embeddings for all issues in the org to train the model * cleanup .gitignore. * Create a kustomize package to start a notebook with appropriate settings for running the example. * Warning: The kpt setters aren't fully configured yet * Evaluate the model qualitatively by fetching recent issues from BigQuery and computing predictions for those issues * Very few issues get an area or platform label. It looks like the model falls far short of our goal of labeling 25% of issues. * Move code for fetching issues from bigquery into github_bigquery.py * Still need to update Get-GitHub-Issues.ipynb to use this * When computing ROC curves only look at issues with an area, kind, or platform label * Do this because we have lots of examples with no labels. We shouldn't take those to be "true negatives" becauses its likely a human never looked at them and if they did some labels might apply.
1 parent 073d2d2 commit 2e59857

File tree

16 files changed

+6144
-135
lines changed

16 files changed

+6144
-135
lines changed

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
.*
21
*.pem
32
*.hdf5
43
*.pkl
@@ -13,5 +12,9 @@ fairing/__pycache__/**
1312
*.pyc
1413
py/code_intelligence/.data/**
1514

15+
# ignore coredumps
16+
**/core.*
17+
# ignore checkpoints
18+
**/.ipynb_checkpoints/
1619
# TODO(jlewi): Is this a remote module? Why is the fairing src getting cloned here?
1720
Label_Microservice/src/**

Issue_Embeddings/notebooks/Get-GitHub-Issues.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,7 @@
397397
}
398398
],
399399
"source": [
400+
"# TODO(jlewi): This code should now be a function in embeddings/github_bigquery.py\n",
400401
"query = \"\"\"SELECT \n",
401402
" JSON_EXTRACT(payload, '$.issue.html_url') as html_url,\n",
402403
" JSON_EXTRACT(payload, '$.issue.title') as title,\n",

Issue_Triage/notebooks/triage.ipynb

Lines changed: 928 additions & 116 deletions
Large diffs are not rendered by default.

Label_Microservice/notebooks/Label_k8s_issues_with_MLP.ipynb

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@
77
"## Background\n",
88
"In this notebook, we show how to feed the embeddings from the language model into the MLP classifier. Then, we take the github repo, `kubernetes/kubernetes`, as an example. We do transfer learning and show the results.\n",
99
"\n",
10+
"* TODO(jlewi): This notebook is duplicative with repo_mlp.ipynb. It looks like this might have contained\n",
11+
"the original training code which has been refactored in repo_mlp.ipynb. But unlike repo_mlp.ipynb this\n",
12+
"notebook contains code for model evaluation. We should probably combine them and remove duplication.\n",
13+
"\n",
1014
"## Data\n",
1115
"**combined_sig_df.pkl**\n",
1216
"https://storage.googleapis.com/issue_label_bot/notebook_files/combined_sig_df.pkl\n",
@@ -451,6 +455,7 @@
451455
"source": [
452456
"from sklearn.metrics import roc_auc_score\n",
453457
"\n",
458+
"# TODO(jlewi): I moved this into mlp.py\n",
454459
"def calculate_auc(predictions):\n",
455460
" auc_scores = []\n",
456461
" counts = []\n",
@@ -5654,9 +5659,9 @@
56545659
"name": "python",
56555660
"nbconvert_exporter": "python",
56565661
"pygments_lexer": "ipython3",
5657-
"version": "3.6.6"
5662+
"version": "3.6.9"
56585663
}
56595664
},
56605665
"nbformat": 4,
5661-
"nbformat_minor": 2
5666+
"nbformat_minor": 4
56625667
}

Label_Microservice/notebooks/repo_mlp.ipynb

Lines changed: 4918 additions & 10 deletions
Large diffs are not rendered by default.

k8s-notebooks/Kptfile

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
apiVersion: kpt.dev/v1alpha1
2+
kind: Kptfile
3+
metadata:
4+
name: .
5+
packageMetadata:
6+
shortDescription: sample description
7+
openAPI:
8+
definitions:
9+
io.k8s.cli.setters.namespace:
10+
x-k8s-cli:
11+
setter:
12+
name: namespace
13+
value: kubeflow-jlewi
14+
io.k8s.cli.substitutions.namespace:
15+
x-k8s-cli:
16+
substitution:
17+
name: namespace
18+
pattern: NAMESPACE
19+
values:
20+
- marker: NAMESPACE
21+
ref: '#/definitions/io.k8s.cli.setters.namespace'
22+
io.k8s.cli.setters.name:
23+
x-k8s-cli:
24+
setter:
25+
name: name
26+
value: mnist
27+
io.k8s.cli.substitutions.name:
28+
x-k8s-cli:
29+
substitution:
30+
name: name
31+
pattern: NAME
32+
values:
33+
- marker: NAME
34+
ref: '#/definitions/io.k8s.cli.setters.name'

k8s-notebooks/README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Notebook Manifests
2+
3+
TODO(jlewi): kpt setters aren't properly configured yet
4+
* volumes need to be properly set
5+
6+
This directory contains a kustomize package for spinning up
7+
a notebook on Kubeflow to run the example.
8+
9+
Create a secret with the GITHUB_TOKEN
10+
11+
```
12+
kubectl -n kubeflow-jlewi create secret generic github-token --from-literal=github_token=${GITHUB_TOKEN}
13+
```
14+

k8s-notebooks/kustomization.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
namespace: kubeflow-jlewi # {"$ref":"#/definitions/io.k8s.cli.substitutions.namespace"}
4+
resources:
5+
- notebook.yaml
6+
- pvc.yaml
7+
- service.yaml
8+
- virtual_service.yaml

k8s-notebooks/notebook.yaml

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
apiVersion: kubeflow.org/v1
2+
kind: Notebook
3+
metadata:
4+
labels:
5+
app: mnist # {"$ref":"#/definitions/io.k8s.cli.substitutions.name"}
6+
name: mnist # {"$ref":"#/definitions/io.k8s.cli.substitutions.name"}
7+
spec:
8+
template:
9+
spec:
10+
containers:
11+
- env:
12+
- name: JUPYTERLAB_DIR # Set the JJUPYTERLAB_DIR so we can install extensions
13+
value: /home/jovyan/.jupyterlab_dir
14+
- name: GITHUB_TOKEN
15+
valueFrom:
16+
secretKeyRef:
17+
name: github-token
18+
key: github_token
19+
image: gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
20+
name: mnist # {"$ref":"#/definitions/io.k8s.cli.substitutions.name"}
21+
# Bump the resources to include a GPU
22+
resources:
23+
limits:
24+
nvidia.com/gpu: 1
25+
requests:
26+
cpu: "15"
27+
memory: 32.0Gi
28+
volumeMounts:
29+
- mountPath: /home/jovyan
30+
name: workspace-mnist
31+
- mountPath: /dev/shm
32+
name: dshm
33+
# Start a container running theia which is an ID
34+
- env:
35+
- name: GITHUB_TOKEN
36+
valueFrom:
37+
secretKeyRef:
38+
name: github-token
39+
key: github_token
40+
# TODO(jlewi): Should we use an image which actually includes an appropriate toolchain like python?
41+
image: theiaide/theia:next
42+
name: theia
43+
resources:
44+
requests:
45+
cpu: "4"
46+
memory: 1.0Gi
47+
volumeMounts:
48+
- mountPath: /mount/jovyan
49+
name: workspace-mnist
50+
serviceAccountName: default-editor
51+
ttlSecondsAfterFinished: 300
52+
volumes:
53+
- name: workspace-mnist
54+
persistentVolumeClaim:
55+
claimName: workspace-mnist
56+
- emptyDir:
57+
medium: Memory
58+
name: dshm

k8s-notebooks/pvc.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
apiVersion: v1
2+
kind: PersistentVolumeClaim
3+
metadata:
4+
# TODO(jlewi): Need to create a kpt setter for this.
5+
name: workspace-mnist
6+
spec:
7+
accessModes:
8+
- ReadWriteOnce
9+
resources:
10+
requests:
11+
# We need more storage.
12+
storage: 100Gi
13+
storageClassName: standard
14+
volumeMode: Filesystem

k8s-notebooks/service.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Define a service for theia
2+
# TODO(jlewi): This needs to be adjusted based on kpt setters
3+
apiVersion: v1
4+
kind: Service
5+
metadata:
6+
name: mnist-theia
7+
spec:
8+
ports:
9+
- name: http-theia
10+
port: 3000
11+
protocol: TCP
12+
targetPort: 3000
13+
selector:
14+
notebook-name: mnist
15+
type: ClusterIP

k8s-notebooks/virtual_service.yaml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
apiVersion: networking.istio.io/v1alpha3
2+
kind: VirtualService
3+
metadata:
4+
name: notebook-kubeflow-jlewi-mnist-theia
5+
namespace: kubeflow-jlewi
6+
spec:
7+
gateways:
8+
- kubeflow/kubeflow-gateway
9+
hosts:
10+
- '*'
11+
http:
12+
- match:
13+
- uri:
14+
# The prefix must have a trailing slash
15+
# And when you navigate to the URL you must include the trailing slash.
16+
prefix: /notebook/kubeflow-jlewi/mnist-theia/
17+
rewrite:
18+
uri: /
19+
route:
20+
- destination:
21+
host: mnist-theia.kubeflow-jlewi.svc.cluster.local
22+
port:
23+
number: 3000
24+
timeout: 300s

py/code_intelligence/embeddings.py

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ def find_max_issue_num(owner, repo):
1717
1818
Returns
1919
-------
20-
int
21-
the highest issue number associated with this repo.
20+
int192
21+
the highest issue number associated with this repo.
2222
"""
2323
url = f'https://github.com/{owner}/{repo}/issues'
2424
r = requests.get(url)
@@ -174,24 +174,32 @@ def pass_through(x):
174174
"""Avoid messages when the model is deserialized in fastai library."""
175175
return x
176176

177-
def load_model_artifact(model_url):
177+
# TODO(jlewi): I think we should just get rid of this method.
178+
# Callers should use gcs_util and then call inference_wrapper
179+
def load_model_artifact(model_url, local_dir=None):
178180
"""
179181
Download the pretrained language model from URL
180182
Args:
181183
model_url: URL to store the pretrained model
184+
local_dir: (Optional) Director where model files are stored
182185
183186
Returns
184187
------
185188
InferenceWrapper
186189
a wrapper for a Learner object in fastai.
187190
"""
188-
path = Path('./model_files')
189-
full_path = path/'model.pkl'
190-
191+
if not local_dir:
192+
home = str(Path.home())
193+
local_dir = os.path.join(home, "model_files")
194+
195+
full_path = os.path.join(local_dir, 'model.pkl')
196+
191197
if not full_path.exists():
192198
logging.info('Loading model.')
193199
path.mkdir(exist_ok=True)
194200
request_url.urlretrieve(model_url, path/'model.pkl')
201+
else:
202+
logging.info(f"Model {full_path} exists")
195203
return InferenceWrapper(model_path=path, model_file_name='model.pkl')
196204

197205

py/code_intelligence/gcs_util.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,18 @@ def upload_file_to_gcs(bucket_name, gcs_filename, local_filename, storage_client
7171
blob = bucket.blob(gcs_filename)
7272
blob.upload_from_filename(local_filename)
7373

74+
75+
def copy_from_gcs(gcs_path, local_filename, storage_client=None):
76+
"""
77+
Download a file in GCS to the local.
78+
Args:
79+
gcs_path: gcs path
80+
local_filename: the new local file, str
81+
storage_client: client to bundle configuration needed for API requests
82+
"""
83+
bucket_name, gcs_file_name = split_gcs_uri(gcs_path)
84+
return download_file_from_gcs(bucket_name, gcs_file_name, local_filename, storage_client=storage_client)
85+
7486
def download_file_from_gcs(bucket_name, gcs_filename, local_filename, storage_client=None):
7587
"""
7688
Download a file in GCS to the local.
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
"""This module contains code to get issue data from BigQuery."""
2+
3+
import dateutil
4+
import json
5+
from pandas.io import gbq
6+
import re
7+
8+
def get_issues(login, project, max_age_days=None):
9+
"""Get issue data from bigquery.
10+
11+
Args:
12+
login: Which GitHub organization to query for
13+
project: GCP project to charge BigQuery to
14+
max_age_days: (Optional) If present only fetch issues which were created
15+
less then max age_days ago
16+
"""
17+
query = f"""SELECT
18+
JSON_EXTRACT(payload, '$.issue.html_url') as html_url,
19+
JSON_EXTRACT(payload, '$.issue.title') as title,
20+
JSON_EXTRACT(payload, '$.issue.body') as body,
21+
JSON_EXTRACT(payload, "$.issue.labels") as labels,
22+
JSON_EXTRACT(payload, "$.issue.created_at") as created_at,
23+
JSON_EXTRACT(payload, "$.issue.updated_at") as updated_at,
24+
org.login,
25+
type,
26+
FROM `githubarchive.month.20*`
27+
WHERE (type="IssuesEvent" or type="IssueCommentEvent") and org.login = '{login}'"""
28+
29+
if max_age_days:
30+
# We need to convert the created_at field to a timestamp.
31+
# JSON_EXTRACT returns a json string meaning it is quoted and we need
32+
# to remove the quotes
33+
query += f""" and DATETIME_DIFF(CURRENT_DATETIME(), PARSE_DATETIME(
34+
"\\"%Y-%m-%dT%TZ\\"", JSON_EXTRACT(payload,
35+
"$.issue.created_at")), DAY)
36+
<= {max_age_days} """
37+
38+
issues_and_pulls=gbq.read_gbq(query, dialect='standard', project_id=project)
39+
40+
# pull request comments also get included so we need to filter those out
41+
pattern = re.compile(".*issues/[\d]+")
42+
43+
issues_index = issues_and_pulls["html_url"].apply(lambda x: pattern.match(x) is not None)
44+
issues = issues_and_pulls[issues_index]
45+
46+
# We need to group the events by issue and then select the most recent event for each
47+
# issue as that should have the most up to date labels for each issue.
48+
# TODO(jlewi): Should we be converting updated_at to a datetime before doing the sort?
49+
latest_issues = issues.groupby("html_url", as_index=False).apply(lambda x: x.sort_values(["updated_at"]).iloc[-1])
50+
51+
# we need to deserialize the json strings to remove escaping
52+
for f in ["html_url", "title", "body", "created_at", "updated_at"]:
53+
latest_issues[f] = latest_issues[f].apply(lambda x : json.loads(x))
54+
55+
# Parse timestamps
56+
for f in ["created_at", "updated_at"]:
57+
latest_issues[f] = latest_issues[f].apply(lambda x : dateutil.parser.parse(x))
58+
59+
# Parse labels
60+
def get_labels(x):
61+
d = json.loads(x)
62+
return [i["name"] for i in d]
63+
64+
latest_issues["parsed_labels"] = latest_issues["labels"].apply(get_labels)
65+
66+
return latest_issues

0 commit comments

Comments
 (0)