Skip to content

Commit 073d2d2

Browse files
authored
Compute embeddings for all kubeflow repositories. (kubeflow#124)
* This is the first step in creating an org wide model for all of Kubeflow (kubeflow#110) * Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory. * Add some missing packages to requirements.txt. * Use the GitHub graphql client to get a list of all repositories. * Add some missing GCS utilities * Remove some of the duplication between Get-GitHub-Issues.ipynb and our library methods (kubeflow#122) * Start fetching the data from bigquery. * Using BigQuery turned out to be a lot better for bulk pulling all of the Kubeflow issues. * Use hdf5 to save the data. * Start a doc to keep track of notes for how to train a Kubeflow model. Add logic to save to hDF5 and do a sanity check compared to the inference code * Add a function to fetch the data using the GitHub API * related to kubeflow#126 swiching the embeddings service to use the GraphQL API rather than html fetching * I added this function as a way to sanity check that we get the same data using bigquery as at inference time.
1 parent 9bbdce3 commit 073d2d2

File tree

13 files changed

+1159
-677
lines changed

13 files changed

+1159
-677
lines changed

Issue_Embeddings/notebooks/01_AcquireData.ipynb

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -155,9 +155,7 @@
155155
{
156156
"cell_type": "code",
157157
"execution_count": 43,
158-
"metadata": {
159-
"scrolled": false
160-
},
158+
"metadata": {},
161159
"outputs": [
162160
{
163161
"data": {
@@ -501,8 +499,20 @@
501499
"display_name": "Python 3",
502500
"language": "python",
503501
"name": "python3"
502+
},
503+
"language_info": {
504+
"codemirror_mode": {
505+
"name": "ipython",
506+
"version": 3
507+
},
508+
"file_extension": ".py",
509+
"mimetype": "text/x-python",
510+
"name": "python",
511+
"nbconvert_exporter": "python",
512+
"pygments_lexer": "ipython3",
513+
"version": "3.6.9"
504514
}
505515
},
506516
"nbformat": 4,
507-
"nbformat_minor": 2
517+
"nbformat_minor": 4
508518
}

Issue_Embeddings/notebooks/02_fastai_DataBunch.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -267,9 +267,9 @@
267267
"name": "python",
268268
"nbconvert_exporter": "python",
269269
"pygments_lexer": "ipython3",
270-
"version": "3.6.7"
270+
"version": "3.6.9"
271271
}
272272
},
273273
"nbformat": 4,
274-
"nbformat_minor": 2
274+
"nbformat_minor": 4
275275
}

Issue_Embeddings/notebooks/03_Create_Model.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -282,9 +282,9 @@
282282
"name": "python",
283283
"nbconvert_exporter": "python",
284284
"pygments_lexer": "ipython3",
285-
"version": "3.6.7"
285+
"version": "3.6.9"
286286
}
287287
},
288288
"nbformat": 4,
289-
"nbformat_minor": 2
289+
"nbformat_minor": 4
290290
}

Issue_Embeddings/notebooks/07_Get_Repo_TrainingData_BigQuery.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -451,9 +451,9 @@
451451
"name": "python",
452452
"nbconvert_exporter": "python",
453453
"pygments_lexer": "ipython3",
454-
"version": "3.6.3"
454+
"version": "3.6.9"
455455
}
456456
},
457457
"nbformat": 4,
458-
"nbformat_minor": 2
458+
"nbformat_minor": 4
459459
}

Issue_Embeddings/notebooks/Get-GitHub-Issues.ipynb

Lines changed: 886 additions & 179 deletions
Large diffs are not rendered by default.

Issue_Embeddings/requirements.txt

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,18 @@ cymem==2.0.2
1717
cytoolz==0.9.0.1
1818
decorator==4.4.0
1919
defusedxml==0.6.0
20+
dill==0.3.0
2021
entrypoints==0.3
2122
fastai==1.0.53.post3
2223
fastprogress==0.1.21
2324
flask-session==0.3.1
2425
flask==1.0.2
2526
ftfy==4.4.3
2627
gcsfs==0.2.1
27-
google-auth-oauthlib==0.3.0
28-
google-auth==1.6.3
28+
github3.py>=1.3.0
29+
google-auth-oauthlib
30+
google-auth
31+
google-cloud-bigquery
2932
gunicorn==19.9.0
3033
html5lib==1.0.1
3134
idna==2.8
@@ -40,13 +43,15 @@ jedi==0.13.3
4043
jinja2==2.10.1
4144
joblib==0.13.2
4245
jsonschema==3.0.1
46+
JSON-log-formatter==0.2.0
4347
jupyter-client==5.2.4
4448
jupyter-core==4.4.0
4549
kiwisolver==1.1.0
4650
markupsafe==1.1.1
4751
matplotlib==3.0.3
4852
mdparse==0.13
4953
mistune==0.8.4
54+
more-itertools==8.2.0
5055
murmurhash==1.0.2
5156
mwparserfromhell==0.5.3
5257
nbconvert==5.5.0
@@ -58,6 +63,7 @@ numpy==1.16.4
5863
oauthlib==3.0.1
5964
packaging==19.0
6065
pandas==0.24.2
66+
pandas-gbq
6167
pandocfilters==1.4.2
6268
parso==0.4.0
6369
passlib==1.7.1
@@ -89,17 +95,19 @@ rsa==4.0
8995
scikit-learn==0.20.3
9096
scipy==1.2.1
9197
send2trash==1.5.0
92-
six==1.12.0
98+
six>=1.13.0
9399
soupsieve==1.9.1
94100
spacy==2.1.4
95101
srsly==0.0.7
102+
tables
96103
terminado==0.8.2
97104
testpath==0.4.2
98105
textacy==0.7.1
99106
thinc==7.0.4
100107
timeout-decorator==0.4.1
101108
toolz==0.9.0
102109
tornado==6.0.2
110+
torch==1.1.0
103111
tqdm==4.32.2
104112
traitlets==4.3.2
105113
typing==3.6.6

0 commit comments

Comments
 (0)