forked from HPCL/ideas-uo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPatternsTest.py
282 lines (141 loc) · 9.02 KB
/
PatternsTest.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
#!/usr/bin/env python
# coding: utf-8
# # Software development patterns through git data mining
#
# In[ ]:
import sys, os, getpass, warnings
warnings.filterwarnings('ignore')
from patterns.visualizer import Visualizer
# Create a `Visualizer` object for a project, e.g., for Spack, `vis = Visualizer('spack')`. This brings in data from database and annotate it with *locc*, *locc+*, *locc-*, *change-size-cos* = *1-similarity* (**expensive call**). The data is cached locally after it's fetched from the database and subsequently loaded from disk, unless you specify the `cache=False` parameter, e.g., `vis.get_data(cache=False)`. Available projects include `'lammps', 'spack', 'petsc', 'Nek5000', 'nwchem', 'E3SM', 'qmcpack', 'qdpxx', 'LATTE', 'namd', 'enzo-dev'` (the full list can be obtained with by calling the `Visualizer()` constructor without arguments.
#
# The `get_data` method automatically removes changes associated with non-code files. The determination of what is code is made by including common suffixes, as well as checking manually a sampling of ECP projects for the suffixes used for things that can be labeled as code (vs. input simulation data, documentation, or generated files), for a list of suffixes, refer to the `Patterns.code_suffixes` list in [patterns.py](https://github.com/HPCL/ideas-uo/blob/master/src/patterns/patterns.py). This makes the analysis of decades of project data feasible. You can disable this if desired, by passing the `code_only=False` parameter to `get_data`. You can also explicitly remove non-code rows with `vis.remove_noncode()`. Optionally, you can remove files that are likely copies of external code (path contains `extern` or `contrib`) with `vis.remove_external()`.
# In[ ]:
vis = Visualizer(project_name='spack')
vis.get_data()
# By default, the names of projects and developers are not shown in the figures. If you wish to include project names, set `vis.hide_names` to `False`.
# In[ ]:
vis.hide_names = False
# Let's start with some global views -- this plot shows the entire project's git history (including imports from other RCS). The three different metrics shown represent different ways of quantifying the magnitude of the change based on the differences produced by `git log`. The `locc+` and `locc-` lines are lines added and removed, respectively. The `change-size-cos` is one of many text difference metrics, which computes the "distance" between the old and new code snippets in each commit. We discuss distance metrics in more detail later.
# In[ ]:
df = vis.plot_overall_project_locc(log=True)
# to focus on a given year and/or month, set *year* and *month* fields
# In[ ]:
vis.set_year(2020)
vis.set_month(10)
# to plot for a given year, provide *time_range='year'*
# In[ ]:
df = vis.plot_overall_project_locc(time_range='year',log=True)
# similarly to plot for a given month, set *time_range='month'*
# In[ ]:
df = vis.plot_overall_project_locc(time_range='month',log=True)
# ### Finding trends with averages
# We can plot the annual averages timeline for the entire project's history (by default showing LOCC and cos distance) with `plot_total_avg`. Several moving average plots are available, with different aggregation granularities (year, month) and different sliding window sizes.
# In[ ]:
vis.plot_total_avg(log=True)
# We can also compute different moving averages, indicating the aggregation frequency with the `freq` paramater. The default is `quarter`.
# In[ ]:
vis.plot_total_moving_avgs(freq='year')
# In[ ]:
vis.plot_total_moving_avgs()
# In[ ]:
vis.plot_total_moving_avgs(freq='month')
# ## More timelines
# This reflects changed lines of code as reported in git commit diffs. We have two functions that generate timeline plots of a change metric: `plot_proj_change_line` and `plot_proj_change_bubble`. By default, they show the entire range of selected dates and use the cos distance metric. You can specify a different metric with the `locc_metric` argument, e.g., `vis.plot_proj_change_line(locc_metric='locc')`
# In[ ]:
df = vis.plot_proj_change_line()
# And looking at both line counts (LOCC) and the distance based `change-size-cos` at the same time. If not specified, the time_range argument defaults None, which indicates the entire time period of the dataset.
# In[ ]:
vis.set_month(11)
df = vis.plot_proj_change_bubble(time_range="month")
# We can choose to zoom into a specific year, recall that previously we did `vis.set_year(2020)`.
# In[ ]:
df = vis.plot_proj_change_bubble(time_range='year')
# Or a specific year range.
# In[ ]:
vis.select_year_range(2019,2020)
vis.plot_proj_change_bubble(time_range='year-year')
df = vis.plot_overall_project_locc(time_range='year-year',log=True)
# We can also zoom into a single month; recall we previously did:
# ```
# vis.set_year(2019)
# vis.set_month(10)
# ```
# In[ ]:
_ = vis.plot_proj_change_line(time_range='month')
# Or a month range
# In[ ]:
vis.select_month_range(5,11)
df = vis.plot_proj_change_line(time_range='month-month')
# ## Using a text distance metric to adjust the size of the changes
#
# We use the python [textdistance](https://github.com/life4/textdistance) module. The following algorithms have been integrated with the visualizer.
# ```
# 'cos', 'hamming', 'damerau_levenshtein', 'jaccard', 'jaro', 'jaro_winkler', 'bag', 'editex'
# ```
# Any of the above plots can be made with any line counting metric, typically specified through the `locc_metric` argument.
# In[ ]:
vis.set_diff_alg('jaccard')
df = vis.plot_proj_change_line()
_ = vis.plot_proj_change_bubble()
# ## More patterns
# Here we look at a combination of the the high-churn and domain champion patterns. Basically we are focusing on the files that have the most changes and restricting the developers by those with the biggest contributions. One tricky issue that makes this nontrivial is that contributors use different names for their contributions. We have implemented a fuzzy name matching scheme of author names using the python `fuzzywuzzy` package to consolidate single-author contributions as much as possible.
# In[ ]:
N = 10
vis.reset()
#vis.set_unique_authors() # force author recomputation, this is expensive, so the result will be cached
vis.set_max_ylabel_length(30)
top_N = vis.plot_top_N_heatmap(N, locc_metric='locc')
top_N.head()
# In[ ]:
top_N_cos = vis.plot_top_N_heatmap(N, locc_metric='change-size-cos')
top_N_cos.head()
# In[ ]:
vis.set_year(2020)
top_N_cos = vis.plot_top_N_heatmap(N, time_range='year', locc_metric='change-size-cos')
top_N_cos.head()
# We can also easily see the exact differences between different ways of measuring change. This is not something that we normally compute frequently, hence there isn't a special plot function.
# In[ ]:
file_dev_locc, _ = vis.make_file_developer_df(locc_metric='locc')
file_dev_diff, _ = vis.make_file_developer_df(locc_metric='change-size-cos')
# In[ ]:
diff_df = file_dev_locc.sub(file_dev_diff, axis=0)
print("Total number of developers: %d" % diff_df.shape[1])
df = vis.commit_data
df['locc - cos diff'] = df['locc']-df['change-size-cos']
d = vis.plot_top_N_heatmap(top_N = 10, locc_metric='locc - cos diff', my_df=df)
# We can generate the "hot-files" data for any time period, the way we select it is the same as previously described.
# In[ ]:
N = 10
vis.set_year(2019)
vis.set_month(11)
vis.set_max_ylabel_length(30)
top_N = vis.plot_top_N_heatmap(N, time_range="month",locc_metric='locc')
top_N.head()
# ## In the zone
# Here we look at what days of the week and times of day developers are most productive. This one also takes the usual argumemts, the defaults are `time_range=None, locc_metric='change-size-cos'`. You can choose between `'sum'` and `'mean'` for aggregating the data over the specified time range (or entire project if time range is None). Using the sum helps see when the bulk of the contributions are made, while `mean` reveals more fine-grained periods of high average productivity better.
# In[ ]:
df = vis.plot_zone_heatmap(agg='mean')
# In[ ]:
df = vis.plot_zone_heatmap(agg='sum')
# ## Did anything unusual happen in 2020?
# This specific function looks at how 2020 contributions compare with the average (and the previous year).
#
# We use the day-time heatmap again, zooming to specific years, in this case, 2019 and 2020. With `sum`, we see when most of the changes were made, while `mean` reveals when people are most productive.
# In[ ]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,8))
vis.set_year(2019)
df_2019 = vis.plot_zone_heatmap(time_range='year',fig_ax_pair = (fig,axes[0]),agg='mean')
vis.set_year(2020)
df_2020 = vis.plot_zone_heatmap(time_range='year',fig_ax_pair = (fig,axes[1]),agg='mean')
# In[ ]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,8))
vis.set_year(2019)
df_2019 = vis.plot_zone_heatmap(time_range='year',fig_ax_pair = (fig,axes[0]),agg='sum')
vis.set_year(2020)
df_2020 = vis.plot_zone_heatmap(time_range='year',fig_ax_pair = (fig,axes[1]),agg='sum')
# In[ ]:
vis.how_was_2020('change-size-cos')
# In[ ]:
vis.how_was_2020('locc')