15
15
- 📊 Automatically produces ** meaningful metrics** for in-depth assessment and comparison.
16
16
17
17
### [ New] TypeEvalPy Autogen
18
+
18
19
- 🤖 ** Autogenerates code snippets** and ground truth to scale the benchmark based on the original ` TypeEvalPy ` benchmark.
19
20
- 📈 The autogen benchmark now contains:
20
21
- ** Python files** : 7121
30
31
| [ HiTyper] ( https://github.com/JohnnyPeng18/HiTyper ) | [ Pytype] ( https://github.com/google/pytype ) |
31
32
| [ Scalpel] ( https://github.com/SMAT-Lab/Scalpel/issues ) | [ TypeT5] ( https://github.com/utopia-group/TypeT5 ) |
32
33
| [ Type4Py] ( https://github.com/saltudelft/type4py ) | |
33
- | [ GPT-4 ] ( https://openai.com/research/gpt-4 ) | |
34
+ | [ GPT] ( https://openai.com ) | |
34
35
| [ Ollama] ( https://ollama.ai ) | |
35
36
36
37
---
37
38
38
- ---
39
-
40
39
## 🏆 TypeEvalPy Leaderboard
41
40
42
- Below is a comparison showcasing exact matches across different tools, coupled with ` top_n ` predictions for ML-based tools.
43
-
44
- | Rank | 🛠️ Tool | Top-n | Function Return Type | Function Parameter Type | Local Variable Type | Total |
45
- | ---- | ------------------------------------------------------------------------- | ----------- | -------------------- | ----------------------- | ------------------- | ----------------- |
46
- | 1 | ** [ HeaderGen] ( https://github.com/secure-software-engineering/HeaderGen ) ** | 1 | 186 | 56 | 322 | 564 |
47
- | 2 | ** [ Jedi] ( https://github.com/davidhalter/jedi ) ** | 1 | 122 | 0 | 293 | 415 |
48
- | 3 | ** [ Pyright] ( https://github.com/microsoft/pyright ) ** | 1 | 100 | 8 | 297 | 405 |
49
- | 4 | ** [ HiTyper] ( https://github.com/JohnnyPeng18/HiTyper ) ** | 1<br >3<br >5 | 163<br >173<br >175 | 27<br >37<br >37 | 179<br >225<br >229 | 369<br >435<br >441 |
50
- | 5 | ** [ HiTyper (static)] ( https://github.com/JohnnyPeng18/HiTyper ) ** | 1 | 141 | 7 | 102 | 250 |
51
- | 6 | ** [ Scalpel] ( https://github.com/SMAT-Lab/Scalpel/issues ) ** | 1 | 155 | 32 | 6 | 193 |
52
- | 7 | ** [ Type4Py] ( https://github.com/saltudelft/type4py ) ** | 1<br >3<br >5 | 39<br >103<br >109 | 19<br >31<br >31 | 99<br >167<br >174 | 157<br >301<br >314 |
53
-
54
- _ <sub >(Auto-generated based on the the analysis run on 20 Oct 2023)</sub >_
55
-
56
- ---
57
-
58
- ## 🏆🤖 TypeEvalPy LLM Leaderboard
59
-
60
- Below is a comparison showcasing exact matches for LLMs.
61
-
62
- | Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total |
63
- | ---- | ------------------------------------------------------------------------------------------- | -------------------- | ----------------------- | ------------------- | ----- |
64
- | 1 | ** [ GPT-4] ( https://openai.com/research/gpt-4 ) ** | 225 | 85 | 465 | 775 |
65
- | 2 | ** [ Finetuned: GPT 3.5] ( https://platform.openai.com/docs/models/gpt-3-5-turbo ) ** | 209 | 85 | 436 | 730 |
66
- | 3 | ** [ codellama:13b-instruct] ( https://huggingface.co/docs/transformers/model_doc/code_llama ) ** | 199 | 75 | 425 | 699 |
67
- | 4 | ** [ GPT 3.5 Turbo] ( https://platform.openai.com/docs/models/gpt-3-5-turbo ) ** | 188 | 73 | 429 | 690 |
68
- | 5 | ** [ codellama:34b-instruct] ( https://huggingface.co/docs/transformers/model_doc/code_llama ) ** | 190 | 52 | 425 | 667 |
69
- | 6 | phind-codellama:34b-v2 | 182 | 60 | 399 | 641 |
70
- | 7 | codellama:7b-instruct | 171 | 72 | 384 | 627 |
71
- | 8 | dolphin-mistral | 184 | 76 | 356 | 616 |
72
- | 9 | codebooga | 186 | 56 | 354 | 596 |
73
- | 10 | llama2:70b | 168 | 55 | 342 | 565 |
74
- | 11 | ** [ HeaderGen] ( https://github.com/secure-software-engineering/HeaderGen ) ** | 186 | 56 | 321 | 563 |
75
- | 12 | wizardcoder:13b-python | 170 | 74 | 317 | 561 |
76
- | 13 | llama2:13b | 153 | 40 | 283 | 476 |
77
- | 14 | mistral: instruct | 155 | 45 | 250 | 450 |
78
- | 15 | mistral: v0 .2 | 155 | 45 | 248 | 448 |
79
- | 16 | vicuna:13b | 153 | 35 | 260 | 448 |
80
- | 17 | vicuna:33b | 133 | 29 | 267 | 429 |
81
- | 18 | ** [ Jedi] ( https://github.com/davidhalter/jedi ) ** | 122 | 0 | 293 | 415 |
82
- | 19 | ** [ Pyright] ( https://github.com/microsoft/pyright ) ** | 100 | 8 | 297 | 405 |
83
- | 19 | wizardcoder:7b-python | 103 | 48 | 254 | 405 |
84
- | 20 | llama2:7b | 140 | 34 | 216 | 390 |
85
- | 21 | ** [ HiTyper] ( https://github.com/JohnnyPeng18/HiTyper ) ** | 163 | 27 | 179 | 369 |
86
- | 22 | wizardcoder:34b-python | 140 | 43 | 178 | 361 |
87
- | 23 | orca2:7b | 117 | 27 | 184 | 328 |
88
- | 24 | vicuna:7b | 131 | 17 | 172 | 320 |
89
- | 25 | orca2:13b | 113 | 19 | 166 | 298 |
90
- | 26 | ** [ Scalpel] ( https://github.com/SMAT-Lab/Scalpel/issues ) ** | 155 | 32 | 6 | 193 |
91
- | 27 | ** [ Type4Py] ( https://github.com/saltudelft/type4py ) ** | 39 | 19 | 99 | 157 |
92
- | 28 | tinyllama | 3 | 0 | 23 | 26 |
93
- | 29 | phind-codellama:34b-python | 5 | 0 | 15 | 20 |
94
- | 30 | codellama:13b-python | 0 | 0 | 0 | 0 |
95
- | 31 | codellama:34b-python | 0 | 0 | 0 | 0 |
96
- | 32 | codellama:7b-python | 0 | 0 | 0 | 0 |
97
-
98
- _ <sub >(Auto-generated based on the the analysis run on 14 Jan 2024)</sub >_
41
+ Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.
42
+
43
+ | Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total |
44
+ | ---- | ---------------------------------------------------------------------------------------------- | -------------------- | ----------------------- | ------------------- | ----- |
45
+ | 1 | ** [ mistral-large-it-2407-123b] ( https://huggingface.co/mistralai/Mistral-Large-Instruct-2407 ) ** | 16701 | 728 | 57550 | 74979 |
46
+ | 2 | ** [ qwen2-it-72b] ( https://huggingface.co/Qwen/Qwen2-72B-Instruct ) ** | 16488 | 629 | 55160 | 72277 |
47
+ | 3 | ** [ llama3.1-it-70b] ( https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct ) ** | 16648 | 580 | 54445 | 71673 |
48
+ | 4 | ** [ gemma2-it-27b] ( https://huggingface.co/google/gemma-2-27b-it ) ** | 16342 | 599 | 49772 | 66713 |
49
+ | 5 | ** [ codestral-v0.1-22b] ( https://huggingface.co/mistralai/Codestral-22B-v0.1 ) ** | 16456 | 706 | 49379 | 66541 |
50
+ | 6 | ** [ codellama-it-34b] ( https://huggingface.co/meta-llama/CodeLlama-34b-Instruct-hf ) ** | 15960 | 473 | 48957 | 65390 |
51
+ | 7 | ** [ mistral-nemo-it-2407-12.2b] ( https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 ) ** | 16221 | 526 | 48439 | 65186 |
52
+ | 8 | ** [ mistral-v0.3-it-7b] ( https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 ) ** | 16686 | 472 | 47935 | 65093 |
53
+ | 9 | ** [ phi3-medium-it-14b] ( https://huggingface.co/microsoft/Phi-3-medium-128k-instruct ) ** | 16802 | 467 | 45121 | 62390 |
54
+ | 10 | ** [ llama3.1-it-8b] ( https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct ) ** | 16125 | 492 | 44313 | 60930 |
55
+ | 11 | ** [ codellama-it-13b] ( https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf ) ** | 16214 | 479 | 43021 | 59714 |
56
+ | 12 | ** [ phi3-small-it-7.3b] ( https://huggingface.co/microsoft/Phi-3-small-128k-instruct ) ** | 16155 | 422 | 38093 | 54670 |
57
+ | 13 | ** [ qwen2-it-7b] ( https://huggingface.co/Qwen/Qwen2-7B-Instruct ) ** | 15684 | 313 | 38109 | 54106 |
58
+ | 14 | ** [ HeaderGen] ( https://github.com/ashwinprasadme/headergen ) ** | 14086 | 346 | 36370 | 50802 |
59
+ | 15 | ** [ phi3-mini-it-3.8b] ( https://huggingface.co/microsoft/Phi-3-mini-128k-instruct ) ** | 15908 | 320 | 30341 | 46569 |
60
+ | 16 | ** [ phi3.5-mini-it-3.8b] ( https://huggingface.co/microsoft/Phi-3.5-mini-instruct ) ** | 15763 | 362 | 28694 | 44819 |
61
+ | 17 | ** [ codellama-it-7b] ( https://huggingface.co/meta-llama/CodeLlama-7b-Instruct-hf ) ** | 13779 | 318 | 29346 | 43443 |
62
+ | 18 | ** [ Jedi] ( https://github.com/davidhalter/jedi ) ** | 13160 | 0 | 15403 | 28563 |
63
+ | 19 | ** [ Scalpel] ( https://github.com/SMAT-Lab/Scalpel/issues ) ** | 15383 | 171 | 18 | 15572 |
64
+ | 20 | ** [ gemma2-it-9b] ( https://huggingface.co/google/gemma-2-9b-it ) ** | 1611 | 66 | 5464 | 7141 |
65
+ | 21 | ** [ Type4Py] ( https://github.com/saltudelft/type4py ) ** | 3143 | 38 | 2243 | 5424 |
66
+ | 22 | ** [ tinyllama-1.1b] ( https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 ) ** | 1514 | 28 | 2699 | 4241 |
67
+ | 23 | ** [ mixtral-v0.1-it-8x7b] ( https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 ) ** | 3235 | 33 | 377 | 3645 |
68
+ | 24 | ** [ phi3.5-moe-it-41.9b] ( https://huggingface.co/microsoft/Phi-3.5-MoE-instruct ) ** | 3090 | 25 | 273 | 3388 |
69
+ | 25 | ** [ gemma2-it-2b] ( https://huggingface.co/google/gemma-2-2b-it ) ** | 1497 | 41 | 1848 | 3386 |
70
+
71
+ _ <sub >(Auto-generated based on the the analysis run on 30 Aug 2024)</sub >_
99
72
100
73
---
101
74
@@ -125,15 +98,16 @@ Each results folder will have a timestamp, allowing you to easily track and comp
125
98
Here is how the auto-generated CSV tables relate to the paper's tables:
126
99
127
100
- ** Table 1** in the paper is derived from three auto-generated CSV tables:
128
- - `paper_table_1.csv` - details Exact matches by type category.
129
- - `paper_table_2.csv` - lists Exact matches for 18 micro-benchmark categories.
130
- - `paper_table_3.csv` - provides Sound and Complete values for tools.
131
101
102
+ - ` paper_table_1.csv ` - details Exact matches by type category.
103
+ - ` paper_table_2.csv ` - lists Exact matches for 18 micro-benchmark categories.
104
+ - ` paper_table_3.csv ` - provides Sound and Complete values for tools.
132
105
133
106
- ** Table 2** in the paper is based on the following CSV table:
134
- - `paper_table_5.csv` - shows Exact matches with top_n values for machine learning tools.
107
+ - ` paper_table_5.csv ` - shows Exact matches with top_n values for machine learning tools.
108
+
109
+ Additionally, there are CSV tables that are _ not_ included in the paper:
135
110
136
- Additionally, there are CSV tables that are * not* included in the paper:
137
111
- ` paper_table_4.csv ` - containing Sound and Complete values for 18 micro-benchmark categories.
138
112
- ` paper_table_6.csv ` - featuring Sensitivity analysis.
139
113
</details >
@@ -260,7 +234,6 @@ To generate an extended version of the original TypeEvalPy benchmark to include
260
234
cd autogen
261
235
` ` `
262
236
263
-
264
237
2. ** Execute the Generation Script**
265
238
266
239
Run the following command to start the generation process:
@@ -270,7 +243,7 @@ To generate an extended version of the original TypeEvalPy benchmark to include
270
243
` ` `
271
244
272
245
This will generate a folder in the repo root with the autogen benchmark with the current date.
273
-
246
+
274
247
---
275
248
276
249
# ## 🤝 Contributing
0 commit comments