Skip to content

Commit f19fc80

Browse files
Update Leaderboard
1 parent facd641 commit f19fc80

File tree

3 files changed

+316
-79
lines changed

3 files changed

+316
-79
lines changed

README.md

+40-67
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
- 📊 Automatically produces **meaningful metrics** for in-depth assessment and comparison.
1616

1717
### [New] TypeEvalPy Autogen
18+
1819
- 🤖 **Autogenerates code snippets** and ground truth to scale the benchmark based on the original `TypeEvalPy` benchmark.
1920
- 📈 The autogen benchmark now contains:
2021
- **Python files**: 7121
@@ -30,72 +31,44 @@
3031
| [HiTyper](https://github.com/JohnnyPeng18/HiTyper) | [Pytype](https://github.com/google/pytype) |
3132
| [Scalpel](https://github.com/SMAT-Lab/Scalpel/issues) | [TypeT5](https://github.com/utopia-group/TypeT5) |
3233
| [Type4Py](https://github.com/saltudelft/type4py) | |
33-
| [GPT-4](https://openai.com/research/gpt-4) | |
34+
| [GPT](https://openai.com) | |
3435
| [Ollama](https://ollama.ai) | |
3536

3637
---
3738

38-
---
39-
4039
## 🏆 TypeEvalPy Leaderboard
4140

42-
Below is a comparison showcasing exact matches across different tools, coupled with `top_n` predictions for ML-based tools.
43-
44-
| Rank | 🛠️ Tool | Top-n | Function Return Type | Function Parameter Type | Local Variable Type | Total |
45-
| ---- | ------------------------------------------------------------------------- | ----------- | -------------------- | ----------------------- | ------------------- | ----------------- |
46-
| 1 | **[HeaderGen](https://github.com/secure-software-engineering/HeaderGen)** | 1 | 186 | 56 | 322 | 564 |
47-
| 2 | **[Jedi](https://github.com/davidhalter/jedi)** | 1 | 122 | 0 | 293 | 415 |
48-
| 3 | **[Pyright](https://github.com/microsoft/pyright)** | 1 | 100 | 8 | 297 | 405 |
49-
| 4 | **[HiTyper](https://github.com/JohnnyPeng18/HiTyper)** | 1<br>3<br>5 | 163<br>173<br>175 | 27<br>37<br>37 | 179<br>225<br>229 | 369<br>435<br>441 |
50-
| 5 | **[HiTyper (static)](https://github.com/JohnnyPeng18/HiTyper)** | 1 | 141 | 7 | 102 | 250 |
51-
| 6 | **[Scalpel](https://github.com/SMAT-Lab/Scalpel/issues)** | 1 | 155 | 32 | 6 | 193 |
52-
| 7 | **[Type4Py](https://github.com/saltudelft/type4py)** | 1<br>3<br>5 | 39<br>103<br>109 | 19<br>31<br>31 | 99<br>167<br>174 | 157<br>301<br>314 |
53-
54-
_<sub>(Auto-generated based on the the analysis run on 20 Oct 2023)</sub>_
55-
56-
---
57-
58-
## 🏆🤖 TypeEvalPy LLM Leaderboard
59-
60-
Below is a comparison showcasing exact matches for LLMs.
61-
62-
| Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total |
63-
| ---- | ------------------------------------------------------------------------------------------- | -------------------- | ----------------------- | ------------------- | ----- |
64-
| 1 | **[GPT-4](https://openai.com/research/gpt-4)** | 225 | 85 | 465 | 775 |
65-
| 2 | **[Finetuned:GPT 3.5](https://platform.openai.com/docs/models/gpt-3-5-turbo)** | 209 | 85 | 436 | 730 |
66-
| 3 | **[codellama:13b-instruct](https://huggingface.co/docs/transformers/model_doc/code_llama)** | 199 | 75 | 425 | 699 |
67-
| 4 | **[GPT 3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo)** | 188 | 73 | 429 | 690 |
68-
| 5 | **[codellama:34b-instruct](https://huggingface.co/docs/transformers/model_doc/code_llama)** | 190 | 52 | 425 | 667 |
69-
| 6 | phind-codellama:34b-v2 | 182 | 60 | 399 | 641 |
70-
| 7 | codellama:7b-instruct | 171 | 72 | 384 | 627 |
71-
| 8 | dolphin-mistral | 184 | 76 | 356 | 616 |
72-
| 9 | codebooga | 186 | 56 | 354 | 596 |
73-
| 10 | llama2:70b | 168 | 55 | 342 | 565 |
74-
| 11 | **[HeaderGen](https://github.com/secure-software-engineering/HeaderGen)** | 186 | 56 | 321 | 563 |
75-
| 12 | wizardcoder:13b-python | 170 | 74 | 317 | 561 |
76-
| 13 | llama2:13b | 153 | 40 | 283 | 476 |
77-
| 14 | mistral:instruct | 155 | 45 | 250 | 450 |
78-
| 15 | mistral:v0.2 | 155 | 45 | 248 | 448 |
79-
| 16 | vicuna:13b | 153 | 35 | 260 | 448 |
80-
| 17 | vicuna:33b | 133 | 29 | 267 | 429 |
81-
| 18 | **[Jedi](https://github.com/davidhalter/jedi)** | 122 | 0 | 293 | 415 |
82-
| 19 | **[Pyright](https://github.com/microsoft/pyright)** | 100 | 8 | 297 | 405 |
83-
| 19 | wizardcoder:7b-python | 103 | 48 | 254 | 405 |
84-
| 20 | llama2:7b | 140 | 34 | 216 | 390 |
85-
| 21 | **[HiTyper](https://github.com/JohnnyPeng18/HiTyper)** | 163 | 27 | 179 | 369 |
86-
| 22 | wizardcoder:34b-python | 140 | 43 | 178 | 361 |
87-
| 23 | orca2:7b | 117 | 27 | 184 | 328 |
88-
| 24 | vicuna:7b | 131 | 17 | 172 | 320 |
89-
| 25 | orca2:13b | 113 | 19 | 166 | 298 |
90-
| 26 | **[Scalpel](https://github.com/SMAT-Lab/Scalpel/issues)** | 155 | 32 | 6 | 193 |
91-
| 27 | **[Type4Py](https://github.com/saltudelft/type4py)** | 39 | 19 | 99 | 157 |
92-
| 28 | tinyllama | 3 | 0 | 23 | 26 |
93-
| 29 | phind-codellama:34b-python | 5 | 0 | 15 | 20 |
94-
| 30 | codellama:13b-python | 0 | 0 | 0 | 0 |
95-
| 31 | codellama:34b-python | 0 | 0 | 0 | 0 |
96-
| 32 | codellama:7b-python | 0 | 0 | 0 | 0 |
97-
98-
_<sub>(Auto-generated based on the the analysis run on 14 Jan 2024)</sub>_
41+
Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.
42+
43+
| Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total |
44+
| ---- | ---------------------------------------------------------------------------------------------- | -------------------- | ----------------------- | ------------------- | ----- |
45+
| 1 | **[mistral-large-it-2407-123b](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407)** | 16701 | 728 | 57550 | 74979 |
46+
| 2 | **[qwen2-it-72b](https://huggingface.co/Qwen/Qwen2-72B-Instruct)** | 16488 | 629 | 55160 | 72277 |
47+
| 3 | **[llama3.1-it-70b](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)** | 16648 | 580 | 54445 | 71673 |
48+
| 4 | **[gemma2-it-27b](https://huggingface.co/google/gemma-2-27b-it)** | 16342 | 599 | 49772 | 66713 |
49+
| 5 | **[codestral-v0.1-22b](https://huggingface.co/mistralai/Codestral-22B-v0.1)** | 16456 | 706 | 49379 | 66541 |
50+
| 6 | **[codellama-it-34b](https://huggingface.co/meta-llama/CodeLlama-34b-Instruct-hf)** | 15960 | 473 | 48957 | 65390 |
51+
| 7 | **[mistral-nemo-it-2407-12.2b](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)** | 16221 | 526 | 48439 | 65186 |
52+
| 8 | **[mistral-v0.3-it-7b](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)** | 16686 | 472 | 47935 | 65093 |
53+
| 9 | **[phi3-medium-it-14b](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct)** | 16802 | 467 | 45121 | 62390 |
54+
| 10 | **[llama3.1-it-8b](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)** | 16125 | 492 | 44313 | 60930 |
55+
| 11 | **[codellama-it-13b](https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf)** | 16214 | 479 | 43021 | 59714 |
56+
| 12 | **[phi3-small-it-7.3b](https://huggingface.co/microsoft/Phi-3-small-128k-instruct)** | 16155 | 422 | 38093 | 54670 |
57+
| 13 | **[qwen2-it-7b](https://huggingface.co/Qwen/Qwen2-7B-Instruct)** | 15684 | 313 | 38109 | 54106 |
58+
| 14 | **[HeaderGen](https://github.com/ashwinprasadme/headergen)** | 14086 | 346 | 36370 | 50802 |
59+
| 15 | **[phi3-mini-it-3.8b](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)** | 15908 | 320 | 30341 | 46569 |
60+
| 16 | **[phi3.5-mini-it-3.8b](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)** | 15763 | 362 | 28694 | 44819 |
61+
| 17 | **[codellama-it-7b](https://huggingface.co/meta-llama/CodeLlama-7b-Instruct-hf)** | 13779 | 318 | 29346 | 43443 |
62+
| 18 | **[Jedi](https://github.com/davidhalter/jedi)** | 13160 | 0 | 15403 | 28563 |
63+
| 19 | **[Scalpel](https://github.com/SMAT-Lab/Scalpel/issues)** | 15383 | 171 | 18 | 15572 |
64+
| 20 | **[gemma2-it-9b](https://huggingface.co/google/gemma-2-9b-it)** | 1611 | 66 | 5464 | 7141 |
65+
| 21 | **[Type4Py](https://github.com/saltudelft/type4py)** | 3143 | 38 | 2243 | 5424 |
66+
| 22 | **[tinyllama-1.1b](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)** | 1514 | 28 | 2699 | 4241 |
67+
| 23 | **[mixtral-v0.1-it-8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)** | 3235 | 33 | 377 | 3645 |
68+
| 24 | **[phi3.5-moe-it-41.9b](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct)** | 3090 | 25 | 273 | 3388 |
69+
| 25 | **[gemma2-it-2b](https://huggingface.co/google/gemma-2-2b-it)** | 1497 | 41 | 1848 | 3386 |
70+
71+
_<sub>(Auto-generated based on the the analysis run on 30 Aug 2024)</sub>_
9972

10073
---
10174

@@ -125,15 +98,16 @@ Each results folder will have a timestamp, allowing you to easily track and comp
12598
Here is how the auto-generated CSV tables relate to the paper's tables:
12699

127100
- **Table 1** in the paper is derived from three auto-generated CSV tables:
128-
- `paper_table_1.csv` - details Exact matches by type category.
129-
- `paper_table_2.csv` - lists Exact matches for 18 micro-benchmark categories.
130-
- `paper_table_3.csv` - provides Sound and Complete values for tools.
131101

102+
- `paper_table_1.csv` - details Exact matches by type category.
103+
- `paper_table_2.csv` - lists Exact matches for 18 micro-benchmark categories.
104+
- `paper_table_3.csv` - provides Sound and Complete values for tools.
132105

133106
- **Table 2** in the paper is based on the following CSV table:
134-
- `paper_table_5.csv` - shows Exact matches with top_n values for machine learning tools.
107+
- `paper_table_5.csv` - shows Exact matches with top_n values for machine learning tools.
108+
109+
Additionally, there are CSV tables that are _not_ included in the paper:
135110

136-
Additionally, there are CSV tables that are *not* included in the paper:
137111
- `paper_table_4.csv` - containing Sound and Complete values for 18 micro-benchmark categories.
138112
- `paper_table_6.csv` - featuring Sensitivity analysis.
139113
</details>
@@ -260,7 +234,6 @@ To generate an extended version of the original TypeEvalPy benchmark to include
260234
cd autogen
261235
```
262236

263-
264237
2. **Execute the Generation Script**
265238

266239
Run the following command to start the generation process:
@@ -270,7 +243,7 @@ To generate an extended version of the original TypeEvalPy benchmark to include
270243
```
271244

272245
This will generate a folder in the repo root with the autogen benchmark with the current date.
273-
246+
274247
---
275248

276249
### 🤝 Contributing

0 commit comments

Comments
 (0)