Skip to content

Commit e197e98

Browse files
author
jimmy.xj
committed
Update README.md
1 parent 102dd35 commit e197e98

File tree

1 file changed

+37
-2
lines changed

1 file changed

+37
-2
lines changed

Diff for: README.md

+37-2
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,19 @@
99
DevOps-Eval is a comprehensive evaluation suite specifically designed for foundation models in the DevOps field. We hope DevOps-Eval could help developers, especially in the DevOps field, track the progress and analyze the important strengths/shortcomings of their models.
1010

1111

12-
📚 This repo contains questions and exercises related to DevOps, including the AIOps.
12+
📚 This repo contains questions and exercises related to DevOps, including the AIOps, ToolLearning;
1313

14-
💥️ There are currently **5977** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
14+
💥️ There are currently **7486** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
1515

1616
🔥 There are a total of **2840** samples in the AIOps subcategory, covering scenarios such as **log parsing**, **time series anomaly detection**, **time series classification**, **time series forecasting**, and **root cause analysis**.
1717

18+
🔧 There are a total of **1509** samples in the ToolLearning subcategory, covering 239 tool scenes across 59 fields.
19+
1820
<p align="center"> <a href="resources/devops_diagram_zh.jpg"> <img src="images/data_info.png" style="width: 100%;" id="data_info"></a></p>
1921

2022

2123
## 🔔 News
24+
* **[2023.12.27]** Add 1509 **ToolLearning** samples, covering 239 tool categories across 59 fields; Release the associated evaluation leaderboard;
2225
* **[2023.11.27]** Add 487 operation scene samples and 640 time series forecasting samples; Update the Leaderboard;
2326
* **[2023.10.30]** Add the AIOps Leaderboard.
2427
* **[2023.10.25]** Add the AIOps samples, including log parsing, time series anomaly detection, time series classification and root cause analysis.
@@ -30,9 +33,11 @@ DevOps-Eval is a comprehensive evaluation suite specifically designed for founda
3033
- [🏆 Leaderboard](#-leaderboard)
3134
- [👀 DevOps](#-devops)
3235
- [🔥 AIOps](#-aiops)
36+
- [🔧 ToolLearning](#-toollearning)
3337
- [⏬ Data](#-data)
3438
- [👀 Notes](#-notes)
3539
- [🔥 AIOps Sample Example](#-aiops-sample-example)
40+
- [🔧 ToolLearning Sample Example](#-toollearning-sample-example)
3641
- [🚀 How to Evaluate](#-how-to-evaluate)
3742
- [🧭 TODO](#-todo)
3843
- [🏁 Licenses](#-licenses)
@@ -83,6 +88,9 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
8388
| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 63.45 | 75.46 | 67.17 |
8489

8590
### 🔥 AIOps
91+
92+
<details>
93+
8694
#### Zero Shot
8795
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** |
8896
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:|
@@ -119,6 +127,29 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
119127
| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 50.31 | 36.69 |
120128
| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 31.56 | 35.85 |
121129

130+
</details>
131+
132+
133+
### 🔧 ToolLearning
134+
<details>
135+
136+
| **FuncCall-Filler** | dataset_name | fccr | 1-fcffr | 1-fcfnr | 1-fcfpr | 1-fcfnir | aar |
137+
|:-------------------:| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
138+
| Qwen-14b-chat | luban | 98.37 | 99.73 | 99.86 | 98.78 | 100 | 81.58 |
139+
| Qwen-7b-chat | luban | 99.46 | 99.86 | 100 | 99.59 | 100 | 79.25 |
140+
| Baichuan-7b-chat | luban | 97.96 | 99.32 | 100 | 98.64 | 100 | 89.53 |
141+
| Internlm-chat-7b | luban | 94.29 | 95.78 | 100 | 98.5 | 100 | 88.19 |
142+
| Qwen-14b-chat | fc_data | 98.78 | 99.73 | 100 | 99.05 | 100 | 94.7 |
143+
| Qwen-7b-chat | fc_data | 98.1 | 99.87 | 99.73 | 98.5 | 100 | 93.14 |
144+
| Baichuan-7b-chat | fc_data | 98.91 | 99.87 | 99.87 | 99.18 | 100 | 89.5 |
145+
| Internlm-chat-7b | fc_data | 61 | 100 | 97.68 | 63.32 | 100 | 69.46 |
146+
| CodeLLaMa-7b | fc_data | 50.58 | 100 | 98.07 | 52.51 | 100 | 63.59 |
147+
| CodeFuse-7b-16k | fc_data | 60.23 | 100 | 97.3 | 62.93 | 99.61 | 61.12 |
148+
| CodeFuse-7b-4k | fc_data | 47.88 | 100 | 96.14 | 51.74 | 99.61 | 61.85 |
149+
150+
151+
</details>
152+
122153

123154
## ⏬ Data
124155
#### Download
@@ -216,6 +247,9 @@ D: 12
216247
answer: D
217248
explanation: According to the analysis, the value 265 in the given time series at 12 o'clock is significantly larger than the surrounding data, indicating a sudden increase phenomenon. Therefore, selecting option D is correct.
218249
```
250+
#### 🔧 ToolLearning Sample Example
251+
252+
👀 👀The data format is compatible with OpenAI's Function Calling. Please refer to [category_mapping.json](resources/categroy_mapping.json) for details.
219253

220254

221255
## 🚀 How to Evaluate
@@ -289,6 +323,7 @@ python src/run_eval.py \
289323
## 🧭 TODO
290324
- [x] add AIOps samples.
291325
- [x] add AIOps scenario **time series forecasting**.
326+
- [x] add **ToolLearning** samples.
292327
- [ ] increase in sample size.
293328
- [ ] add samples with the difficulty level set to hard.
294329
- [ ] add the English version of the samples.

0 commit comments

Comments
 (0)