@@ -7,6 +7,7 @@ Uses data from the ``vulnerability-lookup:meta`` container such as vulnrichment
7
7
8
8
## Datasets
9
9
10
+ Various datasets generated are available on HuggingFace:
10
11
11
12
https://huggingface.co/datasets/circl/vulnerability-dataset
12
13
@@ -21,32 +22,101 @@ Authenticate to HuggingFace:
21
22
huggingface-cli login
22
23
```
23
24
24
- Creation of datasets :
25
+ Install VulnTrain :
25
26
26
27
``` bash
27
28
$ pipx install VulnTrain
29
+ ```
30
+
31
+ Then ensures that the kvrocks database of Vulnerability-Lookup is running.
32
+
28
33
29
- $ vulntrain-create-dataset
34
+ Creation of datasets:
35
+
36
+ ``` bash
37
+ $ vulntrain-create-dataset --nb-rows 10000 --upload --repo-id CIRCL/vulnerability-dataset-10k
38
+ Generating train split: 9999 examples [00:00, 177710.74 examples/s]
30
39
DatasetDict({
31
40
train: Dataset({
32
- features: [' id' , ' title' , ' description' ],
33
- num_rows: 4
41
+ features: [' id' , ' title' , ' description' , ' cpes ' ],
42
+ num_rows: 8999
34
43
})
35
44
test: Dataset({
36
- features: [' id' , ' title' , ' description' ],
37
- num_rows: 1
45
+ features: [' id' , ' title' , ' description' , ' cpes ' ],
46
+ num_rows: 1000
38
47
})
39
48
})
40
- Creating parquet from Arrow format: 100%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1/1 [00:00< 00:00, 1317.72ba /s]
41
- Uploading the dataset shards: 100%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1/1 [00:00 < 00:00, 1.16it/s ]
42
- Creating parquet from Arrow format: 100%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1/1 [00:00< 00:00, 2233.39ba /s]
43
- Uploading the dataset shards: 100%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1/1 [00:00 < 00:00, 1.39it/s ]
44
- README.md: 100%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 428/428 [00:00< 00:00, 1.70MB /s]
49
+ Creating parquet from Arrow format: 100%| ██████████████████████████████████████████████████████████████████████████████| 9/9 [00:00< 00:00, 49.66ba /s]
50
+ Uploading the dataset shards: 100%| ████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02 < 00:00, 2.03s/it ]
51
+ Creating parquet from Arrow format: 100%| ██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00< 00:00, 63.36ba /s]
52
+ Uploading the dataset shards: 100%| ████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01 < 00:00, 1.19s/it ]
53
+ README.md: 100%| ████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00< 00:00, 2.34MB /s]
45
54
```
46
55
47
56
48
- Train:
57
+ ### Train
58
+
59
+ #### Training for text generation
60
+
61
+ For now we are using distilbert-base-uncased (AutoModelForMaskedLM) or gpt2 (AutoModelForCausalLM).
62
+ The goal is to generate text.
49
63
50
64
``` bash
51
65
$ vulntrain-train-dataset
52
- ```
66
+ Using CPU.
67
+ [codecarbon WARNING @ 07:45:34] Multiple instances of codecarbon are allowed to run at the same time.
68
+ [codecarbon INFO @ 07:45:34] [setup] RAM Tracking...
69
+ [codecarbon INFO @ 07:45:34] [setup] CPU Tracking...
70
+ [codecarbon WARNING @ 07:45:34] No CPU tracking mode found. Falling back on CPU constant mode.
71
+ Linux OS detected: Please ensure RAPL files exist at \s ys\c lass\p owercap\i ntel-rapl to measure CPU
72
+
73
+ [codecarbon WARNING @ 07:45:36] We saw that you have a 13th Gen Intel(R) Core(TM) i7-1365U but we don' t know it. Please contact us.
74
+ [codecarbon INFO @ 07:45:36] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-1365U
75
+ [codecarbon INFO @ 07:45:36] [setup] GPU Tracking...
76
+ [codecarbon INFO @ 07:45:36] No GPU found.
77
+ [codecarbon INFO @ 07:45:36] >>> Tracker' s metadata:
78
+ [codecarbon INFO @ 07:45:36] Platform system: Linux-6.1.0-31-amd64-x86_64-with-glibc2.36
79
+ [codecarbon INFO @ 07:45:36] Python version: 3.13.0
80
+ [codecarbon INFO @ 07:45:36] CodeCarbon version: 2.8.3
81
+ [codecarbon INFO @ 07:45:36] Available RAM : 30.937 GB
82
+ [codecarbon INFO @ 07:45:36] CPU count: 12
83
+ [codecarbon INFO @ 07:45:36] CPU model: 13th Gen Intel(R) Core(TM) i7-1365U
84
+ [codecarbon INFO @ 07:45:36] GPU count: None
85
+ [codecarbon INFO @ 07:45:36] GPU model: None
86
+ [codecarbon INFO @ 07:45:39] Saving emissions data to file /home/cedric/git/VulnTrain/emissions.csv
87
+ Base model distilbert-base-uncased
88
+ README.md: 100%| ████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00< 00:00, 5.96MB/s]
89
+ train-00000-of-00001.parquet: 100%| █████████████████████████████████████████████████████████████████████████████| 1.48M/1.48M [00:00< 00:00, 6.92MB/s]
90
+ test-00000-of-00001.parquet: 100%| █████████████████████████████████████████████████████████████████████████████████| 170k/170k [00:00< 00:00, 488kB/s]
91
+ Generating train split: 100%| █████████████████████████████████████████████████████████████████████████| 8999/8999 [00:00< 00:00, 277013.99 examples/s]
92
+ Generating test split: 100%| ██████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00< 00:00, 205250.99 examples/s]
93
+ Map: 100%| ██████████████████████████████████████████████████████████████████████████████████████████████| 8999/8999 [00:01< 00:00, 8233.47 examples/s]
94
+ [codecarbon INFO @ 07:45:47] [setup] RAM Tracking...
95
+ [codecarbon INFO @ 07:45:47] [setup] CPU Tracking...
96
+ [codecarbon WARNING @ 07:45:47] No CPU tracking mode found. Falling back on CPU constant mode.
97
+ Linux OS detected: Please ensure RAPL files exist at \s ys\c lass\p owercap\i ntel-rapl to measure CPU
98
+
99
+ [codecarbon WARNING @ 07:45:48] We saw that you have a 13th Gen Intel(R) Core(TM) i7-1365U but we don' t know it. Please contact us.
100
+ [codecarbon INFO @ 07:45:48] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-1365U
101
+ [codecarbon INFO @ 07:45:48] [setup] GPU Tracking...
102
+ [codecarbon INFO @ 07:45:48] No GPU found.
103
+ [codecarbon INFO @ 07:45:48] >>> Tracker' s metadata:
104
+ [codecarbon INFO @ 07:45:48] Platform system: Linux-6.1.0-31-amd64-x86_64-with-glibc2.36
105
+ [codecarbon INFO @ 07:45:48] Python version: 3.13.0
106
+ [codecarbon INFO @ 07:45:48] CodeCarbon version: 2.8.3
107
+ [codecarbon INFO @ 07:45:48] Available RAM : 30.937 GB
108
+ [codecarbon INFO @ 07:45:48] CPU count: 12
109
+ [codecarbon INFO @ 07:45:48] CPU model: 13th Gen Intel(R) Core(TM) i7-1365U
110
+ [codecarbon INFO @ 07:45:48] GPU count: None
111
+ [codecarbon INFO @ 07:45:48] GPU model: None
112
+ [codecarbon INFO @ 07:45:51] Saving emissions data to file /home/cedric/git/VulnTrain/vulnerability/emissions.csv
113
+ 0%| | 0/2700 [00:00< ? , ? it/s][codecarbon INFO @ 07:45:54] Energy consumed for RAM : 0.000048 kWh. RAM Power : 11.601505279541016 W
114
+ [codecarbon INFO @ 07:45:54] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
115
+ [codecarbon INFO @ 07:45:54] 0.000225 kWh of electricity used since the beginning.
116
+ 0%| | 1/2700 [00:07< 5:45:36, 7.68s/it]
117
+ ```
118
+
119
+
120
+ #### Training for classification
121
+
122
+ tf-idf on the vulnerability descriptions.
0 commit comments