Skip to content

Commit e169baf

Browse files
chg: [RELEASE] Experimenting distilbert-base-uncased (AutoModelForMaskedLM) and gpt2 (AutoModelForCausalLM). The goal is to generate text.
1 parent 589dd09 commit e169baf

File tree

4 files changed

+121
-24
lines changed

4 files changed

+121
-24
lines changed

README.md

+83-13
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Uses data from the ``vulnerability-lookup:meta`` container such as vulnrichment
77

88
## Datasets
99

10+
Various datasets generated are available on HuggingFace:
1011

1112
https://huggingface.co/datasets/circl/vulnerability-dataset
1213

@@ -21,32 +22,101 @@ Authenticate to HuggingFace:
2122
huggingface-cli login
2223
```
2324

24-
Creation of datasets:
25+
Install VulnTrain:
2526

2627
```bash
2728
$ pipx install VulnTrain
29+
```
30+
31+
Then ensures that the kvrocks database of Vulnerability-Lookup is running.
32+
2833

29-
$ vulntrain-create-dataset
34+
Creation of datasets:
35+
36+
```bash
37+
$ vulntrain-create-dataset --nb-rows 10000 --upload --repo-id CIRCL/vulnerability-dataset-10k
38+
Generating train split: 9999 examples [00:00, 177710.74 examples/s]
3039
DatasetDict({
3140
train: Dataset({
32-
features: ['id', 'title', 'description'],
33-
num_rows: 4
41+
features: ['id', 'title', 'description', 'cpes'],
42+
num_rows: 8999
3443
})
3544
test: Dataset({
36-
features: ['id', 'title', 'description'],
37-
num_rows: 1
45+
features: ['id', 'title', 'description', 'cpes'],
46+
num_rows: 1000
3847
})
3948
})
40-
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1317.72ba/s]
41-
Uploading the dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.16it/s]
42-
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2233.39ba/s]
43-
Uploading the dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.39it/s]
44-
README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 428/428 [00:00<00:00, 1.70MB/s]
49+
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 49.66ba/s]
50+
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.03s/it]
51+
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.36ba/s]
52+
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.19s/it]
53+
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00<00:00, 2.34MB/s]
4554
```
4655

4756

48-
Train:
57+
### Train
58+
59+
#### Training for text generation
60+
61+
For now we are using distilbert-base-uncased (AutoModelForMaskedLM) or gpt2 (AutoModelForCausalLM).
62+
The goal is to generate text.
4963

5064
```bash
5165
$ vulntrain-train-dataset
52-
```
66+
Using CPU.
67+
[codecarbon WARNING @ 07:45:34] Multiple instances of codecarbon are allowed to run at the same time.
68+
[codecarbon INFO @ 07:45:34] [setup] RAM Tracking...
69+
[codecarbon INFO @ 07:45:34] [setup] CPU Tracking...
70+
[codecarbon WARNING @ 07:45:34] No CPU tracking mode found. Falling back on CPU constant mode.
71+
Linux OS detected: Please ensure RAPL files exist at \sys\class\powercap\intel-rapl to measure CPU
72+
73+
[codecarbon WARNING @ 07:45:36] We saw that you have a 13th Gen Intel(R) Core(TM) i7-1365U but we don't know it. Please contact us.
74+
[codecarbon INFO @ 07:45:36] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-1365U
75+
[codecarbon INFO @ 07:45:36] [setup] GPU Tracking...
76+
[codecarbon INFO @ 07:45:36] No GPU found.
77+
[codecarbon INFO @ 07:45:36] >>> Tracker's metadata:
78+
[codecarbon INFO @ 07:45:36] Platform system: Linux-6.1.0-31-amd64-x86_64-with-glibc2.36
79+
[codecarbon INFO @ 07:45:36] Python version: 3.13.0
80+
[codecarbon INFO @ 07:45:36] CodeCarbon version: 2.8.3
81+
[codecarbon INFO @ 07:45:36] Available RAM : 30.937 GB
82+
[codecarbon INFO @ 07:45:36] CPU count: 12
83+
[codecarbon INFO @ 07:45:36] CPU model: 13th Gen Intel(R) Core(TM) i7-1365U
84+
[codecarbon INFO @ 07:45:36] GPU count: None
85+
[codecarbon INFO @ 07:45:36] GPU model: None
86+
[codecarbon INFO @ 07:45:39] Saving emissions data to file /home/cedric/git/VulnTrain/emissions.csv
87+
Base model distilbert-base-uncased
88+
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00<00:00, 5.96MB/s]
89+
train-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 6.92MB/s]
90+
test-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████| 170k/170k [00:00<00:00, 488kB/s]
91+
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████| 8999/8999 [00:00<00:00, 277013.99 examples/s]
92+
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 205250.99 examples/s]
93+
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 8999/8999 [00:01<00:00, 8233.47 examples/s]
94+
[codecarbon INFO @ 07:45:47] [setup] RAM Tracking...
95+
[codecarbon INFO @ 07:45:47] [setup] CPU Tracking...
96+
[codecarbon WARNING @ 07:45:47] No CPU tracking mode found. Falling back on CPU constant mode.
97+
Linux OS detected: Please ensure RAPL files exist at \sys\class\powercap\intel-rapl to measure CPU
98+
99+
[codecarbon WARNING @ 07:45:48] We saw that you have a 13th Gen Intel(R) Core(TM) i7-1365U but we don't know it. Please contact us.
100+
[codecarbon INFO @ 07:45:48] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-1365U
101+
[codecarbon INFO @ 07:45:48] [setup] GPU Tracking...
102+
[codecarbon INFO @ 07:45:48] No GPU found.
103+
[codecarbon INFO @ 07:45:48] >>> Tracker's metadata:
104+
[codecarbon INFO @ 07:45:48] Platform system: Linux-6.1.0-31-amd64-x86_64-with-glibc2.36
105+
[codecarbon INFO @ 07:45:48] Python version: 3.13.0
106+
[codecarbon INFO @ 07:45:48] CodeCarbon version: 2.8.3
107+
[codecarbon INFO @ 07:45:48] Available RAM : 30.937 GB
108+
[codecarbon INFO @ 07:45:48] CPU count: 12
109+
[codecarbon INFO @ 07:45:48] CPU model: 13th Gen Intel(R) Core(TM) i7-1365U
110+
[codecarbon INFO @ 07:45:48] GPU count: None
111+
[codecarbon INFO @ 07:45:48] GPU model: None
112+
[codecarbon INFO @ 07:45:51] Saving emissions data to file /home/cedric/git/VulnTrain/vulnerability/emissions.csv
113+
0%| | 0/2700 [00:00<?, ?it/s][codecarbon INFO @ 07:45:54] Energy consumed for RAM : 0.000048 kWh. RAM Power : 11.601505279541016 W
114+
[codecarbon INFO @ 07:45:54] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
115+
[codecarbon INFO @ 07:45:54] 0.000225 kWh of electricity used since the beginning.
116+
0%| | 1/2700 [00:07<5:45:36, 7.68s/it]
117+
```
118+
119+
120+
#### Training for classification
121+
122+
tf-idf on the vulnerability descriptions.

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ build-backend = "poetry.core.masonry.api"
55

66
[project]
77
name = "VulnTrain"
8-
version = "0.1.0"
8+
version = "0.2.0"
99
description = "Generate datasets amd models based on vulnerabilities descriptions from Vulnerability-Lookup."
1010
authors = [
1111
{name = "Cédric Bonhomme",email = "[email protected]"}

vulntrain/create_dataset.py

+36-9
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
66
"""
77

8+
import argparse
89
import json
910
from typing import Any, Generator
1011

@@ -13,7 +14,8 @@
1314

1415

1516
class VulnExtractor:
16-
def __init__(self):
17+
def __init__(self, nb_rows):
18+
self.nb_rows = nb_rows
1719
self.valkey_client = valkey.Valkey(
1820
host="127.0.0.1",
1921
port=10002,
@@ -66,12 +68,8 @@ def get_all(
6668
yield vuln
6769

6870
def __call__(self):
69-
# count = 0
71+
count = 0
7072
for vuln in self.get_all("cvelistv5", True):
71-
# count += 1
72-
# if count == 1000:
73-
# return
74-
7573
#
7674
# CVE id, title, and description
7775
#
@@ -137,6 +135,11 @@ def __call__(self):
137135

138136
vuln_cpes = list(dict.fromkeys(cpe.lower() for cpe in vuln_cpes))
139137

138+
139+
count += 1
140+
if count == self.nb_rows:
141+
return
142+
140143
#
141144
# Create the data
142145
#
@@ -150,7 +153,30 @@ def __call__(self):
150153

151154

152155
def main():
153-
extractor = VulnExtractor()
156+
parser = argparse.ArgumentParser(description="Dataset generation.")
157+
parser.add_argument(
158+
"--upload",
159+
action="store_true",
160+
help="Upload to HuggingFace.",
161+
default=False,
162+
)
163+
parser.add_argument(
164+
"--repo-id",
165+
dest="repo_id",
166+
help="Repo id.",
167+
default="",
168+
)
169+
parser.add_argument(
170+
"--nb-rows",
171+
dest="nb_rows",
172+
type=int,
173+
help="Number of rows in the dataset.",
174+
default=0,
175+
)
176+
177+
args = parser.parse_args()
178+
179+
extractor = VulnExtractor(args.nb_rows)
154180

155181
vulns = list(extractor())
156182

@@ -165,8 +191,9 @@ def gen():
165191
)
166192

167193
print(dataset_dict)
168-
# dataset_dict.push_to_hub("cedricbonhomme/vulnerability-descriptions")
169-
dataset_dict.push_to_hub("CIRCL/vulnerability-dataset")
194+
if args.upload:
195+
# dataset_dict.push_to_hub("cedricbonhomme/vulnerability-descriptions")
196+
dataset_dict.push_to_hub(args.repo_id)
170197

171198

172199
if __name__ == "__main__":

vulntrain/summarize.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
# - distilgpt2, gpt2
1818
#
1919
BASE_MODEL = "distilbert-base-uncased" # distilgpt2, gpt2
20-
DATASET = "circl/vulnerability-dataset"
20+
DATASET = "circl/vulnerability-dataset-10k"
2121
MODEL_PATH = "./vulnerability"
2222

2323
if torch.cuda.is_available():

0 commit comments

Comments
 (0)