chg: [RELEASE] Experimenting distilbert-base-uncased (AutoModelForMaskedLM) and gpt2 (AutoModelForCausalLM). The goal is to generate text.

cedricbonhomme · cedricbonhomme · commit e169bafe352a · 2025-02-20T07:53:15.000+01:00
diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@ Uses data from the ``vulnerability-lookup:meta`` container such as vulnrichment
 
 ## Datasets
 
+Various datasets generated are available on HuggingFace:
 
 https://huggingface.co/datasets/circl/vulnerability-dataset
 
@@ -21,32 +22,101 @@ Authenticate to HuggingFace:
 huggingface-cli login
 ```
 
-Creation of datasets:
+Install VulnTrain:
 
 ```bash
 $ pipx install VulnTrain
+```
+
+Then ensures that the kvrocks database of Vulnerability-Lookup is running.
+
 
-$ vulntrain-create-dataset 
+Creation of datasets:
+
+```bash
+$ vulntrain-create-dataset --nb-rows 10000 --upload --repo-id CIRCL/vulnerability-dataset-10k
+Generating train split: 9999 examples [00:00, 177710.74 examples/s]
 DatasetDict({
     train: Dataset({
-        features: ['id', 'title', 'description'],
-        num_rows: 4
+        features: ['id', 'title', 'description', 'cpes'],
+        num_rows: 8999
     })
     test: Dataset({
-        features: ['id', 'title', 'description'],
-        num_rows: 1
+        features: ['id', 'title', 'description', 'cpes'],
+        num_rows: 1000
     })
 })
-Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1317.72ba/s]
-Uploading the dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.16it/s]
-Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2233.39ba/s]
-Uploading the dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.39it/s]
-README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 428/428 [00:00<00:00, 1.70MB/s]
+Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 49.66ba/s]
+Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.03s/it]
+Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.36ba/s]
+Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.19s/it]
+README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00<00:00, 2.34MB/s]
 ```
 
 
-Train:
+### Train
+
+#### Training for text generation
+
+For now we are using distilbert-base-uncased (AutoModelForMaskedLM) or gpt2 (AutoModelForCausalLM).
+The goal is to generate text.
 
 ```bash
 $ vulntrain-train-dataset 
-```
+Using CPU.
+[codecarbon WARNING @ 07:45:34] Multiple instances of codecarbon are allowed to run at the same time.
+[codecarbon INFO @ 07:45:34] [setup] RAM Tracking...
+[codecarbon INFO @ 07:45:34] [setup] CPU Tracking...
+[codecarbon WARNING @ 07:45:34] No CPU tracking mode found. Falling back on CPU constant mode. 
+ Linux OS detected: Please ensure RAPL files exist at \sys\class\powercap\intel-rapl to measure CPU
+
+[codecarbon WARNING @ 07:45:36] We saw that you have a 13th Gen Intel(R) Core(TM) i7-1365U but we don't know it. Please contact us.
+[codecarbon INFO @ 07:45:36] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-1365U
+[codecarbon INFO @ 07:45:36] [setup] GPU Tracking...
+[codecarbon INFO @ 07:45:36] No GPU found.
+[codecarbon INFO @ 07:45:36] >>> Tracker's metadata:
+[codecarbon INFO @ 07:45:36]   Platform system: Linux-6.1.0-31-amd64-x86_64-with-glibc2.36
+[codecarbon INFO @ 07:45:36]   Python version: 3.13.0
+[codecarbon INFO @ 07:45:36]   CodeCarbon version: 2.8.3
+[codecarbon INFO @ 07:45:36]   Available RAM : 30.937 GB
+[codecarbon INFO @ 07:45:36]   CPU count: 12
+[codecarbon INFO @ 07:45:36]   CPU model: 13th Gen Intel(R) Core(TM) i7-1365U
+[codecarbon INFO @ 07:45:36]   GPU count: None
+[codecarbon INFO @ 07:45:36]   GPU model: None
+[codecarbon INFO @ 07:45:39] Saving emissions data to file /home/cedric/git/VulnTrain/emissions.csv
+Base model distilbert-base-uncased
+README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 503/503 [00:00<00:00, 5.96MB/s]
+train-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 6.92MB/s]
+test-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████| 170k/170k [00:00<00:00, 488kB/s]
+Generating train split: 100%|█████████████████████████████████████████████████████████████████████████| 8999/8999 [00:00<00:00, 277013.99 examples/s]
+Generating test split: 100%|██████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 205250.99 examples/s]
+Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 8999/8999 [00:01<00:00, 8233.47 examples/s]
+[codecarbon INFO @ 07:45:47] [setup] RAM Tracking...
+[codecarbon INFO @ 07:45:47] [setup] CPU Tracking...
+[codecarbon WARNING @ 07:45:47] No CPU tracking mode found. Falling back on CPU constant mode. 
+ Linux OS detected: Please ensure RAPL files exist at \sys\class\powercap\intel-rapl to measure CPU
+
+[codecarbon WARNING @ 07:45:48] We saw that you have a 13th Gen Intel(R) Core(TM) i7-1365U but we don't know it. Please contact us.
+[codecarbon INFO @ 07:45:48] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-1365U
+[codecarbon INFO @ 07:45:48] [setup] GPU Tracking...
+[codecarbon INFO @ 07:45:48] No GPU found.
+[codecarbon INFO @ 07:45:48] >>> Tracker's metadata:
+[codecarbon INFO @ 07:45:48]   Platform system: Linux-6.1.0-31-amd64-x86_64-with-glibc2.36
+[codecarbon INFO @ 07:45:48]   Python version: 3.13.0
+[codecarbon INFO @ 07:45:48]   CodeCarbon version: 2.8.3
+[codecarbon INFO @ 07:45:48]   Available RAM : 30.937 GB
+[codecarbon INFO @ 07:45:48]   CPU count: 12
+[codecarbon INFO @ 07:45:48]   CPU model: 13th Gen Intel(R) Core(TM) i7-1365U
+[codecarbon INFO @ 07:45:48]   GPU count: None
+[codecarbon INFO @ 07:45:48]   GPU model: None
+[codecarbon INFO @ 07:45:51] Saving emissions data to file /home/cedric/git/VulnTrain/vulnerability/emissions.csv
+  0%|                                                                                                                       | 0/2700 [00:00<?, ?it/s][codecarbon INFO @ 07:45:54] Energy consumed for RAM : 0.000048 kWh. RAM Power : 11.601505279541016 W
+[codecarbon INFO @ 07:45:54] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
+[codecarbon INFO @ 07:45:54] 0.000225 kWh of electricity used since the beginning.
+  0%|                                                                                                             | 1/2700 [00:07<5:45:36,  7.68s/it]
+```
+
+
+#### Training for classification
+
+tf-idf on the vulnerability descriptions.
diff --git a/pyproject.toml b/pyproject.toml
@@ -5,7 +5,7 @@ build-backend = "poetry.core.masonry.api"
 
 [project]
 name = "VulnTrain"
-version = "0.1.0"
+version = "0.2.0"
 description = "Generate datasets amd models based on vulnerabilities descriptions from Vulnerability-Lookup."
 authors = [
     {name = "Cédric Bonhomme",email = "cedric.bonhomme@circl.lu"}
diff --git a/vulntrain/create_dataset.py b/vulntrain/create_dataset.py
@@ -5,6 +5,7 @@
 
 """
 
+import argparse
 import json
 from typing import Any, Generator
 
@@ -13,7 +14,8 @@
 
 
 class VulnExtractor:
-    def __init__(self):
+    def __init__(self, nb_rows):
+        self.nb_rows = nb_rows
         self.valkey_client = valkey.Valkey(
             host="127.0.0.1",
             port=10002,
@@ -66,12 +68,8 @@ def get_all(
                 yield vuln
 
     def __call__(self):
-        # count = 0
+        count = 0
         for vuln in self.get_all("cvelistv5", True):
-            # count += 1
-            # if count == 1000:
-            #     return
-
             #
             # CVE id, title, and description
             #
@@ -137,6 +135,11 @@ def __call__(self):
 
             vuln_cpes = list(dict.fromkeys(cpe.lower() for cpe in vuln_cpes))
 
+
+            count += 1
+            if count == self.nb_rows:
+                return
+
             #
             # Create the data
             #
@@ -150,7 +153,30 @@ def __call__(self):
 
 
 def main():
-    extractor = VulnExtractor()
+    parser = argparse.ArgumentParser(description="Dataset generation.")
+    parser.add_argument(
+        "--upload",
+        action="store_true",
+        help="Upload to HuggingFace.",
+        default=False,
+    )
+    parser.add_argument(
+        "--repo-id",
+        dest="repo_id",
+        help="Repo id.",
+        default="",
+    )
+    parser.add_argument(
+        "--nb-rows",
+        dest="nb_rows",
+        type=int,
+        help="Number of rows in the dataset.",
+        default=0,
+    )
+
+    args = parser.parse_args()
+
+    extractor = VulnExtractor(args.nb_rows)
 
     vulns = list(extractor())
 
@@ -165,8 +191,9 @@ def gen():
     )
 
     print(dataset_dict)
-    # dataset_dict.push_to_hub("cedricbonhomme/vulnerability-descriptions")
-    dataset_dict.push_to_hub("CIRCL/vulnerability-dataset")
+    if args.upload:
+        # dataset_dict.push_to_hub("cedricbonhomme/vulnerability-descriptions")
+        dataset_dict.push_to_hub(args.repo_id)
 
 
 if __name__ == "__main__":
diff --git a/vulntrain/summarize.py b/vulntrain/summarize.py
@@ -17,7 +17,7 @@
 # - distilgpt2, gpt2
 #
 BASE_MODEL = "distilbert-base-uncased"  # distilgpt2, gpt2
-DATASET = "circl/vulnerability-dataset"
+DATASET = "circl/vulnerability-dataset-10k"
 MODEL_PATH = "./vulnerability"
 
 if torch.cuda.is_available():

Original file line number	Diff line number	Diff line change
`@@ -17,7 +17,7 @@`
`17`	`17`	`# - distilgpt2, gpt2`
`18`	`18`	`#`
`19`	`19`	`BASE_MODEL = "distilbert-base-uncased" # distilgpt2, gpt2`
`20`		`-DATASET = "circl/vulnerability-dataset"`
	`20`	`+DATASET = "circl/vulnerability-dataset-10k"`
`21`	`21`	`MODEL_PATH = "./vulnerability"`
`22`	`22`
`23`	`23`	`if torch.cuda.is_available():`