chg: [RELEASE] Updated documentation and CHANGELOG.

cedricbonhomme · cedricbonhomme · commit 3f11a97528c5 · 2025-02-25T08:38:45.000+01:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,23 @@
 # Changelog
 
+## Release 1.0.0 (2025-02-25)
+
+### News
+
+- Introduced a new trainer to automatically classify vulnerabilities based on their descriptions,  
+  even when CVSS scores are unavailable.  
+- Added CVSS parsing to the dataset generation script.  
+
+### Changes
+
+- Refactored the project structure for better organization.  
+- Improved CPE parsing.  
+- Enhanced the dataset generation script.  
+- Optimized the trainer for text generation on vulnerability descriptions.  
+- Improved command-line argument parsing.  
+- Improved the process of pushing the tokenizer and trainer to Hugging Face.  
+
+
 ## Release 0.5.1 (2025-02-22)
 
 Fixed configuration module name.
diff --git a/README.md b/README.md
@@ -18,14 +18,16 @@ Check out the datasets and models on Hugging Face:
 
 ## Usage
 
-Various types of commands are available:
+Three types of commands are available:
 
 - **Dataset generation**: Create and prepare datasets.
-- **Model training**: Train models on the prepared datasets.
-- **Model validation**: Evaluate the performance of the trained model.
+- **Model training**: Train models using the prepared datasets.
+  - Train a model for text generation to assist in writing vulnerability descriptions.
+  - Train a model to classify vulnerabilities by severity.
+- **Model validation**: Assess the performance of trained models.
 
 
-### Generate datasets
+### Dataset generation
 
 Authenticate to HuggingFace:
 
@@ -45,7 +47,7 @@ Then ensures that the kvrocks database of Vulnerability-Lookup is running.
 Creation of datasets:
 
 ```bash
-$ vulntrain-dataset-generation --sources cvelistv5 --nb-rows 10000 --upload --repo-id CIRCL/vulnerability-dataset-10k
+$ vulntrain-dataset-generation --sources cvelistv5 --nb-rows 10000 --repo-id CIRCL/vulnerability-dataset-10k
 Generating train split: 9999 examples [00:00, 177710.74 examples/s]
 DatasetDict({
     train: Dataset({
@@ -65,7 +67,7 @@ README.md: 100%|█████████████████████
 ```
 
 
-### Train
+### Model training
 
 #### Training for text generation
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -5,7 +5,7 @@ build-backend = "poetry.core.masonry.api"
 
 [project]
 name = "VulnTrain"
-version = "0.5.1"
+version = "1.0.0"
 description = "Generate datasets amd models based on vulnerabilities data from Vulnerability-Lookup."
 authors = [
     {name = "Cédric Bonhomme",email = "cedric.bonhomme@circl.lu"}
diff --git a/vulntrain/datasets/create_dataset.py b/vulntrain/datasets/create_dataset.py
@@ -126,11 +126,16 @@ def main():
         help="Comma-separated list of sources (cvelistv5, github)",
     )
     parser.add_argument(
-        "--upload", action="store_true", help="Upload dataset to Hugging Face"
+        "--repo-id",
+        dest="repo_id",
+        default="",
+        help="The name of the repository you want to push your object to. It should contain your organization name when pushing to a given organization.",
     )
-    parser.add_argument("--repo-id", required=False, help="Hugging Face repository ID")
     parser.add_argument(
-        "--commit-message", default="", help="Commit message when publishing"
+        "--commit-message",
+        dest="commit_message",
+        default="",
+        help="Commit message when publishing",
     )
     parser.add_argument(
         "--nb-rows", type=int, default=0, help="Number of rows in the dataset"
@@ -150,7 +155,7 @@ def main():
     )
     print(dataset_dict)
 
-    if args.upload:
+    if args.repo_id:
         if args.commit_message:
             # dataset_dict.push_to_hub(args.repo_id, commit_message=args.commit_message, token=hf_token)
             dataset_dict.push_to_hub(args.repo_id, commit_message=args.commit_message)