Skip to content

Commit 2bc426f

Browse files
chg: [user-manual] Added a page on how to get started with AI datasets and models.
1 parent 98b89d0 commit 2bc426f

File tree

2 files changed

+111
-0
lines changed

2 files changed

+111
-0
lines changed

content/user-manual/_index.md

+1
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@ It covers fundamental concepts, key features, and essential administrative tasks
1616
- [Sightings](./sightings)
1717
- [System monitoring](./system-monitoring)
1818
- [PyVulnerabilityLookup](./pyvulnerabilitylookup)
19+
- [AI datasets and models](./ai)

content/user-manual/ai/index.md

+110
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: AI datasets and models
3+
description: Machine learning datasets and models
4+
toc: true
5+
---
6+
7+
## Datasets
8+
9+
<html>
10+
<iframe
11+
src="https://huggingface.co/datasets/CIRCL/vulnerability-scores/embed/viewer/default/train"
12+
frameborder="0"
13+
width="100%"
14+
height="560px"
15+
></iframe>
16+
</html>
17+
18+
This dataset is updated daily.
19+
20+
Sources of the data:
21+
22+
- [CVE Program](https://vulnerability.circl.lu/recent#cvelistv5) (enriched with data from vulnrichment and Fraunhofer FKIE)
23+
- [GitHub Security Advisories](https://vulnerability.circl.lu/recent#github)
24+
- [PySec advisories](https://vulnerability.circl.lu/recent#pysec)
25+
- [CSAF Red Hat](https://vulnerability.circl.lu/recent#csaf_redhat)
26+
- [CSAF Cisco](https://vulnerability.circl.lu/recent#csaf_cisco)
27+
28+
The licenses for each security advisory feed are listed here:
29+
https://vulnerability.circl.lu/about#sources
30+
31+
### Get started with the dataset
32+
33+
```python
34+
import json
35+
from datasets import load_dataset
36+
37+
dataset = load_dataset("CIRCL/vulnerability-scores")
38+
39+
vulnerabilities = ["CVE-2012-2339", "RHSA-2023:5964", "GHSA-7chm-34j8-4f22", "PYSEC-2024-225"]
40+
41+
filtered_entries = dataset.filter(lambda elem: elem["id"] in vulnerabilities)
42+
43+
for entry in filtered_entries["train"]:
44+
print(json.dumps(entry, indent=4))
45+
```
46+
47+
For each vulnerability, you will find all assigned severity scores and associated CPEs.
48+
49+
50+
## Models
51+
52+
### Text classification
53+
54+
#### vulnerability-severity-classification-roberta-base
55+
56+
This model is a fine-tuned version of ``roberta-base`` on the dataset
57+
[CIRCL/vulnerability-scores](https://huggingface.co/datasets/CIRCL/vulnerability-scores).
58+
The time of generation with two GPUs NVIDIA L40S is approximately 6 hours.
59+
60+
Try it with Python:
61+
62+
```python
63+
>>> from transformers import AutoModelForSequenceClassification, AutoTokenizer
64+
... import torch
65+
...
66+
... labels = ["low", "medium", "high", "critical"]
67+
...
68+
... model_name = "CIRCL/vulnerability-severity-classification-roberta-base"
69+
... tokenizer = AutoTokenizer.from_pretrained(model_name)
70+
... model = AutoModelForSequenceClassification.from_pretrained(model_name)
71+
... model.eval()
72+
...
73+
... test_description = "langchain_experimental 0.0.14 allows an attacker to bypass the CVE-2023-36258 fix and execute arbitrary code via the PALChain in the python exec method."
74+
... inputs = tokenizer(test_description, return_tensors="pt", truncation=True, padding=True)
75+
...
76+
... # Run inference
77+
... with torch.no_grad():
78+
... outputs = model(**inputs)
79+
... predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
80+
...
81+
...
82+
... # Print results
83+
... print("Predictions:", predictions)
84+
... predicted_class = torch.argmax(predictions, dim=-1).item()
85+
... print("Predicted severity:", labels[predicted_class])
86+
...
87+
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.25k/1.25k [00:00<00:00, 4.51MB/s]
88+
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 2.66MB/s]
89+
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 3.42MB/s]
90+
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.56M/3.56M [00:00<00:00, 5.92MB/s]
91+
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 280/280 [00:00<00:00, 1.14MB/s]
92+
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 913/913 [00:00<00:00, 3.40MB/s]
93+
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 499M/499M [00:44<00:00, 11.2MB/s]
94+
Predictions: tensor([[2.5910e-04, 2.1585e-03, 1.3680e-02, 9.8390e-01]])
95+
Predicted severity: critical
96+
```
97+
98+
``critical`` has a score of 98%.
99+
100+
101+
Try it with the Hugging Face space:
102+
103+
<html>
104+
<iframe
105+
src="https://circl-vulnerability-severity-classification-roberta-base.hf.space"
106+
frameborder="0"
107+
width="850"
108+
height="450"
109+
></iframe>
110+
</html>

0 commit comments

Comments
 (0)