RFC Set cachedir and backupcachedir as parameter for parallel instances of the tool with own database #4773

motto-phytec · 2025-02-06T15:00:26Z

The cachedir and backup_cachedir is always on the default value for the data sources. (~/.cache/cve-bin-tool)
For the cvedb is the cachedir configurable. In the data_sources, the cve_scanner or others use always the default cachedir path.
If you set the cachedir for the CVEDB to an other, then the cvedb access in line 441
purl2cpe_conn = sqlite3.connect(self.cachedir / "purl2cpe/purl2cpe.db")
fails, because the sql connection for purl2cpe use the DISK_LOCATION_DEFAULT instead of the self.cachedir of the cvedb.

The motivation for this RFC patch is to run different instances of the cve-bin-tool with their own cachedir for the instances.
In parallel operation, there is actually a risk of collisions when accessing the database if a task starts later and wants to update the database. Then the first task makes a rollback of the database, which corrupt the complete database.

This RFC patch try to set the cachedir in all affected files, but there are many affected files and dependencies.
I am not sure, if this is the right approach to realize a better parallel processing.

terriko · 2025-02-06T20:27:28Z

I think this is probably a change we should have.

BUT for parallel working, our standard recommendation is that you not allow scan jobs to update the database to avoid this problem, so you might want to switch to doing that.

That is:

have one database update job that will do updates as needed. I usually have it scan against a blank or predictable small binary like so:

cve-bin-tool ~/blank.csv

Our docs actually recommend you use -u now here, but that will take longer as it needs to refresh all data instead of just getting updates then: https://cve-bin-tool.readthedocs.io/en/latest/how_to_guides/multiple_scans_at_once.html

In every parallel scan job, use -u never as an option so that the scan job will not attempt to update the database.

I'm wondering if we should make some changes to make this happen better or make it more obvious to users that this is the recommended solution. I'll open an issue so someone could maybe work on that.

motto-phytec · 2025-02-11T14:33:28Z

Hi @terriko,
thank you for your information and the realization for an update of the database without a scan.
We use the cve-bin-tool as python package as part in our own python script and this script runs in different asynchronous jenkins pipelines. so we can start the pipelines without an update of the cve db. Our challenge is when we can update the cve.db.
Do you have a similar use case?

The other think is, that you can not set an other path to the cve db, at the moment. The different cve sources and files use always the compiled in default path to the ~/.cache/cve-bin-tool/cve.db. You can not run a pipeline with its own cve.db in the workspace. If you make a copy to the workspace and use CVEDB, then it depends from the data request, which database path is in use. If no database in the ~/.cache/cve-bin-tool/cve.db, then it crashs e.g. purl2cpe_conn = sqlite3.connect(self.cachedir / "purl2cpe/purl2cpe.db")
Is this a use case for you to use a cve db in an other folder?

terriko

Jenkins would be a bit different, but we're using the cache functionality of github actions:

https://github.com/intel/cve-bin-tool/blob/main/.github/workflows/update-cache.yml
https://github.com/intel/cve-bin-tool/actions/workflows/update-cache.yml

So the update job does the update and then clobbers the cache, and the scan jobs each make a copy of the cache to their local directory before they start. Because it's github actions which uses new containers for every job, the individual scan/test jobs can't share data any other way.

That probably doesn't help for the case where you're reusing the same machine/container instead of fresh ones, but it probably explains why the cache directory hasn't been fixed yet.

Anyhow, back to this PR -- do you have time to fix the linter issues? You probably need to run isort and black and just let them auto-fix. It would probably be good to update the branch at the same time, since we made some CI changes that will probably make the tests run faster for you. I can do that through the web interface but I don't want to make your local branch all out of sync if you're going to fix the linter issues.

terriko · 2025-02-13T19:56:51Z

Oh, forgot the link for how to run the linters and what they're doing:
https://github.com/intel/cve-bin-tool/blob/main/CONTRIBUTING.md#running-linters

The cachedir and backupcachedir was always on the default value for the data sources. For the cvedb is the cachedir configurable. With this patch the cve-bin-tool can run in different instances with a own cache for the CVE information. Additional the patch is a workaround for the cvedb access in line 441 purl2cpe_conn = sqlite3.connect(self.cachedir / "purl2cpe/purl2cpe.db"), which fails if cvedb has an other cachedir as the purl2cpe with the DISK_LOCATION_DEFAULT. Signed-off-by: Maik Otto <[email protected]>

motto-phytec · 2025-03-25T09:10:42Z

Hi @terriko
I want to close this PR, because there too many dependencies to other files and not all necessary files are changed.
Thank you for your comments. We have added one jenkins pipeline to update the cve database and the other cve check pipelines have only read access to the cve database. The only critical step is that the cve update pipeline do not run on the same time as the other check pipelines.

terriko · 2025-04-14T17:02:04Z

Thanks, I'll close this now, but hopefully we'll actually get customizable cache directories eventually.

terriko mentioned this pull request Feb 6, 2025

fix: improve experience for users using multiple copies of cve-bin-tool in parallel #4777

Open

terriko requested changes Feb 13, 2025

View reviewed changes

motto-phytec force-pushed the WIP/[email protected]/set_cachedir_path branch from 57585c0 to 7ec5658 Compare February 19, 2025 09:54

motto-phytec requested a review from terriko February 19, 2025 09:56

terriko added the awaiting CI label Feb 21, 2025

terriko closed this Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC Set cachedir and backupcachedir as parameter for parallel instances of the tool with own database #4773

RFC Set cachedir and backupcachedir as parameter for parallel instances of the tool with own database #4773

Uh oh!

motto-phytec commented Feb 6, 2025

Uh oh!

terriko commented Feb 6, 2025

Uh oh!

motto-phytec commented Feb 11, 2025

Uh oh!

terriko left a comment

Uh oh!

terriko commented Feb 13, 2025

Uh oh!

motto-phytec commented Mar 25, 2025

Uh oh!

terriko commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RFC Set cachedir and backupcachedir as parameter for parallel instances of the tool with own database #4773

RFC Set cachedir and backupcachedir as parameter for parallel instances of the tool with own database #4773

Uh oh!

Conversation

motto-phytec commented Feb 6, 2025

Uh oh!

terriko commented Feb 6, 2025

Uh oh!

motto-phytec commented Feb 11, 2025

Uh oh!

terriko left a comment

Choose a reason for hiding this comment

Uh oh!

terriko commented Feb 13, 2025

Uh oh!

motto-phytec commented Mar 25, 2025

Uh oh!

terriko commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants