Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Set cachedir and backupcachedir as parameter for parallel instances of the tool with own database #4773

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

motto-phytec
Copy link

The cachedir and backup_cachedir is always on the default value for the data sources. (~/.cache/cve-bin-tool)
For the cvedb is the cachedir configurable. In the data_sources, the cve_scanner or others use always the default cachedir path.
If you set the cachedir for the CVEDB to an other, then the cvedb access in line 441
purl2cpe_conn = sqlite3.connect(self.cachedir / "purl2cpe/purl2cpe.db")
fails, because the sql connection for purl2cpe use the DISK_LOCATION_DEFAULT instead of the self.cachedir of the cvedb.

The motivation for this RFC patch is to run different instances of the cve-bin-tool with their own cachedir for the instances.
In parallel operation, there is actually a risk of collisions when accessing the database if a task starts later and wants to update the database. Then the first task makes a rollback of the database, which corrupt the complete database.

This RFC patch try to set the cachedir in all affected files, but there are many affected files and dependencies.
I am not sure, if this is the right approach to realize a better parallel processing.

@terriko
Copy link
Contributor

terriko commented Feb 6, 2025

I think this is probably a change we should have.

BUT for parallel working, our standard recommendation is that you not allow scan jobs to update the database to avoid this problem, so you might want to switch to doing that.

That is:

  1. have one database update job that will do updates as needed. I usually have it scan against a blank or predictable small binary like so:
cve-bin-tool ~/blank.csv

Our docs actually recommend you use -u now here, but that will take longer as it needs to refresh all data instead of just getting updates then: https://cve-bin-tool.readthedocs.io/en/latest/how_to_guides/multiple_scans_at_once.html

  1. In every parallel scan job, use -u never as an option so that the scan job will not attempt to update the database.

I'm wondering if we should make some changes to make this happen better or make it more obvious to users that this is the recommended solution. I'll open an issue so someone could maybe work on that.

@motto-phytec
Copy link
Author

Hi @terriko,
thank you for your information and the realization for an update of the database without a scan.
We use the cve-bin-tool as python package as part in our own python script and this script runs in different asynchronous jenkins pipelines. so we can start the pipelines without an update of the cve db. Our challenge is when we can update the cve.db.
Do you have a similar use case?

The other think is, that you can not set an other path to the cve db, at the moment. The different cve sources and files use always the compiled in default path to the ~/.cache/cve-bin-tool/cve.db. You can not run a pipeline with its own cve.db in the workspace. If you make a copy to the workspace and use CVEDB, then it depends from the data request, which database path is in use. If no database in the ~/.cache/cve-bin-tool/cve.db, then it crashs e.g. purl2cpe_conn = sqlite3.connect(self.cachedir / "purl2cpe/purl2cpe.db")
Is this a use case for you to use a cve db in an other folder?

Copy link
Contributor

@terriko terriko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jenkins would be a bit different, but we're using the cache functionality of github actions:

https://github.com/intel/cve-bin-tool/blob/main/.github/workflows/update-cache.yml
https://github.com/intel/cve-bin-tool/actions/workflows/update-cache.yml

So the update job does the update and then clobbers the cache, and the scan jobs each make a copy of the cache to their local directory before they start. Because it's github actions which uses new containers for every job, the individual scan/test jobs can't share data any other way.

That probably doesn't help for the case where you're reusing the same machine/container instead of fresh ones, but it probably explains why the cache directory hasn't been fixed yet.

Anyhow, back to this PR -- do you have time to fix the linter issues? You probably need to run isort and black and just let them auto-fix. It would probably be good to update the branch at the same time, since we made some CI changes that will probably make the tests run faster for you. I can do that through the web interface but I don't want to make your local branch all out of sync if you're going to fix the linter issues.

@terriko
Copy link
Contributor

terriko commented Feb 13, 2025

Oh, forgot the link for how to run the linters and what they're doing:
https://github.com/intel/cve-bin-tool/blob/main/CONTRIBUTING.md#running-linters

The cachedir and backupcachedir was always on the default value for the
data sources. For the cvedb is the cachedir configurable.
With this patch the cve-bin-tool can run in different instances
with a own cache for the CVE information.

Additional the patch is a workaround for the cvedb access in line 441 purl2cpe_conn = sqlite3.connect(self.cachedir / "purl2cpe/purl2cpe.db"), which fails if cvedb has an other cachedir as the purl2cpe with the DISK_LOCATION_DEFAULT.

Signed-off-by: Maik Otto <[email protected]>
@motto-phytec
Copy link
Author

Hi @terriko
I want to close this PR, because there too many dependencies to other files and not all necessary files are changed.
Thank you for your comments. We have added one jenkins pipeline to update the cve database and the other cve check pipelines have only read access to the cve database. The only critical step is that the cve update pipeline do not run on the same time as the other check pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants