Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CheckM2 Data Manager writes to CheckM2 install dir #6717

Open
natefoo opened this issue Feb 3, 2025 · 9 comments
Open

CheckM2 Data Manager writes to CheckM2 install dir #6717

natefoo opened this issue Feb 3, 2025 · 9 comments

Comments

@natefoo
Copy link
Member

natefoo commented Feb 3, 2025

This is not the DM's fault, it's CheckM2 doing it, and as far as I can tell there's no way to override it. However, it makes running with Singularity impossible and with Conda unadvisable (since it modifies your installation).

$ mkdir test
$ singularity exec -B $(pwd)/test:/test /cvmfs/singularity.galaxyproject.org/all/checkm2:1.0.2--pyh7cba7a3_0 checkm2 database --download --path /test
[02/03/2025 06:06:42 PM] INFO: Command: Download database. Checking internal path information.
[02/03/2025 06:06:43 PM] INFO: Downloading https://zenodo.org/api/records/5571251/files/checkm2_database.tar.gz/content to /test/checkm2_database.tar.gz.
100%|███████| 1.74G/1.74G [01:09<00:00, 25.1MiB/s]
[02/03/2025 06:07:53 PM] INFO: Extracting files from archive...
[02/03/2025 06:08:07 PM] INFO: Verifying version and checksums...
[02/03/2025 06:08:07 PM] INFO: Verification success.
Traceback (most recent call last):
  File "/usr/local/bin/checkm2", line 282, in <module>
    fileManager.DiamondDB().download_database(args.path)
  File "/usr/local/lib/python3.8/site-packages/checkm2/fileManager.py", line 131, in download_database
    with open(diamond_location, 'w') as dd:
OSError: [Errno 30] Read-only file system: '/usr/local/lib/python3.8/site-packages/checkm2/version/diamond_path.json'

I'll look at this, although even if I can prevent it from writing to the internal diamond_path.json, which does exist:

$ singularity exec -B $(pwd)/test:/test /cvmfs/singularity.galaxyproject.org/all/checkm2:1.0.2--pyh7cba7a3_0 cat /usr/local/lib/python3.8/site-packages/checkm2/version/diamond_path.json
{"Type": "DIAMONDDB", "DBPATH": "Not Set"}%    

I don't know if the tool will work if this file isn't updated.

@natefoo
Copy link
Member Author

natefoo commented Feb 4, 2025

I can confirm that a modified diamond_path.json is not needed in order to run the tool:

Fetched the DB using a writable conda install of checkm2:

$ checkm2 database --download --path $(pwd)/data
[02/04/2025 12:34:01 PM] INFO: Command: Download database. Checking internal path information.
[02/04/2025 12:34:03 PM] INFO: Downloading https://zenodo.org/api/records/5571251/files/checkm2_database.tar.gz/content to /home/nate/xplor/checkm2/data/checkm2_database.tar.gz.
100%|████████| 1.74G/1.74G [01:05<00:00, 26.4MiB/s]
[02/04/2025 12:35:09 PM] INFO: Extracting files from archive...
[02/04/2025 12:35:24 PM] INFO: Verifying version and checksums...
[02/04/2025 12:35:24 PM] INFO: Verification success.
[02/04/2025 12:35:25 PM] INFO: Diamond DATABASE downloaded successfully! Consider running <checkm2 testrun> to verify everything works.

The json in the conda env just contains the path to the database:

$ cat ~/.condas/24.11/envs/[email protected]/lib/python3.8/site-packages/checkm2/version/diamond_path.json
{"Type": "DIAMONDDB", "DBPATH": "/home/nate/xplor/checkm2/data/CheckM2_database/uniref100.KO.1.dmnd"}%             

But this appears to be ignored when --database_path is provided with predict:

$ apptainer -s exec --cleanenv -B $(pwd)/data:/data:ro -B $(pwd)/input:/input:ro -B $(pwd)/output:/output /cvmfs/singularity.galaxyproject.org/all/checkm2:1.0.2--pyh7cba7a3_0 checkm2 predict --input /input --allmodels --genes --ttable 13 -x .faa --threads 1 --database_path /data/CheckM2_database/uniref100.KO.1.dmnd --output-directory /output
[02/04/2025 12:48:21 PM] INFO: Running CheckM2 version 1.0.2
[02/04/2025 12:48:21 PM] INFO: Custom database path provided for predict run. Checking database at /data/CheckM2_database/uniref100.KO.1.dmnd...
[02/04/2025 12:48:23 PM] INFO: Running quality prediction workflow with 1 threads.
[02/04/2025 12:48:24 PM] INFO: Using user-supplied protein files.
[02/04/2025 12:48:24 PM] INFO: Calculating metadata for 2 bins with 1 threads:
    Finished processing 2 of 2 (100.00%) bin metadata.
[02/04/2025 12:48:25 PM] INFO: Annotating input genomes with DIAMOND using 1 threads
[02/04/2025 12:50:07 PM] INFO: Processing DIAMOND output
[02/04/2025 12:50:07 PM] INFO: Predicting completeness and contamination using ML models.
[02/04/2025 12:50:10 PM] INFO: Parsing all results and constructing final output table.
[02/04/2025 12:50:10 PM] INFO: CheckM2 finished successfully.

@bernt-matthias
Copy link
Contributor

Is there an upstream issue?

It's hardcoded here:
https://github.com/chklovski/CheckM2/blob/319dae65f1c7f2fc1c0bb160d90ac3ba64ed9457/checkm2/defaultValues.py#L31

Wondering if we can hack this by copying the module to the working dir and prepending it to PYTHONPATH.

@bernt-matthias
Copy link
Contributor

chklovski/CheckM2#126

@bernt-matthias
Copy link
Contributor

Alternatively: can we ignore the error using stdio?

@natefoo
Copy link
Member Author

natefoo commented Feb 7, 2025

Yes the issue is upstream - imo we should just wait and only implement our own workaround if there isn't an upstream fix. I already worked around it (run in conda, throw the env away) for my own use so there is no urgency from me.

@SantaMcCloud
Copy link
Contributor

SantaMcCloud commented Feb 8, 2025

We also can change the DM and use this tool for example: https://github.com/dvolgyes/zenodo_get

CheckM2 downloads it from Zenodo too and we can use the recordID, which is for the current version 5571251, to download the file and unzip it since it is a tar.gz file.

Maybe this is a good solution till they fix it?

I can write a conda recipe and test this tool out with the current DM since there is not a lot to adjust in the current wrapper if this is wanted :)

Zenodo also has an API but how well this works I can not tell for a solution.

@chklovski
Copy link

chklovski commented Feb 9, 2025

Hi, author of CheckM2 here - what's your preferred alternative?

The original reason for the hardcoding was to enable easy central installation of a conda environment where lots of users can activate it (our main use case in our lab) and utilise the tool without needing to have their own config file for the database path (which yes, can just be bypassed). The idea was the admin makes the initial changes (the database has to be downloaded anyway for CheckM2 to work), modifies the path in the json file in the install directory, the tool can then be used by anyone by simply calling 'conda activate central_checkm2_environment' with no further work required from them.

What's the best alternative for your use-case? Move the json to the user's home dir? Have the admin install the database in a different directory somewhere, then export a CHECKM2_DB upon conda env activation? Other?

Currently updating for a new release, so happy to incorporate suggested changes so it works for you as well.

@bernt-matthias
Copy link
Contributor

Anything that avoids the non-zero exit code would be fine for us.

I guess the easiest option would be to catch the OSError: [Errno 30] Read-only file system: exception and print an message to the user (stdout / stderr is both fine). If you like you can also return a non-zero exit code, but it should be a specific number (then the Galaxy datamanager can recognize this and accept the run as successful).

Alternatively add a flag that just disables the writing of the json file. This might be even cleaner.

@bernt-matthias
Copy link
Contributor

Environment variable is not really necessary as long as --database_path covers everything if the file is absent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants