This workflow enables efficient all-vs-all MS2 spectra comparison between two folders, supporting both mzML and mgf formats. It uses fast indexing and cosine similarity, and is designed for large-scale GNPS2 batch spectral comparison tasks.
- Input two folders (each can contain multiple mzML/mgf files); all spectra are automatically discovered recursively.
- Only compares spectra between folder1 and folder2 (no intra-folder comparison).
- Flexible parameter configuration: tolerance, threshold, minimum matched peaks, alignment strategy, peak filtering, threads, and output filename.
- Outputs all high-scoring matches as a TSV file with columns: set1, set2, delta_mz, cosine.
Parameter | Description | Default Value |
---|---|---|
inputspectra1 | Path to input folder 1 | (must be specified) |
inputspectra2 | Path to input folder 2 | (must be specified) |
tolerance | MS2 tolerance (Da) | 0.01 |
threshold | Cosine similarity threshold | 0.7 |
minmatches | Minimum number of matched peaks | 6 |
alignment_strategy | Alignment strategy (index_single_charge/index_multi_charge) | index_single_charge |
enable_peak_filtering | Enable peak filtering (yes/no) | no |
threads | Number of threads | 1 |
output | Output filename | cross_compare_results.tsv |
python bin/ms2crosscompare.py data/round1 data/round2 \
--tolerance 0.01 \
--threshold 0.7 \
--minmatches 6 \
--alignment_strategy index_single_charge \
--enable_peak_filtering no \
--threads 1 \
--output cross_compare_results.tsv
All parameters can be configured via workflowinput.yaml and are compatible with GNPS2 web interface forms.
nextflow run nf_workflow.nf --inputspectra1 data/round1 --inputspectra2 data/round2 --tolerance 0.01 --threshold 0.7 --minmatches 6 --alignment_strategy index_single_charge --enable_peak_filtering no --threads 1 --output cross_compare_results.tsv
The output is a TSV file with the following columns:
- set1: Spectrum ID from folder 1 (including filename and scan number)
- set2: Spectrum ID from folder 2
- delta_mz: Precursor mass difference
- cosine: Cosine similarity score
Make sure to have the NextflowModule updated if you plan on using it:
git submodule update --init --remote --recursive
To run the workflow to test simply do
make run
To learn NextFlow checkout this documentation:
https://www.nextflow.io/docs/latest/index.html
You will need to have conda, mamba, and nextflow installed to run things locally.
Check the definition for the workflow input and display parameters: https://wang-bioinformatics-lab.github.io/GNPS2_Documentation/workflowdev/
In order to deploy, we have a set of deployment tools that will enable deployment to the various gnps2 systems. To run the deployment, you will need the following setup steps completed:
- Checked out of the deployment submodules
- Conda environment and dependencies
- SSH configuration updated
use the following commands from the deploy_gnps2 folder.
You might need to checkout the module, do this by running
git submodule init
git submodule update
You will also need to specify the user on the server that you've been given that your public key has been associated with. If you want to not enter this every time you do a deployment, you can create a Makefile.credentials file in the deploy_gnps2 folder with the following contents
USERNAME=<enter the username>
You will need to install the dependencies in GNPS2_DeploymentTooling/requirements.txt on your own local machine.
You can find this here.
One way to do this is to use conda to create an environment, for example:
conda create -n deploy python=3.8
pip install -r GNPS2_DeploymentTooling/requirements.txt
Also update your ssh config file to include the following ssh target:
Host ucr-gnps2-dev
Hostname ucr-lemon.duckdns.org
To deploy to development, use the following command, if you don't have your ssh public key installed onto the server, you will not be able to deploy.
make deploy-dev
To deploy to production, use the following command, if you don't have your ssh public key installed onto the server, you will not be able to deploy.
make deploy-prod