GNPS Metadata Tools
This repository contains three Python scripts for working with GNPS metadata files: gnps_downloader.py, gnps_validator.py, and gnps_name_matcher.py. These scripts allow you to aggregate lists of GNPS metadata files, validate the files using the GNPS metadata validator, and match the files to their respective CCMS peak dataset names.
Installation
To use these scripts, you'll need to have Python 3 installed on your system.
You can download Python from the official Python website: https://www.python.org/downloads/
-> You can install dependencies using pip:
pip install pandas urllib requests
Usage
gnps_downloader.py
This script aggregates a list of GNPS metadata files, sorts the files by their creation time, and downloads the latest GNPS metadata file. The script then appends the file path and file name into a TSV file.
To run the script, use the following command:
python3 gnps_downloader.py
gnps_validator.py
This script runs the downloaded GNPS metadata files against the metadata validator and stores the list of file names that have passed through the validator. The script also rejects files that haven't passed and appends the passed file names into a TSV file.
To run the script, use the following command:
python3 gnps_validator.py
gnps_name_matcher.py
This script matches the GNPS metadata files to their respective CCMS peak dataset names and gives out a TSV file that contains all the names that match unambiguously.
To run the script, use the following command:
python3 gnps_name_matcher.py
data/allowed_terms.json
The terms allowed in REDU are pulled from this json. Terms from controlled ontologies for variables MassSpectrometer, NCBITaxonomy, UBERONBodyPartName, and DOIDCommonName are added to json within the workflow. Run the data/get_data.sh to download required data. Additional terms can be added to the json, but dont forget to update also in the the google sheet (https://docs.google.com/spreadsheets/d/1v71bnUd8fiXX51zuZIUAvYETWmpwFQj-M3mu4CNsHBU/edit#gid=791995663).