Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: New action to annotate FeatureData[MAG] or GenomeData[Proteins] with optional GenomeData[Loci] #90

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
d33e0cb
added new amrfinder directory and moved types to card directory
VinzentRisch Jun 28, 2024
32e8b7c
dirformat with filecollections
VinzentRisch Jul 1, 2024
d1b2ca6
dirformat with validating all filepaths
VinzentRisch Jul 1, 2024
facc75d
added test data to package data
VinzentRisch Jul 2, 2024
f948195
added amrprot.pot file to git
VinzentRisch Jul 2, 2024
c445800
merge main
VinzentRisch Jul 3, 2024
0670d7e
added new annotation format
VinzentRisch Jul 3, 2024
bfafdb8
added sampledata and feature data dir fmts
VinzentRisch Jul 4, 2024
bb9220c
register all formats
VinzentRisch Jul 4, 2024
317e5cb
using filecollections for the database format
VinzentRisch Jul 4, 2024
71a8da2
merge 80
VinzentRisch Jul 4, 2024
0bf7f20
renamed to dirfmt
VinzentRisch Jul 4, 2024
060f24d
merge 80
VinzentRisch Jul 4, 2024
8378b45
overwrite all pathmakers with code from busco moshpit
VinzentRisch Jul 4, 2024
82a1558
added field to annotation format
VinzentRisch Jul 4, 2024
f42d845
changed name of file in annotation format to allow oter names
VinzentRisch Jul 4, 2024
7e31553
added mags action
VinzentRisch Jul 5, 2024
514688c
registered annotations types in plusgin setup
VinzentRisch Jul 5, 2024
6deb616
Merge branch '85_amrfinderplusannotation_type' into 87_annotate_mags_…
VinzentRisch Jul 5, 2024
a017eeb
changes
VinzentRisch Jul 5, 2024
07e9f52
Revert "overwrite all pathmakers with code from busco moshpit"
VinzentRisch Jul 5, 2024
78c4329
Merge branch '80_amrfinder_database_type' into 85_amrfinderplusannota…
VinzentRisch Jul 5, 2024
09156bb
Merge branch '85_amrfinderplusannotation_type' into 87_annotate_mags_…
VinzentRisch Jul 5, 2024
9237c73
working action
VinzentRisch Jul 5, 2024
34a34b8
removed nested structure of annotaion type
VinzentRisch Jul 5, 2024
c004300
Merge branch '85_amrfinderplusannotation_type' into 87_annotate_mags_…
VinzentRisch Jul 5, 2024
4e38e21
working action with non nested output format
VinzentRisch Jul 5, 2024
16db485
changed magid to id in mags annotaiton
VinzentRisch Jul 8, 2024
8a902ed
moved run fucntion into utils added protein option
VinzentRisch Jul 8, 2024
0cb8492
changed utils to not inlcude _ in filenames
VinzentRisch Jul 9, 2024
4f09ee7
changed type of featuredata one to also include mutations in name
VinzentRisch Jul 9, 2024
f10dcb3
Merge branch '85_amrfinderplusannotation_type' into 87_annotate_mags_…
VinzentRisch Jul 9, 2024
64dffd6
function works without tests
VinzentRisch Jul 9, 2024
ee6804d
minor change
VinzentRisch Jul 9, 2024
bbaca6e
changed type and path_maker
VinzentRisch Jul 10, 2024
81f0fbb
Merge branch '85_amrfinderplusannotation_type' into 87_annotate_mags_…
VinzentRisch Jul 10, 2024
7577214
added sampledata contigs as input
VinzentRisch Jul 10, 2024
b97152f
added validation positive for emty files
VinzentRisch Jul 10, 2024
ac8f92c
Merge branch '85_amrfinderplusannotation_type' into 87_annotate_mags_…
VinzentRisch Jul 10, 2024
fa019d5
fixed bug in mutations empty file creation
VinzentRisch Jul 10, 2024
86babac
fixed other bug in mutations empty file creation
VinzentRisch Jul 10, 2024
55a92d4
merge 87
VinzentRisch Jul 11, 2024
783870d
changed utils protein and nucleotide naming
VinzentRisch Jul 11, 2024
983eb03
Merge branch '87_annotate_mags_amrfinderplus' into 89_annotate-sequen…
VinzentRisch Jul 11, 2024
4b9c200
changed utils protein and nucleotide naming 2
VinzentRisch Jul 11, 2024
dff2500
Merge branch '87_annotate_mags_amrfinderplus' into 89_annotate-sequen…
VinzentRisch Jul 11, 2024
107c039
Revert "changed utils protein and nucleotide naming"
VinzentRisch Jul 11, 2024
92c8400
Revert "changed utils protein and nucleotide naming 2"
VinzentRisch Jul 11, 2024
3813a2b
changed mag and samplename addition to main function
VinzentRisch Jul 11, 2024
3a363bf
Merge branch '87_annotate_mags_amrfinderplus' into 89_annotate-sequen…
VinzentRisch Jul 11, 2024
14a3b62
added tests for utils and sample data
VinzentRisch Jul 12, 2024
db37f81
changed the way manifest is loaded
VinzentRisch Jul 12, 2024
1b49424
merge main
VinzentRisch Jul 16, 2024
a0343d9
added database_format_version
VinzentRisch Jul 16, 2024
a1abc26
Merge branch '91_adding_database_format_version' into 87_annotate_mag…
VinzentRisch Jul 16, 2024
06e809a
added cureated_indet as parameter
VinzentRisch Jul 16, 2024
bff9595
bugfix missing parameter
VinzentRisch Jul 16, 2024
26e66a6
Merge branch 'main' into 87_annotate_mags_amrfinderplus
VinzentRisch Jul 16, 2024
c9a12a9
bug parameter added in mocked function
VinzentRisch Jul 16, 2024
4306889
merge main
VinzentRisch Jul 16, 2024
df06255
merge 87
VinzentRisch Jul 16, 2024
794a6d6
added comments
VinzentRisch Jul 16, 2024
b215266
merge main
VinzentRisch Jul 16, 2024
1d94b24
renaming tests
VinzentRisch Jul 16, 2024
aa63471
Merge branch '87_annotate_mags_amrfinderplus' into 89_annotate-sequen…
VinzentRisch Jul 16, 2024
1148762
added tests
VinzentRisch Jul 17, 2024
f642503
chnages after review
VinzentRisch Jul 17, 2024
e4c6bfd
added s in utils sequneces
VinzentRisch Jul 17, 2024
6004e6e
fix bug full path gff
VinzentRisch Jul 17, 2024
03ae7f9
changed -1 to 0 in indent min
VinzentRisch Jul 17, 2024
7a05e70
changed plugin setup description
VinzentRisch Jul 17, 2024
6761579
merge main
VinzentRisch Jul 17, 2024
5db4afc
changes in plugin setup
VinzentRisch Jul 17, 2024
7bc234a
chnaged to MAG input
VinzentRisch Jul 23, 2024
57e20da
removed AMRFinderplusannotation type
VinzentRisch Jul 23, 2024
16b413f
removed multidirvalidation mixing
VinzentRisch Jul 23, 2024
f5f0bb3
Merge branch '94_chnage_AMRFinderPlusAnnotation_type' into 89_annotat…
VinzentRisch Jul 23, 2024
1aaef95
changed outputs after merge
VinzentRisch Jul 23, 2024
7a42c38
changed format to prodigal
VinzentRisch Jul 23, 2024
869e03c
chnages in annotation action
VinzentRisch Jul 23, 2024
73c2c50
merge 94
VinzentRisch Jul 23, 2024
e3f2abb
added three new parameters
VinzentRisch Jul 23, 2024
f636e73
added typemap for parameters
VinzentRisch Jul 24, 2024
205a4f2
merge 94
VinzentRisch Jul 24, 2024
b298f48
merge 96
VinzentRisch Jul 24, 2024
8988ac1
rename to feature_data
VinzentRisch Jul 24, 2024
4bb7358
split function
VinzentRisch Jul 24, 2024
5cb10fe
fix annotationformat twice bug
VinzentRisch Jul 24, 2024
4127696
added all tests
VinzentRisch Jul 25, 2024
5d83e89
added citations
VinzentRisch Jul 25, 2024
0638809
merge main
VinzentRisch Jul 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions q2_amr/amrfinderplus/feature_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
import glob
import os
import shutil
import tempfile

from q2_types.feature_data_mag import MAGSequencesDirFmt
from q2_types.genome_data import (
GenesDirectoryFormat,
LociDirectoryFormat,
ProteinsDirectoryFormat,
)

from q2_amr.amrfinderplus.types import (
AMRFinderPlusAnnotationsDirFmt,
AMRFinderPlusDatabaseDirFmt,
)
from q2_amr.amrfinderplus.utils import run_amrfinderplus_n


def _validate_inputs(mags, loci, proteins):
if mags and loci and not proteins:
raise ValueError(
"Loci input can only be given in combination with proteins input."
)
if mags and not loci and proteins:
raise ValueError(
"MAGs and proteins inputs together can only "
"be given in combination with loci input."
)
if not mags and not proteins:
raise ValueError("MAGs or proteins input has to be provided.")


def _get_file_paths(file, mags, proteins, loci):
# If mags is provided, mag_id is extracted from the file name.
if mags:
mag_id = os.path.splitext(os.path.basename(file))[0]

# If proteins are provided, construct the expected protein file path.
if proteins:
protein_path = os.path.join(str(proteins), f"{mag_id}_proteins.fasta")

# Raise an error if the expected protein file does not exist.
if not os.path.exists(protein_path):
raise ValueError(
f"Proteins file for ID '{mag_id}' is missing in proteins input."
)
else:
protein_path = None

# If only proteins are provided (without mags), determine mag_id and protein path.
else:
# Extract mag_id from the file name, excluding the last 9 characters
# '_proteins'.
mag_id = os.path.splitext(os.path.basename(file))[0][:-9]
protein_path = file

# If loci are provided, construct the expected GFF file path.
if loci:
gff_path = os.path.join(str(loci), f"{mag_id}_loci.gff")

# Raise an error if the expected GFF file does not exist.
if not os.path.exists(gff_path):
raise ValueError(f"GFF file for ID '{mag_id}' is missing in loci input.")
else:
gff_path = None

return mag_id, protein_path, gff_path


def _move_or_create_files(src_dir: str, mag_id: str, file_operations: list):
# Loop through all files.
for file_name, target_dir in file_operations:
# If the file exists move it to the destination dir and attach mag_id.
if os.path.exists(os.path.join(src_dir, file_name)):
shutil.move(
os.path.join(src_dir, file_name),
os.path.join(str(target_dir), f"{mag_id}_{file_name}"),
)
# If the file does not exist, create empty placeholder file in the
# destination dir.
else:
with open(os.path.join(str(target_dir), f"{mag_id}_{file_name}"), "w"):
pass


def annotate_feature_data_amrfinderplus(
amrfinderplus_db: AMRFinderPlusDatabaseDirFmt,
mags: MAGSequencesDirFmt = None,
proteins: ProteinsDirectoryFormat = None,
loci: LociDirectoryFormat = None,
organism: str = None,
plus: bool = False,
report_all_equal: bool = False,
ident_min: float = None,
curated_ident: bool = False,
coverage_min: float = 0.5,
translation_table: str = "11",
annotation_format: str = "prodigal",
report_common: bool = False,
gpipe_org: bool = False,
threads: int = None,
) -> (
AMRFinderPlusAnnotationsDirFmt,
AMRFinderPlusAnnotationsDirFmt,
GenesDirectoryFormat,
ProteinsDirectoryFormat,
):
# Check for unallowed input combinations.
_validate_inputs(mags, loci, proteins)

# Create all output directories.
amr_annotations = AMRFinderPlusAnnotationsDirFmt()
amr_all_mutations = AMRFinderPlusAnnotationsDirFmt()
amr_genes = GenesDirectoryFormat()
amr_proteins = ProteinsDirectoryFormat()

# Create list of files to loop over, if mags is provided then files in mags will be
# used if only proteins is provided then files in proteins will be used
if mags:
files = glob.glob(os.path.join(str(mags), "*"))
else:
files = glob.glob(os.path.join(str(proteins), "*"))

with tempfile.TemporaryDirectory() as tmp:
# Loop over all files
for file in files:
# Get paths to protein and gff files, and get mag_id
mag_id, protein_path, gff_path = _get_file_paths(file, mags, proteins, loci)

# Run amrfinderplus
run_amrfinderplus_n(
working_dir=tmp,
amrfinderplus_db=amrfinderplus_db,
dna_sequences=file if mags else None,
protein_sequences=protein_path,
gff=gff_path,
organism=organism,
plus=plus,
report_all_equal=report_all_equal,
ident_min=ident_min,
curated_ident=curated_ident,
coverage_min=coverage_min,
translation_table=translation_table,
annotation_format=annotation_format,
report_common=report_common,
gpipe_org=gpipe_org,
threads=threads,
)

# Output filenames and output directories
file_operations = [
("amr_annotations.tsv", amr_annotations),
("amr_all_mutations.tsv", amr_all_mutations),
("amr_genes.fasta", amr_genes),
("amr_proteins.fasta", amr_proteins),
]

# Move the files or create placeholder files
_move_or_create_files(tmp, mag_id, file_operations)

return amr_annotations, amr_all_mutations, amr_genes, amr_proteins
220 changes: 220 additions & 0 deletions q2_amr/amrfinderplus/tests/test_feature_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
import os
from pathlib import Path
from unittest.mock import MagicMock, patch

from q2_types.feature_data_mag import MAGSequencesDirFmt
from q2_types.genome_data import ProteinsDirectoryFormat
from qiime2.plugin.testing import TestPluginBase

from q2_amr.amrfinderplus.feature_data import (
_get_file_paths,
_move_or_create_files,
_validate_inputs,
annotate_feature_data_amrfinderplus,
)


class TestValidateInputs(TestPluginBase):
package = "q2_amr.amrfinderplus.tests"

def test_loci_mags(self):
with self.assertRaisesRegex(
ValueError,
"Loci input can only be given in combination with proteins input.",
):
_validate_inputs(mags="mags", loci="loci", proteins=None)

def test_no_loci_protein_mags(self):
with self.assertRaisesRegex(
ValueError,
"MAGs and proteins inputs together can only be given in combination with "
"loci input.",
):
_validate_inputs(mags="mags", loci=None, proteins="proteins")

def test_no_protein_no_mags(self):
with self.assertRaisesRegex(
ValueError, "MAGs or proteins input has to be provided."
):
_validate_inputs(mags=None, loci="loci_directory", proteins=None)


class TestMoveOrCreateFiles(TestPluginBase):
package = "q2_amr.amrfinderplus.tests"

def setUp(self):
super().setUp()

self.tmp = self.temp_dir.name
self.src_dir = os.path.join(self.tmp, "src_dir")
self.target_dir = os.path.join(self.tmp, "target_dir")
os.mkdir(self.src_dir)
os.mkdir(self.target_dir)

def test_move_file(self):
# Create a dummy file in the source directory
with open(os.path.join(self.src_dir, "test_file.txt"), "w"):
pass

# Define the file operations
file_operations = [("test_file.txt", self.target_dir)]

# Run the function
_move_or_create_files(
src_dir=self.src_dir,
mag_id="mag",
file_operations=file_operations,
)

# Assert the file was moved
self.assertTrue(
os.path.exists(os.path.join(self.target_dir, "mag_test_file.txt"))
)

def test_file_missing_create_placeholder(self):
# Define the file operations
file_operations = [("test_file.txt", self.target_dir)]

# Run the function
_move_or_create_files(
src_dir=self.src_dir,
mag_id="mag",
file_operations=file_operations,
)

# Assert the file was moved
self.assertTrue(
os.path.exists(os.path.join(self.target_dir, "mag_test_file.txt"))
)

def test_with_mags_and_proteins_file_missing(self):
with self.assertRaisesRegex(
ValueError, "Proteins file for ID 'mag_id' is missing in proteins input."
):
_get_file_paths("path/mag_id.fasta", "path/mags", "path/proteins", None)


class TestGetFilePaths(TestPluginBase):
package = "q2_amr.amrfinderplus.tests"

def setUp(self):
super().setUp()

self.test_dir = self.temp_dir
self.test_dir_path = Path(self.test_dir.name)
self.file_path = self.test_dir_path / "test_file.fasta"
self.file_path.touch() # Create an empty test file

def test_with_mags_and_proteins_file_exists(self):
protein_file_path = self.test_dir_path / "test_file_proteins.fasta"
protein_file_path.touch() # Create an empty protein file

mag_id, protein_path, gff_path = _get_file_paths(
file=self.file_path,
mags=self.test_dir_path,
proteins=self.test_dir_path,
loci=None,
)
self.assertEqual(mag_id, "test_file")
self.assertEqual(protein_path, str(protein_file_path))
self.assertIsNone(gff_path)

def test_with_mags_and_proteins_file_missing(self):
with self.assertRaisesRegex(
ValueError,
"Proteins file for ID 'test_file' is missing in proteins input.",
):
_get_file_paths(
file=self.file_path,
mags=self.test_dir_path,
proteins=self.test_dir_path,
loci=None,
)

def test_with_proteins_only(self):
protein_file_path = self.test_dir_path / "test_file_proteins.fasta"
protein_file_path.touch() # Create an empty protein file

mag_id, protein_path, gff_path = _get_file_paths(
file=protein_file_path, mags=None, proteins=self.test_dir_path, loci=None
)
self.assertEqual(mag_id, "test_file")
self.assertEqual(protein_path, protein_file_path)
self.assertIsNone(gff_path)

def test_with_loci_file_exists(self):
gff_file_path = self.test_dir_path / "test_file_loci.gff"
gff_file_path.touch() # Create an empty GFF file

mag_id, protein_path, gff_path = _get_file_paths(
file=self.file_path,
mags=self.test_dir_path,
proteins=None,
loci=self.test_dir_path,
)
self.assertEqual(mag_id, "test_file")
self.assertIsNone(protein_path)
self.assertEqual(gff_path, str(gff_file_path))

def test_with_loci_file_missing(self):
with self.assertRaisesRegex(
ValueError, "GFF file for ID 'test_file' is missing in loci input."
):
_get_file_paths(
file=self.file_path,
mags=self.test_dir_path,
proteins=None,
loci=self.test_dir_path,
)

def test_with_mags_proteins_and_loci_all_files_exist(self):
protein_file_path = self.test_dir_path / "test_file_proteins.fasta"
gff_file_path = self.test_dir_path / "test_file_loci.gff"
protein_file_path.touch() # Create an empty protein file
gff_file_path.touch() # Create an empty GFF file

mag_id, protein_path, gff_path = _get_file_paths(
file=self.file_path,
mags=self.test_dir_path,
proteins=self.test_dir_path,
loci=self.test_dir_path,
)
self.assertEqual(mag_id, "test_file")
self.assertEqual(protein_path, str(protein_file_path))
self.assertEqual(gff_path, str(gff_file_path))


class TestAnnotateFeatureDataAMRFinderPlus(TestPluginBase):
package = "q2_amr.amrfinderplus.tests"

@patch("q2_amr.amrfinderplus.feature_data._validate_inputs")
@patch(
"q2_amr.amrfinderplus.feature_data._get_file_paths",
return_value=("mag_id", "protein_path", "gff_path"),
)
@patch("q2_amr.amrfinderplus.feature_data.run_amrfinderplus_n")
@patch("q2_amr.amrfinderplus.feature_data._move_or_create_files")
def test_annotate_feature_data_amrfinderplus_mags(
self, mock_validate, mock_paths, mock_run, mock_move
):
mags = MAGSequencesDirFmt()
with open(os.path.join(str(mags), "mag.fasta"), "w"):
pass
annotate_feature_data_amrfinderplus(amrfinderplus_db=MagicMock(), mags=mags)

@patch("q2_amr.amrfinderplus.feature_data._validate_inputs")
@patch(
"q2_amr.amrfinderplus.feature_data._get_file_paths",
return_value=("mag_id", "protein_path", "gff_path"),
)
@patch("q2_amr.amrfinderplus.feature_data.run_amrfinderplus_n")
@patch("q2_amr.amrfinderplus.feature_data._move_or_create_files")
def test_annotate_feature_data_amrfinderplus_proteins(
self, mock_validate, mock_paths, mock_run, mock_move
):
proteins = ProteinsDirectoryFormat()
with open(os.path.join(str(proteins), "proteins.fasta"), "w"):
pass
annotate_feature_data_amrfinderplus(
amrfinderplus_db=MagicMock(), proteins=proteins
)
Loading
Loading