Skip to content

Commit

Permalink
ENH: check RGI mapping accuracy and manually curate incorrect hits
Browse files Browse the repository at this point in the history
All RGI hits are checked for mapping accuracy using `check_mapping_accuracy.py`. This is done by comparing the drug categorization of the ARO assigned to the ARG with the drug categorization assigned to the ARG by its original database. A list of ARGs with mismatched mappings are included in `manual_curation/`. Mismatched hits are manually curated to correct ARGs and manual curation files are updated. Some mismatched hits are marked `correct` or `incorrect`. `correct` hits have a drug category mismatch but have an ARO mapping that's determined to be correct. `incorrect` hits have a drug category mismatch and a manually curated ARO can't be found for them. Both `correct` and `incorrect` hits should not be included in the final manual curation. For megares, a file called `megares_meta_biocide_and_virulence_genes.tsv` is also created with metal, biocide, and virulence genes. These should not be mapped to any ARO terms and can be manually curated to no ARO. Metal, biocide, and virulence genes are directly added to manual curation files for other databases. Groot is derived from resfinder and argannot.

A list of all mismatched hits, with their manual curation can be found in `db_harmonisation/all_mismatched_hits.tsv`
  • Loading branch information
Vedanth-Ramji committed Feb 12, 2025
1 parent 934e208 commit f59f39b
Show file tree
Hide file tree
Showing 23 changed files with 19,423 additions and 1,122 deletions.
16 changes: 8 additions & 8 deletions argnorm/data/groot_ARO_mapping.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -1512,7 +1512,7 @@ Original ID ARO
(Gly)VanA-Ao2:HQ679900:1034-2080:1047 3002913
(Gly)VanA-B:AF192329:28835-29863:1029 3000013
(Gly)VanA-D:AY082011:5901-6932:1032 3000005
(Gly)VanA-G:AY271782:157-606:450 3007029
(Gly)VanA-G:AY271782:157-606:450 3000010
(Gly)VanA-M:FJ349556:4857-5888:1032 3002911
(Gly)VanAE-Pp:AF155139:5940-6971:1032 3002908
(Gly)VanB:AY655710:1-1029:1029 3000013
Expand Down Expand Up @@ -1632,7 +1632,7 @@ Original ID ARO
(MLS)ErmZ:AM709783:2817-3665:849 3004605
(MLS)LinA:M14039:413-898:486 3002835
(MLS)LinB:AJ238249:127-930:804 3002836
(MLS)LmrA:X59926:318-1763:1446 3003955
(MLS)LmrA:X59926:318-1763:1446 3003028
(MLS)LmrB:X62867:361-1197:837 3001305
(MLS)LmrB:X79146:27840-28679:840 3001305
(MLS)LnuB:KC688833:1-804:804 3002836
Expand All @@ -1647,7 +1647,7 @@ Original ID ARO
(MLS)MphA:DQ445270:1626-2531:1230 3000316
(MLS)MphB:D85892:1159-2067:909 3000318
(MLS)MphC:AF167161:5665-6564:900 3000319
(MLS)MphD:NC_017312:2291580-2292413:834 3004581
(MLS)MphD:NC_017312:2291580-2292413:834 3000333
(MLS)MphE:JF769133:8777-9661:885 3003741
(MLS)MsrA:AY591760:274-1740:1467 3000251
(MLS)MsrC:AY004350:496-1974:1479 3002819
Expand Down Expand Up @@ -1717,7 +1717,7 @@ Original ID ARO
(Rif)Arr5:EF660563:393-845:453 3002850
(Rif)Arr7:FN397623:1189-1641:453 3002852
(Rif)IRI:U56415:280-1719:1440 3002884
(Rif)Rif:EF541029:530-2170:1641 3004047
(Rif)Rif:EF541029:530-2170:1641 3004040
(Sul)SulI:AF071413:6700-7539:840 3000410
(Sul)SulII:EU360945:1617-2432:816 3000412
(Sul)SulIII:HQ875016:7396-8187:792 3000413
Expand All @@ -1728,7 +1728,7 @@ Original ID ARO
(Tet)Tet-31:AJ250203:1651-2883:1233 3000476
(Tet)Tet-32:DQ647324:181-2100:1920 3000196
(Tet)Tet-33:AJ420072:22940-24163:1224 3000478
(Tet)Tet-34:AB061440:306-770:465 3007840
(Tet)Tet-34:AB061440:306-770:465 3002870
(Tet)Tet-35:AF353562:2213-3322:1110 3000481
(Tet)Tet-36:AJ514254:2534-4456:1923 3000197
(Tet)Tet-37:AF540889:1-327:327 3002871
Expand Down Expand Up @@ -1758,7 +1758,7 @@ Original ID ARO
(Tet)TetR:HF545434:53576-54226:651 3003479
(Tet)TetS:L09756:447-2372:1926 3000192
(Tet)TetT:L42544:478-2433:1956 3000193
(Tet)TetU:U01917:413-730:318 3002907
(Tet)TetU:U01917:413-730:318 3004650
(Tet)TetV:AF030344:462-1721:1260 3000181
(Tet)TetW:AJ222769:3687-5606:1920 3000194
(Tet)TetX:M37699:586-1752:1167 3000205
Expand Down Expand Up @@ -9373,7 +9373,7 @@ leuO 3003843
lfrA 3003967
lin 3004651
linG 3002879
lmr(A)_1_X59926 3003955
lmr(A)_1_X59926 3003028
lmr(B)_1_X62867 3001305
lmrB 3002813
lmrC 3002881
Expand Down Expand Up @@ -9866,7 +9866,7 @@ tet(32)_2_EF626943 3000196
tet(33) 3000478
tet(33)_1_AY255627 3000478
tet(33)_2_DQ390458 3000478
tet(34)_1_AB061440 3007840
tet(34)_1_AB061440 3002870
tet(35) 3000481
tet(35)_1_AF353562 3000481
tet(36) 3000197
Expand Down
18 changes: 14 additions & 4 deletions argnorm/data/manual_curation/argannot_curation.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@
Original ID ARO Gene Name in CARD Description
(Phe)cpt_strepv:U09991:AAB36569:1412-1948:537 3000249 chloramphenicol phosphotransferase Parent ARO mapping
(Tet)tetH:EF460464:6286-7839:1554 3000175 tet(H) Loose RGI mapping. Mapped incorrectly to ARO:3004797.
(AGly)aadC:V01282:225-701:477 3000225 ANT(6) Parent ARO mapping
Original ID ARO Gene Name in CARD Description
(Phe)cpt_strepv:U09991:AAB36569:1412-1948:537 3000249 chloramphenicol phosphotransferase Parent ARO mapping
(AGly)aadC:V01282:225-701:477 3000225 ANT(6) Parent ARO mapping
(Gly)vanA-G:AY271782:157-606:450 3000010
(MLS)lmr(A):X59926:318-1763:1446 3003028
(MLS)mph(D):NC_017312:2291580-2292413:834 3000333
(Rif)rif:EF541029:530-2170:1641 3004040
(Tet)tet(34):AB061440:306-770:465 3002870
(Tet)tetH:EF460464:6286-7839:1554 3000175
(Tet)tetU:U01917:413-730:318 3004650
(TetracenomycinC)tcmA:NG_048121:101-1717:1617 3003554
(Phe)cpt:NG_051909:101-631:531 3000249
(Fcyn)FomC:AB016934:10868-11656:789 3004246
(Flq)crpP:NG_062203:WP_033179079:101-298:198 3004467
214 changes: 198 additions & 16 deletions argnorm/data/manual_curation/deeparg_curation.tsv

Large diffs are not rendered by default.

652 changes: 519 additions & 133 deletions argnorm/data/manual_curation/megares_curation.tsv

Large diffs are not rendered by default.

713 changes: 416 additions & 297 deletions argnorm/data/manual_curation/ncbi_curation.tsv

Large diffs are not rendered by default.

105 changes: 54 additions & 51 deletions argnorm/data/manual_curation/resfinder_curation.tsv

Large diffs are not rendered by default.

13 changes: 9 additions & 4 deletions argnorm/data/manual_curation/resfinderfg_curation.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
Original ID ARO Gene Name in CARD Description
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KF629588.1|pediatric_fecal_sample|CYC 3003970 D-Ala-D-Ala ligase
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KF629153.1|pediatric_fecal_sample|CYC 3003970 D-Ala-D-Ala ligase
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KJ695568.1|Agricultural soil|CYC 3003970 D-Ala-D-Ala ligase
Original ID ARO Gene Name in CARD Description
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KF629588.1|pediatric_fecal_sample|CYC 3003970 D-Ala-D-Ala ligase
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KF629153.1|pediatric_fecal_sample|CYC 3003970 D-Ala-D-Ala ligase
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KJ695568.1|Agricultural soil|CYC 3003970 D-Ala-D-Ala ligase
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KX125757.1|human_gut|CYC 3003970
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KX125843.1|human_gut|CYC 3003970
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KF627869.1|pediatric_fecal_sample|CYC 3003970
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KF629229.1|pediatric_fecal_sample|CYC 3003970
UDP-N-acetylmuramoyl-tripeptide--D-alanyl-D-alanine ligase|KF630034.1|pediatric_fecal_sample|CYC 3003970
9 changes: 5 additions & 4 deletions argnorm/data/manual_curation/sarg_curation.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
Original ID ARO Gene Name in CARD Description
gb|AAG57600.1|ARO:3000318|mphB 3000318 mphB
AM180355.1.gene2260.p01 3000250 ErmC
gb|AUW34359.1|ARO:3004445|RSA-2 3005440 RSA2 beta-lactamase ARO number of RSA2 had been changed.
Original ID ARO Gene Name in CARD Description
gb|AAG57600.1|ARO:3000318|mphB 3000318 mphB
AM180355.1.gene2260.p01 3000250 ErmC
gb|AUW34359.1|ARO:3004445|RSA-2 3005440 RSA2 beta-lactamase ARO number of RSA2 had been changed.
gb|AAB08924.1|ARO:3004650|tetU 3004650
4 changes: 4 additions & 0 deletions db_harmonisation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ cd .. && rm -r tmp

Genes from ARG annotation outputs are mapped to ARO accessions using ARO annotation tables. ARO annotation tables are constructed using the RGI (Alcock et al., 2023) and manual curation. All databases except MEGARes v3.0 are handled as amino acid files, where all ARG sequences (coding sequences) in the databases are translated to amino acid form using BioPython (Cock et al., 2009). The amino acid files are processed by RGI using the ‘protein’ mode to map genes to the ARO. The ‘Original ID’ (gene name in ARG annotation database), ‘Best_Hit_ARO’ (gene name in CARD) and ‘ARO’ (ARO accession) columns from the RGI output are specifically chosen from the RGI output, with the ‘Best_Hit_ARO’ column renamed to ‘Gene Name in CARD’ , to form the automated annotation tables. Genes which were not given an ARO mapping by RGI are manually assigned an ARO accession. The manual curation and automated annotation tables are combined to produce ARO annotation tables.

All RGI hits are checked for mapping accuracy using `check_mapping_accuracy.py`. This is done by comparing the drug categorization of the ARO assigned to the ARG with the drug categorization assigned to the ARG by its original database. A list of ARGs with mismatched mappings are included in `manual_curation/`. Mismatched hits are manually curated to correct ARGs and manual curation files are updated. Some mismatched hits are marked `correct` or `incorrect`. `correct` hits have a drug category mismatch but have an ARO mapping that's determined to be correct. `incorrect` hits have a drug category mismatch and a manually curated ARO can't be found for them. Both `correct` and `incorrect` hits should not be included in the final manual curation. For megares, a file called `megares_meta_biocide_and_virulence_genes.tsv` is also created with metal, biocide, and virulence genes. These should not be mapped to any ARO and can be manually curated to no ARO. Metal, biocide, and virulence genes are directly added to manual curation files for other databases. Groot is derived from resfinder and argannot. All loose hit manual curation for resfinder and argannot should also be manually updated for groot.

A list of all mismatched hits, with their manual curation can be found in `all_mismatched_hits.tsv`

## Handling ResFinder
The ResFinder v4.0 is a notable example as it contains forty instances of gene clusters or sequences with multiple coding sequences within (e.g. vanM_1_FJ349556 contains seven coding sequences). As RGI maps a single gene from the ResFinder database to multiple AROs (corresponding to different coding sequences within) in cases with gene clusters, gene clusters are manually assigned ARO accessions corresponding to specific gene clusters. Reverse complement sequences within ResFinder are also manually assigned ARO accessions.

Expand Down
Loading

0 comments on commit f59f39b

Please sign in to comment.