ENH: Improve mapping accuracy of normalizers by making the preprocessing of input genes and reference genes more precise #84

Vedanth-Ramji · 2025-02-01T09:23:32Z

Previously, for resfinder and ncbi genes, input ids were processed by just using gene symbols. However, several genes can have the same gene_symbol. For example, tet(O/32/O)_1_JQ740052, tet(O/32/O)_2_AJ295238, and tet(O/32/O)_3_NZ_AUJS01000017 all have the same gene_symbol—tet(O/32/O). While tet(O/32/O)_1_JQ740052 and tet(O/32/O)_2_AJ295238 are mapped to ARO:3007119, tet(O/32/O)_3_NZ_AUJS01000017 is mapped to ARO:3000190. To solve this, I've included the reference accessions (e.g. AJ295238) in the preprocessing step for input ids from the resfinder and ncbi databases in ResFinderNormalizer, AMRFinderPlusNormalizer, and AbricateNormalizer.

Exact changes to processing:
Resfinder database:

ResFinderNormalizer: Changed preprocessing of input ids to concate entries from gene_name & reference_accession for hamronized results and Resistance gene & Accession no. for raw results.
AbricateNormalizer: Changed preprocessing of input ids to concatenate entries from gene_name and reference_accession for hamronized results and GENE and ACCESSION for raw results.

NCBI database:

AMRFinderPlusNormalizer: Changed preprocessing of input ids to concatenate entries from gene_name and reference_accession for hamronized results & Sequence name and Accession of closest sequence for raw results.
AbricateNormalizer: Changed preprociess of input ids to procure entries from gene_name for hamronized results and PRODUCT from raw results.

Reference genes from the resfinder and ncbi databases are now processed to match these new input genes.

…ing of input genes and reference genes more precise Previously, for resfinder and ncbi genes, input ids were processed by just using gene symbols. However, several genes can have the same gene_symbol. For example, tet(O/32/O)_1_JQ740052, tet(O/32/O)_2_AJ295238, and tet(O/32/O)_3_NZ_AUJS01000017 all have the same gene_symbol—tet(O/32/O). While tet(O/32/O)_1_JQ740052 and tet(O/32/O)_2_AJ295238 are mapped to ARO:3007119, tet(O/32/O)_3_NZ_AUJS01000017 is mapped to ARO:3000190. To solve this, I've included the reference accessions (e.g. AJ295238) in the preprocessing step for input ids from the resfinder and ncbi databases in ResFinderNormalizer, AMRFinderPlusNormalizer, and AbricateNormalizer. Exact changes to processing: Resfinder database: - ResFinderNormalizer: Changed preprocessing of input ids to concate entries from `gene_name` & `reference_accession` for hamronized results and `Resistance gene` & `Accession no.` for raw results. - AbricateNormalizer: Changed preprocessing of input ids to concatenate entries from `gene_name` and `reference_accession` for hamronized results and `GENE` and `ACCESSION` for raw results. NCBI database: - AMRFinderPlusNormalizer: Changed preprocessing of input ids to concatenate entries from `gene_name` and `reference_accession` for hamronized results & `Sequence name` and `Accession of closest sequence` for raw results. - AbricateNormalizer: Changed preprociess of input ids to procure entries from `gene_name` for hamronized results and `PRODUCT` from raw results. Reference genes from the `resfinder` and `ncbi` databases are now processed to match these new input genes.

luispedro

Avoid the looping if you can

argnorm/normalizers.py

Vedanth-Ramji · 2025-02-04T06:40:27Z

I just saw that I can cut down looping in the abricate normalizer as well. I'll just get that as well and rebase, so it'll be included in the latest commit.

…gene preprocessing Cutting down unnecessary looping throughout input and reference genes.

luispedro requested changes Feb 4, 2025

View reviewed changes

argnorm/normalizers.py Outdated Show resolved Hide resolved

argnorm/normalizers.py Outdated Show resolved Hide resolved

argnorm/normalizers.py Outdated Show resolved Hide resolved

RFCT: simplify implementation of resfinder and ncbi input_id and ref_…

a2cbdb5

…gene preprocessing Cutting down unnecessary looping throughout input and reference genes.

Vedanth-Ramji force-pushed the update_ref_gene_processing branch from 262289e to a2cbdb5 Compare February 4, 2025 06:47

luispedro merged commit 2b064e2 into BigDataBiology:main Feb 4, 2025
6 checks passed

Vedanth-Ramji deleted the update_ref_gene_processing branch February 4, 2025 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improve mapping accuracy of normalizers by making the preprocessing of input genes and reference genes more precise #84

ENH: Improve mapping accuracy of normalizers by making the preprocessing of input genes and reference genes more precise #84

Vedanth-Ramji commented Feb 1, 2025

luispedro left a comment

Vedanth-Ramji commented Feb 4, 2025

ENH: Improve mapping accuracy of normalizers by making the preprocessing of input genes and reference genes more precise #84

ENH: Improve mapping accuracy of normalizers by making the preprocessing of input genes and reference genes more precise #84

Conversation

Vedanth-Ramji commented Feb 1, 2025

luispedro left a comment

Choose a reason for hiding this comment

Vedanth-Ramji commented Feb 4, 2025