Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Improve mapping accuracy of normalizers by making the preprocessing of input genes and reference genes more precise #84

Merged

Conversation

Vedanth-Ramji
Copy link
Member

Previously, for resfinder and ncbi genes, input ids were processed by just using gene symbols. However, several genes can have the same gene_symbol. For example, tet(O/32/O)_1_JQ740052, tet(O/32/O)_2_AJ295238, and tet(O/32/O)_3_NZ_AUJS01000017 all have the same gene_symbol—tet(O/32/O). While tet(O/32/O)_1_JQ740052 and tet(O/32/O)_2_AJ295238 are mapped to ARO:3007119, tet(O/32/O)_3_NZ_AUJS01000017 is mapped to ARO:3000190. To solve this, I've included the reference accessions (e.g. AJ295238) in the preprocessing step for input ids from the resfinder and ncbi databases in ResFinderNormalizer, AMRFinderPlusNormalizer, and AbricateNormalizer.

Exact changes to processing:
Resfinder database:

  • ResFinderNormalizer: Changed preprocessing of input ids to concate entries from gene_name & reference_accession for hamronized results and Resistance gene & Accession no. for raw results.
  • AbricateNormalizer: Changed preprocessing of input ids to concatenate entries from gene_name and reference_accession for hamronized results and GENE and ACCESSION for raw results.

NCBI database:

  • AMRFinderPlusNormalizer: Changed preprocessing of input ids to concatenate entries from gene_name and reference_accession for hamronized results & Sequence name and Accession of closest sequence for raw results.
  • AbricateNormalizer: Changed preprociess of input ids to procure entries from gene_name for hamronized results and PRODUCT from raw results.

Reference genes from the resfinder and ncbi databases are now processed to match these new input genes.

…ing of input genes and reference genes more precise

Previously, for resfinder and ncbi genes, input ids were processed by just using gene symbols. However, several genes can have the same gene_symbol. For example, tet(O/32/O)_1_JQ740052, tet(O/32/O)_2_AJ295238, and tet(O/32/O)_3_NZ_AUJS01000017 all have the same gene_symbol—tet(O/32/O).  While tet(O/32/O)_1_JQ740052 and tet(O/32/O)_2_AJ295238 are mapped to ARO:3007119, tet(O/32/O)_3_NZ_AUJS01000017 is mapped to ARO:3000190. To solve this, I've included the reference accessions (e.g. AJ295238) in the preprocessing step for input ids from the resfinder and ncbi databases in ResFinderNormalizer, AMRFinderPlusNormalizer, and AbricateNormalizer.

Exact changes to processing:
Resfinder database:
- ResFinderNormalizer: Changed preprocessing of input ids to concate entries from `gene_name` &  `reference_accession` for hamronized results and `Resistance gene` & `Accession no.` for raw results.
- AbricateNormalizer: Changed preprocessing of input ids to concatenate entries from `gene_name` and `reference_accession` for hamronized results and `GENE` and `ACCESSION` for raw results.

NCBI database:
- AMRFinderPlusNormalizer: Changed preprocessing of input ids to concatenate entries from `gene_name` and `reference_accession` for hamronized results &
 `Sequence name` and `Accession of closest sequence` for raw results.
- AbricateNormalizer: Changed preprociess of input ids to procure entries from `gene_name` for hamronized results and `PRODUCT` from raw results.

Reference genes from the `resfinder` and `ncbi` databases are now processed to match these new input genes.
Copy link
Member

@luispedro luispedro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid the looping if you can

argnorm/normalizers.py Outdated Show resolved Hide resolved
argnorm/normalizers.py Outdated Show resolved Hide resolved
argnorm/normalizers.py Outdated Show resolved Hide resolved
@Vedanth-Ramji
Copy link
Member Author

I just saw that I can cut down looping in the abricate normalizer as well. I'll just get that as well and rebase, so it'll be included in the latest commit.

…gene preprocessing

Cutting down unnecessary looping throughout input and reference genes.
@Vedanth-Ramji Vedanth-Ramji force-pushed the update_ref_gene_processing branch from 262289e to a2cbdb5 Compare February 4, 2025 06:47
@luispedro luispedro merged commit 2b064e2 into BigDataBiology:main Feb 4, 2025
6 checks passed
@Vedanth-Ramji Vedanth-Ramji deleted the update_ref_gene_processing branch February 4, 2025 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants