ENH: Improve mapping accuracy of normalizers by making the preprocessing of input genes and reference genes more precise #84
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, for resfinder and ncbi genes, input ids were processed by just using gene symbols. However, several genes can have the same gene_symbol. For example,
tet(O/32/O)_1_JQ740052
,tet(O/32/O)_2_AJ295238
, andtet(O/32/O)_3_NZ_AUJS01000017
all have the same gene_symbol—tet(O/32/O)
. Whiletet(O/32/O)_1_JQ740052
andtet(O/32/O)_2_AJ295238
are mapped toARO:3007119
,tet(O/32/O)_3_NZ_AUJS01000017
is mapped toARO:3000190
. To solve this, I've included the reference accessions (e.g. AJ295238) in the preprocessing step for input ids from the resfinder and ncbi databases in ResFinderNormalizer, AMRFinderPlusNormalizer, and AbricateNormalizer.Exact changes to processing:
Resfinder database:
gene_name
&reference_accession
for hamronized results andResistance gene
&Accession no.
for raw results.gene_name
andreference_accession
for hamronized results andGENE
andACCESSION
for raw results.NCBI database:
gene_name
andreference_accession
for hamronized results &Sequence name
andAccession of closest sequence
for raw results.gene_name
for hamronized results andPRODUCT
from raw results.Reference genes from the
resfinder
andncbi
databases are now processed to match these new input genes.