BigDataBiology
diff --git a/‎CHANGELOG.md
Lines changed: 11 additions & 0 deletions b/‎CHANGELOG.md
Lines changed: 11 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 23 additions & 31 deletions b/‎README.md
Lines changed: 23 additions & 31 deletions
diff --git a/‎docs/api.md
Lines changed: 7 additions & 20 deletions b/‎docs/api.md
Lines changed: 7 additions & 20 deletions
@@ -2,6 +2,17 @@
 
 ## Unreleased
 
+#### Added `HamronizationNormalizer`
+- Removed the `is_hamronized` property for all normalizers and removed `--hamronized` flag for CLI.
+- All hamronized results now go through the `HamronizationNormalizer` class.
+- HamronizationNormalizer reads a hamronized file line by line, procures input genes, and loads all ARO mapping tables to support hamronized results that combine the outputs from multiple tools and databases.
+- For CLI hamronization commands will look like: 
+```bash
+argnorm hamronization -i PATH_TO_INPUT -o PATH_TO_OUTPUT
+```
+
+> Note: Updated preprocessing of resfinder genes. Concatenating entries from 'gene_name' and 'reference_accession' in hamronized results to form input genes for HamronizationNormalizer. While this improves ARO mapping accuracy (previously only `gene_symbol` was used and several genes can have the same `gene_symbol`), this simplifies preprocessing of resfinder inputs (if `gene_symbol` is used, two different preprocessing functions are required for `resfinder` and `abricate` for resfinder db).
+
 #### Update `confers_resistance_to()` to use `regulates`, `part_of`, and `participates_in` ARO relationships
 Previously, argNorm used the `is_a` ARO relationship along with `confers_resistance_to_drug_class` and `confers_resistance_to_antibiotic` to map ARGs to the drugs they confer resistance to. While this worked well for most genes, some ARGs such as those coding for efflux pumps/proteins (e.g. `ARO:3003548`, `ARO:3000826`, `ARO:3003066`) were previously not mapped to any drugs. This is because none of their superclasses mapped to drugs/antibiotics via `confers_resistance_to_antibiotic` or `confers_resistance_to_drug_class`. However, these genes were related to other ARGs that did map to drugs via the `regulates`, `part_of`, or `participates_in` ARO relationships. argNorm now also utilizes these three relationships to ensure that even if the superclasses (derived using `is_a`) of an ARG don't map to a drug, the gene can be assigned a drug mapping.
 
 
@@ -51,18 +51,18 @@ The `resistance_to_drug_classes` column will contain ARO numbers of the broader
 If you use argNorm in a publication, please cite the preprint:
 > Ugarcina Perovic S, Ramji V et al. argNorm: Normalization of Antibiotic Resistance Gene Annotations to the Antibiotic Resistance Ontology (ARO). Queensland University of Technology ePrints, 2024. DOI: https://doi.org/10.5204/rep.eprints.252448 [Preprint] (Under review).
 
-## Supported tools and databases
+## Supported ARG annotation tools and databases
 
 | ARG database                       | Tool for ARG annotation                                 |
 | ---------------------------------- | ------------------------------------------------------- |
-| ARG-ANNOT v5.0                     | [ABRicate v1.0.1](https://github.com/tseemann/abricate) |
-| DeepARG v2                         | [DeepARG v1.0.2](https://bench.cs.vt.edu/deeparg)       |
-| Groot v1.1.2                       | [GROOT v1.1.2](https://github.com/will-rowe/groot)      |
-| MEGARes v3.0                       | [ABRicate v1.0.1](https://github.com/tseemann/abricate) |
-| NCBI Reference Gene Database v3.12 | [ABRicate v1.0.1](https://github.com/tseemann/abricate) & [AMRFinderPlus v3.10.30](https://github.com/ncbi/amr) |
-| ResFinder v4.0                     | [ABRicate v1.0.1](https://github.com/tseemann/abricate) & [ResFinder v4.0](https://bitbucket.org/genomicepidemiology/resfinder/src/master/) |
-| ResFinderFG v2.0                   | [ABRicate v1.0.1](https://github.com/tseemann/abricate) |
-| SARG (reads mode) v3.2.1           | [ARGs-OAP v2.3](https://galaxyproject.org/use/args-oap/)  | 
+| ARG-ANNOT v5.0                     | [ABRicate v1.0.1](https://github.com/tseemann/abricate) & [hAMRonization](https://github.com/pha4ge/hAMRonization) |
+| DeepARG v2                         | [DeepARG v1.0.2](https://bench.cs.vt.edu/deeparg) & [hAMRonization](https://github.com/pha4ge/hAMRonization)       |
+| Groot v1.1.2                       | [GROOT v1.1.2](https://github.com/will-rowe/groot) & [hAMRonization](https://github.com/pha4ge/hAMRonization)      |
+| MEGARes v3.0                       | [ABRicate v1.0.1](https://github.com/tseemann/abricate) & [hAMRonization](https://github.com/pha4ge/hAMRonization) |
+| NCBI Reference Gene Database v3.12 | [ABRicate v1.0.1](https://github.com/tseemann/abricate), [AMRFinderPlus v3.10.30](https://github.com/ncbi/amr), & [hAMRonization](https://github.com/pha4ge/hAMRonization) |
+| ResFinder v4.0                     | [ABRicate v1.0.1](https://github.com/tseemann/abricate), [ResFinder v4.0](https://bitbucket.org/genomicepidemiology/resfinder/src/master/), & [hAMRonization](https://github.com/pha4ge/hAMRonization) |
+| ResFinderFG v2.0                   | [ABRicate v1.0.1](https://github.com/tseemann/abricate) & [hAMRonization](https://github.com/pha4ge/hAMRonization) |
+| SARG (reads mode) v3.2.1           | [ARGs-OAP v2.3](https://galaxyproject.org/use/args-oap/) & [hAMRonization](https://github.com/pha4ge/hAMRonization)  | 
 
 - Note: ARG database and ARG annotation tool versions can change. argNorm is only intended for supported versions listed above.
 - Note: the argNorm tool will be periodically updated to support the latest versions of databases and annotation tools if they undergo significant changes.
@@ -98,7 +98,7 @@ argNorm is readily available in the funcscan pipeline which can be accessed (her
 Here is a basic outline of calling argNorm.
 
 ```bash
-argnorm [tool] [--db] -i [path to original_annotation.tsv] -o [path to annotation_result_with_aro.tsv] [--hamronized (if hAMRonization used)]
+argnorm [tool] [--db] -i [path to original_annotation.tsv] -o [path to annotation_result_with_aro.tsv]
 ```
 
 ### `tool` (required)
@@ -109,6 +109,7 @@ The most important ***required positional*** argument is `tool` (see [here](#sup
 - `resfinder`
 - `amrfinderplus`
 - `groot`
+- `hamronization`
 
 ### I/O (required)
 - `-i` or `--input`: path to the annotation result
@@ -135,31 +136,26 @@ ARG annotation tools can use several ARG databases for annotation. Hence, the `t
 | `resfinder`     | Not required |
 | `amrfinderplus` | Not required |
 | `groot`         | Any from `groot-argannot`, `groot-resfinder`, `groot-db`, `groot-core-db`, or `groot-card` |
-
-### `--hamronized` (optional)
-Use this if the input is hamronized by [hAMRonization](https://github.com/pha4ge/hAMRonization)
+| `hamronization` | Not required |
 
 ### `-h` or `--help`
 Use `argnorm -h` or `argnorm --help` to see available options.
 
 ```bash
 >argnorm -h
-usage: argnorm [-h]
-               [--db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg,groot-argannot,groot-resfinder,groot-db,groot-core-db,groot-card}]
-               [--hamronized] [-i INPUT] [-o OUTPUT]
-               {argsoap,abricate,deeparg,resfinder,amrfinderplus,groot}
+usage: argnorm [-h] [--db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg,groot-argannot,groot-resfinder,groot-db,groot-core-db,groot-card}] [-i INPUT] [-o OUTPUT]
+               {argsoap,abricate,deeparg,resfinder,amrfinderplus,groot,hamronization}
 
 argNorm normalizes ARG annotation results from different tools and databases to the same ontology, namely ARO (Antibiotic Resistance Ontology).
 
 positional arguments:
-  {argsoap,abricate,deeparg,resfinder,amrfinderplus,groot}
+  {argsoap,abricate,deeparg,resfinder,amrfinderplus,groot,hamronization}
                         The tool you used to do ARG annotation.
 
-optional arguments:
+options:
   -h, --help            show this help message and exit
   --db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg,groot-argannot,groot-resfinder,groot-db,groot-core-db,groot-card}
                         The database you used to do ARG annotation.
-  --hamronized          Use this if the input is hamronized (processed using the hAMRonization tool)
   -i INPUT, --input INPUT
                         The annotation result you have
   -o OUTPUT, --output OUTPUT
@@ -209,23 +205,19 @@ argnorm -h
 
 ```
 > argnorm -h
-usage: argnorm [-h]
-               [--db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg}]
-               [--hamronized] [-i INPUT] [-o OUTPUT]
-               {argsoap,abricate,deeparg,resfinder,amrfinderplus}
+usage: argnorm [-h] [--db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg,groot-argannot,groot-resfinder,groot-db,groot-core-db,groot-card}] [-i INPUT] [-o OUTPUT]
+               {argsoap,abricate,deeparg,resfinder,amrfinderplus,groot,hamronization}
 
 argNorm normalizes ARG annotation results from different tools and databases to the same ontology, namely ARO (Antibiotic Resistance Ontology).
 
 positional arguments:
-  {argsoap,abricate,deeparg,resfinder,amrfinderplus}
+  {argsoap,abricate,deeparg,resfinder,amrfinderplus,groot,hamronization}
                         The tool you used to do ARG annotation.
 
-optional arguments:
+options:
   -h, --help            show this help message and exit
-  --db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg}
+  --db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg,groot-argannot,groot-resfinder,groot-db,groot-core-db,groot-card}
                         The database you used to do ARG annotation.
-  --hamronized          Use this if the input is hamronized (processed using
-                        the hAMRonization tool)
   -i INPUT, --input INPUT
                         The annotation result you have
   -o OUTPUT, --output OUTPUT
@@ -257,10 +249,10 @@ wget https://raw.githubusercontent.com/BigDataBiology/argNorm/main/examples/raw/
 Here is a basic outline of most argNorm commands:
 
 ```bash
-argnorm [tool] -i [original_annotation.tsv] -o [argnorm_result.tsv] [--hamronized]
+argnorm [tool] -i [original_annotation.tsv] -o [argnorm_result.tsv] [--db]
 ```
 
-Here, `tool` refers to the ARG annotation tool used (ResFinder in this case). `original_annotation.tsv` is the path to the input data and `argnorm_result.tsv` is the path to output file where the resulting table from argNorm will be stored. `--hamronized` is an option to indicate if the input data is a result of using the [hAMRonization package](https://github.com/pha4ge/hAMRonization). In our example, the input data is not a result of using the hAMRonization package, and so the `--hamronized` option can be omitted.
+Here, `tool` refers to the ARG annotation tool used (ResFinder in this case). `original_annotation.tsv` is the path to the input data and `argnorm_result.tsv` is the path to output file where the resulting table from argNorm will be stored. `--db` is the ARG databases used along with `tool` to perform annotation. ResFinder does not require a `--db` (argNorm will automatically load up the ResFinder database), however, `--db` is required for the ARG annotation tools `groot` and `abricate`.
 
 
 To run argNorm on the input data, use this command in your terminal:
 
@@ -84,9 +84,8 @@ print(drugs_to_drug_classes(['ARO:0000030', 'ARO:0000051', 'ARO:0000069', 'ARO:3
 
 Normalizers classes for specific tools which normalize ARG annotation outputs. Same functionality as CLI.
 
-All normalizers have 2 parameters:
+All normalizers have 1 optional parameter:
 * database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg, sarg, groot-db, groot-core-db, groot-card, groot-argannot, and groot-resfinder.
-* is_hamronized (bool, False by default): whether or not the ARG annotation output has been processed by the hamronization package.
 
 > Note: the database parameter only needs to be specified for AbricateNormalizer and GrootNormalizer. ncbi, deeparg, resfinder, sarg, megares, argannot, resfinderfg are the supported databases for AbricateNormalizer and groot-db, groot-core-db, groot-argannot, groot-resfinder, and groot-card are the supported databases for GrootNormalizer.
 
@@ -97,6 +96,7 @@ Available normalizers:
 * argnorm.normalizers.AMRFinderPlusNormalizer
 * argnorm.normalizers.AbricateNormalizer
 * argnorm.normalizers.GrootNormalizer
+* argnorm.normalizers.HamronizationNormalizer
 
 ### Methods
 
@@ -128,18 +128,7 @@ resfinder_normalizer.run('./resfinder.resfinder.orfs.tsv').to_csv('./resfinder.r
 
 This will create a file called `resfinder.resfinder.orfs.normed.tsv` with ARO mappings and drug categorization.
 
-### Example 2: using AbricteNormalizer with the ResFinderFG database
-
-The database parameter needs to be specified for the AbricateNormalizer. Supported databases are:
-* `ncbi`
-* `deeparg`
-* `resfinder`
-* `sarg`
-* `megares`
-* `argannot`
-* `resfinderfg`
-
-For this example, we will run the AbricateNormalizer with the [`resfinderfg` database option](https://www.big-data-biology.org/paper/2022_resfinderfgv2/).
+### Example 2: using HamronizationNormalizer
 
 Download the sample data [here](https://raw.githubusercontent.com/BigDataBiology/argNorm/7ee9d74c9fa51956ecb7706fa979cc0696ae305d/examples/hamronized/abricate.resfinderfg.tsv), and store it in a folder called `argnorm_normalizers_tutorial`.
 
@@ -151,13 +140,11 @@ wget https://raw.githubusercontent.com/BigDataBiology/argNorm/7ee9d74c9fa51956ec
 
 Save the following piece of Python code in the `argnorm_normalizers_tutorial` folder, and run the script.
 
-> Note: the data is hamronized, and so the `is_hamronized` parameter should be set to `True`.
-
 ```
-from argnorm.normalizers import AbricateNormalizer
+from argnorm.normalizers import HamronizationNormalizer
 
-abricate_normalizer = AbricateNormalizer(database='resfinderfg', is_hamronized=True)
-abricate_normalizer.run('./abricate.resfinderfg.tsv').to_csv('./abricate.resfinderfg.normed.tsv', sep='\t')
+normalizer = HamronizationNormalizer()
+normalizer.run('./abricate.resfinderfg.tsv').to_csv('./abricate.resfinderfg.normed.tsv', sep='\t')
 ```
 
-This will create a file called `abricate.resfinderfg.normed.tsv` with ARO mappings and drug categorization.
+This will create a file called `abricate.resfinderfg.normed.tsv` with ARO mappings and drug categorization.