From ae99a2ac53bdff2c875de58a0191eeed69e2527e Mon Sep 17 00:00:00 2001 From: Rafael Goncalves Date: Wed, 5 Jun 2024 12:24:06 -0400 Subject: [PATCH] Update README.md --- README.md | 42 +++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 2534e5e..3f273dc 100644 --- a/README.md +++ b/README.md @@ -46,7 +46,7 @@ dfd = text2term.map_terms(source_terms={"asthma":"disease", "acute bronchitis":[ Examples of Programmatic Caching ### Examples of Programmatic Caching -text2term supports caching an ontology for repeated use. The next example caches an ontology and gives it a name for use later on +text2term supports caching an ontology for repeated use. Here we cache an ontology and give it a name for later use: ```python mondo = text2term.cache_ontology(ontology_url="http://purl.obolibrary.org/obo/mondo.owl", ontology_acronym="MONDO") @@ -107,7 +107,7 @@ python text2term.py -s test/unstruct_terms.txt -t test/mondo.owl -iris http://pu While MONDO uses terms from other ontologies such as CHEBI and Uberon, the tool only considers terms whose IRIs start either with "http://purl.obolibrary.org/obo/mondo" or "http://identifiers.org/hgnc". --- -Cache an ontology for repeated use, by first running the tool as usual while instructing it to cache the ontology using `-c `: +Cache an ontology for repeated use by running the tool while instructing it to cache the ontology via `-c `: ```shell python text2term -s test/unstruct_terms.txt -t http://purl.obolibrary.org/obo/mondo.owl -c MONDO ``` @@ -157,9 +157,14 @@ The function returns a pandas `DataFrame` containing the generated ontology mapp - Unmapped terms can still be included in the output if `incl_unmapped` is True `target_ontology`—Path, URL or name of 'target' ontology to map the source terms to -: Ontology names can be given as values to `target_ontology` (eg "EFO" or "CL")--text2term uses [bioregistry](https://bioregistry.io) to get URLs for such names. -: When using BioPortal or Zooma, this should be a comma-separated list of ontology acronyms (eg 'EFO,HPO') or **'all'** to search all ontologies. -: When the target ontology has been cached, this should be the ontology name given when it was first cached. + +> [!TIP] +> Ontology names can be given as values to `target_ontology` e.g. "EFO" or "CL"--text2term uses [bioregistry](https://bioregistry.io) to get URLs for such names. +> +> Similarly, when the target ontology has been cached, enter the name used upon caching. + +> [!NOTE] +> When using BioPortal or Zooma, this should be a comma-separated list of ontology acronyms (eg 'EFO,HPO') or **'all'** to search all ontologies. `base_iris`—Map only to ontology terms whose IRIs start with one of the strings given in this tuple @@ -171,8 +176,7 @@ The function returns a pandas `DataFrame` containing the generated ontology mapp `separator`—Character that separates columns when input is a table (eg '\t' for TSV) -`mapper`—Method used to compare source terms with ontology terms - : One of levenshtein, jaro, jarowinkler, jaccard, fuzzy, tfidf, zooma, bioportal +`mapper`—Method used to compare source terms with ontology terms. One of `levenshtein, jaro, jarowinkler, jaccard, fuzzy, tfidf, zooma, bioportal` (see [Supported Mappers](#supported-mappers)) `max_mappings`—Maximum number of top-ranked mappings returned per source term @@ -307,18 +311,22 @@ To display a help message with descriptions of tool arguments do: The mapping score associated with each mapping is indicative of how similar an input term is to an ontology term (via its labels or synonyms). The mapping/similarity scores generated by text2term are the result of applying one of the following "mappers": -TF-IDF-based mapper -: [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf), a statistical measure often used in information retrieval, measures how important a word is to a document in a corpus of documents. We first generate TF-IDF-based vectors of the source terms and of labels and synonyms of ontology terms. Then we compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors to determine how similar a source term is to a target term (label or synonym). +**TF-IDF-based mapper**—[TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) is a statistical measure often used in information retrieval that measures how important a word is to a document in a corpus of documents. We first generate TF-IDF-based vectors of the source terms and of labels and synonyms of ontology terms. Then we compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors to determine how similar a source term is to a target term (label or synonym). + +**BioPortal Web API-based mapper**—uses an interface to the [BioPortal Annotator](https://bioportal.bioontology.org/annotator) that we built to allow mapping terms to ontologies in the [BioPortal](https://bioportal.bioontology.org) repository. + +> [!IMPORTANT] +> Make sure to specify the target ontology name(s) as they appear in BioPortal -BioPortal Web API-based mapper -: uses an interface to the [BioPortal Annotator](https://bioportal.bioontology.org/annotator) that we built to allow mapping terms to ontologies in the [BioPortal](https://bioportal.bioontology.org) repository. To use it, make sure to specify the target ontology name(s) as they appear in BioPortal. +> [!WARNING] +> there are no confidence scores associated with BioPortal annotations, so we decided to set the mapping score of all mappings to 1 -: _Note_: there are no confidence scores associated with BioPortal annotations, so we decided to set the mapping score of all mappings to 1. +**Zooma Web API-based mapper**—uses a [Zooma](https://www.ebi.ac.uk/spot/zooma/) interface that we built to allow mapping terms to ontologies in the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4) repository. -Zooma Web API-based mapper -: uses a [Zooma](https://www.ebi.ac.uk/spot/zooma/) interface that we built to allow mapping terms to ontologies in the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4) repository. To use it, make sure to specify the target ontology name(s) as they appear in OLS. +> [!IMPORTANT] +> Make sure to specify the target ontology name(s) as they appear in OLS -Syntactic distance-based mappers -: text2term provides support for commonly used and popular syntactic (edit) distance metrics. Specifically, we implemented support for Levenshtein, Jaro, Jaro-Winkler, Jaccard, and Indel metrics. We use the [nltk](https://pypi.org/project/nltk/) package to compute Jaccard distances, and [rapidfuzz](https://pypi.org/project/rapidfuzz/) for all others. +**Syntactic distance-based mappers**—text2term provides support for commonly used and popular syntactic (edit) distance metrics: Levenshtein, Jaro, Jaro-Winkler, Jaccard, and Indel. We use the [nltk](https://pypi.org/project/nltk/) package to compute Jaccard distances and [rapidfuzz](https://pypi.org/project/rapidfuzz/) to compute all others. -_Note_: syntactic distance-based mappers and Web API-based mappers perform slowly (much slower than the TF-IDF mapper). The former because they do pairwise comparisons between each input string and each ontology term label/synonym. In the Web API-based approaches there are networking and API load overheads. \ No newline at end of file +> [!NOTE] +> Syntactic distance-based mappers and Web API-based mappers perform slowly (much slower than the TF-IDF mapper). The former because they do pairwise comparisons between each input string and each ontology term label/synonym. In the Web API-based approaches there are networking and API load overheads. \ No newline at end of file