Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
rsgoncalves committed Jun 5, 2024
1 parent 133be13 commit ae99a2a
Showing 1 changed file with 25 additions and 17 deletions.
42 changes: 25 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ dfd = text2term.map_terms(source_terms={"asthma":"disease", "acute bronchitis":[
<summary><b>Examples of Programmatic Caching</b></summary>

### Examples of Programmatic Caching
text2term supports caching an ontology for repeated use. The next example caches an ontology and gives it a name for use later on
text2term supports caching an ontology for repeated use. Here we cache an ontology and give it a name for later use:
```python
mondo = text2term.cache_ontology(ontology_url="http://purl.obolibrary.org/obo/mondo.owl",
ontology_acronym="MONDO")
Expand Down Expand Up @@ -107,7 +107,7 @@ python text2term.py -s test/unstruct_terms.txt -t test/mondo.owl -iris http://pu
While MONDO uses terms from other ontologies such as CHEBI and Uberon, the tool only considers terms whose IRIs start either with "http://purl.obolibrary.org/obo/mondo" or "http://identifiers.org/hgnc".

---
Cache an ontology for repeated use, by first running the tool as usual while instructing it to cache the ontology using `-c <name>`:
Cache an ontology for repeated use by running the tool while instructing it to cache the ontology via `-c <name>`:
```shell
python text2term -s test/unstruct_terms.txt -t http://purl.obolibrary.org/obo/mondo.owl -c MONDO
```
Expand Down Expand Up @@ -157,9 +157,14 @@ The function returns a pandas `DataFrame` containing the generated ontology mapp
- Unmapped terms can still be included in the output if `incl_unmapped` is True

`target_ontology`&mdash;Path, URL or name of 'target' ontology to map the source terms to
: Ontology names can be given as values to `target_ontology` (eg "EFO" or "CL")--text2term uses [bioregistry](https://bioregistry.io) to get URLs for such names.
: When using BioPortal or Zooma, this should be a comma-separated list of ontology acronyms (eg 'EFO,HPO') or **'all'** to search all ontologies.
: When the target ontology has been cached, this should be the ontology name given when it was first cached.

> [!TIP]
> Ontology names can be given as values to `target_ontology` e.g. "EFO" or "CL"--text2term uses [bioregistry](https://bioregistry.io) to get URLs for such names.
>
> Similarly, when the target ontology has been cached, enter the name used upon caching.
> [!NOTE]
> When using BioPortal or Zooma, this should be a comma-separated list of ontology acronyms (eg 'EFO,HPO') or **'all'** to search all ontologies.
`base_iris`&mdash;Map only to ontology terms whose IRIs start with one of the strings given in this tuple

Expand All @@ -171,8 +176,7 @@ The function returns a pandas `DataFrame` containing the generated ontology mapp

`separator`&mdash;Character that separates columns when input is a table (eg '\t' for TSV)

`mapper`&mdash;Method used to compare source terms with ontology terms
: One of levenshtein, jaro, jarowinkler, jaccard, fuzzy, tfidf, zooma, bioportal
`mapper`&mdash;Method used to compare source terms with ontology terms. One of `levenshtein, jaro, jarowinkler, jaccard, fuzzy, tfidf, zooma, bioportal` (see [Supported Mappers](#supported-mappers))

`max_mappings`&mdash;Maximum number of top-ranked mappings returned per source term

Expand Down Expand Up @@ -307,18 +311,22 @@ To display a help message with descriptions of tool arguments do:

The mapping score associated with each mapping is indicative of how similar an input term is to an ontology term (via its labels or synonyms). The mapping/similarity scores generated by text2term are the result of applying one of the following "mappers":

TF-IDF-based mapper
: [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf), a statistical measure often used in information retrieval, measures how important a word is to a document in a corpus of documents. We first generate TF-IDF-based vectors of the source terms and of labels and synonyms of ontology terms. Then we compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors to determine how similar a source term is to a target term (label or synonym).
**TF-IDF-based mapper**&mdash;[TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) is a statistical measure often used in information retrieval that measures how important a word is to a document in a corpus of documents. We first generate TF-IDF-based vectors of the source terms and of labels and synonyms of ontology terms. Then we compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors to determine how similar a source term is to a target term (label or synonym).

**BioPortal Web API-based mapper**&mdash;uses an interface to the [BioPortal Annotator](https://bioportal.bioontology.org/annotator) that we built to allow mapping terms to ontologies in the [BioPortal](https://bioportal.bioontology.org) repository.

> [!IMPORTANT]
> Make sure to specify the target ontology name(s) as they appear in BioPortal
BioPortal Web API-based mapper
: uses an interface to the [BioPortal Annotator](https://bioportal.bioontology.org/annotator) that we built to allow mapping terms to ontologies in the [BioPortal](https://bioportal.bioontology.org) repository. To use it, make sure to specify the target ontology name(s) as they appear in BioPortal.
> [!WARNING]
> there are no confidence scores associated with BioPortal annotations, so we decided to set the mapping score of all mappings to 1
: _Note_: there are no confidence scores associated with BioPortal annotations, so we decided to set the mapping score of all mappings to 1.
**Zooma Web API-based mapper**&mdash;uses a [Zooma](https://www.ebi.ac.uk/spot/zooma/) interface that we built to allow mapping terms to ontologies in the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4) repository.

Zooma Web API-based mapper
: uses a [Zooma](https://www.ebi.ac.uk/spot/zooma/) interface that we built to allow mapping terms to ontologies in the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4) repository. To use it, make sure to specify the target ontology name(s) as they appear in OLS.
> [!IMPORTANT]
> Make sure to specify the target ontology name(s) as they appear in OLS
Syntactic distance-based mappers
: text2term provides support for commonly used and popular syntactic (edit) distance metrics. Specifically, we implemented support for Levenshtein, Jaro, Jaro-Winkler, Jaccard, and Indel metrics. We use the [nltk](https://pypi.org/project/nltk/) package to compute Jaccard distances, and [rapidfuzz](https://pypi.org/project/rapidfuzz/) for all others.
**Syntactic distance-based mappers**&mdash;text2term provides support for commonly used and popular syntactic (edit) distance metrics: Levenshtein, Jaro, Jaro-Winkler, Jaccard, and Indel. We use the [nltk](https://pypi.org/project/nltk/) package to compute Jaccard distances and [rapidfuzz](https://pypi.org/project/rapidfuzz/) to compute all others.

_Note_: syntactic distance-based mappers and Web API-based mappers perform slowly (much slower than the TF-IDF mapper). The former because they do pairwise comparisons between each input string and each ontology term label/synonym. In the Web API-based approaches there are networking and API load overheads.
> [!NOTE]
> Syntactic distance-based mappers and Web API-based mappers perform slowly (much slower than the TF-IDF mapper). The former because they do pairwise comparisons between each input string and each ontology term label/synonym. In the Web API-based approaches there are networking and API load overheads.

0 comments on commit ae99a2a

Please sign in to comment.