Skip to content

MoultDB/TaxonMatch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌿 TaxonMatch: Integrating Taxonomic Data from GBIF, NCBI, iNaturalist, PaleoDB, IUCN and other sources

📌 Introduction

TaxonMatch is a Python framework designed to integrate, clean, and analyze taxonomic data from GBIF, NCBI, iNaturalist, PaleoDB, and IUCN. It enhances taxonomic consistency across biodiversity datasets, simplifies taxonomic name matching, and enables the generation of phylogenetic trees based on consolidated data.

🔍 Main Features

  • 📥 Download and clean taxonomic datasets from GBIF, NCBI, iNaturalist, PaleoDB, and IUCN.
  • 🔗 Taxonomic name matching to identify synonyms and discrepancies.
  • 🌳 Generate phylogenetic trees from consolidated taxonomic data.
  • 🦴 Analyze fossil taxa and identify their closest living relatives.
  • 🌍 Assign conservation status to species using IUCN data.

⚙ Installation & Setup

Prerequisites

Ensure you have Python 3.8+ installed along with the necessary dependencies:

pip install -r requirements.txt
pip install git+https://github.com/MicheleRoar/TaxonMatch.git

🚀 Usage & Workflow

1️⃣ Download Taxonomic Datasets

import taxonmatch as txm

# Download GBIF and NCBI datasets
gbif_dataset = txm.download_gbif_taxonomy()
ncbi_dataset = txm.download_ncbi_taxonomy()

2️⃣ Taxonomic Name Matching (GBIF vs NCBI)

# Select a specific clade (example: Apidae)
gbif_apidae, ncbi_apidae = txm.select_taxonomic_clade("Apidae", gbif_dataset, ncbi_dataset)

# Perform taxonomic matching
matched_df, unmatched_df, possible_typos_df = txm.match_dataset(gbif_apidae, ncbi_apidae)

3️⃣ Generate a Phylogenetic Tree

tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)
txm.print_tree(tree, root_name="Apidae")
txm.save_tree(tree, "taxon_tree.txt")

4️⃣ Add Conservation Status Using IUCN Data

df_with_iucn_status = txm.add_iucn_status_column(matched_df)
df_with_iucn_status[df_with_iucn_status.iucnRedListCategory.isin(['ENDANGERED', 'CRITICALLY_ENDANGERED', 'VULNERABLE'])]

5️⃣ Identify Closest Living Relatives for Fossil Species

paleodb_dataset = txm.download_gbif_taxonomy(source="paleodb")
a3cat = txm.download_ncbi_taxonomy(source="a3cat")

query = "Arthropoda;Insecta;Hymenoptera;Formicidae;Formica;Formica seuberti"
txm.find_top_n_similar(query, a3cat, n_neighbors=4)

📈 Visualization

Distribution of Conservation Status

import matplotlib.pyplot as plt

conservation_counts = df_with_iucn_status['iucnRedListCategory'].value_counts()
plt.figure(figsize=(10, 6))
conservation_counts.plot(kind='bar', color="blue", edgecolor='black')
plt.title('Distribution by Conservation Status')
plt.ylabel('Number of Species')
plt.xlabel('Category')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

License

This project is licensed under the MIT License.

🤝 Contributing & Maintenance

If you want to contribute to TaxonMatch:

  • Open an issue to report bugs or suggest improvements.
  • Create a pull request with a clear description of your changes.
  • Follow the coding guidelines and ensure all modifications are tested.

For questions or support, contact [email protected].

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.3%
  • Python 3.7%