Cell type annotation is a critical yet laborious step in single-cell RNA sequencing analysis. We present a trustworthy large language model (LLM)-agent, CellTypeAgent, which integrates LLMs with verification from relevant databases. CellTypeAgent achieves higher accuracy than existing methods while mitigating hallucinations. We evaluated CellTypeAgent across nine real datasets involving 303 cell types from 36 tissues. This combined approach holds promise for more efficient and reliable cell type annotation.
- Clone the repository
git clone https://github.com/jianghao-zhang/CellTypeAgent.git
cd CellTypeAgent
- Create a conda environment and install the dependencies
conda create -n CellTypeAgent python=3.10
conda activate CellTypeAgent
pip install -r requirements.txt
-
Set your OpenAI/Anthropic/DeepSeek API keys configuration in the 'CellTypeAgent/APIs' folder
-
Prepare the data
- The datasets used in the paper are stored in the 'CellTypeAgent/data' folder.
- Please download the gene expression data used in this paper from Google Drive and place it in the 'CellTypeAgent/data/CELLxGENE' directory.
- Please check the README.md in the 'CellTypeAgent/data' folder for more information.
- Run an experiment on all datasets:
python CellTypeAgent/get_prediction.py
python CellTypeAgent/get_selection.py
To utilize CellTypeAgent with your own datasets, follow these steps:
- Format your data according to the structure in
CellTypeAgent/data/GPTCellType/datasets
- Download the corresponding gene expression data from the CZ CellxGene - Gene Expression Atlas, for more details, please refer to the README.md in the 'CellTypeAgent/data' folder
- Modify the dataset settings in
get_prediction.py
andget_selection.py
- Configure model parameters as needed (e.g.,
model
,top_n
,max_markers
) - Run the pipeline as described in the Example Usage section