Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation
This repository contains the implementation of advanced AI systems for automated patient cohort generation from electronic health records and epidemiological question answering.
This codebase implements the methods from the following research papers:
Latest Work (2025) - Accepted at European Conference of Artificial Intelligence (ECAI 2025)
International Joint Workshop of Artificial Intelligence for Healthcare and HYbrid Models for Coupling Deductive and Inductive ReAsoning (HC@AIxIA+HYDRA 2025):
@misc{ziletti2025generating,
title={Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation},
author={Angelo Ziletti and Leonardo D'Ambrosi},
year={2025},
eprint={2502.21107},
archivePrefix={arXiv},
primaryClass={cs.CL}
}Previous Work (2024 - 6th Clinical Natural Language Processing Workshop @ NAACL):
@inproceedings{ziletti-dambrosi-2024-retrieval,
title = "Retrieval augmented text-to-{SQL} generation for epidemiological question answering using electronic health records",
author = "Ziletti, Angelo and D{'}Ambrosi, Leonardo",
editor = "Naumann, Tristan and Bethard, Steven and Savova, Guergana and Uzuner, Ozlem",
booktitle = "Proceedings of the 6th Clinical Natural Language Processing Workshop",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.clinicalnlp-1.4",
pages = "47--53",
}The system provides two main generative AI-powered capabilities:
Clinical cohort definition is crucial for patient recruitment in clinical trials and cohort identification in observational studies. The Cohort Generator automates the translation of complex inclusion/exclusion criteria into accurate SQL queries through the following pipeline:
π Processing Pipeline:
- π Criteria Parsing: Converts raw input cohort criteria into semi-structured definitions
- π RAG-Enhanced Processing: Utilizes two-level Retrieval Augmented Generation (RAG) with:
- Criteria-specific knowledge base (EpiAskKB)
- Cohort-level knowledge base (EpiCohortKB)
- β‘ Query Generation: Creates RAG-enhanced instruction prompts for SQL query generation
- π·οΈ Medical Concept Mapping: Maps medical concepts to standardized medical coding systems
- ποΈ SQL Generation: Produces verified executable SQL queries for final cohort identification
- π Patient Funnel: Generates detailed tracking of inclusion/exclusion steps
π Results:
For more information, please refer to our manuscript.
Enables natural language querying of clinical data by automatically translating questions into SQL queries and generating human-readable answers, based on our previous NAACL Clinical NLP Workshop paper.
This is a re-implementation of our previous work.
π Key Features:
- π£οΈ Natural language question processing
- π RAG-enhanced SQL generation using question-SQL pair knowledge base (EpiAskKB)
- π·οΈ Medical concept mapping to standardized codes
- π¬ Contextual answer generation
The system utilizes one SQLite database (.db files) containing two specialized knowledge bases:
- ποΈ EpiAskKB: Question-SQL pairs for epidemiological question answering
- ποΈ EpiCohortKB: Cohort descriptions and corresponding SQL pairs for cohort generation
These knowledge bases enable the RAG-enhanced generation process described in our work.
For more information, please refer to the manuscript.
- Python β₯3.12
- SQLite 3.50.2 or higher
- Conda (recommended) or venv
Clone and set up the environment:
git clone https://github.com/Bayer-Group/epi-cohort-text2sql-ecai2025.git
cd epi-cohort-text2sql-ecai2025
conda create -n ascent-ai python=3.12
conda activate ascent-ai
pip install .If you don't want to use conda, you can create a virtual environment using venv:
git clone https://github.com/Bayer-Group/epi-cohort-text2sql-ecai2025.git
cd epi-cohort-text2sql-ecai2025
python3 -m venv ascent-env
source ascent-env/bin/activate # On Windows use `ascent-env\Scripts\activate`
pip install .pip install git+https://github.com/Bayer-Group/epi-cohort-text2sql-ecai2025.gitThe system requires configuration for:
- AI model endpoints (Claude via AWS Bedrock, GPT via Azure, etc.)
- [Optional] Database connections (if queries are executed)
See example.env for configuration template.
epi-cohort-text2sql-ecai2025/
βββ src/ascent_ai/ # π¦ Main package
β βββ config/ # βοΈ Configuration management
β βββ models/ # π€ AI models and inference
β β βββ inference/ # π― Cohort and QA systems
β βββ data/ # ποΈ Data processing utilities
β βββ db/ # ποΈ Database connections
β βββ utils/ # π§ Utility functions
βββ docs/examples/ # π Usage examples
βββ querylib_20250825.db # ποΈ Query knowledge base (EpiAskKB and EpiCohortKB)
βββ requirements.txt # π Dependencies
This directory contains example implementations of the two main features.
From the root folder of the repository run:
python docs/examples/example_cohort_generator.pyFrom the root folder of the repository run:
python docs/examples/example_qa_system.pyβΉοΈ Note: Medical coding is not performed in this code version - we simply return the SQL queries with entity placeholders. For more information on integrating medical coding, see the following repository: https://github.com/Bayer-Group/text-to-sql-epi-ehr-naacl2024
See CONTRIBUTING.md for contribution guidelines.
See LICENSE file for details.
- Angelo Ziletti - Lead Author
- Leonardo D'Ambrosi
Please cite our work if you use this code in your research. For questions or issues, please contact the authors.



