BioAlign-QLoRA: Quantifying the Structural Alignment of LLM Embeddings with a Biomedical Knowledge Graph Following QLoRA Fine-Tuning
Academic Research Project - CSE443 Bioinformatics Coursework
BRAC University, Department of Computer Science and Engineering
This project addresses the fundamental challenge of transforming generalist Large Language Models (LLMs) into specialized biomedical experts through high-efficiency fine-tuning. We investigate whether targeted QLoRA (Quantized Low-Rank Adaptation) fine-tuning can induce a deep structural reorganization within a model's internal representations, causing its understanding of biological concepts to align more closely with real-world knowledge graphs.
Key Innovation: We introduce a novel "Knowledge Graph Separation" score that quantifies the geometric alignment between an LLM's embedding space and biological knowledge structures, providing empirical evidence of successful knowledge transfer.
- Features
- Project Structure
- Installation
- Usage
- Models
- Dataset
- Results
- Academic Context
- Authors
- License
- Acknowledgments
- Novel Evaluation Framework: Introduction of "Knowledge Graph Separation" score for quantifying embedding alignment
- Structural Reorganization Analysis: Empirical evidence of profound embedding space transformation (>126% improvement)
- Multi-Model Comparative Study: Fine-tuning of Llama-3 8B, Mistral 7B, and Phi-3 Mini 3.8B
- High-Efficiency Training: QLoRA implementation enabling fine-tuning on consumer-grade hardware
- Benchmark Outperformance: Consistently outperformed pre-trained BioMistral-7B expert model
- Curated Dataset: 68,444 gene-disease associations from Comparative Toxicogenomics Database (CTD)
- Accessibility Focus: Lightweight models suitable for resource-constrained environments
BioAlign-QLoRA/
βββ adapters/ # Fine-tuned model adapters
β βββ fine_tuned_llama3_gda/ # Llama3 QLoRA adapters
β βββ fine_tuned_mistral7b_gda/ # Mistral 7B QLoRA adapters
β βββ fine_tuned_phi3_gda/ # Phi-3 QLoRA adapters
βββ codes/ # Core implementation
β βββ analysis.ipynb # Comprehensive analysis notebook
β βββ dataproc.ipynb # Data preprocessing pipeline
β βββ eda.ipynb # Exploratory data analysis
β βββ finetuneLlama.ipynb # Llama3 fine-tuning implementation
β βββ finetuneMistral7b.ipynb # Mistral 7B fine-tuning implementation
β βββ finetunePhi3.ipynb # Phi-3 fine-tuning implementation
βββ data/ # Dataset and processed files
β βββ ctd_processed_dataset.csv # Main processed dataset
β βββ raw&processed/ # Raw and processed data files
β βββ visuals/ # Generated visualizations
βββ paper/ # Research documentation
β βββ BioAlign_Report_LLM_BioKG_QLoRA.pdf
βββ LICENSE # MIT License
βββ README.md # This file
- Python 3.8 or higher
- CUDA-compatible GPU (recommended)
- Conda or pip package manager
-
Clone the repository:
git clone https://github.com/fnziad/BioAlign-QLoRA.git cd BioAlign-QLoRA -
Create a virtual environment:
conda create -n bioalign-qlora python=3.8 conda activate bioalign-qlora
-
Install dependencies:
pip install torch transformers datasets peft accelerate pip install pandas numpy matplotlib seaborn scikit-learn pip install jupyter notebook
-
Download required models: The fine-tuned adapters are included in the
adapters/directory. Base models will be downloaded automatically when running the notebooks.
jupyter notebook codes/dataproc.ipynbjupyter notebook codes/eda.ipynbjupyter notebook codes/finetuneLlama.ipynbjupyter notebook codes/finetuneMistral7b.ipynbjupyter notebook codes/finetunePhi3.ipynbjupyter notebook codes/analysis.ipynbThis project implements QLoRA fine-tuning for three state-of-the-art language models:
| Model | Parameters | Zero-Shot Accuracy | KG Separation Improvement | Use Case |
|---|---|---|---|---|
| Llama-3 8B | 8B | 81.0% (+57.0%) | +49.0% | High-performance biomedical understanding |
| Mistral 7B | 7B | 83.8% (+41.1%) | -38.0%* | Efficient inference with architectural innovations |
| Phi-3 Mini | 3.8B | 68.8% (+16.2%) | +126.3% | Lightweight deployment for resource constraints |
*Mistral's negative KG separation change indicates optimization prioritized classification accuracy over geometric purity.
- Rank (r): 16
- Alpha: 32
- Dropout: 0.0 (optimized for Unsloth)
- Max Steps: 2,000
- Learning Rate: 2e-4
- Optimizer: AdamW 8-bit
The project utilizes a curated dataset from the Comparative Toxicogenomics Database (CTD):
- Source: CTD Curated Gene-Disease Associations (high-confidence only)
- Total Size: 68,444 gene-disease pairs (perfectly balanced)
- Positive Examples: 34,222 evidence-backed associations
- Negative Examples: 34,222 synthetically generated pairs
- Format: Simple template - "{GeneSymbol} is associated with {DiseaseName}"
- Unique Genes: 9,111
- Unique Diseases: 5,858
- Evidence Types: Marker/mechanism, therapeutic (direct evidence only)
- Average Text Length: 52 characters
- Training Split: 80% (54,755) / Test: 20% (13,689)
- Evidence Distribution: Average 1.09 PubMed IDs per positive association
| Model | Zero-Shot Accuracy | KG Separation Score | Probe Accuracy | Benchmark Comparison |
|---|---|---|---|---|
| Mistral-QLoRA | 83.8% | 0.1008 (-38.0%) | 13.6% | Outperformed BioMistral |
| Llama-3-QLoRA | 81.0% | 0.0578 (+49.0%) | 7.5% | Outperformed BioMistral |
| Phi-3-QLoRA | 68.8% | 0.0349 (+126.3%) | 5.9% | Outperformed BioMistral |
| BioMistral-7B | 50.9% | N/A | N/A | Baseline |
- Structural Reorganization: All models showed geometric transformation from "chaotic clouds" to distinct clusters
- Benchmark Superior: All fine-tuned models outperformed the pre-trained BioMistral expert
- Efficiency Achievement: Phi-3 Mini demonstrated remarkable 126% KG separation improvement despite 3.8B parameters
- Training Efficiency: Significant improvements achieved in just 2,000 steps (~29% of one epoch)
- Accessibility Validation: Lightweight models proven viable for resource-constrained deployment
Course: CSE443 - Bioinformatics
Institution: BRAC University
Department: Computer Science and Engineering
Semester: Summer 2025
Project Type: Group Research Project
- Investigate whether targeted fine-tuning can induce structural reorganization in LLM embeddings
- Develop quantitative metrics for measuring knowledge graph alignment in neural representations
- Create computationally efficient methods for specializing generalist models
- Validate accessibility of advanced AI tools for resource-constrained environments
- Demonstrate democratization pathway for biomedical AI applications
Fahad Nadim Ziad - First Author & Primary Contributor
- Student ID: 24341216
- Email: [email protected]
- GitHub: @fnziad
- Role: Project conception and idea, dataset collection and curation, all code implementation, model fine-tuning, experimental framework, Knowledge Graph Separation methodology, final revision and corrections
Aalavi Mahin Khan - Co-Author
- Student ID: 22301789
- Department: Computer Science and Engineering, BRAC University
- Role: Research report writing, documentation
Khaled Saifullah Karim - Co-Author
- Student ID: 24341262
- Department: Computer Science and Engineering, BRAC University
- GitHub: @KsKarim7
- Role: Research report writing, file organization and repository management
Note: Team members may have additional repository access and resources. Please refer to individual repositories for supplementary materials.
If you use this work in your research, please cite:
@misc{ziad2025bioalign,
title={Quantifying the Structural Alignment of LLM Embeddings with a Biomedical Knowledge Graph Following QLoRA Fine-Tuning},
author={Ziad, Fahad Nadim and Khan, Aalavi Mahin and Karim, Khaled Saifullah},
year={2025},
institution={BRAC University},
note={CSE443 Bioinformatics Course Project, Summer 2025},
url={https://github.com/fnziad/BioAlign-QLoRA}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Professor Swakkhar Shatabda - Course Instructor, CSE443 Bioinformatics, BRAC University
- BRAC University Department of Computer Science and Engineering
- Comparative Toxicogenomics Database (CTD) for providing the curated gene-disease associations
- Unsloth for high-efficiency QLoRA implementation enabling accessible fine-tuning
- Hugging Face for transformer models and infrastructure
- Meta AI, Mistral AI, and Microsoft for open-source model contributions
- Open-source research community for foundational tools and methodologies
- QLoRA Paper: Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs
- CTD Database: Davis, A.P., et al. (2023). Comparative Toxicogenomics Database
- Transformers Library: Wolf, T., et al. (2020). Transformers: State-of-the-art Natural Language Processing
Academic Integrity Statement: This work represents original research conducted as part of the CSE443 Bioinformatics coursework. All team members contributed to different aspects of the project, and appropriate attribution has been provided for external resources and datasets used.
Contact: For questions about this research project, please contact the first author or refer to the detailed research report in the paper/ directory.