Med vLLM

Med vLLM is a project aimed at creating a specialized language model for medical applications. By leveraging the efficient Nano vLLM and the domain knowledge of BioBERT and ClinicalBERT, we provide a tool that's both powerful and resource-friendly.

Hugging Face Hub: https://huggingface.co/Junaidi-AI/med-vllm

You can load the config directly from the Hub via:

from medvllm.medical.config.models.medical_config import MedicalModelConfig
cfg = MedicalModelConfig.from_pretrained("Junaidi-AI/med-vllm")

Motivation

Large language models have shown great promise in various fields, but their size and resource requirements can be prohibitive, especially in resource-constrained environments like hospitals or research labs. Med vLLM addresses this by using a lightweight inference engine while maintaining high performance on medical tasks such as analyzing clinical notes or assisting with medical research.

Key Features

Efficient Inference: Powered by Nano vLLM for lightweight performance.
Medical Expertise: Pre-trained on medical data with BioBERT and ClinicalBERT.
Easy Integration: Seamlessly fits into existing workflows.
Customizable: Adaptable for specific medical applications.

Getting Started

Prerequisites

Python 3.8 or higher
PyTorch
Hugging Face Transformers library

Installation

Clone the repository:

git clone https://github.com/your-github-username/med-vllm.git

Navigate to the project directory:
```
cd med-vllm
```

Create and activate a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Quick Start Example

Run a sample inference:

python run_inference.py --model bioBERT --input "The patient has a history of diabetes and hypertension."

This will process the input using the specified model (e.g., BioBERT). You can also use --model clinicalBERT to switch models.

Testing

Med vLLM includes a comprehensive test suite to ensure code quality and functionality. The test suite is built using Python's unittest framework.

Running Tests

To run all tests:

# Run all tests
python -m pytest tests/unit/ -v

# Run a specific test file
python -m pytest tests/unit/test_medical_adapters.py -v

# Run a specific test class
python -m pytest tests/unit/test_medical_adapters.py::TestBioBERTAdapter -v

# Run a specific test method
python -m pytest tests/unit/test_medical_adapters.py::TestBioBERTAdapter::test_biomedical_text_processing -v

Test Coverage

To generate a test coverage report:

# Install coverage if not already installed
pip install coverage

# Run tests with coverage
coverage run -m pytest tests/unit/

# Generate coverage report
coverage report -m

# Generate HTML coverage report
coverage html

The HTML report will be available in the htmlcov directory.

Quality & Evaluation

A/B smoke for text generation strategies (offline echo engine):

python scripts/ab_test_textgen.py --dataset benchmarks/datasets/textgen_small.jsonl --output benchmarks/results/textgen_ab_results.json

Domain expert evaluation protocol and template: see docs/expert_eval_protocol.md and docs/expert_eval_template.csv. Aggregate filled scores with:
```
python scripts/aggregate_expert_eval.py path/to/your_eval.csv
```

Test Structure

The test suite is organized as follows:

tests/unit/test_medical_adapters.py: Contains all unit tests for medical adapters
- TestBaseAdapter: Tests for the base adapter functionality
- TestBioBERTAdapter: Tests specific to BioBERT adapter
- TestClinicalBERTAdapter: Tests specific to ClinicalBERT adapter

Benchmarks

For benchmark quick starts (CPU/GPU adapter smokes, training smokes, report generation), see:

benchmarks/README.md

Usage

Text Classification

Classify a clinical note as positive or negative for a condition:

python run_inference.py --model clinicalBERT --task classify --input "Patient shows signs of pneumonia."

Named Entity Recognition

Extract medical entities from text:

python run_inference.py --model bioBERT --task ner --input "Patient prescribed metformin for diabetes."

NERProcessor (Lightweight utility)

Use a simple, pluggable NER processor with a regex fallback or your own model-backed pipeline:

from medvllm.tasks import NERProcessor

proc = NERProcessor(inference_pipeline=None, config=None)  # regex fallback
res = proc.extract_entities("Patient has myocardial infarction (MI). Aspirin given.")
linked = proc.link_entities(res, ontology="UMLS")
html = proc.highlight_entities(linked)

Example script: examples/ner_processor_example.py
Documentation: docs/ner_processor.md

Benchmarking Ontology Linking

Measure linking performance and cache effectiveness on longer notes:

python3 -m benchmarks.benchmark_linking --paragraphs 50 --runs 3 --ontology RXNORM

See docs/ner_processor.md for external enrichment (RxNorm, UMLS CAS/TGT) configuration.

Text Generation

Generate a summary of a patient's medical history:

python run_inference.py --model clinicalBERT --task generate --input "Patient has diabetes and hypertension."

Fine-Tuning

To fine-tune Med vLLM on your own medical dataset:

Prepare your dataset in a compatible format (e.g., JSON or CSV).

Use the provided training script:

python train.py --model bioBERT --dataset path/to/your/data

Evaluate the fine-tuned model with:

python evaluate.py --model path/to/finetuned/model

Detailed instructions will be provided as the project evolves.

Limitations

Currently supports only English-language medical texts.
Multilingual support is planned for future releases.

Contributing

We welcome contributions! To get involved:

Report bugs or suggest features by opening an issue.
Submit pull requests with improvements, following the project's code style and including tests for new features.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Med vLLM builds upon:

Thanks to their creators for their open-source contributions.

Triton Softmax×V (Development Mode)

The Triton streaming softmax×V kernel is experimental and gated by default. Use it only for development and benchmarking.

Gating (default off): The Triton path is disabled unless explicitly enabled via env vars.
Fallbacks: If disabled, we use a safe row-softmax + matmul path; Flash Attention is optional and not required.

Enable (dev only)

Set these environment variables to route the attention softmax×V through the Triton streaming kernel:

export MEDVLLM_ENABLE_TRITON_SOFTMAXV=1
export MEDVLLM_ENABLE_TRITON_SOFTMAXV_STREAMING=1
export MEDVLLM_FORCE_STREAMING_SOFTMAXV=1   # force use during dev

Autotune control

Autotune may cause long JIT times. Use these to constrain it:

Fast compile (single tiny config): MEDVLLM_SOFTMAXV_COMPILE_FAST=1
Narrow preset (few configs): MEDVLLM_SOFTMAXV_COMPILE_NARROW=1
Force single config by index: MEDVLLM_SOFTMAXV_FORCE_CONFIG=<int>

No-autotune path (safest for JIT)

Bypass autotune entirely by compiling exactly one configuration (recommended during early bring-up):

export MEDVLLM_SOFTMAXV_NO_AUTOTUNE=1
export MEDVLLM_SOFTMAXV_BLOCK_N=128     # seq tile
export MEDVLLM_SOFTMAXV_BLOCK_D=64      # feature tile
export MEDVLLM_SOFTMAXV_K=4             # inner unroll
export MEDVLLM_SOFTMAXV_NUM_WARPS=4
export MEDVLLM_SOFTMAXV_NUM_STAGES=2
export MEDVLLM_SOFTMAXV_MAX_TILES_CAP=32  # cap compile-time loop bound

Safe run guide

Warm-up compile on a smaller shape to prime the JIT cache:

python benchmarks/benchmark_attention.py \
  --device cuda --seq 256 --heads 8 --dim 512 --iters 1 \
  --attn-softmaxv-bench --enable-triton-softmaxv

Target run on your actual shape (consider a shell timeout on first build):

python benchmarks/benchmark_attention.py \
  --device cuda --seq 512 --heads 8 --dim 512 --iters 3 \
  --attn-softmaxv-bench --enable-triton-softmaxv

Notes:

If compile stalls: prefer the no-autotune path; reduce NUM_STAGES to 1; increase BLOCK_N to shrink MAX_TILES.
Performance tuning ideas: switch to block pointers for V, experiment with small-width dot patterns inside the K-unroll, and re-expand autotune once compile is reliable.

Citation

If you use Med vLLM in your research or application, please cite it as:

[SHA888](https://github.com/SHA888). (2025). Med vLLM: A Medical Language Model. GitHub repository, https://github.com/SHA888/med-vllm

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.claude		.claude
.clinerules		.clinerules
.cursor		.cursor
.gemini		.gemini
.github		.github
.hf_export		.hf_export
.roo		.roo
.taskmaster		.taskmaster
.trae/rules		.trae/rules
.windsurf		.windsurf
benchmarks		benchmarks
config		config
configs/deployment		configs/deployment
docs		docs
examples		examples
medvllm		medvllm
outputs		outputs
profiles		profiles
requirements		requirements
scripts		scripts
tests		tests
toy_cli_run		toy_cli_run
toy_cli_run2		toy_cli_run2
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.roomodes		.roomodes
AGENT.md		AGENT.md
LICENSE		LICENSE
README.md		README.md
check_package_structure.py		check_package_structure.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
ruff.toml		ruff.toml
setup.py		setup.py

License

junaidi-ai/med-vllm

Folders and files

Latest commit

History

Repository files navigation

Med vLLM

Motivation

Key Features

Getting Started

Prerequisites

Installation

Quick Start Example

Testing

Running Tests

Test Coverage

Quality & Evaluation

Test Structure

Benchmarks

Usage

Text Classification

Named Entity Recognition

NERProcessor (Lightweight utility)

Benchmarking Ontology Linking

Text Generation

Fine-Tuning

Limitations

Contributing

License

Acknowledgments

Triton Softmax×V (Development Mode)

Enable (dev only)

Autotune control

No-autotune path (safest for JIT)

Safe run guide

Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages