Med vLLM is a project aimed at creating a specialized language model for medical applications. By leveraging the efficient Nano vLLM and the domain knowledge of BioBERT and ClinicalBERT, we provide a tool that's both powerful and resource-friendly.
Hugging Face Hub: https://huggingface.co/Junaidi-AI/med-vllm
You can load the config directly from the Hub via:
from medvllm.medical.config.models.medical_config import MedicalModelConfig
cfg = MedicalModelConfig.from_pretrained("Junaidi-AI/med-vllm")Large language models have shown great promise in various fields, but their size and resource requirements can be prohibitive, especially in resource-constrained environments like hospitals or research labs. Med vLLM addresses this by using a lightweight inference engine while maintaining high performance on medical tasks such as analyzing clinical notes or assisting with medical research.
- Efficient Inference: Powered by Nano vLLM for lightweight performance.
- Medical Expertise: Pre-trained on medical data with BioBERT and ClinicalBERT.
- Easy Integration: Seamlessly fits into existing workflows.
- Customizable: Adaptable for specific medical applications.
- Python 3.8 or higher
- PyTorch
- Hugging Face Transformers library
-
Clone the repository:
git clone https://github.com/your-github-username/med-vllm.git
-
Navigate to the project directory:
cd med-vllm -
Create and activate a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Run a sample inference:
python run_inference.py --model bioBERT --input "The patient has a history of diabetes and hypertension."This will process the input using the specified model (e.g., BioBERT). You can also use --model clinicalBERT to switch models.
Med vLLM includes a comprehensive test suite to ensure code quality and functionality. The test suite is built using Python's unittest framework.
To run all tests:
# Run all tests
python -m pytest tests/unit/ -v
# Run a specific test file
python -m pytest tests/unit/test_medical_adapters.py -v
# Run a specific test class
python -m pytest tests/unit/test_medical_adapters.py::TestBioBERTAdapter -v
# Run a specific test method
python -m pytest tests/unit/test_medical_adapters.py::TestBioBERTAdapter::test_biomedical_text_processing -vTo generate a test coverage report:
# Install coverage if not already installed
pip install coverage
# Run tests with coverage
coverage run -m pytest tests/unit/
# Generate coverage report
coverage report -m
# Generate HTML coverage report
coverage htmlThe HTML report will be available in the htmlcov directory.
-
A/B smoke for text generation strategies (offline echo engine):
python scripts/ab_test_textgen.py --dataset benchmarks/datasets/textgen_small.jsonl --output benchmarks/results/textgen_ab_results.json
-
Domain expert evaluation protocol and template: see
docs/expert_eval_protocol.mdanddocs/expert_eval_template.csv. Aggregate filled scores with:python scripts/aggregate_expert_eval.py path/to/your_eval.csv
The test suite is organized as follows:
tests/unit/test_medical_adapters.py: Contains all unit tests for medical adaptersTestBaseAdapter: Tests for the base adapter functionalityTestBioBERTAdapter: Tests specific to BioBERT adapterTestClinicalBERTAdapter: Tests specific to ClinicalBERT adapter
For benchmark quick starts (CPU/GPU adapter smokes, training smokes, report generation), see:
benchmarks/README.md
Classify a clinical note as positive or negative for a condition:
python run_inference.py --model clinicalBERT --task classify --input "Patient shows signs of pneumonia."Extract medical entities from text:
python run_inference.py --model bioBERT --task ner --input "Patient prescribed metformin for diabetes."Use a simple, pluggable NER processor with a regex fallback or your own model-backed pipeline:
from medvllm.tasks import NERProcessor
proc = NERProcessor(inference_pipeline=None, config=None) # regex fallback
res = proc.extract_entities("Patient has myocardial infarction (MI). Aspirin given.")
linked = proc.link_entities(res, ontology="UMLS")
html = proc.highlight_entities(linked)- Example script:
examples/ner_processor_example.py - Documentation:
docs/ner_processor.md
Measure linking performance and cache effectiveness on longer notes:
python3 -m benchmarks.benchmark_linking --paragraphs 50 --runs 3 --ontology RXNORMSee docs/ner_processor.md for external enrichment (RxNorm, UMLS CAS/TGT) configuration.
Generate a summary of a patient's medical history:
python run_inference.py --model clinicalBERT --task generate --input "Patient has diabetes and hypertension."To fine-tune Med vLLM on your own medical dataset:
- Prepare your dataset in a compatible format (e.g., JSON or CSV).
- Use the provided training script:
python train.py --model bioBERT --dataset path/to/your/data
- Evaluate the fine-tuned model with:
python evaluate.py --model path/to/finetuned/model
Detailed instructions will be provided as the project evolves.
- Currently supports only English-language medical texts.
- Multilingual support is planned for future releases.
We welcome contributions! To get involved:
- Report bugs or suggest features by opening an issue.
- Submit pull requests with improvements, following the project's code style and including tests for new features.
This project is licensed under the MIT License - see the LICENSE file for details.
Med vLLM builds upon:
Thanks to their creators for their open-source contributions.
The Triton streaming softmax×V kernel is experimental and gated by default. Use it only for development and benchmarking.
- Gating (default off): The Triton path is disabled unless explicitly enabled via env vars.
- Fallbacks: If disabled, we use a safe row-softmax + matmul path; Flash Attention is optional and not required.
Set these environment variables to route the attention softmax×V through the Triton streaming kernel:
export MEDVLLM_ENABLE_TRITON_SOFTMAXV=1
export MEDVLLM_ENABLE_TRITON_SOFTMAXV_STREAMING=1
export MEDVLLM_FORCE_STREAMING_SOFTMAXV=1 # force use during devAutotune may cause long JIT times. Use these to constrain it:
- Fast compile (single tiny config):
MEDVLLM_SOFTMAXV_COMPILE_FAST=1 - Narrow preset (few configs):
MEDVLLM_SOFTMAXV_COMPILE_NARROW=1 - Force single config by index:
MEDVLLM_SOFTMAXV_FORCE_CONFIG=<int>
Bypass autotune entirely by compiling exactly one configuration (recommended during early bring-up):
export MEDVLLM_SOFTMAXV_NO_AUTOTUNE=1
export MEDVLLM_SOFTMAXV_BLOCK_N=128 # seq tile
export MEDVLLM_SOFTMAXV_BLOCK_D=64 # feature tile
export MEDVLLM_SOFTMAXV_K=4 # inner unroll
export MEDVLLM_SOFTMAXV_NUM_WARPS=4
export MEDVLLM_SOFTMAXV_NUM_STAGES=2
export MEDVLLM_SOFTMAXV_MAX_TILES_CAP=32 # cap compile-time loop bound- Warm-up compile on a smaller shape to prime the JIT cache:
python benchmarks/benchmark_attention.py \
--device cuda --seq 256 --heads 8 --dim 512 --iters 1 \
--attn-softmaxv-bench --enable-triton-softmaxv- Target run on your actual shape (consider a shell timeout on first build):
python benchmarks/benchmark_attention.py \
--device cuda --seq 512 --heads 8 --dim 512 --iters 3 \
--attn-softmaxv-bench --enable-triton-softmaxvNotes:
- If compile stalls: prefer the no-autotune path; reduce
NUM_STAGESto 1; increaseBLOCK_Nto shrinkMAX_TILES. - Performance tuning ideas: switch to block pointers for
V, experiment with small-width dot patterns inside the K-unroll, and re-expand autotune once compile is reliable.
If you use Med vLLM in your research or application, please cite it as:
[SHA888](https://github.com/SHA888). (2025). Med vLLM: A Medical Language Model. GitHub repository, https://github.com/SHA888/med-vllm