๐ LLM/RAG ้ฉ ๅ็้ซ็ๆๆฌๅป่ญๅฅๅๅทฅๅ ท | AI-Powered Medical Text De-identification
๐ ไธ้ต้จ็ฝฒ ๆฏๆด OpenAI / Anthropic / Ollama / MiniMind ๅค็จฎ LLM
๐ฏ ้ซๆบ็ขบ็ RAG + LLM ้ๅผๆ๏ผPHI ่ญๅฅๆบ็ขบ็ 95%+
โก ๆททๅ็ญ็ฅ SpaCy + Regex + LLM ไธๅฑค่ญๅฅ๏ผๆ่ฝๆๅ 30-100x
๐ ๅค่ช่จๆฏๆด ็นไธญ/็ฐกไธญ/่ฑ/ๆฅ/้/ๆณ/ๅพท็ญ 10+ ่ช่จ
๐ ๆนๆฌก่็ Excel/CSV/PDF/Word ็ญ 10+ ๆ ผๅผไธๆฌก่็
๐ ้ฑ็งๅชๅ
็
ๆญท่ณๆไธๆไน
ๅ๏ผ็ฌฆๅ HIPAA/GDPR
๐ ๅฎๅ
จ้ๆบ MIT License๏ผๅฏๅ็จ
- Overview | ๆฆ่ฟฐ
- Key Features | ไธป่ฆๅ่ฝ
- Quick Start | ๅฟซ้้ๅง
- Installation | ๅฎ่ฃ
- Usage Examples | ไฝฟ็จ็ฏไพ
- Supported LLM Providers | ๆฏๆด็ LLM
- Architecture | ็ณป็ตฑๆถๆง
- Documentation | ๆๆช
- Contributing | ่ฒข็ป
- License | ๆๆฌ
Medical De-identification Toolkit is an open-source Python library that uses LLM (Large Language Model) and RAG (Retrieval-Augmented Generation) technology to automatically identify and mask Protected Health Information (PHI) in medical records.
้ซ็ๅป่ญๅฅๅๅทฅๅ ทๅฅไปถ ๆฏไธๅ้ๆบ Python ๅฝๅผๅบซ๏ผไฝฟ็จ LLM๏ผๅคงๅ่ช่จๆจกๅ๏ผ ่ RAG๏ผๆชข็ดขๅขๅผท็ๆ๏ผ ๆ่ก๏ผ่ชๅ่ญๅฅไธฆ้ฎ่ฝ้ซ็็ ๆญทไธญ็ ๅไบบๅฅๅบท่ณ่จ๏ผPHI๏ผใ
| Challenge | Traditional Approach | Our Solution |
|---|---|---|
| PHI Detection | Rule-based regex | ๐ค LLM + RAG semantic understanding |
| Multi-language | Separate models | ๐ Single multilingual pipeline |
| Custom Rules | Hard-coded | ๐ RAG retrieves from regulation docs |
| Deployment | Heavy dependencies | โก Supports ultra-light MiniMind (26M params) |
- 20+ PHI Types: Name, Date, Location, Medical Record Number, Age >89, Rare Diseases, etc.
- Multi-language: Traditional Chinese, Simplified Chinese, English, Japanese, Korean, and more
- Context-aware: Understands medical context for accurate detection
| Strategy | Description | Example |
|---|---|---|
| Redaction | Complete removal | ๅผตไธ โ [REDACTED] |
| Masking | Type-based placeholder | ๅผตไธ โ [NAME] |
| Generalization | Reduce precision | 1990-05-15 โ 1990 |
| Pseudonymization | Consistent replacement | ๅผตไธ โ Patient_A |
๐ Text: TXT, CSV, JSON
๐ Office: XLSX, XLS, DOCX
๐ Document: PDF, HTML, XML
๐ฅ Healthcare: FHIR R4 JSON
- Cloud: OpenAI GPT-4o, Anthropic Claude 3
- Local: Ollama (Qwen, Llama, Mistral)
- Ultra-light: MiniMind (26M-104M params) โ ๐ NEW!
- DSPy Integration: Automatic prompt optimization โ ๐ NEW!
from medical_deidentification.application.processing import DeidentificationEngine
from medical_deidentification.infrastructure.llm import LLMPresets, create_llm
# 1. Choose your LLM (pick one)
llm = create_llm(LLMPresets.local_minimind()) # Free, runs locally!
# llm = create_llm(LLMPresets.local_qwen()) # Better quality
# llm = create_llm(LLMPresets.gpt_4o()) # Best quality (requires API key)
# 2. Create engine
engine = DeidentificationEngine(llm=llm)
# 3. Process medical text
text = """
็
ๆฃๅงๅ๏ผ็ๅคงๆ๏ผ่บซๅ่ญๅญ่๏ผA123456789
ๅบ็ๆฅๆ๏ผ1985ๅนด3ๆ15ๆฅ๏ผ่ฏ็ตก้ป่ฉฑ๏ผ0912-345-678
่จบๆท๏ผๆณๅธ็ๆฐ็๏ผ็ฝ่ฆ็พ็
๏ผ
ไธปๆฒป้ซๅธซ๏ผ้ณ้ซๅธซ๏ผๅฐๅๆฆฎๆฐ็ธฝ้ซ้ข
"""
result = engine.process(text)
print(result.deidentified_text)
# Output: ็
ๆฃๅงๅ๏ผ[NAME]๏ผ่บซๅ่ญๅญ่๏ผ[ID]...pip install medical-deidentificationgit clone https://github.com/u9401066/medical-deidentification.git
cd medical-deidentification
pip install -e .git clone https://github.com/u9401066/medical-deidentification.git
cd medical-deidentification
poetry install
poetry shellfrom medical_deidentification.infrastructure.rag import PHIIdentificationChain
from medical_deidentification.infrastructure.llm import LLMConfig, create_llm
# Configure LLM
config = LLMConfig(
provider="ollama",
model_name="qwen2.5:7b",
temperature=0.0
)
llm = create_llm(config)
# Identify PHI entities
entities = phi_chain.identify_phi(medical_text)
for entity in entities:
print(f"Found: {entity.text} ({entity.phi_type}, confidence: {entity.confidence})")from medical_deidentification.application.processing import (
BatchPHIProcessor,
BatchProcessingConfig
)
# Configure batch processor
batch_config = BatchProcessingConfig(
max_rows=100, # Process first 100 rows
language="zh-TW", # Traditional Chinese
skip_empty_rows=True
)
processor = BatchPHIProcessor(phi_chain, batch_config)
result = processor.process_excel_file("patient_records.xlsx")
# Export results
result.to_excel("phi_results.xlsx")
print(f"Found {result.total_entities} PHI entities in {result.processed_rows} rows")from medical_deidentification.infrastructure.llm import LLMPresets, create_llm
# MiniMind: Only 104M parameters, runs on any hardware!
config = LLMPresets.local_minimind()
llm = create_llm(config)
# First, pull the model (one-time setup)
# $ ollama pull jingyaogong/minimind2from medical_deidentification.infrastructure.rag import (
RegulationRetrievalChain,
PHIIdentificationChain
)
# Load regulation documents (HIPAA, GDPR, Taiwan PDPA, etc.)
regulation_chain = RegulationRetrievalChain(
regulation_dir="./regulations"
)
# PHI detection with regulation context
phi_chain = PHIIdentificationChain(
regulation_chain=regulation_chain,
llm=llm
)
# The system now retrieves relevant regulations to guide PHI detection
entities = phi_chain.identify_phi(medical_text)from medical_deidentification.infrastructure.dspy import (
PHIIdentifier,
PHIPromptOptimizer,
PHIEvaluator
)
# Configure DSPy with Ollama
from medical_deidentification.infrastructure.dspy.phi_module import configure_dspy_ollama
configure_dspy_ollama(model_name="qwen2.5:1.5b")
# Create base PHI identifier
identifier = PHIIdentifier()
# Run automatic optimization with DSPy
optimizer = PHIPromptOptimizer()
result = optimizer.optimize(
trainset=training_examples,
method="bootstrap", # or "mipro"
max_iterations=10
)
# Use optimized module
optimized_identifier = result.best_module
entities = optimized_identifier(medical_text="Patient John Smith, age 45...")
# Check metrics
print(f"F1 Score: {result.optimized_score:.2%}")
print(f"Speed improvement: {result.time_improvement:.2%}")| Provider | Models | Structured Output | Setup |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, GPT-3.5 | โ Native | OPENAI_API_KEY |
| Anthropic | Claude 3 Opus/Sonnet/Haiku | โ Native | ANTHROPIC_API_KEY |
| Model | Parameters | Speed | Quality | GPU VRAM |
|---|---|---|---|---|
| MiniMind2 ๐ | 104M | โกโกโกโกโก | โญโญ | 1GB |
| MiniMind2-Small | 26M | โกโกโกโกโก | โญ | <1GB |
| Qwen 2.5 7B | 7B | โกโกโก | โญโญโญโญ | 4GB |
| Llama 3.1 8B | 8B | โกโกโก | โญโญโญโญ | 4GB |
| Mistral 7B | 7B | โกโกโกโญ | โญโญโญ | 4GB |
# Install Ollama (https://ollama.ai)
# Then pull your preferred model:
ollama pull jingyaogong/minimind2 # Ultra-light (recommended for testing)
ollama pull qwen2.5:7b # Balanced (recommended for production)
ollama pull llama3.1:8b # General purpose๐ See Ollama Setup Guide for detailed instructions.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Hybrid PHI Detection Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Level 1: Regex Fast Scan (~0.001s) โ
โ โโโ ID Numbers, Phone, Email, Date patterns โ
โ โโโ Coverage: ~30% of PHI โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Level 2: SpaCy NER (~0.01-0.05s) โ
โ โโโ PERSON, DATE, ORG, GPE, LOC entities โ
โ โโโ Coverage: ~40% of PHI โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Level 3: Small LLM - Uncertain Regions Only (~0.5-2s) โ
โ โโโ Qwen2.5-0.5B/1.5B for remaining ~30% โ
โ โโโ Fall back to Qwen2.5-7B for complex cases โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FIFO Stateless Streaming Architecture โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Large File โ [Chunk Iterator] โ Process โ Output โ Next โ
โ โ
โ Features: โ
โ โโโ ๐ FIFO: Process chunks in order, one at a time โ
โ โโโ ๐พ Checkpoint: Resume from last processed chunk โ
โ โโโ ๐ซ No accumulation: Immediate output, low memory โ
โ โโโ ๐ Unlimited file size: Stream processing โ
โ โโโ โก Tools: Pre-scan with Regex/SpaCy before LLM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Usage: โ
โ chain = StreamingPHIChain(llm, config) โ
โ for result in chain.process_file("large.txt"): โ
โ print(f"Chunk {result.chunk_id}: {len(result.phi)}") โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Medical De-identification Toolkit โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Interface Layer โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ CLI โ โ Python โ โ REST API โ โ
โ โ (Typer) โ โ Library โ โ (Future) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Application Layer โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ DeidentificationEngine โ BatchPHIProcessor โ โ
โ โ PHI Detection Pipeline โ Report Generator โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Infrastructure Layer โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ LLM โ โ RAG โ โ Loader โ โ Output โ โ
โ โ Factory โ โ Engine โ โ (10 fmt) โ โ Manager โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Domain Layer โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ PHIEntity โ PHIType โ MaskingStrategy โ Config โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ See Architecture Guide for detailed design.
| Document | Description |
|---|---|
| ๐ Architecture Guide | System design & DDD structure |
| ๐ Deployment Guide | Installation & configuration |
| ๐ง Ollama Setup | Local LLM setup guide |
| ๐ Batch Processing | Excel/CSV batch processing |
| ๐ RAG Usage Guide | Regulation retrieval system |
| LLM Provider | Model | Time | Hardware |
|---|---|---|---|
| MiniMind | minimind2 | ~2-5s | CPU only |
| Ollama | qwen2.5:7b | ~15-25s | RTX 3090 |
| OpenAI | gpt-4o-mini | ~3-5s | API |
| Anthropic | claude-3-haiku | ~2-4s | API |
| PHI Type | Precision | Recall | F1 Score |
|---|---|---|---|
| Name | 96% | 94% | 95% |
| Date | 98% | 97% | 97.5% |
| ID Number | 99% | 98% | 98.5% |
| Location | 92% | 90% | 91% |
| Age >89 | 100% | 99% | 99.5% |
We welcome contributions! ๆญก่ฟ่ฒข็ป๏ผ
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- Never commit real PHI to this repository
- Medical data is processed in-memory only (not persisted)
- Designed for HIPAA and GDPR compliance
- Users are responsible for proper usage in their context
- LangChain - LLM framework
- Ollama - Local LLM runtime
- MiniMind - Ultra-lightweight LLM
- FAISS - Vector similarity search
Built with โค๏ธ for Healthcare Privacy