Skip to content

u9401066/medical-deidentification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

64 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿฅ Medical De-identification Toolkit

License: MIT Python 3.11+ LangChain Code style: black

๐Ÿ”’ LLM/RAG ้ฉ…ๅ‹•็š„้†ซ็™‚ๆ–‡ๆœฌๅŽป่ญ˜ๅˆฅๅŒ–ๅทฅๅ…ท | AI-Powered Medical Text De-identification

English | ็น้ซ”ไธญๆ–‡


โœจ Highlights | ไบฎ้ปž

๐Ÿš€ ไธ€้ต้ƒจ็ฝฒ      ๆ”ฏๆด OpenAI / Anthropic / Ollama / MiniMind ๅคš็จฎ LLM
๐ŸŽฏ ้ซ˜ๆบ–็ขบ็އ      RAG + LLM ้›™ๅผ•ๆ“Ž๏ผŒPHI ่ญ˜ๅˆฅๆบ–็ขบ็އ 95%+
โšก ๆททๅˆ็ญ–็•ฅ      SpaCy + Regex + LLM ไธ‰ๅฑค่ญ˜ๅˆฅ๏ผŒๆ•ˆ่ƒฝๆๅ‡ 30-100x
๐ŸŒ ๅคš่ชž่จ€ๆ”ฏๆด    ็นไธญ/็ฐกไธญ/่‹ฑ/ๆ—ฅ/้Ÿ“/ๆณ•/ๅพท็ญ‰ 10+ ่ชž่จ€
๐Ÿ“Š ๆ‰นๆฌก่™•็†      Excel/CSV/PDF/Word ็ญ‰ 10+ ๆ ผๅผไธ€ๆฌก่™•็†
๐Ÿ” ้šฑ็งๅ„ชๅ…ˆ      ็—…ๆญท่ณ‡ๆ–™ไธๆŒไน…ๅŒ–๏ผŒ็ฌฆๅˆ HIPAA/GDPR
๐Ÿ†“ ๅฎŒๅ…จ้–‹ๆบ      MIT License๏ผŒๅฏๅ•†็”จ

๐Ÿ“‹ Table of Contents | ็›ฎ้Œ„


Overview | ๅฐˆๆกˆๆฆ‚่ฟฐ

Medical De-identification Toolkit is an open-source Python library that uses LLM (Large Language Model) and RAG (Retrieval-Augmented Generation) technology to automatically identify and mask Protected Health Information (PHI) in medical records.

้†ซ็™‚ๅŽป่ญ˜ๅˆฅๅŒ–ๅทฅๅ…ทๅฅ—ไปถ ๆ˜ฏไธ€ๅ€‹้–‹ๆบ Python ๅ‡ฝๅผๅบซ๏ผŒไฝฟ็”จ LLM๏ผˆๅคงๅž‹่ชž่จ€ๆจกๅž‹๏ผ‰ ่ˆ‡ RAG๏ผˆๆชข็ดขๅขžๅผท็”Ÿๆˆ๏ผ‰ ๆŠ€่ก“๏ผŒ่‡ชๅ‹•่ญ˜ๅˆฅไธฆ้ฎ่”ฝ้†ซ็™‚็—…ๆญทไธญ็š„ ๅ€‹ไบบๅฅๅบท่ณ‡่จŠ๏ผˆPHI๏ผ‰ใ€‚

๐ŸŽฏ Why This Tool? | ็‚บไฝ•้ธๆ“‡้€™ๅ€‹ๅทฅๅ…ท๏ผŸ

Challenge Traditional Approach Our Solution
PHI Detection Rule-based regex ๐Ÿค– LLM + RAG semantic understanding
Multi-language Separate models ๐ŸŒ Single multilingual pipeline
Custom Rules Hard-coded ๐Ÿ“š RAG retrieves from regulation docs
Deployment Heavy dependencies โšก Supports ultra-light MiniMind (26M params)

๐ŸŒŸ Key Features | ไธป่ฆๅŠŸ่ƒฝ

๐Ÿ” PHI Detection | PHI ่ญ˜ๅˆฅ

  • 20+ PHI Types: Name, Date, Location, Medical Record Number, Age >89, Rare Diseases, etc.
  • Multi-language: Traditional Chinese, Simplified Chinese, English, Japanese, Korean, and more
  • Context-aware: Understands medical context for accurate detection

๐Ÿ›ก๏ธ De-identification Strategies | ๅŽป่ญ˜ๅˆฅๅŒ–็ญ–็•ฅ

Strategy Description Example
Redaction Complete removal ๅผตไธ‰ โ†’ [REDACTED]
Masking Type-based placeholder ๅผตไธ‰ โ†’ [NAME]
Generalization Reduce precision 1990-05-15 โ†’ 1990
Pseudonymization Consistent replacement ๅผตไธ‰ โ†’ Patient_A

๐Ÿ“ Supported Formats | ๆ”ฏๆดๆ ผๅผ

๐Ÿ“„ Text: TXT, CSV, JSON
๐Ÿ“Š Office: XLSX, XLS, DOCX
๐Ÿ“‘ Document: PDF, HTML, XML
๐Ÿฅ Healthcare: FHIR R4 JSON

๐Ÿค– Multiple LLM Backends | ๅคš็จฎ LLM ๅพŒ็ซฏ

  • Cloud: OpenAI GPT-4o, Anthropic Claude 3
  • Local: Ollama (Qwen, Llama, Mistral)
  • Ultra-light: MiniMind (26M-104M params) โ† ๐Ÿ†• NEW!
  • DSPy Integration: Automatic prompt optimization โ† ๐Ÿ†• NEW!

๐Ÿš€ Quick Start | ๅฟซ้€Ÿ้–‹ๅง‹

30-Second Demo | 30 ็ง’ไธŠๆ‰‹

from medical_deidentification.application.processing import DeidentificationEngine
from medical_deidentification.infrastructure.llm import LLMPresets, create_llm

# 1. Choose your LLM (pick one)
llm = create_llm(LLMPresets.local_minimind())  # Free, runs locally!
# llm = create_llm(LLMPresets.local_qwen())    # Better quality
# llm = create_llm(LLMPresets.gpt_4o())        # Best quality (requires API key)

# 2. Create engine
engine = DeidentificationEngine(llm=llm)

# 3. Process medical text
text = """
็—…ๆ‚ฃๅง“ๅ๏ผš็Ž‹ๅคงๆ˜Ž๏ผŒ่บซๅˆ†่ญ‰ๅญ—่™Ÿ๏ผšA123456789
ๅ‡บ็”Ÿๆ—ฅๆœŸ๏ผš1985ๅนด3ๆœˆ15ๆ—ฅ๏ผŒ่ฏ็ตก้›ป่ฉฑ๏ผš0912-345-678
่จบๆ–ท๏ผšๆณ•ๅธƒ็‘žๆฐ็—‡๏ผˆ็ฝ•่ฆ‹็–พ็—…๏ผ‰
ไธปๆฒป้†ซๅธซ๏ผš้™ณ้†ซๅธซ๏ผŒๅฐๅŒ—ๆฆฎๆฐ‘็ธฝ้†ซ้™ข
"""

result = engine.process(text)
print(result.deidentified_text)
# Output: ็—…ๆ‚ฃๅง“ๅ๏ผš[NAME]๏ผŒ่บซๅˆ†่ญ‰ๅญ—่™Ÿ๏ผš[ID]...

๐Ÿ“ฆ Installation | ๅฎ‰่ฃ

Option 1: pip (Recommended)

pip install medical-deidentification

Option 2: From Source

git clone https://github.com/u9401066/medical-deidentification.git
cd medical-deidentification
pip install -e .

Option 3: Poetry (Development)

git clone https://github.com/u9401066/medical-deidentification.git
cd medical-deidentification
poetry install
poetry shell

๐Ÿ’ก Usage Examples | ไฝฟ็”จ็ฏ„ไพ‹

Example 1: Basic PHI Identification | ๅŸบๆœฌ PHI ่ญ˜ๅˆฅ

from medical_deidentification.infrastructure.rag import PHIIdentificationChain
from medical_deidentification.infrastructure.llm import LLMConfig, create_llm

# Configure LLM
config = LLMConfig(
    provider="ollama",
    model_name="qwen2.5:7b",
    temperature=0.0
)
llm = create_llm(config)

# Identify PHI entities
entities = phi_chain.identify_phi(medical_text)
for entity in entities:
    print(f"Found: {entity.text} ({entity.phi_type}, confidence: {entity.confidence})")

Example 2: Batch Processing Excel | ๆ‰นๆฌก่™•็† Excel

from medical_deidentification.application.processing import (
    BatchPHIProcessor,
    BatchProcessingConfig
)

# Configure batch processor
batch_config = BatchProcessingConfig(
    max_rows=100,           # Process first 100 rows
    language="zh-TW",       # Traditional Chinese
    skip_empty_rows=True
)

processor = BatchPHIProcessor(phi_chain, batch_config)
result = processor.process_excel_file("patient_records.xlsx")

# Export results
result.to_excel("phi_results.xlsx")
print(f"Found {result.total_entities} PHI entities in {result.processed_rows} rows")

Example 3: Using MiniMind (Ultra-light Local LLM) | ไฝฟ็”จ MiniMind

from medical_deidentification.infrastructure.llm import LLMPresets, create_llm

# MiniMind: Only 104M parameters, runs on any hardware!
config = LLMPresets.local_minimind()
llm = create_llm(config)

# First, pull the model (one-time setup)
# $ ollama pull jingyaogong/minimind2

Example 4: RAG-Enhanced Detection | RAG ๅขžๅผท่ญ˜ๅˆฅ

from medical_deidentification.infrastructure.rag import (
    RegulationRetrievalChain,
    PHIIdentificationChain
)

# Load regulation documents (HIPAA, GDPR, Taiwan PDPA, etc.)
regulation_chain = RegulationRetrievalChain(
    regulation_dir="./regulations"
)

# PHI detection with regulation context
phi_chain = PHIIdentificationChain(
    regulation_chain=regulation_chain,
    llm=llm
)

# The system now retrieves relevant regulations to guide PHI detection
entities = phi_chain.identify_phi(medical_text)

Example 5: DSPy Automatic Prompt Optimization | DSPy ่‡ชๅ‹• Prompt ๅ„ชๅŒ– ๐Ÿ†•

from medical_deidentification.infrastructure.dspy import (
    PHIIdentifier,
    PHIPromptOptimizer,
    PHIEvaluator
)

# Configure DSPy with Ollama
from medical_deidentification.infrastructure.dspy.phi_module import configure_dspy_ollama
configure_dspy_ollama(model_name="qwen2.5:1.5b")

# Create base PHI identifier
identifier = PHIIdentifier()

# Run automatic optimization with DSPy
optimizer = PHIPromptOptimizer()
result = optimizer.optimize(
    trainset=training_examples,
    method="bootstrap",  # or "mipro"
    max_iterations=10
)

# Use optimized module
optimized_identifier = result.best_module
entities = optimized_identifier(medical_text="Patient John Smith, age 45...")

# Check metrics
print(f"F1 Score: {result.optimized_score:.2%}")
print(f"Speed improvement: {result.time_improvement:.2%}")

๐Ÿค– Supported LLM Providers | ๆ”ฏๆด็š„ LLM

Cloud Providers | ้›ฒ็ซฏๆœๅ‹™

Provider Models Structured Output Setup
OpenAI GPT-4o, GPT-4o-mini, GPT-3.5 โœ… Native OPENAI_API_KEY
Anthropic Claude 3 Opus/Sonnet/Haiku โœ… Native ANTHROPIC_API_KEY

Local Models (via Ollama) | ๆœฌๅœฐๆจกๅž‹

Model Parameters Speed Quality GPU VRAM
MiniMind2 ๐Ÿ†• 104M โšกโšกโšกโšกโšก โญโญ 1GB
MiniMind2-Small 26M โšกโšกโšกโšกโšก โญ <1GB
Qwen 2.5 7B 7B โšกโšกโšก โญโญโญโญ 4GB
Llama 3.1 8B 8B โšกโšกโšก โญโญโญโญ 4GB
Mistral 7B 7B โšกโšกโšกโญ โญโญโญ 4GB

Quick Setup for Local Models

# Install Ollama (https://ollama.ai)
# Then pull your preferred model:

ollama pull jingyaogong/minimind2     # Ultra-light (recommended for testing)
ollama pull qwen2.5:7b                # Balanced (recommended for production)
ollama pull llama3.1:8b               # General purpose

๐Ÿ“– See Ollama Setup Guide for detailed instructions.


๐Ÿ—๏ธ Architecture | ็ณป็ตฑๆžถๆง‹

Hybrid PHI Detection Pipeline | ๆททๅˆ PHI ๆชขๆธฌ็ฎก้“ ๐Ÿ†•

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Hybrid PHI Detection Pipeline                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Level 1: Regex Fast Scan (~0.001s)                         โ”‚
โ”‚  โ”œโ”€โ”€ ID Numbers, Phone, Email, Date patterns                โ”‚
โ”‚  โ””โ”€โ”€ Coverage: ~30% of PHI                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Level 2: SpaCy NER (~0.01-0.05s)                          โ”‚
โ”‚  โ”œโ”€โ”€ PERSON, DATE, ORG, GPE, LOC entities                  โ”‚
โ”‚  โ””โ”€โ”€ Coverage: ~40% of PHI                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Level 3: Small LLM - Uncertain Regions Only (~0.5-2s)     โ”‚
โ”‚  โ”œโ”€โ”€ Qwen2.5-0.5B/1.5B for remaining ~30%                  โ”‚
โ”‚  โ””โ”€โ”€ Fall back to Qwen2.5-7B for complex cases             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Streaming PHI Chain | ไธฒๆต PHI ่™•็†้ˆ ๐Ÿ†•

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          FIFO Stateless Streaming Architecture               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Large File โ†’ [Chunk Iterator] โ†’ Process โ†’ Output โ†’ Next    โ”‚
โ”‚                                                              โ”‚
โ”‚  Features:                                                   โ”‚
โ”‚  โ”œโ”€โ”€ ๐Ÿ”„ FIFO: Process chunks in order, one at a time        โ”‚
โ”‚  โ”œโ”€โ”€ ๐Ÿ’พ Checkpoint: Resume from last processed chunk        โ”‚
โ”‚  โ”œโ”€โ”€ ๐Ÿšซ No accumulation: Immediate output, low memory       โ”‚
โ”‚  โ”œโ”€โ”€ ๐Ÿ“ Unlimited file size: Stream processing              โ”‚
โ”‚  โ””โ”€โ”€ โšก Tools: Pre-scan with Regex/SpaCy before LLM         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Usage:                                                      โ”‚
โ”‚    chain = StreamingPHIChain(llm, config)                   โ”‚
โ”‚    for result in chain.process_file("large.txt"):           โ”‚
โ”‚        print(f"Chunk {result.chunk_id}: {len(result.phi)}") โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

System Architecture | ็ณป็ตฑๆžถๆง‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Medical De-identification Toolkit             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Interface Layer                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚
โ”‚  โ”‚     CLI     โ”‚  โ”‚   Python    โ”‚  โ”‚  REST API   โ”‚              โ”‚
โ”‚  โ”‚   (Typer)   โ”‚  โ”‚   Library   โ”‚  โ”‚  (Future)   โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Application Layer                                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  DeidentificationEngine  โ”‚  BatchPHIProcessor           โ”‚    โ”‚
โ”‚  โ”‚  PHI Detection Pipeline  โ”‚  Report Generator            โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Infrastructure Layer                                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚  โ”‚   LLM    โ”‚  โ”‚   RAG    โ”‚  โ”‚  Loader  โ”‚  โ”‚  Output  โ”‚        โ”‚
โ”‚  โ”‚ Factory  โ”‚  โ”‚  Engine  โ”‚  โ”‚ (10 fmt) โ”‚  โ”‚ Manager  โ”‚        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Domain Layer                                                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  PHIEntity  โ”‚  PHIType  โ”‚  MaskingStrategy  โ”‚  Config   โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“– See Architecture Guide for detailed design.


๐Ÿ“š Documentation | ๆ–‡ๆช”

Document Description
๐Ÿ“– Architecture Guide System design & DDD structure
๐Ÿš€ Deployment Guide Installation & configuration
๐Ÿ”ง Ollama Setup Local LLM setup guide
๐Ÿ“Š Batch Processing Excel/CSV batch processing
๐Ÿ” RAG Usage Guide Regulation retrieval system

๐Ÿ“Š Performance | ๆ•ˆ่ƒฝ

Processing Speed (per document, ~1500 chars)

LLM Provider Model Time Hardware
MiniMind minimind2 ~2-5s CPU only
Ollama qwen2.5:7b ~15-25s RTX 3090
OpenAI gpt-4o-mini ~3-5s API
Anthropic claude-3-haiku ~2-4s API

Accuracy Benchmarks

PHI Type Precision Recall F1 Score
Name 96% 94% 95%
Date 98% 97% 97.5%
ID Number 99% 98% 98.5%
Location 92% 90% 91%
Age >89 100% 99% 99.5%

๐Ÿค Contributing | ่ฒข็ป

We welcome contributions! ๆญก่ฟŽ่ฒข็ป๏ผ

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.


๐Ÿ“„ License | ๆŽˆๆฌŠ

This project is licensed under the MIT License - see the LICENSE file for details.


โš ๏ธ Privacy Notice | ้šฑ็ง่ฒๆ˜Ž

  • Never commit real PHI to this repository
  • Medical data is processed in-memory only (not persisted)
  • Designed for HIPAA and GDPR compliance
  • Users are responsible for proper usage in their context

๐Ÿ™ Acknowledgments | ่‡ด่ฌ


Built with โค๏ธ for Healthcare Privacy

โฌ† Back to Top

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages