End-to-end workflow for generating synthetic PII-containing healthcare text, fine-tuning a small language model (SLM) with LoRA adapters for PII removal / redaction, and preparing an optimized GGUF artifact for edge / resource-constrained deployment.
This repository demonstrates a three-stage pipeline:
- Synthetic Data Generation (
notebooks/step_1_synthetic_data_generation.ipynb
): Creates structured healthcare-style text with embedded PII (names, dates, MRNs, etc.). - Model Fine-Tuning (
notebooks/step_2_slm_finetuning.ipynb
): Applies LoRA to adapt a small base model for PII detection / transformation. - Edge Deployment (
notebooks/step_3_slm_edge_deployment.ipynb
): Exports / converts the model to GGUF and provides patterns for lightweight inference.
data/
synthetic_data/ # Train/val/test synthetic JSON/JSONL artifacts
lora_finetuned_model/ # LoRA adapter + tokenizer assets (LFS tracked)
gguf_model/ # Exported / quantized GGUF model & tokenizer (LFS tracked)
notebooks/ # Three sequential workflow notebooks
Clone the repository and ensure Git LFS is installed so large model artifacts pull correctly:
git clone https://github.com/superlinear-ai/pii-removal-edge-deployment.git
cd pii-removal-edge-deployment
git lfs install
git lfs pull
- Install Git LFS locally if cloning first.
- Clone (or download) the repo locally (optional if you upload files manually):
git clone https://github.com/superlinear-ai/pii-removal-edge-deployment.git cd pii-removal-edge-deployment git lfs install git lfs pull
- Open Google Colab.
- Upload the desired notebook from
notebooks/
(File -> Upload notebook). - If the notebook requires the model artifacts, either:
- Upload the needed subfolders from
data/
manually, or - Add a cell to
git clone
the repo (Colab runtime) and rungit lfs install && git lfs pull
.
- Upload the needed subfolders from
Large model & adapter artifacts (*.bin
, *.pt
, *.safetensors
, *.gguf
) and dataset JSONL files are tracked via patterns in .gitattributes
.
Key commands:
git lfs install # One-time per machine
git lfs track "*.gguf" # Example: track new pattern
git add .gitattributes
git add path/to/large_file.gguf
git commit -m "Add new quantized model"
To verify LFS pointers:
git lfs ls-files
- Place raw large files under an appropriate subfolder in
data/
. - Ensure pattern is in
.gitattributes
(edit if necessary). - Commit via LFS (see above).
PRs welcome for improvements, tooling, reproducibility, and inference examples.
See LICENSE
file.