MIMIC-IV (v2.2) Preprocessing Pipeline

Adapted from the official MIMIC-IV data pipeline:
Healthylaife — https://github.com/healthylaife/MIMIC-IV-Data-Pipeline

Based on:
An Extensive Data Processing Pipeline for MIMIC-IV
Gupta et al., Proceedings of Machine Learning Research, 2022.

Overview

This pipeline preprocesses MIMIC-IV v2.2 Electronic Health Record (EHR) data into structured static and time-series representations suitable for machine learning tasks such as:

Mortality prediction
Length of Stay
Readmission
Phenotyping

It performs:

Clinical grouping
Feature summarization and selection
Outlier removal
Time-series representation
Data verification and formatting

Dataset Access

Before downloading the data, request access to the MIMIC-IV dataset through the PhysioNet portal.

Access must be approved via the official data use agreement on PhysioNet.

Once access is granted, download the required MIMIC-IV version (e.g., v2.2) and place the raw files in the directory specified by RawDataPath in the configuration file.

PhysioNet portal:
https://physionet.org/content/mimiciv/2.2/

Usage

Run with Default Configuration

python generate_dataset.py

Uses base_config defined in the code.

Run with a Custom Configuration File

python generate_dataset.py ../folder/config.json

Configuration Parameters

Dataset Version and Paths

Parameter	Description
`Version`	Dataset version (e.g. `"2.2"`)
`RawDataPath`	Path to raw MIMIC-IV data

Prediction Task Settings

`Task`

Defines the downstream task:

"mortality"
"LengthOfStay"
"readmission"
"phenotype"

Task-Specific Parameters

Mortality

Parameter	Description
`Mortality_prediction_horizon`	Prediction horizon (hours)

Example:
Predict mortality within the next 24 hours using 24.

Length of Stay

Parameter	Description
`LengthOfStay_greater_or_equal_threshold`	Threshold in days

Example:
3 → classify whether LOS ≥ 3 days.

Readmission

Parameter	Description
`Readmission_number_of_days_threshold`	Readmission window (days)

Example:
30 → predict readmission within 30 days.

Phenotype Prediction

Parameter	Description
`Phenotype`	Target phenotype (`HF`, `COPD`, `CKD`, `CAD`)
`Phenotype_prediction_horizon`	Prediction window (hours)

Example:
Predict phenotype within first 24 hours of ICU stay.

Data Inclusion Flags

Parameter	Description
`Include_ICU`	Include ICU stays
`Include_Diagnosis`	Include diagnosis codes
`Include_Procedures`	Include procedures
`Include_Medications`	Include medications
`Include_chart_event`	Include labs and vitals
`Include_output_event`	Include ICU output events

When ICU is enabled:

Chart events include labs + vitals
Output events can be included

When ICU is disabled:

Vitals are excluded
Output events are not available (thus the flag has no impact)

Disease Filtering

Parameter	Description
`Disease_Filter`	Filter cohort by disease (`None`, `HF`, `COPD`, `CKD`, `CAD`)

Outlier Handling

Parameter	Description
`Outliers_management`	`remove`, `impute_mean`, `impute_median`, `impute_mode`, `impute_random`, `none`
`Outliers_threshold_right`	Right percentile threshold (e.g. 98.0)
`Outliers_threshold_left`	Left percentile threshold (e.g. 0.0)

Outliers are detected per feature distribution after unit standardization.

Time-Series Configuration

Parameter	Description
`Time_window_size`	Observation window (hours)
`Time_window_bucket_size`	Time resolution (hours)

Example:

24 hours window
1 hour bins

Missing Value Handling

Parameter	Description
`Missing_values_management`	`mean`, `median`, `mode`, `random`, `none`

Imputation is performed only within the observation window to prevent data leakage.

If none:

Missing values are encoded as 0.

Oversampling

Parameter	Description
`Oversampling`	`True`, `False`

When enabled, the pipeline performs oversampling of the minority class to reduce class imbalance in the training data.

Oversampling is applied after feature extraction

Concatenate

Parameter	Description
`Concatenate`	`True`, `False`

When enabled, temporal features from multiple sources are merged along the feature dimension:

Warning

Concatenation can significantly increase the feature dimensionality and memory usage.

Output Format

Parameter	Description
`Output_format`	`csv`, `npy` , `pkl`

Preprocessing Steps

1. Clinical Grouping

Purpose: Reduce dimensionality while preserving clinical meaning.

Diagnosis Grouping

ICD-9 codes are converted to ICD-10.
Mapping follows Gupta et al.
First three digits of ICD-10 used as grouping root.

Result: Reduced and standardized diagnosis feature space.

Medication Grouping

NDC codes converted to 11-digit format.
Mapped to FDA NDC database.
Grouped using:
- Labeler code (digits 1–5)
- Product code (digits 6–9)
Drug names converted to non-proprietary names.

Result: Consistent medication representation.

2. Feature Summarizing and Selection

After grouping, the pipeline generates summary CSV files containing:

Mean frequency per admission
Percentage of missing values

3. Outlier Removal

The pipeline:

Detects statistical outliers per feature
Uses user-defined Z-score or percentile thresholds
Either removes or imputes values

Before detection:

Units are standardized per feature

Example:

If threshold = 98
Values > 98th percentile are removed or replaced

4. Time-Series Representation

Observation Window

User defines:

Reference point (admission/discharge)
Duration (e.g. first 24 hours)

Only data inside this window is used.

Time Binning

Data is discretized into fixed intervals:

Size defined by Time_window_bucket_size
Example: 1-hour bins

Feature Encoding

Labs & Vitals

Forward-filled by default
Aggregated if high frequency
Missing values imputed (mean/median) if enabled
Imputation uses only values inside the observation window (prevents leakage)

Medications

Represented by dosage between start and stop time
0 when not administered
No imputation performed

Procedures

1 at time of procedure
0 otherwise

Diagnoses

Static feature
Do not vary over time

5. Data Verification

The pipeline automatically:

Removes admissions with invalid timestamps
Removes medications with invalid start/stop times
Adjusts medication times to:
- Shift start to admission if given before admission
- Shift stop to discharge if beyond discharge

Output Format

For each admission:

Three CSV files are generated:

Static features
Dynamic time-series features
Demographic features

Then a pair of X , y file are generated.

Data Flow Summary

Raw MIMIC-IV v2.2
→ Clinical grouping
→ Feature summary
→ Feature filtering
→ Outlier handling
→ Time window selection
→ Time binning
→ Missing value handling
→ Verification checks
→ CSV + dictionary output

Intended Use

This pipeline produces machine-learning-ready structured EHR representations for:

Deep learning models
Time-series models
Traditional ML classifiers
Survival analysis

It prevents:

Data leakage
Inconsistent time references
Invalid clinical timestamps
Unit mismatch errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIMIC-IV (v2.2) Preprocessing Pipeline

Overview

Dataset Access

Usage

Run with Default Configuration

Run with a Custom Configuration File

Configuration Parameters

Dataset Version and Paths

Prediction Task Settings

`Task`

Task-Specific Parameters

Mortality

Length of Stay

Readmission

Phenotype Prediction

Data Inclusion Flags

Disease Filtering

Outlier Handling

Time-Series Configuration

Missing Value Handling

Oversampling

Concatenate

Output Format

Preprocessing Steps

1. Clinical Grouping

Diagnosis Grouping

Medication Grouping

2. Feature Summarizing and Selection

3. Outlier Removal

4. Time-Series Representation

Observation Window

Time Binning

Feature Encoding

5. Data Verification

Output Format

Data Flow Summary

Intended Use

FilesExpand file tree

mimiciv.md

Latest commit

History

mimiciv.md

File metadata and controls

MIMIC-IV (v2.2) Preprocessing Pipeline

Overview

Dataset Access

Usage

Run with Default Configuration

Run with a Custom Configuration File

Configuration Parameters

Dataset Version and Paths

Prediction Task Settings

Task

Task-Specific Parameters

Mortality

Length of Stay

Readmission

Phenotype Prediction

Data Inclusion Flags

Disease Filtering

Outlier Handling

Time-Series Configuration

Missing Value Handling

Oversampling

Concatenate

Output Format

Preprocessing Steps

1. Clinical Grouping

Diagnosis Grouping

Medication Grouping

2. Feature Summarizing and Selection

3. Outlier Removal

4. Time-Series Representation

Observation Window

Time Binning

Feature Encoding

5. Data Verification

Output Format

Data Flow Summary

Intended Use

`Task`