This project contains a reusable Python function for cleaning and standardizing transaction data using pandas.
It was built to mirror the logic used in an exploratory Jupyter Notebook as closely as possible, while making the process reproducible, readable, and suitable for production use.
The function handles:
- Data type normalization
- Missing and invalid values
- Consistent formatting across categorical columns
- Median-based imputation for numeric data
- Correct handling of dependent columns such as Total Spent
The goal is data consistency and analytical reliability, not artificial perfection at the cent level.
-
Type conversion comes first
All numeric and date columns are converted to their correct data types before any cleaning or replacement happens.
This ensures that invalid values are properly detected as NaN or NaT. -
Cleaning before replacement
String columns are stripped, normalized, and standardized before replacing missing or invalid values. -
Median-based imputation for numeric columns
Missing values in numeric columns are replaced using the median, not the mean, to reduce sensitivity to outliers.
This matches the logic used in the original notebook. -
Dependent column recalculation
- Total Spent is recalculated as: Quantity × Price Per Unit where both values are available.
- Remaining missing values are then filled using the median of Total Spent.
-
Dates are never fabricated
- Invalid or missing Transaction Date values are converted to NaT.
- They are not replaced or imputed, preserving data integrity.
-
Rounding reflects real-world behavior
- Minor differences (typically $1–$7) compared to notebook outputs are expected.
- These arise from floating-point arithmetic and rounding order.
- The differences are statistically insignificant and acceptable for analysis.
Numeric Columns
- Quantity
- Price Per Unit
- Total Spent
Missing values are replaced using the median after cleaning and recalculation.
Categorical Columns
- Item
- Payment Method
- Location
Invalid values such as: NaN, None, "", Unknown, Error are replaced with standardized labels.
Date Column
- Transaction Date
Converted using pd.to_datetime(errors="coerce").
Invalid or missing dates remain as NaT.
You may notice small differences (usually $1–$7) when comparing totals to the original notebook results.
These differences are caused by:
- Floating-point precision
- When rounding is applied
- Median values being calculated on slightly different intermediate results
These differences:
- Do not affect trends
- Do not affect aggregates in a meaningful way
- Are well within acceptable analytical tolerance
For real-world data analysis, this behavior is normal and expected.
from clean_transactions import clean_transactions
import pandas as pd
df = pd.read_csv("transactions.csv")
df_clean = clean_transactions(df)
When This Is Suitable
This pipeline is appropriate for:
Exploratory data analysis
Dashboards and reporting
Financial summaries
Freelance or client data cleaning work
Reproducible preprocessing before modeling
If absolute cent-level reconciliation is required (e.g. accounting or audits), additional domain-specific rules should be applied.
Final Notes
This project prioritizes:
Correct logic
Transparency
Reproducibility
Practical data science standards
Perfection at the cent level is not always realistic or necessary.