Transaction Data Cleaning Pipeline

Overview

This project contains a reusable Python function for cleaning and standardizing transaction data using pandas.
It was built to mirror the logic used in an exploratory Jupyter Notebook as closely as possible, while making the process reproducible, readable, and suitable for production use.

The function handles:

Data type normalization
Missing and invalid values
Consistent formatting across categorical columns
Median-based imputation for numeric data
Correct handling of dependent columns such as Total Spent

The goal is data consistency and analytical reliability, not artificial perfection at the cent level.

Key Design Principles

Type conversion comes first
All numeric and date columns are converted to their correct data types before any cleaning or replacement happens.
This ensures that invalid values are properly detected as NaN or NaT.
Cleaning before replacement
String columns are stripped, normalized, and standardized before replacing missing or invalid values.
Median-based imputation for numeric columns
Missing values in numeric columns are replaced using the median, not the mean, to reduce sensitivity to outliers.
This matches the logic used in the original notebook.
Dependent column recalculation
- Total Spent is recalculated as: Quantity × Price Per Unit where both values are available.
- Remaining missing values are then filled using the median of Total Spent.
Dates are never fabricated
- Invalid or missing Transaction Date values are converted to NaT.
- They are not replaced or imputed, preserving data integrity.
Rounding reflects real-world behavior
- Minor differences (typically $1–$7) compared to notebook outputs are expected.
- These arise from floating-point arithmetic and rounding order.
- The differences are statistically insignificant and acceptable for analysis.

Columns Handled

Numeric Columns

Quantity
Price Per Unit
Total Spent

Missing values are replaced using the median after cleaning and recalculation.

Categorical Columns

Item
Payment Method
Location

Invalid values such as: NaN, None, "", Unknown, Error are replaced with standardized labels.

Date Column

Transaction Date

Converted using pd.to_datetime(errors="coerce").
Invalid or missing dates remain as NaT.

Why Small Differences Exist vs Notebook Output

You may notice small differences (usually $1–$7) when comparing totals to the original notebook results.

These differences are caused by:

Floating-point precision
When rounding is applied
Median values being calculated on slightly different intermediate results

These differences:

Do not affect trends
Do not affect aggregates in a meaningful way
Are well within acceptable analytical tolerance

For real-world data analysis, this behavior is normal and expected.

Usage Example

from clean_transactions import clean_transactions
import pandas as pd

df = pd.read_csv("transactions.csv")
df_clean = clean_transactions(df)

When This Is Suitable

This pipeline is appropriate for:

Exploratory data analysis

Dashboards and reporting

Financial summaries

Freelance or client data cleaning work

Reproducible preprocessing before modeling

If absolute cent-level reconciliation is required (e.g. accounting or audits), additional domain-specific rules should be applied.

Final Notes

This project prioritizes:

Correct logic

Transparency

Reproducibility

Practical data science standards

Perfection at the cent level is not always realistic or necessary.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
input		input
output		output
README.md		README.md
clean_transactions.py		clean_transactions.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transaction Data Cleaning Pipeline

Overview

Key Design Principles

Columns Handled

Why Small Differences Exist vs Notebook Output

Usage Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transaction Data Cleaning Pipeline

Overview

Key Design Principles

Columns Handled

Why Small Differences Exist vs Notebook Output

Usage Example

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages