Skip to content

rajab-bett-analytics/Transaction-Data-Cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transaction Data Cleaning Pipeline

Overview

This project contains a reusable Python function for cleaning and standardizing transaction data using pandas.
It was built to mirror the logic used in an exploratory Jupyter Notebook as closely as possible, while making the process reproducible, readable, and suitable for production use.

The function handles:

  • Data type normalization
  • Missing and invalid values
  • Consistent formatting across categorical columns
  • Median-based imputation for numeric data
  • Correct handling of dependent columns such as Total Spent

The goal is data consistency and analytical reliability, not artificial perfection at the cent level.


Key Design Principles

  1. Type conversion comes first
    All numeric and date columns are converted to their correct data types before any cleaning or replacement happens.
    This ensures that invalid values are properly detected as NaN or NaT.

  2. Cleaning before replacement
    String columns are stripped, normalized, and standardized before replacing missing or invalid values.

  3. Median-based imputation for numeric columns
    Missing values in numeric columns are replaced using the median, not the mean, to reduce sensitivity to outliers.
    This matches the logic used in the original notebook.

  4. Dependent column recalculation

    • Total Spent is recalculated as: Quantity × Price Per Unit where both values are available.
    • Remaining missing values are then filled using the median of Total Spent.
  5. Dates are never fabricated

    • Invalid or missing Transaction Date values are converted to NaT.
    • They are not replaced or imputed, preserving data integrity.
  6. Rounding reflects real-world behavior

    • Minor differences (typically $1–$7) compared to notebook outputs are expected.
    • These arise from floating-point arithmetic and rounding order.
    • The differences are statistically insignificant and acceptable for analysis.

Columns Handled

Numeric Columns

  • Quantity
  • Price Per Unit
  • Total Spent

Missing values are replaced using the median after cleaning and recalculation.

Categorical Columns

  • Item
  • Payment Method
  • Location

Invalid values such as: NaN, None, "", Unknown, Error are replaced with standardized labels.

Date Column

  • Transaction Date

Converted using pd.to_datetime(errors="coerce").
Invalid or missing dates remain as NaT.


Why Small Differences Exist vs Notebook Output

You may notice small differences (usually $1–$7) when comparing totals to the original notebook results.

These differences are caused by:

  • Floating-point precision
  • When rounding is applied
  • Median values being calculated on slightly different intermediate results

These differences:

  • Do not affect trends
  • Do not affect aggregates in a meaningful way
  • Are well within acceptable analytical tolerance

For real-world data analysis, this behavior is normal and expected.


Usage Example

from clean_transactions import clean_transactions
import pandas as pd

df = pd.read_csv("transactions.csv")
df_clean = clean_transactions(df)

When This Is Suitable

This pipeline is appropriate for:

Exploratory data analysis

Dashboards and reporting

Financial summaries

Freelance or client data cleaning work

Reproducible preprocessing before modeling

If absolute cent-level reconciliation is required (e.g. accounting or audits), additional domain-specific rules should be applied.

Final Notes

This project prioritizes:

Correct logic

Transparency

Reproducibility

Practical data science standards

Perfection at the cent level is not always realistic or necessary.

About

A reusable Python pipeline for cleaning and standardizing transaction data using pandas. Handles type normalization, missing/invalid values, median imputation, dependent column recalculation, and consistent formatting for reliable analysis and reporting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages