CleanEasy is a powerful, user-friendly Python library designed to simplify data cleaning and preprocessing for data scientists and analysts. Built on top of pandas, numpy, scikit-learn, and nltk, it provides a chainable API to handle common tasks like missing value imputation, outlier detection, text processing, date manipulation, and categorical encoding. With detailed logging and formatted output, CleanEasy makes data preparation intuitive, transparent, and visually appealing.
- Introduction
- Features
- Installation
- Usage
- Project Structure
- Testing
- Contributing
- License
- Contact and Support
- FAQ
- Roadmap
CleanEasy streamlines the data cleaning process by offering a unified interface for a wide range of preprocessing tasks. Whether you're working with DataFrames, NumPy arrays, lists, dictionaries, or CSV files, CleanEasy handles data conversion, cleaning, and validation with ease. Its method-chaining API allows you to build complex cleaning pipelines in a readable, maintainable way, while detailed logs and formatted outputs (using tabulate and colorama) ensure clarity and usability.
Key highlights:
- Supports multiple data input formats.
- Extensive methods for imputation, outlier removal, text processing, and encoding.
- Built-in validation tools for skewness, normality, and correlations.
- Auto-cleaning pipeline for quick preprocessing.
- Pretty-printed output for easy interpretation.
CleanEasy offers a rich set of tools for data preprocessing:
- Accepts
pandas.DataFrame,numpy.ndarray, lists, dictionaries, or CSV file paths. - Automatically converts inputs to a
pandas.DataFrameusingconvert_to_dataframe.
- KNN Imputation:
impute_knnfor numeric columns using k-nearest neighbors. - Statistical Imputation:
impute_mean,impute_median,impute_mode. - Time-Series Imputation:
impute_forward_fill,impute_backward_fill,impute_interpolate. - Constant Imputation:
impute_constantwith a user-specified value. - Drop Missing:
drop_missing_rows,drop_missing_columnsbased on thresholds.
- Isolation Forest:
remove_outliers_isolation_forestfor robust outlier removal. - IQR:
remove_outliers_iqrandcap_outliers_iqrfor interquartile range-based handling. - Z-Score:
remove_outliers_zscoreandcap_outliers_zscorefor standard deviation-based handling. - DBSCAN:
remove_outliers_dbscanfor clustering-based outlier detection.
- Tokenization:
tokenize_textusing NLTK's word tokenizer. - Lemmatization:
lemmatize_textwith WordNet lemmatizer. - Cleaning:
lowercase_text,remove_special_chars,trim_whitespace,remove_numbers,replace_text.
- Parsing:
parse_datesto convert strings to datetime. - Feature Extraction:
extract_year,extract_month,extract_quarter,extract_day_of_week. - Formatting:
standardize_date_formatfor consistent date strings.
- Frequency Encoding:
frequency_encodefor value counts. - Label Encoding:
label_encodefor ordinal categories. - One-Hot Encoding:
one_hot_encodewith drop-first option. - Rare Categories:
merge_rare_categoriesto group infrequent categories.
- Skewness:
check_skewnessfor numeric columns. - Normality:
check_normalityusing Shapiro-Wilk test. - Missing Values:
check_missing_proportionfor column-wise missing ratios. - Unique Values:
check_unique_valuesfor distinct counts. - Correlations:
check_correlationandremove_highly_correlatedfor numeric columns.
- Duplicates:
drop_duplicatesandidentify_duplicates. - Scaling:
standardize_numeric(z-score) andnormalize_numeric(min-max). - Binning:
bin_numericfor discretizing numeric columns. - Log Transformation:
log_transformfor handling skewed data. - Auto-Cleaning:
auto_cleanfor a customizable, one-step pipeline.
- Detailed logging of all operations with customizable log levels.
- Formatted console output with tables (
tabulate) and colors (colorama). - Results storage in
get_results()for inspection.
- Python: 3.8 or higher
- Operating System: Windows, macOS, or Linux
- Virtual Environment: Recommended for dependency isolation
- Terminal: For running commands (e.g., Windows Terminal, VS Code, or bash)
-
Clone or Download the Repository
git clone https://github.com/CyberMatic-AmAn/cleaneasy.git cd cleaneasy -
Create a Virtual Environment (optional but recommended)
python -m venv venv .\venv\Scripts\activate # Windows source venv/bin/activate # Linux/macOS
-
Install Dependencies Install required packages from
requirements.txt:pip install -r requirements.txt
Dependencies include:
pandas>=1.5.0numpy>=1.23.0scipy>=1.9.0scikit-learn>=1.1.0nltk>=3.7pytest>=7.0.0tabulate>=0.8.9colorama>=0.4.4
-
Download NLTK Data Some methods (e.g.,
tokenize_text,lemmatize_text) require NLTK resources:import nltk nltk.download('punkt') nltk.download('punkt_tab') nltk.download('wordnet')
-
Install CleanEasy as a Package Install the
cleaneasypackage locally to make it importable:pip install .
The main.py script demonstrates a typical cleaning pipeline. It processes a sample dataset with missing values, outliers, text, dates, and categorical data, producing formatted output.
import pandas as pd
import json
from tabulate import tabulate
from colorama import init, Fore, Style
from cleaneasy import CleanEasy
# Initialize colorama for colored output
init()
def format_dict(d, indent=0):
"""Pretty-print a dictionary with indentation for nested structures."""
result = []
for key, value in d.items():
key_str = f"{Fore.CYAN}{key}{Style.RESET_ALL}"
if isinstance(value, dict):
result.append(f"{' ' * indent}{key_str}:")
result.append(format_dict(value, indent + 1))
elif isinstance(value, list) and key == 'name_tokens':
value_str = ', '.join([str(item) for item in value])
result.append(f"{' ' * indent}{key_str}: {value_str}")
else:
if isinstance(value, (np.floating, np.integer)):
value = float(value) if isinstance(value, np.floating) else int(value)
result.append(f"{' ' * indent}{key_str}: {value}")
return '\n'.join(result)
# Sample data
data = {
'name': ['John@Doe', 'Jane Smith!', None, 'Alice'],
'age': [25, 30, 1000, None],
'salary': [50000, None, 60000, 55000],
'date': ['2023-01-01', '2023-02-02', 'invalid', '2023-03-03'],
'category': ['A', 'B', 'A', 'C']
}
df = pd.DataFrame(data)
# Initialize CleanEasy
cleaner = CleanEasy(df, log_level='INFO')
# Apply cleaning steps
cleaner.parse_dates(columns=['date'])
cleaner = (cleaner
.impute_knn(columns=['age', 'salary'], n_neighbors=3, weights='distance')
.remove_outliers_isolation_forest(columns=['age'], contamination=0.2, random_state=42)
.tokenize_text(columns=['name'], lowercase=True)
.extract_day_of_week(columns=['date'], return_numeric=True)
.frequency_encode(columns=['category'], normalize=True)
)
# Store skewness results
skewness_results = cleaner.check_skewness(columns=['age', 'salary'])
# Continue method chain
cleaned_df = (cleaner
.remove_highly_correlated(threshold=0.8, method='pearson')
.get_cleaned_data()
)
# Display results
print(f"\n{Fore.GREEN}=== Cleaned DataFrame ==={Style.RESET_ALL}")
cleaned_df_display = cleaned_df.copy()
cleaned_df_display['name_tokens'] = cleaned_df_display['name_tokens'].apply(lambda x: ', '.join(x))
print(tabulate(cleaned_df_display, headers='keys', tablefmt='psql', showindex=True, floatfmt='.2f'))
print(f"\n{Fore.GREEN}=== Cleaning Steps ==={Style.RESET_ALL}")
for i, step in enumerate(cleaner.get_cleaning_log(), 1):
print(f"{i}. {step}")
print(f"\n{Fore.GREEN}=== Skewness Results ==={Style.RESET_ALL}")
skewness_formatted = {k: float(v) for k, v in skewness_results.items()}
for col, value in skewness_formatted.items():
print(f"{Fore.CYAN}{col}{Style.RESET_ALL}: {value:.4f}")
print(f"\n{Fore.GREEN}=== All Results ==={Style.RESET_ALL}")
results = cleaner.get_results()
for key, value in results.items():
if isinstance(value, dict):
for subkey, subvalue in value.items():
if isinstance(subvalue, (np.floating, np.integer)):
results[key][subkey] = float(subvalue) if isinstance(subvalue, np.floating) else int(subvalue)
print(format_dict(results))Running python main.py produces:
2025-07-05 12:35:10,417 - CleanEasy - INFO - Initialized CleanEasy with data type: DataFrame
2025-07-05 12:35:10,421 - CleanEasy - INFO - Parsed date to datetime
2025-07-05 12:35:10,425 - CleanEasy - INFO - Imputed ['age', 'salary'] with KNN (n_neighbors=3, weights=distance)
2025-07-05 12:35:10,540 - CleanEasy - INFO - Removed 1 outliers from ['age'] using Isolation Forest (contamination=0.2)
2025-07-05 12:35:10,610 - CleanEasy - INFO - Tokenized text in name (lowercase=True)
2025-07-05 12:35:10,637 - CleanEasy - INFO - Extracted day of week from date to date_dayofweek (numeric=True)
2025-07-05 12:35:10,641 - CleanEasy - INFO - Frequency encoded category to category_freq (normalize=True)
2025-07-05 12:35:10,641 - CleanEasy - INFO - Skewness for age: 1.7314
2025-07-05 12:35:10,642 - CleanEasy - INFO - Skewness for salary: 1.7314
2025-07-05 12:35:10,645 - CleanEasy - INFO - Dropped 1 highly correlated columns (method=pearson, threshold=0.8)
=== Cleaned DataFrame ===
+----+------------+-------+------------+-----------+------------------+--------------+-----------------+
| | name | age | date | category | name_tokens | date_dayofweek | category_freq |
|----+------------+-------+------------+-----------+------------------+--------------+-----------------|
| 0 | John@Doe | 25.00 | 2023-01-01 | A | john, @, doe | 6 | 0.33 |
| 1 | Jane Smith!| 30.00 | 2023-02-02 | B | jane, smith, ! | 3 | 0.33 |
| 3 | Alice | 512.50| 2023-03-03 | C | alice | 4 | 0.33 |
+----+------------+-------+------------+-----------+------------------+--------------+-----------------+
=== Cleaning Steps ===
1. Parsed date columns
2. Imputed missing values with KNN (weights=distance)
3. Removed outliers using Isolation Forest
4. Tokenized text columns
5. Extracted day of week from datetime columns
6. Applied frequency encoding
7. Checked skewness
8. Removed highly correlated columns (threshold=0.8)
=== Skewness Results ===
age: 1.7314
salary: 1.7314
=== All Results ===
knn_imputation:
columns: ['age', 'salary']
n_neighbors: 3
weights: distance
isolation_forest:
columns: ['age']
outliers_removed: 1
name_tokens: [john, @, doe], [jane, smith, !], [alice]
category_freq:
A: 0.3333333333333333
B: 0.3333333333333333
C: 0.3333333333333333
skewness:
age: 1.7314295926231227
salary: 1.7314295926231076
correlated_columns_dropped: ['salary']
Use auto_clean for a one-step pipeline:
cleaner = CleanEasy(df, log_level='INFO')
cleaned_df = cleaner.auto_clean(
impute_method='knn',
outlier_method='isolation_forest',
text_clean=True,
date_parse=True,
categorical_encode='frequency'
)
print(f"\n{Fore.GREEN}=== Auto-Cleaned DataFrame ==={Style.RESET_ALL}")
print(tabulate(cleaned_df, headers='keys', tablefmt='psql', showindex=True, floatfmt='.2f'))cleaneasy/
├── cleaneasy/
│ ├── __init__.py # Package initialization and exports
│ ├── core.py # Core CleanEasy class with cleaning methods
│ ├── utils.py # Utility functions (e.g., convert_to_dataframe)
│ ├── validators.py # Validation functions (e.g., check_skewness)
├── tests/
│ ├── __init__.py # Test package initialization
│ ├── test_core.py # Tests for core.py
│ ├── test_utils.py # Tests for utils.py
│ ├── test_validators.py # Tests for validators.py
├── docs/
│ ├── conf.py # Sphinx documentation configuration
│ ├── index.rst # Sphinx documentation index
├── main.py # Example script demonstrating usage
├── pyproject.toml # Project metadata and build configuration
├── requirements.txt # Dependencies
├── README.md # This file
├── LICENSE # License file (MIT)
CleanEasy includes a test suite using pytest to ensure reliability.
-
Install pytest
pip install pytest
-
Run Tests
cd cleaneasy pytest tests/Tests cover:
- Initialization and data conversion (
test_core.py,test_utils.py) - Cleaning methods (e.g.,
impute_knn,remove_outliers_isolation_forest) - Validation functions (e.g.,
check_skewness,check_normality)
- Initialization and data conversion (
We welcome contributions to CleanEasy! To contribute:
-
Fork the Repository
git clone https://github.com/CyberMatic-AmAn/cleaneasy.git cd cleaneasy -
Create a Branch
git checkout -b feature/your-feature-name
-
Make Changes
- Add new features or fix bugs in
cleaneasy/. - Update tests in
tests/. - Document changes in
docs/if necessary.
- Add new features or fix bugs in
-
Run Tests Ensure all tests pass:
pytest tests/. -
Submit a Pull Request
- Push your branch:
git push origin feature/your-feature-name. - Open a pull request on GitHub with a clear description of changes.
- Push your branch:
-
Report Issues
- Use the GitHub Issues page to report bugs or suggest features.
- Include detailed descriptions and reproduction steps.
CleanEasy is licensed under the MIT License. See the LICENSE file for details.
- Email: exehyper999@gmail.com
- GitHub Issues: github.com/CyberMatic-AmAn/cleaneasy/issues
- Documentation: https://github.com/CyberMatic-AmAn/cleaneasy
For support, open an issue on GitHub or contact the maintainer directly.
Ensure NLTK data is downloaded:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')- Verify
coloramais installed:pip show colorama. - Ensure your terminal supports ANSI colors (e.g., Windows Terminal, VS Code).
- Check that
colorama.init()is called inmain.py.
- Add the method to
cleaneasy/core.pyin theCleanEasyclass. - Ensure it returns
selffor method chaining. - Update tests in
tests/test_core.py. - Document the method in
docs/and thisREADME.md.
Yes, but performance depends on the methods used (e.g., impute_knn and remove_outliers_isolation_forest can be computationally intensive). Test with a sample first.