From e766bef9e91d7f77334f267a8b2a603aadf92318 Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Wed, 13 Nov 2024 18:49:48 -0600 Subject: [PATCH 1/9] Revise general intro in docs --- README.md | 6 +++--- docs/index.md | 4 ++-- docs/usage.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index ab81a98..424b16f 100644 --- a/README.md +++ b/README.md @@ -3,8 +3,8 @@ Banner image for Pandas Checks -## Introduction -**Pandas Checks** is a Python library for data science and data engineering. It adds non-invasive health checks for Pandas method chains. +## What is it? +**Pandas Checks** is a Python package for data science and data engineering. It adds non-invasive health checks for Pandas method chains. It can inspect and validate your data at various points in your Pandas pipelines, without modifying the underlying data. @@ -27,7 +27,7 @@ pip install pandas-checks import pandas_checks ``` -It works in Jupyter, IPython, and Python scripts run from the command line. +It works in Jupyter notebooks, IPython, and Python scripts run from the command line. ## Usage Pandas Checks adds `.check` methods to Pandas DataFrames and Series. diff --git a/docs/index.md b/docs/index.md index 0013bda..78cb6ab 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,9 +2,9 @@ Banner image for Pandas Checks -## Introduction +## What is it? -**Pandas Checks** is a Python library for data science and data engineering. It adds non-invasive health checks for Pandas method chains. +**Pandas Checks** is a Python package for data science and data engineering. It adds non-invasive health checks for [Pandas](https://github.com/pandas-dev/pandas/) method chains. ## What are method chains? Method chains are one of the [coolest features](https://tomaugspurger.net/posts/method-chaining/) of the Pandas library! They allow you to write more functional code with fewer intermediate variables and fewer side effects. If you're familiar with R, method chains are Python's version of [dplyr pipes](https://style.tidyverse.org/pipes.html). diff --git a/docs/usage.md b/docs/usage.md index a0e5de1..96c23d0 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -7,7 +7,7 @@ First make Pandas Check available in your environment. pip install pandas-checks ``` -Then import it in your code. It works in Jupyter, IPython, and Python scripts run from the command line. +Then import it in your code. It works in Jupyter notebooks, IPython, and Python scripts run from the command line. ```python import pandas_checks From 17c9a370a7ddd6475148338b1539d1d126b21f8c Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Wed, 13 Nov 2024 19:02:12 -0600 Subject: [PATCH 2/9] Add TOC --- README.md | 153 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 85 insertions(+), 68 deletions(-) diff --git a/README.md b/README.md index 424b16f..4142fbd 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,21 @@ As Fleetwood Mac says, [you would never break the chain](https://www.youtube.com > 💡 Tip: > See the [full documentation](https://cparmet.github.io/pandas-checks/) for all the details on the what, why, and how of Pandas Checks. - +## Table of Contents + * [Installation](#installation) + * [Usage](#usage) + * [Methods available](#methods-available) + + [Describe data](#describe-data) + + [Export interim files](#export-interim-files) + + [Time your code](#time-your-code) + + [Turn on/off Pandas Checks](#turn-on-off-pandas-checks) + + [Validate data](#validate-data) + + [Visualize data](#visualize-data) + * [Customizing a check](#customizing-a-check) + * [Configuring Pandas Checks](#configuring-pandas-checks) + * [Giving feedback and contributing](#giving-feedback-and-contributing) + * [License](#license) + ## Installation ```bash @@ -85,73 +99,76 @@ The `.check` methods will display the following results: The `.check` methods didn't modify how the `iris` data is processed by your code. They just let you check the data as it flows down the pipeline. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`. -## Features -### Check methods +## Methods available Here's what's in the doctor's bag. -* **Describe** - - Standard Pandas methods: - - `.check.columns()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.columns) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.columns) - - `.check.dtypes()` for [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.dtypes) | `.check.dtype()` for [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.dtype) - - `.check.describe()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.describe) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.describe) - - `.check.head()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.head) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.head) - - `.check.info()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.info) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.info) - - `.check.memory_usage()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.memory_usage) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.memory_usage) - - `.check.nunique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nunique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nunique) - - `.check.shape()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.shape) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.shape) - - `.check.tail()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.tail) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.tail) - - `.check.unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.unique) - - `.check.value_counts()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.value_counts) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.value_counts) - - New methods in Pandas Checks: - - `.check.function()`: Apply an arbitrary lambda function to your data and see the result - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.function) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.function) - - `.check.ncols()`: Count columns - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ncols) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ncols) - - `.check.ndups()`: Count rows with duplicate values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ndups) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ndups) - - `.check.nnulls()`: Count rows with null values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nnulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nnulls) - - `.check.print()`: Print a string, a variable, or the current dataframe - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print) - -* **Export interim files** - - `.check.write()`: Export the current data, inferring file format from the name - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.write) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.write) - -* **Time your code** - - `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print_time_elapsed) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print_time_elapsed) - - 💡 Tip: You can also use this stopwatch outside a method chain, anywhere in your Python code: - - ```python - from pandas_checks import print_elapsed_time, start_timer - - start_time = start_timer() - ... - print_elapsed_time(start_time) - ``` +### Describe data +Standard Pandas methods: +- `.check.columns()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.columns) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.columns) +- `.check.dtypes()` for [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.dtypes) | `.check.dtype()` for [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.dtype) +- `.check.describe()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.describe) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.describe) +- `.check.head()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.head) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.head) +- `.check.info()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.info) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.info) +- `.check.memory_usage()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.memory_usage) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.memory_usage) +- `.check.nunique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nunique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nunique) +- `.check.shape()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.shape) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.shape) +- `.check.tail()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.tail) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.tail) +- `.check.unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.unique) +- `.check.value_counts()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.value_counts) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.value_counts) + +New methods in Pandas Checks: +- `.check.function()`: Apply an arbitrary lambda function to your data and see the result - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.function) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.function) +- `.check.ncols()`: Count columns - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ncols) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ncols) +- `.check.ndups()`: Count rows with duplicate values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ndups) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ndups) +- `.check.nnulls()`: Count rows with null values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nnulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nnulls) +- `.check.print()`: Print a string, a variable, or the current dataframe - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print) + +### Export interim files +- `.check.write()`: Export the current data, inferring file format from the name - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.write) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.write) + +### Time your code +- `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print_time_elapsed) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print_time_elapsed) +- 💡 Tip: You can also use this stopwatch outside a method chain, anywhere in your Python code: + + ```python + from pandas_checks import print_elapsed_time, start_timer + + start_time = start_timer() + ... + print_elapsed_time(start_time) + ``` -* **Turn off Pandas Checks** - - `.check.disable_checks()`: Don't run checks, for production mode etc. By default, still runs assertions. - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.disable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.disable_checks) - - `.check.enable_checks()`: Run checks - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.enable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.enable_checks) - -* **Validate** - - *General* - - `.check.assert_data()`: Check that data passes an arbitrary condition - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_data) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_data) - - *Types* - - `.check.assert_datetime()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_datetime) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_datetime) - - `.check.assert_float()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_float) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_float) - - `.check.assert_int()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_int) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_int) - - `.check.assert_str()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_str) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_str) - - `.check.assert_timedelta()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_timedelta) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_timedelta) - - `.check.assert_type()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_type) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_type) - - *Values* - - `.check.assert_less_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_less_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_less_than) - - `.check.assert_greater_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_greater_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_greater_than) - - `.check.assert_negative()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_negative) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_negative) - - `.check.assert_no_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_no_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_no_nulls) - - `.check.assert_all_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_all_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_all_nulls) - - `.check.assert_positive()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_positive) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_positive) - - `.check.assert_unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_unique) - -* **Visualize** - - `.check.hist()`: A histogram - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.hist) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.hist) - - `.check.plot()`: An arbitrary plot you can customize - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.plot) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.plot) - -### Customizing a check +### Turn on/off Pandas Checks +These can be used to disable subsequent Pandas Checks methods, either temporarily for a single method chain or permanently such as in a production environment. +- `.check.disable_checks()`: Don't run checks. By default, still runs assertions. - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.disable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.disable_checks) +- `.check.enable_checks()`: Run checks again - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.enable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.enable_checks) + +### Validate data +General: +- `.check.assert_data()`: Check that data passes an arbitrary condition - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_data) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_data) + +Types: +- `.check.assert_datetime()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_datetime) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_datetime) +- `.check.assert_float()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_float) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_float) +- `.check.assert_int()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_int) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_int) +- `.check.assert_str()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_str) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_str) +- `.check.assert_timedelta()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_timedelta) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_timedelta) +- `.check.assert_type()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_type) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_type) + +Values: +- `.check.assert_less_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_less_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_less_than) +- `.check.assert_greater_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_greater_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_greater_than) +- `.check.assert_negative()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_negative) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_negative) +- `.check.assert_no_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_no_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_no_nulls) +- `.check.assert_all_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_all_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_all_nulls) +- `.check.assert_positive()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_positive) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_positive) +- `.check.assert_unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_unique) + +### Visualize data +- `.check.hist()`: A histogram - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.hist) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.hist) +- `.check.plot()`: An arbitrary plot you can customize - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.plot) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.plot) + +## Customizing a check You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass: * `.check.head(7)` @@ -175,8 +192,8 @@ Also, most Pandas Checks methods accept 3 additional arguments: Power user output -### Configuring Pandas Check -#### Global configuration +## Configuring Pandas Checks +### Global configuration You can change how Pandas Checks works everywhere. For example: ```python @@ -196,7 +213,7 @@ Run `pdc.describe_options()` to see the arguments you can pass to `.set_format() > > To turn off assertions too, add the argument `enable_asserts=False`, such as: `disable_checks(enable_asserts=False)`. -#### Local configuration +### Local configuration You can also adjust settings within a method chain by bookending the chain, like this: ```python From 0bc221524c959650a0c247aaed960010cde24846 Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Thu, 14 Nov 2024 07:20:34 -0600 Subject: [PATCH 3/9] Add .check.nrows() to README --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 4142fbd..5c453b6 100644 --- a/README.md +++ b/README.md @@ -110,6 +110,7 @@ Standard Pandas methods: - `.check.head()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.head) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.head) - `.check.info()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.info) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.info) - `.check.memory_usage()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.memory_usage) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.memory_usage) +- `.check.nrows()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nrows) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nrows) - `.check.nunique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nunique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nunique) - `.check.shape()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.shape) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.shape) - `.check.tail()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.tail) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.tail) From d7d88314fcfcb5a6f3c757b85d4f6c13f32ebf9f Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Thu, 14 Nov 2024 07:22:18 -0600 Subject: [PATCH 4/9] Wordsmith and update format --- README.md | 42 +++++++------- docs/usage.md | 156 +++++++++++++++++++++++++------------------------- 2 files changed, 101 insertions(+), 97 deletions(-) diff --git a/README.md b/README.md index 5c453b6..9c9b830 100644 --- a/README.md +++ b/README.md @@ -61,9 +61,9 @@ def clean_iris_data(iris: pd.DataFrame) -> pd.DataFrame: return ( iris - .dropna() # Drop rows with any null values - .rename(columns={"FLOWER_SPECIES": "species"}) # Rename a column - .query("species=='setosa'") # Filter to rows with a certain value + .dropna() + .rename(columns={"FLOWER_SPECIES": "species"}) + .query("species=='setosa'") ) ``` @@ -72,7 +72,6 @@ But what if you want to make the chain more robust? Or see what's happening to t You can add some `.check` steps. ```python - ( iris .dropna() @@ -92,9 +91,9 @@ You can add some `.check` steps. ``` The `.check` methods will display the following results: - +

Sample output - +

The `.check` methods didn't modify how the `iris` data is processed by your code. They just let you check the data as it flows down the pipeline. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`. @@ -105,12 +104,12 @@ Here's what's in the doctor's bag. ### Describe data Standard Pandas methods: - `.check.columns()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.columns) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.columns) -- `.check.dtypes()` for [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.dtypes) | `.check.dtype()` for [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.dtype) +- `.check.dtype()` - [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.dtype) +- `.check.dtypes()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.dtypes) - `.check.describe()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.describe) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.describe) - `.check.head()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.head) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.head) - `.check.info()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.info) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.info) - `.check.memory_usage()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.memory_usage) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.memory_usage) -- `.check.nrows()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nrows) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nrows) - `.check.nunique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nunique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nunique) - `.check.shape()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.shape) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.shape) - `.check.tail()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.tail) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.tail) @@ -122,6 +121,7 @@ New methods in Pandas Checks: - `.check.ncols()`: Count columns - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ncols) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ncols) - `.check.ndups()`: Count rows with duplicate values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ndups) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ndups) - `.check.nnulls()`: Count rows with null values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nnulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nnulls) +- `.check.nrows()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nrows) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nrows) - `.check.print()`: Print a string, a variable, or the current dataframe - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print) ### Export interim files @@ -129,23 +129,24 @@ New methods in Pandas Checks: ### Time your code - `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print_time_elapsed) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print_time_elapsed) -- 💡 Tip: You can also use this stopwatch outside a method chain, anywhere in your Python code: +

+> 💡 Tip: You can also use this stopwatch outside a method chain, anywhere in your Python code: - ```python - from pandas_checks import print_elapsed_time, start_timer +```python +from pandas_checks import print_elapsed_time, start_timer - start_time = start_timer() - ... - print_elapsed_time(start_time) - ``` +start_time = start_timer() +... +print_elapsed_time(start_time) +``` -### Turn on/off Pandas Checks -These can be used to disable subsequent Pandas Checks methods, either temporarily for a single method chain or permanently such as in a production environment. +### Turn Pandas Checks on or off +These methods can be used to disable subsequent Pandas Checks methods, either temporarily for a single method chain or permanently such as in a production environment. - `.check.disable_checks()`: Don't run checks. By default, still runs assertions. - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.disable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.disable_checks) -- `.check.enable_checks()`: Run checks again - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.enable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.enable_checks) +- `.check.enable_checks()`: Run checks again. - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.enable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.enable_checks) ### Validate data -General: +Custom: - `.check.assert_data()`: Check that data passes an arbitrary condition - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_data) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_data) Types: @@ -190,6 +191,7 @@ Also, most Pandas Checks methods accept 3 additional arguments: .check.describe(subset=['sepal_width', 'sepal_length']) # Only apply the check to certain columns ) ``` +

Power user output @@ -237,7 +239,7 @@ You can also adjust settings within a method chain by bookending the chain, like > 💡 Tip: **Hybrid EDA-Prod data processing** > -> Exploratory data analysis (EDA) is traditionally thought of as the first step of data projects. But often when we're in production, we wish we could reuse parts of the EDA. Maybe we're debugging, editing prod code, or need to change the input data. Unfortunately, the EDA code is often too stale to fire up again. The prod pipeline has changed too much. +> Exploratory data analysis (EDA) is traditionally thought of as the first step of data projects. But often when we're in production, we wish we could reuse parts of the EDA. Maybe we're debugging, editing prod code, or need to change the input data. Unfortunately, the original EDA code is often too stale to fire up again. The prod pipeline has changed too much. > > If you used Pandas Checks during EDA, you can keep your `.check` methods in your first prod code. In production, you can disable Pandas Checks, but enable it when you need it. This can make your prod pipline more transparent and easier to inspect. diff --git a/docs/usage.md b/docs/usage.md index 96c23d0..571ca8c 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -24,7 +24,6 @@ Pandas Checks adds `.check` methods to Pandas DataFrames and Series. Say you have a nice function. ```python - def clean_iris_data(iris: pd.DataFrame) -> pd.DataFrame: """Preprocess data about pretty flowers. @@ -37,9 +36,9 @@ def clean_iris_data(iris: pd.DataFrame) -> pd.DataFrame: return ( iris - .dropna() # Drop rows with any null values - .rename(columns={"FLOWER_SPECIES": "species"}) # Rename a column - .query("species=='setosa'") # Filter to rows with a certain value + .dropna() + .rename(columns={"FLOWER_SPECIES": "species"}) + .query("species=='setosa'") ) ``` @@ -48,7 +47,6 @@ But what if you want to make the chain more robust? Or see what's happening to t You can add some `.check` steps. ```python - ( iris .dropna() @@ -68,80 +66,84 @@ You can add some `.check` steps. ``` The `.check` methods will display the following results: +

+Sample output +The `.check` methods didn't modify how the `iris` data is processed by your code. They just let you check the data as it flows down the pipeline. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`. -Sample output +## Methods available +Here's what's in the doctor's bag. +### Describe data +Standard Pandas methods: +- `.check.columns()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.columns) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.columns) +- `.check.dtype()` - [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.dtype) +- `.check.dtypes()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.dtypes) +- `.check.describe()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.describe) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.describe) +- `.check.head()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.head) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.head) +- `.check.info()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.info) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.info) +- `.check.memory_usage()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.memory_usage) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.memory_usage) +- `.check.nunique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nunique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nunique) +- `.check.shape()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.shape) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.shape) +- `.check.tail()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.tail) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.tail) +- `.check.unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.unique) +- `.check.value_counts()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.value_counts) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.value_counts) + +New methods in Pandas Checks: +- `.check.function()`: Apply an arbitrary lambda function to your data and see the result - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.function) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.function) +- `.check.ncols()`: Count columns - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ncols) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ncols) +- `.check.ndups()`: Count rows with duplicate values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ndups) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ndups) +- `.check.nnulls()`: Count rows with null values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nnulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nnulls) +- `.check.nrows()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nrows) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nrows) +- `.check.print()`: Print a string, a variable, or the current dataframe - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print) + +### Export interim files +- `.check.write()`: Export the current data, inferring file format from the name - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.write) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.write) + +### Time your code +- `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print_time_elapsed) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print_time_elapsed) +

+> 💡 Tip: You can also use this stopwatch outside a method chain, anywhere in your Python code: -The `.check` methods didn't modify how the `iris` data is processed by your code. They just let you check the data as it flows down the pipeline. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`. +```python +from pandas_checks import print_elapsed_time, start_timer +start_time = start_timer() +... +print_elapsed_time(start_time) +``` -## Features -### Check methods -Here's what's in the doctor's bag. +### Turn Pandas Check on or off +These methods can be used to disable subsequent Pandas Checks methods, either temporarily for a single method chain or permanently such as in a production environment. + +- `.check.disable_checks()`: Don't run checks. By default, still runs assertions. - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.disable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.disable_checks) +- `.check.enable_checks()`: Run checks again. - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.enable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.enable_checks) + +### Validate data +Custom: +- `.check.assert_data()`: Check that data passes an arbitrary condition - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_data) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_data) + +Types: +- `.check.assert_datetime()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_datetime) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_datetime) +- `.check.assert_float()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_float) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_float) +- `.check.assert_int()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_int) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_int) +- `.check.assert_str()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_str) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_str) +- `.check.assert_timedelta()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_timedelta) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_timedelta) +- `.check.assert_type()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_type) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_type) + +Values: +- `.check.assert_less_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_less_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_less_than) +- `.check.assert_greater_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_greater_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_greater_than) +- `.check.assert_negative()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_negative) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_negative) +- `.check.assert_no_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_no_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_no_nulls) +- `.check.assert_all_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_all_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_all_nulls) +- `.check.assert_positive()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_positive) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_positive) +- `.check.assert_unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_unique) + +### Visualize data +- `.check.hist()`: A histogram - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.hist) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.hist) +- `.check.plot()`: An arbitrary plot you can customize - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.plot) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.plot) - **Describe** - - Standard Pandas methods: - - `.check.columns()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.columns) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.columns) - - `.check.dtypes()` for [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.dtypes) | `.check.dtype()` for [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.dtype) - - `.check.describe()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.describe) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.describe) - - `.check.head()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.head) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.head) - - `.check.info()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.info) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.info) - - `.check.memory_usage()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.memory_usage) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.memory_usage) - - `.check.nunique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nunique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nunique) - - `.check.shape()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.shape) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.shape) - - `.check.tail()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.tail) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.tail) - - `.check.unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.unique) - - `.check.value_counts()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.value_counts) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.value_counts) - - New functions in Pandas Checks: - - `.check.function()`: Apply an arbitrary lambda function to your data and see the result - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.function) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.function) - - `.check.ncols()`: Count columns - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ncols) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ncols) - - `.check.ndups()`: Count rows with duplicate values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.ndups) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.ndups) - - `.check.nnulls()`: Count rows with null values - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.nnulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.nnulls) - - `.check.print()`: Print a string, a variable, or the current dataframe - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print) - -* **Export interim files** - - `.check.write()`: Export the current data, inferring file format from the name - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.write) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.write) - -* **Time your code** - - `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.print_time_elapsed) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.print_time_elapsed) - - 💡 Tip: You can also use this stopwatch outside a method chain, anywhere in your Python code: - - ```python - from pandas_checks import print_elapsed_time, start_timer - - start_time = start_timer() - ... - print_elapsed_time(start_time) - ``` - -* **Turn off Pandas Checks** - - `.check.disable_checks()`: Don't run checks, for production mode etc. By default, still runs assertions. - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.disable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.disable_checks) - - `.check.enable_checks()`: Run checks - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.enable_checks) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.enable_checks) - -* **Validate** - - *General* - - `.check.assert_data()`: Check that data passes an arbitrary condition - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_data) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_data) - - *Types* - - `.check.assert_datetime()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_datetime) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_datetime) - - `.check.assert_float()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_float) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_float) - - `.check.assert_int()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_int) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_int) - - `.check.assert_str()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_str) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_str) - - `.check.assert_timedelta()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_timedelta) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_timedelta) - - `.check.assert_type()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_type) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_type) - - *Values* - - `.check.assert_less_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_less_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_less_than) - - `.check.assert_greater_than()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_greater_than) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_greater_than) - - `.check.assert_negative()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_negative) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_negative) - - `.check.assert_no_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_no_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_no_nulls) - - `.check.assert_all_nulls()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_all_nulls) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_all_nulls) - - `.check.assert_positive()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_positive) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_positive) - - `.check.assert_unique()` - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.assert_unique) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.assert_unique) - -* **Visualize** - - `.check.hist()`: A histogram - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.hist) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.hist) - - `.check.plot()`: An arbitrary plot you can customize - [DataFrame](https://cparmet.github.io/pandas-checks/API%20reference/DataFrameChecks/#pandas_checks.DataFrameChecks.DataFrameChecks.plot) | [Series](https://cparmet.github.io/pandas-checks/API%20reference/SeriesChecks/#pandas_checks.SeriesChecks.SeriesChecks.plot) - -### Customizing a check +## Customizing a check You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass: * `.check.head(7)` @@ -165,8 +167,8 @@ Also, most Pandas Checks methods accept 3 additional arguments: Power user output -### Configuring Pandas Check -#### Global configuration +## Configuring Pandas Check +### Global configuration You can change how Pandas Checks works everywhere. For example: ```python @@ -186,7 +188,7 @@ Run `pdc.describe_options()` to see the arguments you can pass to `.set_format() > > To turn off assertions too, add the argument `enable_asserts=False`, such as: `disable_checks(enable_asserts=False)`. -#### Local configuration +### Local configuration You can also adjust settings within a method chain by bookending the chain, like this: ```python @@ -212,7 +214,7 @@ You can also adjust settings within a method chain by bookending the chain, like Exploratory Data Analysis is often taught as a one-time step we do to plan our production data processing. But sometimes EDA is a cyclical process we go back to for deeper inspection during debugging, code edits, or changes in the input data. If explorations were useful in EDA, they may be useful again. -Unfortunately, it's hard to go back to EDA. It's too out of sync. The prod data processing pipeline has usually evolved too much, making the EDA code a historical artifact full of cobwebs that we can't easily fire up again. +Unfortunately, it's hard to go back to the original EDA code. It's too out of sync. The prod data processing pipeline has usually evolved too much, making the EDA code a historical artifact full of cobwebs that we can't easily fire up again. But if you use Pandas Checks during EDA, you could roll your `.check` methods into your first production code. Then in prod mode, disable Pandas Checks when you don't need it, to save compute and streamline output. When you ever need to pull out those EDA tools, enable Pandas Checks globally or locally. From 65961a76d7c42685b21c61753c520ac80b6eba45 Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Sat, 16 Nov 2024 08:08:10 -0600 Subject: [PATCH 5/9] Add example usage to docstrings --- pandas_checks/DataFrameChecks.py | 384 ++++++++++++++++++++++++++++-- pandas_checks/SeriesChecks.py | 389 +++++++++++++++++++++++++++++-- 2 files changed, 736 insertions(+), 37 deletions(-) diff --git a/pandas_checks/DataFrameChecks.py b/pandas_checks/DataFrameChecks.py index 3c82217..fd9bcf4 100644 --- a/pandas_checks/DataFrameChecks.py +++ b/pandas_checks/DataFrameChecks.py @@ -65,6 +65,14 @@ def assert_data( ) -> pd.DataFrame: """Tests whether Dataframe meets condition. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + # Validate that the Dataframe has at least 2 rows + + ( + iris + .check.assert_data(lambda df: df.shape[0]>1, verbose=True) + ) + Args: condition: Assertion criteria in the form of a lambda function, such as `lambda df: df.shape[0]>10`. fail_message: Message to display if the condition fails. @@ -160,8 +168,14 @@ def assert_datetime( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns is datetime or timestamp. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + df + .check.assert_datetime(subset="datetime_col") + ) + Args: - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. pass_message: Message to display if the condition passes. fail_message: Message to display if the condition fails. raise_exception: Whether to raise an exception if the condition fails. @@ -194,10 +208,16 @@ def assert_float( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns is floats. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + df + .check.assert_float(subset="float_col") + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -228,10 +248,16 @@ def assert_int( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns is integers. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + df + .check.assert_int(subset="int_col") + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -262,14 +288,27 @@ def assert_less_than( exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.DataFrame: - """Tests whether Dataframe or subset of columns is < or <= a value. Optionally raises an exception. Does not modify the DataFrame itself. + """Tests whether all values in a Dataframe or subset of columns is < or <= a maximum threshold. Optionally raises an exception. Does not modify the DataFrame itself. + + Example: + # Validate that sepal_length is always < 1000 + ( + iris + .check.assert_less_than(1000, subset="sepal_length") + ) + + # Validate that two columns are each always <= 1000 + ( + iris + .check.assert_less_than(1000, subset=["sepal_length", "petal_length"], or_equal_to=True) + ) Args: max: the max value to compare DataFrame to. Accepts any type that can be used in <, such as int, float, str, datetime or_equal_to: whether to test for <= min (True) or < max (False) fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -305,14 +344,28 @@ def assert_greater_than( exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.DataFrame: - """Tests whether Dataframe or subset of columns is > or >= a value. Optionally raises an exception. Does not modify the DataFrame itself. + """Tests whether all values in a Dataframe or subset of columns is > or >= a minimum threshold. Optionally raises an exception. Does not modify the DataFrame itself. + + + Example: + # Validate that sepal_length is always >0 + ( + iris + .check.assert_greater_than(0, subset="sepal_length") + ) + + # Validate that two columns are each always >= 0.1 + ( + iris + .check.assert_greater_than(0.1, subset=["sepal_length", "petal_length"], or_equal_to=True) + ) Args: min: the minimum value to compare DataFrame to. Accepts any type that can be used in >, such as int, float, str, datetime fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. or_equal_to: whether to test for >= min (True) or > min (False) - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -349,6 +402,12 @@ def assert_negative( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns has all negative values. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + df + .check.assert_negative(subset="column_name") + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -395,10 +454,16 @@ def assert_no_nulls( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns has no nulls. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + iris + .check.assert_no_nulls(subset=["sepal_length"]) + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -430,10 +495,18 @@ def assert_all_nulls( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns has all nulls. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + iris + .check.assert_all_nulls(subset=["sepal_length"]) + ) + + # Will raise an exception, "ㄨ Assert all nulls failed" + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -466,10 +539,16 @@ def assert_positive( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns has all positive values. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + iris + .check.assert_positive(subset=["sepal_length"]) + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. assert_no_nulls: Whether to also enforce that data has no nulls. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -511,10 +590,16 @@ def assert_str( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns is strings. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + iris + .check.assert_str(subset=["species", "another_string_column"]) + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -545,10 +630,16 @@ def assert_timedelta( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns is of type timedelta. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + ( + df + .check.assert_timedelta(subset=["timedelta_col"]) + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -580,11 +671,18 @@ def assert_type( ) -> pd.DataFrame: """Tests whether Dataframe or subset of columns meets type assumption. Optionally raises an exception. Does not modify the DataFrame itself. + Example: + # Validate that a column of mixed types has overall type `object` + ( + iris + .check.assert_type(object, subset="column_with_mixed_types") + ) + Args: dtype: The required variable type fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -631,12 +729,25 @@ def assert_unique( exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.DataFrame: - """Tests whether Dataframe or subset of columns has no duplicate rows. Optionally raises an exception. Does not modify the DataFrame itself. + """Validates that a subset of columns have no duplicate values, or validates that a DataFrame has no duplicate rows. Optionally raises an exception. Does not modify the DataFrame itself. + + Example: + # Validate that a column has no duplicate values + ( + df + .check.assert_unique(subset="id_column") + ) + + # Validate that a DataFrame has no duplicate rows + ( + df + .check.assert_unique() + ) Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. ` + subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -665,6 +776,12 @@ def columns( ) -> pd.DataFrame: """Prints the column names of a DataFrame, without modifying the DataFrame itself. + Example: + ( + df + .check.columns() + ) + Args: fn: An optional lambda function to apply to the DataFrame before printing columns. Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before printing their names. Applied after fn. @@ -693,6 +810,12 @@ def describe( See Pandas docs for describe() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + df + .check.describe() + ) + Args: fn: An optional lambda function to apply to the DataFrame before running Pandas describe(). Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before running Pandas describe(). Applied after fn. @@ -714,6 +837,14 @@ def describe( def disable_checks(self, enable_asserts: bool = True) -> pd.DataFrame: """Turns off Pandas Checks globally, such as in production mode. Calls to .check functions will not be run. Does not modify the DataFrame itself. + Example: + ( + iris + .check.disable_checks() + .check.assert_data(lambda df: df.shape[0]>10) # This check will NOT be run + .check.enable_checks() # Subsequent calls to .check will be run + ) + Args enable_assert: Optionally, whether to also enable or disable assert statements @@ -733,6 +864,12 @@ def dtypes( See Pandas docs for dtypes for additional usage information. + Example: + ( + iris + .check.dtypes() + ) + Args: fn: An optional lambda function to apply to the DataFrame before running Pandas dtypes. Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before running Pandas .dtypes. Applied after fn. @@ -753,6 +890,15 @@ def dtypes( def enable_checks(self, enable_asserts: bool = True) -> pd.DataFrame: """Globally enables Pandas Checks. Subequent calls to .check methods will be run. Does not modify the DataFrame itself. + Example: + ( + iris + ["sepal_length"] + .check.disable_checks() + .check.assert_data(lambda s: s.shape[0]>10) # This check will NOT be run + .check.enable_checks() # Subsequent calls to .check will be run + ) + Args: enable_asserts: Optionally, whether to globally enable or disable calls to .check.assert_data(). @@ -772,7 +918,8 @@ def function( Example: .check.function(fn=lambda df: df.shape[0]>10, check_name='Has at least 10 rows?') - which will result in 'True' or 'False' + + # Will return either 'True' or 'False' Args: fn: A lambda function to apply to the DataFrame. Example: `lambda df: df.shape[0]>10`. Applied before subset. @@ -790,6 +937,16 @@ def get_mode( ) -> pd.DataFrame: """Displays the current values of Pandas Checks global options enable_checks and enable_asserts. Does not modify the DataFrame itself. + Example: + ( + iris + .check.get_mode() + ) + + # The check will print: + # "🐼🩺 Pandas Checks mode: {'enable_checks': True, 'enable_asserts': True}" + + Args: check_name: An optional name for the check. Will be used as a preface the printed result. @@ -810,6 +967,12 @@ def head( See Pandas docs for head() for additional usage information. + Example: + ( + iris + .check.head(10) + ) + Args: n: The number of rows to display. fn: An optional lambda function to apply to the DataFrame before running Pandas head(). Example: `lambda df: df.shape[0]>10`. Applied before subset. @@ -839,6 +1002,12 @@ def hist( See Pandas docs for hist() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + .check.hist(subset=["sepal_length", "sepal_width"]) + ) + Args: fn: An optional lambda function to apply to the DataFrame before running Pandas hist(). Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before running Pandas hist(). Applied after fn. @@ -878,6 +1047,12 @@ def info( See Pandas docs for info() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + .check.info() + ) + Args: fn: An optional lambda function to apply to the DataFrame before running Pandas info(). Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before running Pandas info(). Applied after fn. @@ -904,6 +1079,12 @@ def memory_usage( See Pandas docs for memory_usage() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + .check.memory_usage() + ) + Args: fn: An optional lambda function to apply to the DataFrame before running Pandas memory_usage(). Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before running Pandas memory_usage(). Applied after fn. @@ -933,6 +1114,12 @@ def ncols( ) -> pd.DataFrame: """Displays the number of columns in a DataFrame, without modifying the DataFrame itself. + Example: + ( + iris + .check.ncols() + ) + Args: fn: An optional lambda function to apply to the DataFrame before counting the number of columns. Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before counting the number of columns. Applied after fn. @@ -961,6 +1148,13 @@ def ndups( See Pandas docs for duplicated() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + # Count the rows with duplicate pairs of values in two columns + ( + iris + .check.ndups(subset=["sepal_length", "sepal_width"]) + ) + Args: fn: An optional lambda function to apply to the DataFrame before counting the number of duplicates. Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before counting duplicate rows. Applied after fn. @@ -994,6 +1188,19 @@ def nnulls( See Pandas docs for isna() for additional usage information. + Example: + # Count the number of rows that have any nulls, one count per column + ( + iris + .check.nnulls() + ) + + # Count the number of rows in the DataFrame that have a null in any column + ( + iris + .check.nnulls(by_column=False) + ) + Args: fn: An optional lambda function to apply to the DataFrame before counting the number of rows with a null. Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string to select a subset of columns before counting nulls. @@ -1038,6 +1245,12 @@ def nrows( ) -> pd.DataFrame: """Displays the number of rows in a DataFrame, without modifying the DataFrame itself. + Example: + ( + iris + .check.nrows() + ) + Args: fn: An optional lambda function to apply to the DataFrame before counting the number of rows. Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string name of one column to limit which columns are considered when counting rows. Applied after fn. @@ -1066,6 +1279,12 @@ def nunique( See Pandas docs for nunique() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + .check.nunique(column="sepal_width") + ) + Args: column: The name of a column to count uniques in. Applied after fn. fn: An optional lambda function to apply to the DataFrame before running Pandas nunique(). Example: `lambda df: df.shape[0]>10`. Applied before subset. @@ -1099,6 +1318,13 @@ def plot( See Pandas docs for plot() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + + Example: + ( + iris + .check.plot(kind="scatter", x="sepal_width", y="sepal_length", title="Sepal width vs sepal length") + ) + Args: fn: An optional lambda function to apply to the DataFrame before running Pandas plot(). Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string name of one column to limit which columns are plotted. Applied after fn. @@ -1133,6 +1359,21 @@ def print( ) -> pd.DataFrame: """Displays text, another object, or (by default) the current DataFrame's head. Does not modify the DataFrame itself. + Example: + # Print messages and milestones + ( + iris + .check.print("Starting data cleaning..."") + ... + ) + + # Inspect a DataFrame, such as the interim result of data processing + ( + iris + ... + .check.print(fn=lambda df: df.query("sepal_width<0"), check_name="Rows with negative sepal_width") + ) + Args: object: Object to print. Can be anything printable: str, int, list, another DataFrame, etc. If None, print the DataFrame's head (with `max_rows` rows). fn: An optional lambda function to apply to the DataFrame before printing `object`. Example: `lambda df: df.shape[0]>10`. Applied before subset. @@ -1160,6 +1401,24 @@ def print_time_elapsed( ) -> pd.DataFrame: """Displays the time elapsed since start_time. + Example: + import pandas_checks as pdc + + start_time = pdc.start_timer() + + ( + iris + ... # Do some data processing + .check.print_time_elapsed(start_time, "Cleaning took") + + ... # Do more + .check.print_time_elapsed(start_time, "Processing total time", units="seconds") # Force units to stay in seconds + + ) + + # Result: "Cleaning took: 17.298324584960938 seconds + # "Processing total time: 71.0400543212890625 seconds + Args: start_time: The index time when the stopwatch started, which comes from the Pandas Checks start_timer() lead_in: Optional text to print before the elapsed time. @@ -1179,6 +1438,17 @@ def print_time_elapsed( def reset_format(self) -> pd.DataFrame: """Globally restores all Pandas Checks formatting options to their default "factory" settings. Does not modify the DataFrame itself. + Example: + ( + iris + .check.set_format(precision=9, use_emojis=False) + + # Print DF summary stats with precision 9 digits and no Pandas Checks emojis + .check.describe() + + .check.reset_format() # Go back to default precision and emojis 🥳 + ) + Returns: The original DataFrame, unchanged. """ @@ -1190,8 +1460,16 @@ def set_format(self, **kwargs: Any) -> pd.DataFrame: Run pandas_checks.describe_options() to see a list of available options. - For example, .check.set_format(check_text_tag= "h1", use_emojis=False`) - will globally change Pandas Checks to display text results as H1 headings and remove all emojis. + Example: + ( + iris + .check.set_format(precision=9, use_emojis=False) + + # Print DF summary stats with precision 9 digits and no Pandas Checks emojis + .check.describe() + + .check.reset_format() # Go back to default precision and emojis 🥳 + ) Args: **kwargs: Pairs of setting name and its new value. @@ -1205,9 +1483,26 @@ def set_format(self, **kwargs: Any) -> pd.DataFrame: def set_mode(self, enable_checks: bool, enable_asserts: bool) -> pd.DataFrame: """Configures the operation mode for Pandas Checks globally. Does not modify the DataFrame itself. + Example: + + # Disable checks except keep running assertions + # Same as using .check.disable_checks() + ( + iris + .check.set_mode(enable_checks=False) + .check.describe() # This check will not be run + .check.assert_data(lambda s: s.shape[0]>10) # This check will still be run + ) + + # Disable checks and assertions + ( + iris + .check.set_mode(enable_checks=False, enable_asserts=False) + ) + Args: - enable_checks: Whether to run any Pandas Checks methods globally. Does not affect .check.assert_data(). - enable_asserts: Whether to run calls to Pandas Checks .check.assert_data() statements globally. + enable_checks: Whether to run any Pandas Checks methods globally. Does not affect .check.assert_*(). + enable_asserts: Whether to run calls to Pandas Checks .check.assert_*() statements globally. Returns: The original DataFrame, unchanged. @@ -1225,6 +1520,13 @@ def shape( See Pandas docs for shape for additional usage information. + Example: + ( + iris + .check.shape() + .check.shape(fn=lambda df: df.query("sepal_length<5"), check_name="Shape of DataFrame subgroup with sepal_length<5") + ) + Args: fn: An optional lambda function to apply to the DataFrame before running Pandas `shape`. Example: `lambda df: df.shape[0]>10`. Applied before subset. subset: An optional list of column names or a string name of one column to limit which columns are considered when printing the shape. Applied after fn. @@ -1256,6 +1558,12 @@ def tail( See Pandas docs for tail() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + .check.tail(10) + ) + Args: n: Number of rows to show. fn: An optional lambda function to apply to the DataFrame before running Pandas tail(). Example: `lambda df: df.shape[0]>10`. Applied before subset. @@ -1284,6 +1592,14 @@ def unique( See Pandas docs for unique() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + .check.unique("species") + ) + # The check will print: + # 🌟 Unique values of species: ['setosa', 'versicolor', 'virginica'] + Args: column: Column to check for unique values. fn: An optional lambda function to apply to the DataFrame before calling Pandas unique(). Example: `lambda df: df.shape[0]>10`. Applied before subset. @@ -1319,6 +1635,12 @@ def value_counts( See Pandas docs for value_counts() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + .check.value_counts("sepal_length") + ) + Args: column: Column to check for value counts. max_rows: Maximum number of rows to show in the value counts. @@ -1356,10 +1678,30 @@ def write( ) -> pd.DataFrame: """Exports DataFrame to file, without modifying the DataFrame itself. - Format is inferred from path extension like .csv. + The file format is inferred from the extension. Supports: + - .csv + - .feather + - .parquet + - .pkl # Pickle + - .tsv # Tab-separated data file + - .xlsx This functions uses the corresponding Pandas export function such as to_csv(). See Pandas docs for those functions for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + + # Process data + ... + + # Export the interim data for inspection + .check.write("iris_interim.xlsx") + + # Continue processing + ... + ) + Args: path: Path to write the file to. format: Optional file format to force for the export. If None, format is inferred from the file's extension in `path`. diff --git a/pandas_checks/SeriesChecks.py b/pandas_checks/SeriesChecks.py index b4b9ef5..783b5b8 100644 --- a/pandas_checks/SeriesChecks.py +++ b/pandas_checks/SeriesChecks.py @@ -59,6 +59,15 @@ def assert_data( ) -> pd.Series: """Tests whether Series meets condition. Optionally raises an exception. Does not modify the Series itself. + Example: + # Validate that the Series has at least 2 rows + + ( + iris + ["sepal_length"] + .check.assert_data(lambda s: s.shape[0]>1, verbose=True) + ) + Args: condition: Assertion criteria in the form of a lambda function, such as `lambda s: s.shape[0]>10`. fail_message: Message to display if the condition fails. @@ -149,6 +158,13 @@ def assert_datetime( ) -> pd.Series: """Tests whether Series is datetime or timestamp. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + df + ["datetime_col"] + .check.assert_datetime() + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -180,6 +196,13 @@ def assert_float( ) -> pd.Series: """Tests whether Series is floats. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + df + ["float_col"] + .check.assert_float() + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -211,6 +234,13 @@ def assert_int( ) -> pd.Series: """Tests whether Series is integers. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + df + ["int_col"] + .check.assert_int() + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -242,7 +272,22 @@ def assert_less_than( exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.Series: - """Tests whether Series is < or <= a value. Optionally raises an exception. Does not modify the Series itself. + """Tests whether all values in Series are < or <= a maximum threshold. Optionally raises an exception. Does not modify the Series itself. + + Example: + # Validate that sepal_length is always < 1000 + ( + iris + ["sepal_length"] + .check.assert_less_than(1000) + ) + + # Validate that it's always <= 1000 + ( + iris + ["sepal_length"] + .check.assert_less_than(1000, or_equal_to=True) + ) Args: max: the max value to compare Series to. Accepts any type that can be used in <, such as int, float, str, datetime @@ -282,7 +327,22 @@ def assert_greater_than( exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.Series: - """Tests whether Series is > or >= a value. Optionally raises an exception. Does not modify the Series itself. + """Tests whether Series is > or >= a minimum threshold. Optionally raises an exception. Does not modify the Series itself. + + Example: + # Validate that sepal_length is always >0 + ( + iris + ["sepal_length"] + .check.assert_greater_than(0) + ) + + # Validate that two columns are each always >= 0.1 + ( + iris + [["sepal_length", "petal_length"]] + .check.assert_greater_than(0.1, or_equal_to=True) + ) Args: min: the minimum value to compare Series to. Accepts any type that can be used in >, such as int, float, str, datetime @@ -323,6 +383,13 @@ def assert_negative( ) -> pd.Series: """Tests whether Series has all negative values. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + df + ["column_name"] + .check.assert_negative() + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -366,6 +433,12 @@ def assert_no_nulls( ) -> pd.Series: """Tests whether Series has no nulls. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + ["sepal_length"] + .check.assert_no_nulls() + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -398,6 +471,15 @@ def assert_all_nulls( ) -> pd.Series: """Tests whether Series has all nulls. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + iris + ["sepal_length"] + .check.assert_all_nulls() + ) + + # Will raise an exception, "ㄨ Assert all nulls failed" + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -431,6 +513,13 @@ def assert_positive( ) -> pd.Series: """Tests whether Series has all positive values. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + iris + ["sepal_length"] + .check.assert_positive() + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -454,8 +543,8 @@ def assert_positive( self._obj.dropna().check.assert_data( condition=lambda s: (s > 0).all().all(), - fail_message=fail_message, pass_message=pass_message, + fail_message=fail_message, raise_exception=raise_exception, exception_to_raise=exception_to_raise, message_shows_condition=False, @@ -473,6 +562,13 @@ def assert_str( ) -> pd.Series: """Tests whether Series is strings. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + iris + ["species"] + .check.assert_str() + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -504,6 +600,12 @@ def assert_timedelta( ) -> pd.Series: """Tests whether Series is of type timedelta. Optionally raises an exception. Does not modify the Series itself. + Example: + ( + df + .check.assert_timedelta(subset=["timedelta_col"]) + ) + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -536,6 +638,14 @@ def assert_type( ) -> pd.Series: """Tests whether Series meets type assumption. Optionally raises an exception. Does not modify the Series itself. + Example: + # Validate that a column of mixed types has overall type `object` + ( + iris + ["column_with_mixed_types"] + .check.assert_type(object) + ) + Args: dtype: The required variable type fail_message: Message to display if the condition fails. @@ -575,7 +685,14 @@ def assert_unique( exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.Series: - """Tests whether Series has no duplicate rows. Optionally raises an exception. Does not modify the Series itself. + """Validates that a Series has no duplicate values. Optionally raises an exception. Does not modify the Series itself. + + Example: + ( + df + ["id_column"] + .check.assert_unique() + ) Args: fail_message: Message to display if the condition fails. @@ -609,6 +726,13 @@ def describe( See Pandas docs for describe() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + ["sepal_length"] + .check.describe() + ) + Args: fn: An optional lambda function to apply to the Series before running Pandas describe(). Example: `lambda s: s.dropna()`. check_name: An optional name for the check to preface the result with. @@ -625,6 +749,15 @@ def describe( def disable_checks(self, enable_asserts: bool = True) -> pd.Series: """Turns off Pandas Checks globally, such as in production mode. Calls to .check functions will not be run. Does not modify the Series itself. + Example: + ( + iris + ["sepal_length"] + .check.disable_checks() + .check.assert_data(lambda s: s.shape[0]>10) # This check will NOT be run + .check.enable_checks() # Subsequent calls to .check will be run + ) + Args enable_assert: Optionally, whether to also enable or disable assert statements @@ -643,6 +776,13 @@ def dtype( See Pandas docs for .dtype for additional usage information. + Example: + ( + iris + ["sepal_length"] + .check.dtype() + ) + Args: fn: An optional lambda function to apply to the Series before running Pandas dtype. Example: `lambda s: s.dropna()`. check_name: An optional name for the check to preface the result with. @@ -661,6 +801,15 @@ def dtype( def enable_checks(self, enable_asserts: bool = True) -> pd.Series: """Globally enables Pandas Checks. Subequent calls to .check methods will be run. Does not modify the Series itself. + Example: + ( + iris + ["sepal_length"] + .check.disable_checks() + .check.assert_data(lambda s: s.shape[0]>10) # This check will NOT be run + .check.enable_checks() # Subsequent calls to .check will be run + ) + Args: enable_asserts: Optionally, whether to globally enable or disable calls to .check.assert_data(). @@ -679,7 +828,8 @@ def function( Example: .check.function(fn=lambda s: s.shape[0]>10, check_name='Has at least 10 rows?') - which will result in 'True' or 'False' + + # Will return either 'True' or 'False' Args: fn: The lambda function to apply to the Series. Example: `lambda s: s.dropna()`. @@ -696,6 +846,17 @@ def get_mode( ) -> pd.Series: """Displays the current values of Pandas Checks global options enable_checks and enable_asserts. Does not modify the Series itself. + Example: + ( + iris + ["sepal_length"] + .check.get_mode() + ) + + # The check will print: + # "🐼🩺 Pandas Checks mode: {'enable_checks': True, 'enable_asserts': True}" + + Args: check_name: An optional name for the check. Will be used as a preface the printed result. @@ -715,6 +876,13 @@ def head( See Pandas docs for head() for additional usage information. + Example: + ( + iris + ["sepal_length"] + .check.head(10) + ) + Args: n: The number of rows to display. fn: An optional lambda function to apply to the Series before running Pandas head(). Example: `lambda s: s.dropna()`. @@ -738,6 +906,13 @@ def hist( See Pandas docs for hist() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + ["sepal_length"] + .check.hist() + ) + Args: fn: An optional lambda function to apply to the Series before running Pandas head(). Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -764,6 +939,13 @@ def info( See Pandas docs for info() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + ["sepal_length"] + .check.info() + ) + Args: fn: An optional lambda function to apply to the Series before running Pandas info(). Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -788,6 +970,13 @@ def memory_usage( See Pandas docs for memory_usage() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + ["sepal_length"] + .check.memory_usage() + ) + Args: fn: An optional lambda function to apply to the Series before running Pandas memory_usage(). Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -814,6 +1003,13 @@ def ndups( See Pandas docs for duplicated() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + ["sepal_length"] + .check.ndups() + ) + Args: fn: An optional lambda function to apply to the Series before counting the number of duplicates. Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -836,6 +1032,13 @@ def nnulls( See Pandas docs for isna() for additional usage information. + Example: + ( + iris + ["sepal_length"] + .check.nnulls() + ) + Args: fn: An optional lambda function to apply to the Series before counting rows with nulls. Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -855,6 +1058,13 @@ def nrows( ) -> pd.Series: """Displays the number of rows in a Series, without modifying the Series itself. + Example: + ( + iris + ["sepal_width"] + .check.nrows() + ) + Args: fn: An optional lambda function to apply to the Series before counting the number of rows. Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -877,6 +1087,13 @@ def nunique( See Pandas docs for nunique() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + ["sepal_width"] + .check.nunique() + ) + Args: fn: An optional lambda function to apply to the Series before running Pandas nunique(). Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -905,6 +1122,15 @@ def plot( See Pandas docs for plot() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + # Show a box plot of the Series distribution + ( + iris + ["sepal_width"] + .check.plot(kind="box", title="Distribution of sepal width") + ) + + Args: fn: An optional lambda function to apply to the Series before running Pandas plot(). Example: `lambda s: s.dropna()`. check_name: An optional title for the plot. @@ -932,6 +1158,22 @@ def print( ) -> pd.Series: """Displays text, another object, or (by default) the current DataFrame's head. Does not modify the Series itself. + Example: + # Print messages and milestones + ( + iris + ["sepal_width"] + .check.print("Starting data cleaning..."") + ... + ) + + # Inspect a Series, such as the interim result of data processing + ( + iris + ... + .check.print(fn=lambda s: s[s<0], check_name="Negative values of sepal_width") + ) + Args: object: Object to print. Can be anything printable: str, int, list, another DataFrame, etc. If None, print the Series's head (with `max_rows` rows). fn: An optional lambda function to apply to the Series before printing `object`. Example: `lambda s: s.dropna()`. @@ -954,10 +1196,29 @@ def print_time_elapsed( ) -> pd.Series: """Displays the time elapsed since start_time. + Example: + import pandas_checks as pdc + + start_time = pdc.start_timer() + + ( + iris + ["species"] + ... # Do some data processing + .check.print_time_elapsed(start_time, "Cleaning took") + + ... # Do more + .check.print_time_elapsed(start_time, "Processing total time", units="seconds") # Force units to stay in seconds + + ) + + # Result: "Cleaning took: 17.298324584960938 seconds + # "Processing total time: 71.0400543212890625 seconds + Args: - start_time: The index time when the stopwatch started, which comes from the Pandas Checks start_timer() - lead_in: Optional text to print before the elapsed time. - units: The units in which to display the elapsed time. Allowed values: "auto", "milliseconds", "seconds", "minutes", "hours" or shorthands "ms", "s", "m", "h". + start_time: The index time when the stopwatch started, which comes from the Pandas Checks start_timer() + lead_in: Optional text to print before the elapsed time. + units: The units in which to display the elapsed time. Allowed values: "auto", "milliseconds", "seconds", "minutes", "hours" or shorthands "ms", "s", "m", "h". Raises: ValueError: If `units` is not one of allowed values. @@ -973,6 +1234,18 @@ def print_time_elapsed( def reset_format(self) -> pd.Series: """Globally restores all Pandas Checks formatting options to their default "factory" settings. Does not modify the Series itself. + Example: + ( + iris + ["sepal_width"] + .check.set_format(precision=9, use_emojis=False) + + # Print Series summary stats with precision 9 digits and no Pandas Checks emojis + .check.describe() + + .check.reset_format() # Go back to default precision and emojis 🥳 + ) + Returns: The original Series, unchanged. """ @@ -980,10 +1253,23 @@ def reset_format(self) -> pd.Series: return self._obj def set_format(self, **kwargs: Any) -> pd.Series: - """Configures selected formatting options for Pandas Checks. Run pandas_checks.describe_options() to see a list of available options. Does not modify the Series itself + """Configures selected formatting options for Pandas Checks. Does not modify the Series itself. + + Run pandas_checks.describe_options() to see a list of available options. + + See .check.reset_format() to restore defaults. + + Example: + ( + iris + ["sepal_width"] + .check.set_format(precision=9, use_emojis=False) + + # Print Series summary stats with precision 9 digits and no Pandas Checks emojis + .check.describe() - For example, .check.set_format(check_text_tag= "h1", use_emojis=False`) - will globally change Pandas Checks to display text results as H1 headings and remove all emojis. + .check.reset_format() # Go back to default precision and emojis 🥳 + ) Args: **kwargs: Pairs of setting name and its new value. @@ -997,9 +1283,28 @@ def set_format(self, **kwargs: Any) -> pd.Series: def set_mode(self, enable_checks: bool, enable_asserts: bool) -> pd.Series: """Configures the operation mode for Pandas Checks globally. Does not modify the Series itself. + Example: + + # Disable checks except keep running assertions + # Same as using .check.disable_checks() + ( + iris + ["sepal_width"] + .check.set_mode(enable_checks=False) + .check.describe() # This check will not be run + .check.assert_data(lambda s: s.shape[0]>10) # This check will still be run + ) + + # Disable checks and assertions + ( + iris + ["sepal_width"] + .check.set_mode(enable_checks=False, enable_asserts=False) + ) + Args: - enable_checks: Whether to run any Pandas Checks methods globally. Does not affect .check.assert_data(). - enable_asserts: Whether to run calls to Pandas Checks .check.assert_data() globally. + enable_checks: Whether to run any Pandas Checks methods globally. Does not affect .check.assert_*() calls. + enable_asserts: Whether to run calls to Pandas Checks .check.assert_*() globally. Returns: The original Series, unchanged. @@ -1016,6 +1321,14 @@ def shape( See Pandas docs for `shape` for additional usage information. + Example: + ( + iris + ["sepal_width"] + .check.shape() + .check.shape(fn=lambda s: s[s<5]), check_name="Shape of sepal_width series with values <5") + ) + Args: fn: An optional lambda function to apply to the Series before running Pandas `shape`. Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -1045,6 +1358,12 @@ def tail( See Pandas docs for tail() for additional usage information. + Example: + ( + iris + .check.tail(10) + ) + Args: n: Number of rows to show. fn: An optional lambda function to apply to the Series before running Pandas tail(). Example: `lambda s: s.dropna()`. @@ -1067,6 +1386,15 @@ def unique( See Pandas docs for unique() for additional usage information. + Example: + ( + iris + ["species"] + .check.unique() + ) + # The check will print: + # 🌟 Unique values of species: ['setosa', 'versicolor', 'virginica'] + Args: fn: An optional lambda function to apply to the Series before running Pandas unique(). Example: `lambda s: s.dropna()`. check_name: An optional name for the check, to be printed as preface to the result. @@ -1095,6 +1423,13 @@ def value_counts( See Pandas docs for value_counts() for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Example: + ( + iris + ["sepal_length"] + .check.value_counts() + ) + Args: max_rows: Maximum number of rows to show in the value counts. fn: An optional lambda function to apply to the Series before running Pandas value_counts(). Example: `lambda s: s.dropna()`. @@ -1126,10 +1461,34 @@ def write( ) -> pd.Series: """Exports Series to file, without modifying the Series itself. - Format is inferred from path extension like .csv. + The file format is inferred from the extension. Supports: + - .csv + - .feather + - .parquet + - .pkl # Pickle + - .tsv # Tab-separated data file + - .xlsx This functions uses the corresponding Pandas export function such as to_csv(). See Pandas docs for those functions for additional usage information, including more configuration options you can pass to this Pandas Checks method. + Note: + Exporting to some formats such as Excel, Feather, and Parquet may require you to install additional packages. + + Example: + ( + iris + ["sepal_length"] + + # Process data + ... + + # Export the interim data for inspection + .check.write("sepal_length_interim.xlsx") + + # Continue processing + ... + ) + Args: path: Path to write the file to. format: Optional file format to force for the export. If None, format is inferred from the file's extension in `path`. @@ -1140,8 +1499,6 @@ def write( Returns: The original Series, unchanged. - Note: - Exporting to some formats such as Excel, Feather, and Parquet may require you to install additional packages. """ ( pd.DataFrame(_apply_modifications(self._obj, fn)).check.write( From fff415d8d52688c7773d6277709b2f8521f46571 Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Sat, 16 Nov 2024 08:25:26 -0600 Subject: [PATCH 6/9] Revise example usage for assertion methods --- pandas_checks/DataFrameChecks.py | 60 ++++++++++++++++++++++---------- pandas_checks/SeriesChecks.py | 52 +++++++++++++++++++-------- 2 files changed, 79 insertions(+), 33 deletions(-) diff --git a/pandas_checks/DataFrameChecks.py b/pandas_checks/DataFrameChecks.py index fd9bcf4..3cae7ad 100644 --- a/pandas_checks/DataFrameChecks.py +++ b/pandas_checks/DataFrameChecks.py @@ -70,7 +70,13 @@ def assert_data( ( iris - .check.assert_data(lambda df: df.shape[0]>1, verbose=True) + .check.assert_data(lambda df: df.shape[0]>1) + + # Or customize the message displayed when alert fails + .check.assert_data(lambda df: df.shape[0]>1, "Assertion failed, DataFrame has no rows!") + + # Or show a warning instead of raising an exception + .check.assert_data(lambda df: s.shape[0]>1, "FYI Series has no rows", raise_exception=False) ) Args: @@ -174,6 +180,8 @@ def assert_datetime( .check.assert_datetime(subset="datetime_col") ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: subset: Optional, which column or columns to check the condition against. pass_message: Message to display if the condition passes. @@ -214,6 +222,8 @@ def assert_float( .check.assert_float(subset="float_col") ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -254,6 +264,8 @@ def assert_int( .check.assert_int(subset="int_col") ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -291,18 +303,18 @@ def assert_less_than( """Tests whether all values in a Dataframe or subset of columns is < or <= a maximum threshold. Optionally raises an exception. Does not modify the DataFrame itself. Example: - # Validate that sepal_length is always < 1000 ( iris + + # Validate that sepal_length is always < 1000 .check.assert_less_than(1000, subset="sepal_length") - ) - # Validate that two columns are each always <= 1000 - ( - iris + # Validate that two columns are each always less than or equal too 100 .check.assert_less_than(1000, subset=["sepal_length", "petal_length"], or_equal_to=True) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: max: the max value to compare DataFrame to. Accepts any type that can be used in <, such as int, float, str, datetime or_equal_to: whether to test for <= min (True) or < max (False) @@ -348,18 +360,17 @@ def assert_greater_than( Example: - # Validate that sepal_length is always >0 ( iris - .check.assert_greater_than(0, subset="sepal_length") - ) + # Validate that sepal_length is always greater than 0.1 + .check.assert_greater_than(0.1, subset="sepal_length") - # Validate that two columns are each always >= 0.1 - ( - iris + # Validate that two columns are each always greater than or equal to 0.1 .check.assert_greater_than(0.1, subset=["sepal_length", "petal_length"], or_equal_to=True) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: min: the minimum value to compare DataFrame to. Accepts any type that can be used in >, such as int, float, str, datetime fail_message: Message to display if the condition fails. @@ -408,6 +419,8 @@ def assert_negative( .check.assert_negative(subset="column_name") ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -460,6 +473,8 @@ def assert_no_nulls( .check.assert_no_nulls(subset=["sepal_length"]) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -501,7 +516,9 @@ def assert_all_nulls( .check.assert_all_nulls(subset=["sepal_length"]) ) - # Will raise an exception, "ㄨ Assert all nulls failed" + # Will raise an exception "ㄨ Assert all nulls failed" + + # See docs for .check.assert_data() for examples of how to customize assertions Args: fail_message: Message to display if the condition fails. @@ -545,6 +562,8 @@ def assert_positive( .check.assert_positive(subset=["sepal_length"]) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -596,6 +615,8 @@ def assert_str( .check.assert_str(subset=["species", "another_string_column"]) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -636,6 +657,8 @@ def assert_timedelta( .check.assert_timedelta(subset=["timedelta_col"]) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -678,6 +701,8 @@ def assert_type( .check.assert_type(object, subset="column_with_mixed_types") ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: dtype: The required variable type fail_message: Message to display if the condition fails. @@ -732,18 +757,17 @@ def assert_unique( """Validates that a subset of columns have no duplicate values, or validates that a DataFrame has no duplicate rows. Optionally raises an exception. Does not modify the DataFrame itself. Example: - # Validate that a column has no duplicate values ( df + # Validate that a column has no duplicate values .check.assert_unique(subset="id_column") - ) - # Validate that a DataFrame has no duplicate rows - ( - df + # Validate that a DataFrame has no duplicate rows .check.assert_unique() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. diff --git a/pandas_checks/SeriesChecks.py b/pandas_checks/SeriesChecks.py index 783b5b8..e04fb27 100644 --- a/pandas_checks/SeriesChecks.py +++ b/pandas_checks/SeriesChecks.py @@ -65,7 +65,13 @@ def assert_data( ( iris ["sepal_length"] - .check.assert_data(lambda s: s.shape[0]>1, verbose=True) + .check.assert_data(lambda s: s.shape[0]>1) + + # Or customize the message displayed when alert fails + .check.assert_data(lambda df: s.shape[0]>1, "Assertion failed, Series has no rows!") + + # Or show a warning instead of raising an exception + .check.assert_data(lambda df: s.shape[0]>1, "FYI Series has no rows", raise_exception=False) ) Args: @@ -165,6 +171,8 @@ def assert_datetime( .check.assert_datetime() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -203,6 +211,8 @@ def assert_float( .check.assert_float() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -241,6 +251,8 @@ def assert_int( .check.assert_int() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -275,20 +287,19 @@ def assert_less_than( """Tests whether all values in Series are < or <= a maximum threshold. Optionally raises an exception. Does not modify the Series itself. Example: - # Validate that sepal_length is always < 1000 ( iris ["sepal_length"] + + # Validate that sepal_length is always < 1000 .check.assert_less_than(1000) - ) - # Validate that it's always <= 1000 - ( - iris - ["sepal_length"] + # Validate that it's always <= 1000 .check.assert_less_than(1000, or_equal_to=True) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: max: the max value to compare Series to. Accepts any type that can be used in <, such as int, float, str, datetime fail_message: Message to display if the condition fails. @@ -330,19 +341,14 @@ def assert_greater_than( """Tests whether Series is > or >= a minimum threshold. Optionally raises an exception. Does not modify the Series itself. Example: - # Validate that sepal_length is always >0 ( iris ["sepal_length"] - .check.assert_greater_than(0) + # Validate that the Series is always >= 0 + .check.assert_greater_than(0, or_equal_to=True) ) - # Validate that two columns are each always >= 0.1 - ( - iris - [["sepal_length", "petal_length"]] - .check.assert_greater_than(0.1, or_equal_to=True) - ) + # See docs for .check.assert_data() for examples of how to customize assertions Args: min: the minimum value to compare Series to. Accepts any type that can be used in >, such as int, float, str, datetime @@ -390,6 +396,8 @@ def assert_negative( .check.assert_negative() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -439,6 +447,8 @@ def assert_no_nulls( .check.assert_no_nulls() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -480,6 +490,8 @@ def assert_all_nulls( # Will raise an exception, "ㄨ Assert all nulls failed" + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -520,6 +532,8 @@ def assert_positive( .check.assert_positive() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -569,6 +583,8 @@ def assert_str( .check.assert_str() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -606,6 +622,8 @@ def assert_timedelta( .check.assert_timedelta(subset=["timedelta_col"]) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. @@ -646,6 +664,8 @@ def assert_type( .check.assert_type(object) ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: dtype: The required variable type fail_message: Message to display if the condition fails. @@ -694,6 +714,8 @@ def assert_unique( .check.assert_unique() ) + # See docs for .check.assert_data() for examples of how to customize assertions + Args: fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. From 95a27f18afe3241015f99b505096f52ba3b95aff Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Sat, 16 Nov 2024 08:34:53 -0600 Subject: [PATCH 7/9] Clarify default fail_message on assert type methods --- pandas_checks/DataFrameChecks.py | 8 ++++---- pandas_checks/SeriesChecks.py | 12 ++++++------ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/pandas_checks/DataFrameChecks.py b/pandas_checks/DataFrameChecks.py index 3cae7ad..7abf86d 100644 --- a/pandas_checks/DataFrameChecks.py +++ b/pandas_checks/DataFrameChecks.py @@ -184,8 +184,8 @@ def assert_datetime( Args: subset: Optional, which column or columns to check the condition against. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. - fail_message: Message to display if the condition fails. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -618,7 +618,7 @@ def assert_str( # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. @@ -660,7 +660,7 @@ def assert_timedelta( # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. @@ -705,7 +705,7 @@ def assert_type( Args: dtype: The required variable type - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. diff --git a/pandas_checks/SeriesChecks.py b/pandas_checks/SeriesChecks.py index e04fb27..5d97905 100644 --- a/pandas_checks/SeriesChecks.py +++ b/pandas_checks/SeriesChecks.py @@ -174,7 +174,7 @@ def assert_datetime( # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -214,7 +214,7 @@ def assert_float( # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -254,7 +254,7 @@ def assert_int( # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -586,7 +586,7 @@ def assert_str( # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -625,7 +625,7 @@ def assert_timedelta( # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -668,7 +668,7 @@ def assert_type( Args: dtype: The required variable type - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. From fc3a72609a324c356fc7f6dfba3ad4aa4d715fb2 Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Sat, 16 Nov 2024 08:37:01 -0600 Subject: [PATCH 8/9] Change default of assert_greater_than and assert_less_than of or_equal_to: False --- pandas_checks/DataFrameChecks.py | 6 +++--- pandas_checks/SeriesChecks.py | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/pandas_checks/DataFrameChecks.py b/pandas_checks/DataFrameChecks.py index 7abf86d..1c219e7 100644 --- a/pandas_checks/DataFrameChecks.py +++ b/pandas_checks/DataFrameChecks.py @@ -294,7 +294,7 @@ def assert_less_than( max: Any, fail_message: str = " ㄨ Assert maximum failed ", pass_message: str = " ✔️ Assert maximum passed ", - or_equal_to: bool = True, + or_equal_to: bool = False, subset: Union[str, List, None] = None, raise_exception: bool = True, exception_to_raise: Type[BaseException] = DataError, @@ -317,7 +317,7 @@ def assert_less_than( Args: max: the max value to compare DataFrame to. Accepts any type that can be used in <, such as int, float, str, datetime - or_equal_to: whether to test for <= min (True) or < max (False) + or_equal_to: whether to test for <= max (True) or < max (False) fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. subset: Optional, which column or columns to check the condition against. @@ -350,7 +350,7 @@ def assert_greater_than( min: Any, fail_message: str = " ㄨ Assert minimum failed ", pass_message: str = " ✔️ Assert minimum passed ", - or_equal_to: bool = True, + or_equal_to: bool = False, subset: Union[str, List, None] = None, raise_exception: bool = True, exception_to_raise: Type[BaseException] = DataError, diff --git a/pandas_checks/SeriesChecks.py b/pandas_checks/SeriesChecks.py index 5d97905..42ace53 100644 --- a/pandas_checks/SeriesChecks.py +++ b/pandas_checks/SeriesChecks.py @@ -279,7 +279,7 @@ def assert_less_than( max: Any, fail_message: str = " ㄨ Assert maximum failed ", pass_message: str = " ✔️ Assert maximum passed ", - or_equal_to: bool = True, + or_equal_to: bool = False, raise_exception: bool = True, exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, @@ -304,7 +304,7 @@ def assert_less_than( max: the max value to compare Series to. Accepts any type that can be used in <, such as int, float, str, datetime fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - or_equal_to: whether to test for <= min (True) or < max (False) + or_equal_to: whether to test for <= max (True) or < max (False) raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -333,7 +333,7 @@ def assert_greater_than( min: Any, fail_message: str = " ㄨ Assert minimum failed ", pass_message: str = " ✔️ Assert minimum passed ", - or_equal_to: bool = True, + or_equal_to: bool = False, raise_exception: bool = True, exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, From 6fff0c03ebc27b74ee56d415a9f64754a57b2287 Mon Sep 17 00:00:00 2001 From: Chad Parmet Date: Sat, 16 Nov 2024 08:41:11 -0600 Subject: [PATCH 9/9] Alphabetize assert methods --- pandas_checks/DataFrameChecks.py | 184 +++++++++++++++---------------- pandas_checks/SeriesChecks.py | 182 +++++++++++++++--------------- 2 files changed, 183 insertions(+), 183 deletions(-) diff --git a/pandas_checks/DataFrameChecks.py b/pandas_checks/DataFrameChecks.py index 1c219e7..482c5ba 100644 --- a/pandas_checks/DataFrameChecks.py +++ b/pandas_checks/DataFrameChecks.py @@ -52,6 +52,51 @@ class DataFrameChecks: def __init__(self, pandas_obj: Union[pd.DataFrame, pd.Series]) -> None: self._obj = pandas_obj + def assert_all_nulls( + self, + fail_message: str = " ㄨ Assert all nulls failed ", + pass_message: str = " ✔️ Assert all nulls passed ", + subset: Union[str, List, None] = None, + raise_exception: bool = True, + exception_to_raise: Type[BaseException] = DataError, + verbose: bool = False, + ) -> pd.DataFrame: + """Tests whether Dataframe or subset of columns has all nulls. Optionally raises an exception. Does not modify the DataFrame itself. + + Example: + ( + iris + .check.assert_all_nulls(subset=["sepal_length"]) + ) + + # Will raise an exception "ㄨ Assert all nulls failed" + + # See docs for .check.assert_data() for examples of how to customize assertions + + Args: + fail_message: Message to display if the condition fails. + pass_message: Message to display if the condition passes. + subset: Optional, which column or columns to check the condition against. + raise_exception: Whether to raise an exception if the condition fails. + exception_to_raise: The exception to raise if the condition fails and raise_exception is True. + verbose: Whether to display the pass message if the condition passes. + + Returns: + The original DataFrame, unchanged. + """ + + self._obj.check.assert_data( + condition=lambda df: df.isna().all().all(), + fail_message=fail_message, + pass_message=pass_message, + subset=subset, + raise_exception=raise_exception, + exception_to_raise=exception_to_raise, + message_shows_condition=False, + verbose=verbose, + ) + return self._obj + def assert_data( self, condition: Callable, @@ -247,28 +292,37 @@ def assert_float( ) return self._obj - def assert_int( + def assert_greater_than( self, - fail_message: Union[str, None] = None, - pass_message: str = " ✔️ Assert integeer passed ", + min: Any, + fail_message: str = " ㄨ Assert minimum failed ", + pass_message: str = " ✔️ Assert minimum passed ", + or_equal_to: bool = False, subset: Union[str, List, None] = None, raise_exception: bool = True, - exception_to_raise: Type[BaseException] = TypeError, + exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.DataFrame: - """Tests whether Dataframe or subset of columns is integers. Optionally raises an exception. Does not modify the DataFrame itself. + """Tests whether all values in a Dataframe or subset of columns is > or >= a minimum threshold. Optionally raises an exception. Does not modify the DataFrame itself. + Example: ( - df - .check.assert_int(subset="int_col") + iris + # Validate that sepal_length is always greater than 0.1 + .check.assert_greater_than(0.1, subset="sepal_length") + + # Validate that two columns are each always greater than or equal to 0.1 + .check.assert_greater_than(0.1, subset=["sepal_length", "petal_length"], or_equal_to=True) ) # See docs for .check.assert_data() for examples of how to customize assertions Args: + min: the minimum value to compare DataFrame to. Accepts any type that can be used in >, such as int, float, str, datetime fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. + or_equal_to: whether to test for >= min (True) or > min (False) subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -277,47 +331,43 @@ def assert_int( Returns: The original DataFrame, unchanged. """ + if or_equal_to: + min_fn = lambda df: (df >= min).all().all() + else: + min_fn = lambda df: (df > min).all().all() - self._obj.check.assert_type( - dtype=int, + self._obj.check.assert_data( + condition=min_fn, fail_message=fail_message, pass_message=pass_message, subset=subset, raise_exception=raise_exception, exception_to_raise=exception_to_raise, + message_shows_condition=False, verbose=verbose, ) return self._obj - def assert_less_than( + def assert_int( self, - max: Any, - fail_message: str = " ㄨ Assert maximum failed ", - pass_message: str = " ✔️ Assert maximum passed ", - or_equal_to: bool = False, + fail_message: Union[str, None] = None, + pass_message: str = " ✔️ Assert integeer passed ", subset: Union[str, List, None] = None, raise_exception: bool = True, - exception_to_raise: Type[BaseException] = DataError, + exception_to_raise: Type[BaseException] = TypeError, verbose: bool = False, ) -> pd.DataFrame: - """Tests whether all values in a Dataframe or subset of columns is < or <= a maximum threshold. Optionally raises an exception. Does not modify the DataFrame itself. + """Tests whether Dataframe or subset of columns is integers. Optionally raises an exception. Does not modify the DataFrame itself. Example: ( - iris - - # Validate that sepal_length is always < 1000 - .check.assert_less_than(1000, subset="sepal_length") - - # Validate that two columns are each always less than or equal too 100 - .check.assert_less_than(1000, subset=["sepal_length", "petal_length"], or_equal_to=True) + df + .check.assert_int(subset="int_col") ) # See docs for .check.assert_data() for examples of how to customize assertions Args: - max: the max value to compare DataFrame to. Accepts any type that can be used in <, such as int, float, str, datetime - or_equal_to: whether to test for <= max (True) or < max (False) fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. subset: Optional, which column or columns to check the condition against. @@ -328,54 +378,49 @@ def assert_less_than( Returns: The original DataFrame, unchanged. """ - if or_equal_to: - max_fn = lambda df: (df <= max).all().all() - else: - max_fn = lambda df: (df < max).all().all() - self._obj.check.assert_data( - condition=max_fn, + self._obj.check.assert_type( + dtype=int, fail_message=fail_message, pass_message=pass_message, subset=subset, raise_exception=raise_exception, exception_to_raise=exception_to_raise, - message_shows_condition=False, verbose=verbose, ) return self._obj - def assert_greater_than( + def assert_less_than( self, - min: Any, - fail_message: str = " ㄨ Assert minimum failed ", - pass_message: str = " ✔️ Assert minimum passed ", + max: Any, + fail_message: str = " ㄨ Assert maximum failed ", + pass_message: str = " ✔️ Assert maximum passed ", or_equal_to: bool = False, subset: Union[str, List, None] = None, raise_exception: bool = True, exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.DataFrame: - """Tests whether all values in a Dataframe or subset of columns is > or >= a minimum threshold. Optionally raises an exception. Does not modify the DataFrame itself. - + """Tests whether all values in a Dataframe or subset of columns is < or <= a maximum threshold. Optionally raises an exception. Does not modify the DataFrame itself. Example: ( iris - # Validate that sepal_length is always greater than 0.1 - .check.assert_greater_than(0.1, subset="sepal_length") - # Validate that two columns are each always greater than or equal to 0.1 - .check.assert_greater_than(0.1, subset=["sepal_length", "petal_length"], or_equal_to=True) + # Validate that sepal_length is always < 1000 + .check.assert_less_than(1000, subset="sepal_length") + + # Validate that two columns are each always less than or equal too 100 + .check.assert_less_than(1000, subset=["sepal_length", "petal_length"], or_equal_to=True) ) # See docs for .check.assert_data() for examples of how to customize assertions Args: - min: the minimum value to compare DataFrame to. Accepts any type that can be used in >, such as int, float, str, datetime + max: the max value to compare DataFrame to. Accepts any type that can be used in <, such as int, float, str, datetime + or_equal_to: whether to test for <= max (True) or < max (False) fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - or_equal_to: whether to test for >= min (True) or > min (False) subset: Optional, which column or columns to check the condition against. raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. @@ -385,12 +430,12 @@ def assert_greater_than( The original DataFrame, unchanged. """ if or_equal_to: - min_fn = lambda df: (df >= min).all().all() + max_fn = lambda df: (df <= max).all().all() else: - min_fn = lambda df: (df > min).all().all() + max_fn = lambda df: (df < max).all().all() self._obj.check.assert_data( - condition=min_fn, + condition=max_fn, fail_message=fail_message, pass_message=pass_message, subset=subset, @@ -499,51 +544,6 @@ def assert_no_nulls( ) return self._obj - def assert_all_nulls( - self, - fail_message: str = " ㄨ Assert all nulls failed ", - pass_message: str = " ✔️ Assert all nulls passed ", - subset: Union[str, List, None] = None, - raise_exception: bool = True, - exception_to_raise: Type[BaseException] = DataError, - verbose: bool = False, - ) -> pd.DataFrame: - """Tests whether Dataframe or subset of columns has all nulls. Optionally raises an exception. Does not modify the DataFrame itself. - - Example: - ( - iris - .check.assert_all_nulls(subset=["sepal_length"]) - ) - - # Will raise an exception "ㄨ Assert all nulls failed" - - # See docs for .check.assert_data() for examples of how to customize assertions - - Args: - fail_message: Message to display if the condition fails. - pass_message: Message to display if the condition passes. - subset: Optional, which column or columns to check the condition against. - raise_exception: Whether to raise an exception if the condition fails. - exception_to_raise: The exception to raise if the condition fails and raise_exception is True. - verbose: Whether to display the pass message if the condition passes. - - Returns: - The original DataFrame, unchanged. - """ - - self._obj.check.assert_data( - condition=lambda df: df.isna().all().all(), - fail_message=fail_message, - pass_message=pass_message, - subset=subset, - raise_exception=raise_exception, - exception_to_raise=exception_to_raise, - message_shows_condition=False, - verbose=verbose, - ) - return self._obj - def assert_positive( self, fail_message: str = " ㄨ Assert positive failed ", diff --git a/pandas_checks/SeriesChecks.py b/pandas_checks/SeriesChecks.py index 42ace53..d6f184d 100644 --- a/pandas_checks/SeriesChecks.py +++ b/pandas_checks/SeriesChecks.py @@ -47,6 +47,49 @@ class SeriesChecks: def __init__(self, pandas_obj: pd.Series) -> None: self._obj = pandas_obj + def assert_all_nulls( + self, + fail_message: str = " ㄨ Assert all nulls failed ", + pass_message: str = " ✔️ Assert all nulls passed ", + raise_exception: bool = True, + exception_to_raise: Type[BaseException] = DataError, + verbose: bool = False, + ) -> pd.Series: + """Tests whether Series has all nulls. Optionally raises an exception. Does not modify the Series itself. + + Example: + ( + iris + ["sepal_length"] + .check.assert_all_nulls() + ) + + # Will raise an exception, "ㄨ Assert all nulls failed" + + # See docs for .check.assert_data() for examples of how to customize assertions + + Args: + fail_message: Message to display if the condition fails. + pass_message: Message to display if the condition passes. + raise_exception: Whether to raise an exception if the condition fails. + exception_to_raise: The exception to raise if the condition fails and raise_exception is True. + verbose: Whether to display the pass message if the condition passes. + + Returns: + The original Series, unchanged. + """ + + self._obj.check.assert_data( + condition=lambda s: s.isna().all().all(), + fail_message=fail_message, + pass_message=pass_message, + raise_exception=raise_exception, + exception_to_raise=exception_to_raise, + message_shows_condition=False, + verbose=verbose, + ) + return self._obj + def assert_data( self, condition: Callable, @@ -234,28 +277,33 @@ def assert_float( ) return self._obj - def assert_int( + def assert_greater_than( self, - fail_message: Union[str, None] = None, - pass_message: str = " ✔️ Assert integeer passed ", + min: Any, + fail_message: str = " ㄨ Assert minimum failed ", + pass_message: str = " ✔️ Assert minimum passed ", + or_equal_to: bool = False, raise_exception: bool = True, - exception_to_raise: Type[BaseException] = TypeError, + exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.Series: - """Tests whether Series is integers. Optionally raises an exception. Does not modify the Series itself. + """Tests whether Series is > or >= a minimum threshold. Optionally raises an exception. Does not modify the Series itself. Example: ( - df - ["int_col"] - .check.assert_int() + iris + ["sepal_length"] + # Validate that the Series is always >= 0 + .check.assert_greater_than(0, or_equal_to=True) ) # See docs for .check.assert_data() for examples of how to customize assertions Args: - fail_message: Message to display if the condition fails. If None, will report expected vs observed type. + min: the minimum value to compare Series to. Accepts any type that can be used in >, such as int, float, str, datetime + fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. + or_equal_to: whether to test for >= min (True) or > min (False) raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -263,48 +311,44 @@ def assert_int( Returns: The original Series, unchanged. """ + if or_equal_to: + min_fn = lambda s: (s >= min).all().all() + else: + min_fn = lambda s: (s > min).all().all() - self._obj.check.assert_type( - dtype=int, + self._obj.check.assert_data( + condition=min_fn, fail_message=fail_message, pass_message=pass_message, raise_exception=raise_exception, exception_to_raise=exception_to_raise, + message_shows_condition=False, verbose=verbose, ) return self._obj - def assert_less_than( + def assert_int( self, - max: Any, - fail_message: str = " ㄨ Assert maximum failed ", - pass_message: str = " ✔️ Assert maximum passed ", - or_equal_to: bool = False, + fail_message: Union[str, None] = None, + pass_message: str = " ✔️ Assert integeer passed ", raise_exception: bool = True, - exception_to_raise: Type[BaseException] = DataError, + exception_to_raise: Type[BaseException] = TypeError, verbose: bool = False, ) -> pd.Series: - """Tests whether all values in Series are < or <= a maximum threshold. Optionally raises an exception. Does not modify the Series itself. + """Tests whether Series is integers. Optionally raises an exception. Does not modify the Series itself. Example: ( - iris - ["sepal_length"] - - # Validate that sepal_length is always < 1000 - .check.assert_less_than(1000) - - # Validate that it's always <= 1000 - .check.assert_less_than(1000, or_equal_to=True) + df + ["int_col"] + .check.assert_int() ) # See docs for .check.assert_data() for examples of how to customize assertions Args: - max: the max value to compare Series to. Accepts any type that can be used in <, such as int, float, str, datetime - fail_message: Message to display if the condition fails. + fail_message: Message to display if the condition fails. If None, will report expected vs observed type. pass_message: Message to display if the condition passes. - or_equal_to: whether to test for <= max (True) or < max (False) raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -312,49 +356,48 @@ def assert_less_than( Returns: The original Series, unchanged. """ - if or_equal_to: - max_fn = lambda s: (s <= max).all().all() - else: - max_fn = lambda s: (s < max).all().all() - self._obj.check.assert_data( - condition=max_fn, + self._obj.check.assert_type( + dtype=int, fail_message=fail_message, pass_message=pass_message, raise_exception=raise_exception, exception_to_raise=exception_to_raise, - message_shows_condition=False, verbose=verbose, ) return self._obj - def assert_greater_than( + def assert_less_than( self, - min: Any, - fail_message: str = " ㄨ Assert minimum failed ", - pass_message: str = " ✔️ Assert minimum passed ", + max: Any, + fail_message: str = " ㄨ Assert maximum failed ", + pass_message: str = " ✔️ Assert maximum passed ", or_equal_to: bool = False, raise_exception: bool = True, exception_to_raise: Type[BaseException] = DataError, verbose: bool = False, ) -> pd.Series: - """Tests whether Series is > or >= a minimum threshold. Optionally raises an exception. Does not modify the Series itself. + """Tests whether all values in Series are < or <= a maximum threshold. Optionally raises an exception. Does not modify the Series itself. Example: ( iris ["sepal_length"] - # Validate that the Series is always >= 0 - .check.assert_greater_than(0, or_equal_to=True) + + # Validate that sepal_length is always < 1000 + .check.assert_less_than(1000) + + # Validate that it's always <= 1000 + .check.assert_less_than(1000, or_equal_to=True) ) # See docs for .check.assert_data() for examples of how to customize assertions Args: - min: the minimum value to compare Series to. Accepts any type that can be used in >, such as int, float, str, datetime + max: the max value to compare Series to. Accepts any type that can be used in <, such as int, float, str, datetime fail_message: Message to display if the condition fails. pass_message: Message to display if the condition passes. - or_equal_to: whether to test for >= min (True) or > min (False) + or_equal_to: whether to test for <= max (True) or < max (False) raise_exception: Whether to raise an exception if the condition fails. exception_to_raise: The exception to raise if the condition fails and raise_exception is True. verbose: Whether to display the pass message if the condition passes. @@ -363,12 +406,12 @@ def assert_greater_than( The original Series, unchanged. """ if or_equal_to: - min_fn = lambda s: (s >= min).all().all() + max_fn = lambda s: (s <= max).all().all() else: - min_fn = lambda s: (s > min).all().all() + max_fn = lambda s: (s < max).all().all() self._obj.check.assert_data( - condition=min_fn, + condition=max_fn, fail_message=fail_message, pass_message=pass_message, raise_exception=raise_exception, @@ -471,49 +514,6 @@ def assert_no_nulls( ) return self._obj - def assert_all_nulls( - self, - fail_message: str = " ㄨ Assert all nulls failed ", - pass_message: str = " ✔️ Assert all nulls passed ", - raise_exception: bool = True, - exception_to_raise: Type[BaseException] = DataError, - verbose: bool = False, - ) -> pd.Series: - """Tests whether Series has all nulls. Optionally raises an exception. Does not modify the Series itself. - - Example: - ( - iris - ["sepal_length"] - .check.assert_all_nulls() - ) - - # Will raise an exception, "ㄨ Assert all nulls failed" - - # See docs for .check.assert_data() for examples of how to customize assertions - - Args: - fail_message: Message to display if the condition fails. - pass_message: Message to display if the condition passes. - raise_exception: Whether to raise an exception if the condition fails. - exception_to_raise: The exception to raise if the condition fails and raise_exception is True. - verbose: Whether to display the pass message if the condition passes. - - Returns: - The original Series, unchanged. - """ - - self._obj.check.assert_data( - condition=lambda s: s.isna().all().all(), - fail_message=fail_message, - pass_message=pass_message, - raise_exception=raise_exception, - exception_to_raise=exception_to_raise, - message_shows_condition=False, - verbose=verbose, - ) - return self._obj - def assert_positive( self, fail_message: str = " ㄨ Assert positive failed ",