-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add post on use of Great Tables in Pointblank library #595
Changes from 8 commits
b3a0230
11582a3
6d74c16
567e365
cd24b0f
8dba124
63e3b20
cb5cdb3
d556728
20458df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,127 @@ | ||||||||||||
--- | ||||||||||||
title: "How We Used Great Tables to Supercharge Reporting in Pointblank" | ||||||||||||
html-table-processing: none | ||||||||||||
author: Rich Iannone | ||||||||||||
date: 2025-02-03 | ||||||||||||
freeze: true | ||||||||||||
jupyter: python3 | ||||||||||||
--- | ||||||||||||
|
||||||||||||
The Great Tables package allows you to make tables, and they're really great when part of a report, a book, or a web page. The API is meant to be easy to work with so DataFrames could be made into publication-quality tables without a lot of hassle. And having nice-looking tables in the mix elevates the quality of the medium you're working in. | ||||||||||||
|
||||||||||||
We were inspired by this and decided to explore what it could mean to introduce a package where reporting is largely in the form of beautiful tables. To this end, we started work on a new Python package that generates tables (c/o Great Tables) as reporting objects. This package is called [Pointblank](https://github.com/posit-dev/pointblank), its focus is that of data validation, and the reporting tables it can produce informs users on the results of a data validation workflow. In this post we'll go through how Pointblank: | ||||||||||||
|
||||||||||||
- enables you to validate many types of DataFrames and SQL databases | ||||||||||||
- provides easy-to-understand validation result tables and thorough drilldowns | ||||||||||||
- gives you nice previews of data tables across a range of backends | ||||||||||||
|
||||||||||||
### Validating data with Pointblank | ||||||||||||
|
||||||||||||
Just like Great Tables, Pointblank's primary input is a table and the goal of that library is to perform checks of the tabular data. Other libraries in this domain include [Great Expectations](https://github.com/great-expectations/great_expectations), [pandera](https://github.com/unionai-oss/pandera), and [Soda](https://github.com/sodadata/soda-core?tab=readme-ov-file), and [PyDeequ](https://github.com/awslabs/python-deequ). Let's look at the main reporting table that users are likely to see quite often. | ||||||||||||
|
||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Prep'ing people for what they'll be seeing in example
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! Adding this in. |
||||||||||||
```{python} | ||||||||||||
#| code-fold: true | ||||||||||||
#| code-summary: "Show the code" | ||||||||||||
|
||||||||||||
import pointblank as pb | ||||||||||||
|
||||||||||||
validation = ( | ||||||||||||
pb.Validate( | ||||||||||||
data=pb.load_dataset(dataset="small_table", tbl_type="polars"), | ||||||||||||
label="An example validation", | ||||||||||||
thresholds=(0.1, 0.2, 0.5), | ||||||||||||
) | ||||||||||||
.col_vals_gt(columns="d", value=1000) | ||||||||||||
.col_vals_le(columns="c", value=5) | ||||||||||||
.col_exists(columns=["date", "date_time"]) | ||||||||||||
.interrogate() | ||||||||||||
) | ||||||||||||
|
||||||||||||
validation | ||||||||||||
``` | ||||||||||||
|
||||||||||||
The table is chock full of the information you need when doing data validation tasks. And it's also easy on the eyes. Some cool features include: | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Directed attention at example
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! This will be added in. |
||||||||||||
|
||||||||||||
1. a header with information on the type of input table plus important validation options | ||||||||||||
2. vertical color strips on the left side to indicate overall status of the rows | ||||||||||||
3. icons in several columns (space saving and they let you know what's up) | ||||||||||||
4. 'CSV' buttons that, when clicked, provide you with a CSV file | ||||||||||||
5. a footer with timing information for the analysis | ||||||||||||
|
||||||||||||
It's a nice table and it scales nicely to the large variety of validation types and options available in the Pointblank library. Viewing this table is a central part of using that library and the great thing about the reporting being a table like this is that it can be shared by placing it in a publication environment of your choosing (for example, it could be put in a Quarto document). | ||||||||||||
|
||||||||||||
Here is the code that was used to generate the data validation above: | ||||||||||||
|
||||||||||||
```{python} | ||||||||||||
#| eval: false | ||||||||||||
|
||||||||||||
import pointblank as pb | ||||||||||||
|
||||||||||||
validation = ( | ||||||||||||
pb.Validate( | ||||||||||||
data=pb.load_dataset(dataset="small_table", tbl_type="polars"), | ||||||||||||
label="An example validation", | ||||||||||||
thresholds=(0.1, 0.2, 0.5), | ||||||||||||
) | ||||||||||||
.col_vals_gt(columns="d", value=1000) | ||||||||||||
.col_vals_le(columns="c", value=5) | ||||||||||||
.col_exists(columns=["date", "date_time"]) | ||||||||||||
.interrogate() | ||||||||||||
) | ||||||||||||
|
||||||||||||
validation | ||||||||||||
``` | ||||||||||||
|
||||||||||||
Pointblank makes it easy to get started by giving you a simple entry point (`Validate()`), allowing you to define as many validation steps as needed. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is good. Using it! |
||||||||||||
|
||||||||||||
Pointblank enables you to validate many types of DataFrames and SQL databases. Pointblank supports Pandas and Polars through Narwhals, and numerous backends (like DuckDB and MySQL) are also supported though our Ibis integration. | ||||||||||||
|
||||||||||||
### Exploring data validation failures | ||||||||||||
|
||||||||||||
Note that the above validation showed 6 failures in the first step. You might want to know exactly *what* failed, giving you a chance to fix the underlying data quality issues. To do that, you can use the `get_step_report()` method: | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Much clearer! Adding it in. |
||||||||||||
|
||||||||||||
```{python} | ||||||||||||
validation.get_step_report(i=1) | ||||||||||||
``` | ||||||||||||
|
||||||||||||
The use of a table for reporting is ideal here! The main features of this step report table include: | ||||||||||||
|
||||||||||||
1. a header with summarized information | ||||||||||||
2. the selected rows that contain the failures | ||||||||||||
3. a highlighted column of interest | ||||||||||||
|
||||||||||||
Different types of validation methods will have step report tables that organize the pertinent information in a way that makes sense for the validation performed. | ||||||||||||
|
||||||||||||
### Previewing datasets across backends | ||||||||||||
|
||||||||||||
Because Pointblank supports many backends, with varying ways for displaying the underlying data, we provide the `preview()` function. With that you can get a beautiful and consistent view of any data table. Here is how it looks against a 2,000 row DuckDB table that's included in the package (`game_revenue`): | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tweaked a bit to clarify that backends vary (not pointblank), which motivates
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! Definitely adding this in. |
||||||||||||
|
||||||||||||
```{python} | ||||||||||||
# | code-fold: true | ||||||||||||
# | code-summary: "Show the code" | ||||||||||||
|
||||||||||||
pb.preview(pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")) | ||||||||||||
``` | ||||||||||||
|
||||||||||||
The `preview()` function had a few design goals in mind: | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Directed people's attention at example:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is great! Will add it in. |
||||||||||||
|
||||||||||||
- get the dimensions of the table and display them prominently in the header | ||||||||||||
- provide the column names and the column types | ||||||||||||
- have a consistent line height along with a sensible limit to the column width | ||||||||||||
- use a monospaced typeface having high legibility | ||||||||||||
- should work for all sorts of tables! | ||||||||||||
|
||||||||||||
This is a nice drop-in replacement for looking at DataFrames or Ibis tables (the types of tables that Pointblank can work with). If you were to inspect the DuckDB table materialized by `pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")` without `preview()` you'd get this: | ||||||||||||
|
||||||||||||
```{python} | ||||||||||||
# | code-fold: true | ||||||||||||
# | code-summary: "Show the code" | ||||||||||||
|
||||||||||||
pb.load_dataset(dataset="game_revenue", tbl_type="duckdb") | ||||||||||||
``` | ||||||||||||
|
||||||||||||
Which is not nearly as good. | ||||||||||||
|
||||||||||||
### In closing | ||||||||||||
|
||||||||||||
We hope this post is a good introduction to Pointblank and that it provides some insight on how Great Tables makes sense for reporting in a different library. If you'd like to learn more about Pointblank, please visit the [project website](https://posit-dev.github.io/pointblank/) and check out the many [examples](https://posit-dev.github.io/pointblank/demos/). |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,7 +63,7 @@ extra = [ | |
|
||
dev = [ | ||
"great_tables[dev-no-pandas]", | ||
"pandas" | ||
"pandas", | ||
] | ||
|
||
dev-no-pandas = [ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this now