Add initial version of blog post

rich-iannone · rich-iannone · commit b3a02300eb53 · 2025-01-31T16:26:41.000-05:00
diff --git a/docs/blog/pointblank-intro/index.qmd b/docs/blog/pointblank-intro/index.qmd
@@ -0,0 +1,130 @@
+---
+title: "How We Used Great Tables to Supercharge Reporting in Pointblank"
+html-table-processing: none
+author: Rich Iannone
+date: 2025-02-03
+freeze: true
+jupyter: python3
+---
+
+The Great Tables package allows you to make tables, and they're really great when part of a report, a book, or a web page. The API is meant to be easy to work with so DataFrames could be made into publication-qualty tables without a lot of hassle. And having nice-looking tables in the mix elevates the quality of the medium you're working in.
+
+To go a bit further, we are working on a new Python package that generates tables (c/o Great Tables) as reporting objects. This package is called [Pointblank](https://github.com/posit-dev/pointblank), its focus is that of data validation, and the reporting tables it can produce informs users on the results of a data validation workflow. In this post we'll highlight the tables it can make and, in doing so, convince you that such outputs can be useful and worth the effort on the part of the maintainer.
+
+### The table report for a data validation
+
+Just like Great Tables, Pointblank's primary input is a table and the goal of that library is to perform checks of the tabular data. Other libraries in this domain include [Great Expectations](https://github.com/great-expectations/great_expectations), [pandera](https://github.com/unionai-oss/pandera), and [Soda](https://github.com/sodadata/soda-core?tab=readme-ov-file), and [PyDeequ](https://github.com/awslabs/python-deequ). Let's look at the main reporting table that users are likely to see quite often.
+
+```{python}
+# | code-fold: true
+# | code-summary: "Show the code"
+
+import pointblank as pb
+
+validation = (
+    pb.Validate(
+        data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
+        label="An example validation",
+        thresholds=(0.1, 0.2, 0.5)
+    )
+    .col_vals_gt(columns="d", value=1000)
+    .col_vals_le(columns="c", value=5)
+    .col_exists(columns=["date", "date_time"])
+    .interrogate()
+)
+
+validation
+```
+
+The table is chock full of the information you need when doing data validation tasks. And it's also easy on the eyes. Some cool features include:
+
+1. a header with information on the type of input table plus important validation options
+2. vertical color strips on the left side to indicate overall status of the rows
+3. icons in several columns (space saving and they let you know what's up)
+4. 'CSV' buttons that, when clicked, provide you with a CSV file
+5. a footer with timing information for the analysis
+
+It's a nice table and it scales nicely to the large variety of validation types and options available in the Pointblank library. Viewing this table is a central part of using that library and the great thing about the reporting being a table like this is that it can be shared by placing it in a publication environment of your choosing (for example, it could be put in a Quarto document).
+
+We didn't stop there however... we went ahead and made it possible to view other artifacts as tables.
+
+### Preview of a dataset
+
+Because Pointblank allows for the collection of data extracts (subsets of the target table where data quality issues were encountered), we found it useful to have a function (`preview()`) that provides a consistent view of this tabular data. It also just works with any type of table that Pointblank supports (which is a lot). Here is how that looks with a 2,000 row dataset included in the package (`game_revenue`):
+
+```{python}
+# | code-fold: true
+# | code-summary: "Show the code"
+
+pb.preview(pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"))
+```
+
+The `preview()` function had a few design goals in mind:
+
+- get the dimensions of the table and display them prominently in the header
+- provide the column names and the column types
+- have a consistent line height along with a sensible limit to the column width
+- use a monospaced typeface having high legibility
+- should work for all sorts of tables!
+
+This is a nice drop-in replacement for looking at DataFrames or Ibis tables (the types of tables that Pointblank can work with). If you were to inspect the DuckDB table materialized by `pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")` without `preview()` you'd get this:
+
+```{python}
+# | code-fold: true
+# | code-summary: "Show the code"
+
+pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")
+```
+
+Which is not nearly as good.
+
+### Explaining the result of a particular valdiation step, with a table!
+
+We were clearly excited about the possibilities of Great Tables tables in Pointblank, because we did even more. Data validations are performed as distinct steps (e.g., a step could check that all values were greater than a fixed value, down a specific column) and while you get a reporting of atomic successes and failures in a step, it's better to see exactly what failed. This is all in the service of helping the user get to the root causes of a data quality issue. So, we have a method called `get_step_report()` that gives you a custom view of failures on a stepwise basis. Of course, it's using a table to get the job done.
+
+Let's look at an example where you might check a table against an expected schema for that table. Turns out it's a schema expectation that doesn't match the schema of the actual table, and the report for this step shows what elements don't match up:
+
+```{python}
+# | code-fold: true
+# | code-summary: "Show the code"
+
+# Create a schema for the target table (`small_table` as a DuckDB table)
+schema = pb.Schema(
+    columns=[
+        ("date_time", "timestamp"),
+        ("dates", "date"),
+        ("a", "int64"),
+        ("b",),
+        ("c",),
+        ("d", "float64"),
+        ("e", ["bool", "boolean"]),
+        ("f", "str"),
+    ]
+)
+
+# Use the `col_schema_match()` validation method to perform a schema check
+validation = (
+    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="duckdb"))
+    .col_schema_match(schema=schema)
+    .interrogate()
+)
+
+validation.get_step_report(i=1)
+```
+
+With just a basic report of steps, you'd just see a failure and be left wondering what went wrong. The tabular reporting of the step report above serves to reveal the issues in an easy-to-understand manner.
+
+The use of a table is so ideal here! On the left are the column names and the data types of the target table. On the right are the elements of the expected schema. We can very quickly see three places where the expectation doesn't match the actual:
+
+1. the dtype for the first column `date_time` is incorrect
+2. the column name of second column `date` is misspelled (as `"dates"`)
+3. the dtype for the last column is incorrect (`"str"` instead of `"string"`)
+
+This reporting can scale nicely to very large tables since the width of the table in this case will
+always be fixed (having schema column comparisons represented in rows). Other nice touches include a robust header with: information on schema comparison settings, the step number, and an indication of the overall pass/fail status (here, a large red cross mark).
+
+There are many types of validations and so naturally there are different types of step reports, but the common thread is that they all use Great Tables to provide reporting in a sensible fashion.
+
+### In closing
+
+We hope this post provides some insight on how Great Tables can be versatile enough to be used within Python libraries. The added benefit is that outputs that are GT object can be further modified or styled by the user of library producing GT tables.