From zero to data hero (with Bauplan), Data Camp Edition

Overview

This project implements a standard landing / staging / transformation architecture in Bauplan (sometimes called "Medallion"). In particular, you will learn:

the simplest way to safely load "files" (in S3) into "tables": tables (but not files) can be branched, queries and governed, and they are the foundation of any serious data workload;
the simplest way to safely go from source tables to refined data assets for business outcomes, e.g. tables to be visualized in BI tools, or dataset to be used by data scientists in their notebooks. This is accomplished writing pipelines, which are DAG of SQL and Python functions chained together.

We distinguish between an interactive, manual flow, in which most operations are done through the Bauplan CLI, and an automated, scripted flow, in which the same operations are done through a Python script, using the Python SDK (which exposes nice, typed Python objects through no-nonsense APIs.)

Desired architecture at a glance

Landing Zone (S3)
- Archive and debugging store
- Parquet and csv files
Staging Zone
- Loading parquet to Iceberg
- Raw data in Iceberg format.
- data quality testing
Transform layer
- data pipelines
- transformation and aggregation

Setup

Python environment

We recommend using uv to manage the dependencies:

uv sync

Additional dependencies (on top of bauplan) are only necessary to run the code in notebooks at the end.

Bauplan

Join the bauplan sandbox, create your username and API key.
Complete do the 3-min tutorial to get familiar with the platform.

From landing to transformation: interactive mode

Step 0

You need to set up your landing zone with some initial files, which is outside the scope of Bauplan (typically, your application will send events to S3 buckets through Kafka or similar mechanism). To simulate this step, we will add the csv files in the data folder to a public repo hosted by Bauplan and connected to our public sandbox - do this for the three files in data, corresponding to three sample tables, product_data, supplier_sku_lookup, transaction_line_item, and copy the S3 path that the app shows (this is were the files now are). Do not run the commands displayed there just yet!

Step 1: landing to staging

Once the data landed in S3, we move from files to tables, using a Git-like pattern (a.k.a. Write-Audit-Publish): we create a data branch to sandbox our operations and import the three source tables there, i.e. product_data, supplier_sku_lookup, transaction_line_item. For the product_data table for example, this looks like this (remember to replace the placeholders with your values):

bauplan checkout main
bauplan checkout -b <YOUR_USERNAME>.demodatacamp
bauplan namespace create <YOUR_USERNAME>_data_camp
bauplan table create \
    --search-uri s3://<URL_FROM_THE_APP>/1755359275284_demo-data-2025-02-12-product_data.csv \
    --name product_data --namespace <YOUR_USERNAME>_data_camp
bauplan table import \
    --search-uri s3://<URL_FROM_THE_APP>/1755359275284_demo-data-2025-02-12-product_data.csv \
    --name product_data --namespace <YOUR_USERNAME>_data_camp

You should then repeat table create and table import for the other two tables, supplier_sku_lookup and transaction_line_item, using the same namespace, and just replacing the S3 path with the one corresponding to the other files.

Note: we use namespaces to logically group your tables together. This may or may not be needed in the real-world, depending on your use case.

To check that the data quality is what is expected, let's for example query the table and check no NULL are in the customer_product_id column:

bauplan query "SELECT COUNT(customer_product_id) FROM product_data WHERE customer_product_id IS NULL" --namespace <YOUR_USERNAME>_data_camp

If the check passes (count=0), we can "promote" the tables to the official staging zone on the main branch, similar to what we do with code branches in Git (your namespace will now be visible):

bauplan checkout main
bauplan branch merge <YOUR_USERNAME>.demodatacamp
bauplan namespace

Step 2: staging to transformation

Once the data is in the staging zone, we can run our transformation pipelines. This is where we apply business logic to the data, such as aggregations, joins, and other transformations. The pipeline is a two step DAG in bpln_pipeline:

step 1 (in SQL): build an intermediate table called top_selling_products;
step 2 (in Python): build a final table called top_selling_suppliers that aggregates the results of the first step and enriches it with additional data from the supplier_sku_lookup table.

We will once again use branches to develop safely on production data, without affecting the main branch until we are ready to merge. Cd into bpln_pipeline, verify you are now on main with bauplan branch, then run the pipeline on a new branch:

cd bpln_pipeline
bauplan branch
bauplan checkout -b <YOUR_USERNAME>.transform_datacamp
bauplan run --namespace <YOUR_USERNAME>_data_camp

Note: we may or may not merge here, but we don't by default to showcase that artifacts on branches can immediately be used and shared inside an organization before making a final decision to merge.

Step3 (optional): visualize final tables

If you want to explore the final datasets, we have a few options.

Run ad hoc queries in the CLI

bauplan query "SELECT SUM(total_supplier_revenue) AS rev FROM top_selling_suppliers" --namespace <YOUR_USERNAME>_data_camp

Use a Python notebook to explore the data

 uv run marimo edit query_to_pandas.py

Note: make sure to edit the notebook cells with your own variables before running.

Use a BI tool

You can also connect a BI tool compatible with Postgres. You can for example use a local Docker container with Metabase and visualize the table we just created. The configuration details for the connection to the public sandbox are:

host: nlb.pub.bauplanlabs.us-east-1.computing.prod.bauplanlabs.com
port: 25432
database: <YOUR_USERNAME>.transform_datacamp
user: <YOUR_USERNAME>
password: <YOUR_API_KEY>
advanced settings (for metabase): preferQueryMode=simple

For a video walkthrough of the setup, see this video - note that passwords have been edited out.

From landing to transformation: automation mode

We now do the exact same operations, but scripted as a Python file which can be run in an automated way. This is useful for production pipelines, or for running the same operations multiple times.

cd src 
uv run automated_flow.py --namespace_suffix data_camp_auto

The script executes the entire logic for you: these are the same steps as we did above, but now they are embedded inside a Python flow:

import the files to tables and merge into main to get to a ready staging zone;
run the transformation pipeline to get the final tables.

Note that we did not to rewrite a single line of business logic, nor infrastructure code: going from development to production has never been easier - this is a video of a successful automated run.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
img		img
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From zero to data hero (with Bauplan), Data Camp Edition

Overview

Desired architecture at a glance

Setup

Python environment

Bauplan

From landing to transformation: interactive mode

Step 0

Step 1: landing to staging

Step 2: staging to transformation

Step3 (optional): visualize final tables

Run ad hoc queries in the CLI

Use a Python notebook to explore the data

Use a BI tool

From landing to transformation: automation mode

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From zero to data hero (with Bauplan), Data Camp Edition

Overview

Desired architecture at a glance

Setup

Python environment

Bauplan

From landing to transformation: interactive mode

Step 0

Step 1: landing to staging

Step 2: staging to transformation

Step3 (optional): visualize final tables

Run ad hoc queries in the CLI

Use a Python notebook to explore the data

Use a BI tool

From landing to transformation: automation mode

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages