This project implements a standard landing / staging / transformation architecture in Bauplan (sometimes called "Medallion"). In particular, you will learn:
- the simplest way to safely load "files" (in S3) into "tables": tables (but not files) can be branched, queries and governed, and they are the foundation of any serious data workload;
- the simplest way to safely go from source tables to refined data assets for business outcomes, e.g. tables to be visualized in BI tools, or dataset to be used by data scientists in their notebooks. This is accomplished writing pipelines, which are DAG of SQL and Python functions chained together.
We distinguish between an interactive, manual flow, in which most operations are done through the Bauplan CLI, and an automated, scripted flow, in which the same operations are done through a Python script, using the Python SDK (which exposes nice, typed Python objects through no-nonsense APIs.)
-
Landing Zone (S3)
- Archive and debugging store
- Parquet and csv files
-
Staging Zone
- Loading parquet to Iceberg
- Raw data in Iceberg format.
- data quality testing
-
Transform layer
- data pipelines
- transformation and aggregation
We recommend using uv to manage the dependencies:
uv syncAdditional dependencies (on top of bauplan) are only necessary to run the code in notebooks at the end.
- Join the bauplan sandbox, create your username and API key.
- Complete do the 3-min tutorial to get familiar with the platform.
You need to set up your landing zone with some initial files, which is outside the scope of Bauplan (typically, your application will send events to S3 buckets through Kafka or similar mechanism). To simulate this step, we will add the csv files in the data folder to a public repo hosted by Bauplan and connected to our public sandbox - do this for the three files in data, corresponding to three sample tables, product_data, supplier_sku_lookup, transaction_line_item, and copy the S3 path that the app shows (this is were the files now are). Do not run the commands displayed there just yet!
Once the data landed in S3, we move from files to tables, using a Git-like pattern (a.k.a. Write-Audit-Publish): we create a data branch to sandbox our operations and import the three source tables there, i.e. product_data, supplier_sku_lookup, transaction_line_item. For the product_data table for example, this looks like this (remember to replace the placeholders with your values):
bauplan checkout main
bauplan checkout -b <YOUR_USERNAME>.demodatacamp
bauplan namespace create <YOUR_USERNAME>_data_camp
bauplan table create \
--search-uri s3://<URL_FROM_THE_APP>/1755359275284_demo-data-2025-02-12-product_data.csv \
--name product_data --namespace <YOUR_USERNAME>_data_camp
bauplan table import \
--search-uri s3://<URL_FROM_THE_APP>/1755359275284_demo-data-2025-02-12-product_data.csv \
--name product_data --namespace <YOUR_USERNAME>_data_campYou should then repeat table create and table import for the other two tables, supplier_sku_lookup and transaction_line_item, using the same namespace, and just replacing the S3 path with the one corresponding to the other files.
Note: we use namespaces to logically group your tables together. This may or may not be needed in the real-world, depending on your use case.
To check that the data quality is what is expected, let's for example query the table and check no NULL are in the customer_product_id column:
bauplan query "SELECT COUNT(customer_product_id) FROM product_data WHERE customer_product_id IS NULL" --namespace <YOUR_USERNAME>_data_campIf the check passes (count=0), we can "promote" the tables to the official staging zone on the main branch, similar to what we do with code branches in Git (your namespace will now be visible):
bauplan checkout main
bauplan branch merge <YOUR_USERNAME>.demodatacamp
bauplan namespaceOnce the data is in the staging zone, we can run our transformation pipelines. This is where we apply business logic to the data, such as aggregations, joins, and other transformations. The pipeline is a two step DAG in bpln_pipeline:
- step 1 (in SQL): build an intermediate table called
top_selling_products; - step 2 (in Python): build a final table called
top_selling_suppliersthat aggregates the results of the first step and enriches it with additional data from thesupplier_sku_lookuptable.
We will once again use branches to develop safely on production data, without affecting the main branch until we are ready to merge. Cd into bpln_pipeline, verify you are now on main with bauplan branch, then run the pipeline on a new branch:
cd bpln_pipeline
bauplan branch
bauplan checkout -b <YOUR_USERNAME>.transform_datacamp
bauplan run --namespace <YOUR_USERNAME>_data_campNote: we may or may not merge here, but we don't by default to showcase that artifacts on branches can immediately be used and shared inside an organization before making a final decision to merge.
If you want to explore the final datasets, we have a few options.
bauplan query "SELECT SUM(total_supplier_revenue) AS rev FROM top_selling_suppliers" --namespace <YOUR_USERNAME>_data_camp uv run marimo edit query_to_pandas.py
Note: make sure to edit the notebook cells with your own variables before running.
You can also connect a BI tool compatible with Postgres. You can for example use a local Docker container with Metabase and visualize the table we just created. The configuration details for the connection to the public sandbox are:
host: nlb.pub.bauplanlabs.us-east-1.computing.prod.bauplanlabs.com
port: 25432
database: <YOUR_USERNAME>.transform_datacamp
user: <YOUR_USERNAME>
password: <YOUR_API_KEY>
advanced settings (for metabase): preferQueryMode=simple
For a video walkthrough of the setup, see this video - note that passwords have been edited out.
We now do the exact same operations, but scripted as a Python file which can be run in an automated way. This is useful for production pipelines, or for running the same operations multiple times.
cd src
uv run automated_flow.py --namespace_suffix data_camp_autoThe script executes the entire logic for you: these are the same steps as we did above, but now they are embedded inside a Python flow:
- import the files to tables and merge into
mainto get to a ready staging zone; - run the transformation pipeline to get the final tables.
Note that we did not to rewrite a single line of business logic, nor infrastructure code: going from development to production has never been easier - this is a video of a successful automated run.

