chore: Add a basic cli for `generate_data.py` #3421

dangotbanned · 2026-01-25T12:17:39Z

Description

A pretty modest start, but adds the ability to control the scale factor and prevents data generation happening on module import.

I've wanted to be able to control the (previously global) parameter since (#805 (comment)) 😅

$ python generate_data.py --help
usage: generate_data.py [-h] [-sf SCALE_FACTOR]

Generate the data required to run TPCH queries.

options:
  -h, --help            show this help message and exit
  -sf, --scale-factor SCALE_FACTOR
                        Scale the database by this factor (default: 0.1)

                        ┌──────────────┬───────────────┐
                        │ Scale factor ┆ Database (MB) │
                        ╞══════════════╪═══════════════╡
                        │ 0.1          ┆ 25            │
                        │ 1.0          ┆ 250           │
                        │ 3.0          ┆ 754           │
                        │ 100.0        ┆ 26624         │
                        └──────────────┴───────────────┘

Related issues

Child of chore(typing): Improve tpch typing #3420
Started from reviewing chore(typing): Improve tpch typing #3420

dangotbanned · 2026-01-25T12:43:38Z

tpch/generate_data.py

+        tbl_arrow = tbl.to_arrow_table()
+        new_schema = convert_schema(tbl_arrow.schema)
+        tbl_arrow = tbl_arrow.cast(new_schema)
+        pq.write_table(tbl_arrow, data_path / f"{t}.parquet")


I kinda wanna add logging for the file writes.
There isn't any feedback on if any part of the script was successful.

Would be especially nice for a larger --scale-factor, where the runtime can be MUCH LONGER

FBruzzesi

Thanks @dangotbanned - left a few comments! It would be nice if we would finally be able to run anything regarding #805

tpch/__init__.py

FBruzzesi · 2026-01-25T13:15:47Z

tpch/generate_data.py

+        default="0.1",
+        dest="scale_factor",
+        help=f"Scale the database by this factor (default: %(default)s)\n{TABLE_SCALE_FACTOR}",
+    )


You can add type=float to parse it as a float, otherwise scale_factor will be a string in main (which does not really matter as we always use it inside string formatting, but it's not what the annotation in main is declaring)

You can add type=float to parse it as a float, otherwise scale_factor will be a string in main

Woops - I did have that but lost it when copying over somehow?

I only learned about type and default requiring the default to be a string yesterday as well 😭

FBruzzesi · 2026-01-25T13:19:27Z

tpch/generate_data.py

-import duckdb
-import pyarrow as pa
-import pyarrow.csv as pc
-import pyarrow.parquet as pq


Is there any real benefit of not having the imports on the top level in this case?

If those imports are at the top-level, then the module will import them eagerly.

By moving them to where they are used - the script has them when needed - but importing the module (like in __init__.py) doesn't

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

dangotbanned added 3 commits January 24, 2026 23:18

chore: Don't run generate_data.py on import

ca27d63

chore: Add a basic cli for generate_data.py

9ff9f15

Merge branch 'tpch/refactor-typing' into tpch/refactor-cli

78fc381

dangotbanned added developer tools internal labels Jan 25, 2026

dangotbanned marked this pull request as ready for review January 25, 2026 12:39

dangotbanned commented Jan 25, 2026

View reviewed changes

dangotbanned requested a review from FBruzzesi January 25, 2026 12:48

FBruzzesi reviewed Jan 25, 2026

View reviewed changes

Update tpch/__init__.py

ff05c8e

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

dangotbanned marked this pull request as draft January 25, 2026 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Add a basic cli for `generate_data.py` #3421

chore: Add a basic cli for `generate_data.py` #3421

dangotbanned commented Jan 25, 2026 •

edited

Loading

Uh oh!

dangotbanned Jan 25, 2026

Uh oh!

FBruzzesi left a comment

Uh oh!

Uh oh!

FBruzzesi Jan 25, 2026

Uh oh!

dangotbanned Jan 25, 2026

Uh oh!

FBruzzesi Jan 25, 2026

Uh oh!

dangotbanned Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: Add a basic cli for generate_data.py #3421

Are you sure you want to change the base?

chore: Add a basic cli for generate_data.py #3421

Conversation

dangotbanned commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Uh oh!

dangotbanned Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FBruzzesi Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

dangotbanned Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

dangotbanned Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: Add a basic cli for `generate_data.py` #3421

chore: Add a basic cli for `generate_data.py` #3421

dangotbanned commented Jan 25, 2026 •

edited

Loading