Skip to content

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Jan 25, 2026

Description

A pretty modest start, but adds the ability to control the scale factor and prevents data generation happening on module import.

I've wanted to be able to control the (previously global) parameter since (#805 (comment)) 😅

$ python generate_data.py --help
usage: generate_data.py [-h] [-sf SCALE_FACTOR]

Generate the data required to run TPCH queries.

options:
  -h, --help            show this help message and exit
  -sf, --scale-factor SCALE_FACTOR
                        Scale the database by this factor (default: 0.1)

                        ┌──────────────┬───────────────┐
                        │ Scale factor ┆ Database (MB) │
                        ╞══════════════╪═══════════════╡
                        │ 0.1          ┆ 25            │
                        │ 1.0          ┆ 250           │
                        │ 3.0          ┆ 754           │
                        │ 100.0        ┆ 26624         │
                        └──────────────┴───────────────┘

Related issues

@dangotbanned dangotbanned marked this pull request as ready for review January 25, 2026 12:39
tbl_arrow = tbl.to_arrow_table()
new_schema = convert_schema(tbl_arrow.schema)
tbl_arrow = tbl_arrow.cast(new_schema)
pq.write_table(tbl_arrow, data_path / f"{t}.parquet")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wanna add logging for the file writes.
There isn't any feedback on if any part of the script was successful.

Would be especially nice for a larger --scale-factor, where the runtime can be MUCH LONGER

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dangotbanned - left a few comments! It would be nice if we would finally be able to run anything regarding #805

default="0.1",
dest="scale_factor",
help=f"Scale the database by this factor (default: %(default)s)\n{TABLE_SCALE_FACTOR}",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add type=float to parse it as a float, otherwise scale_factor will be a string in main (which does not really matter as we always use it inside string formatting, but it's not what the annotation in main is declaring)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add type=float to parse it as a float, otherwise scale_factor will be a string in main

Woops - I did have that but lost it when copying over somehow?

I only learned about type and default requiring the default to be a string yesterday as well 😭

Comment on lines -6 to -9
import duckdb
import pyarrow as pa
import pyarrow.csv as pc
import pyarrow.parquet as pq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any real benefit of not having the imports on the top level in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those imports are at the top-level, then the module will import them eagerly.

By moving them to where they are used - the script has them when needed - but importing the module (like in __init__.py) doesn't

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
@dangotbanned dangotbanned marked this pull request as draft January 25, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants