-
Notifications
You must be signed in to change notification settings - Fork 176
chore: Add a basic cli for generate_data.py
#3421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tpch/refactor-typing
Are you sure you want to change the base?
Conversation
| tbl_arrow = tbl.to_arrow_table() | ||
| new_schema = convert_schema(tbl_arrow.schema) | ||
| tbl_arrow = tbl_arrow.cast(new_schema) | ||
| pq.write_table(tbl_arrow, data_path / f"{t}.parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda wanna add logging for the file writes.
There isn't any feedback on if any part of the script was successful.
Would be especially nice for a larger --scale-factor, where the runtime can be MUCH LONGER
FBruzzesi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dangotbanned - left a few comments! It would be nice if we would finally be able to run anything regarding #805
| default="0.1", | ||
| dest="scale_factor", | ||
| help=f"Scale the database by this factor (default: %(default)s)\n{TABLE_SCALE_FACTOR}", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add type=float to parse it as a float, otherwise scale_factor will be a string in main (which does not really matter as we always use it inside string formatting, but it's not what the annotation in main is declaring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add
type=floatto parse it as a float, otherwise scale_factor will be a string inmain
Woops - I did have that but lost it when copying over somehow?
I only learned about type and default requiring the default to be a string yesterday as well 😭
| import duckdb | ||
| import pyarrow as pa | ||
| import pyarrow.csv as pc | ||
| import pyarrow.parquet as pq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any real benefit of not having the imports on the top level in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those imports are at the top-level, then the module will import them eagerly.
By moving them to where they are used - the script has them when needed - but importing the module (like in __init__.py) doesn't
Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
Description
A pretty modest start, but adds the ability to control the scale factor and prevents data generation happening on module import.
I've wanted to be able to control the (previously global) parameter since (#805 (comment)) 😅
Related issues
tpchtyping #3420tpchtyping #3420