tablefaker is a lightweight Python tool to generate realistic synthetic datasets from a YAML schema, for testing, demos, and prototyping.
Key features:
- Schema-driven YAML config: Specify tables, columns and data type for generation in a simple YAML format.
- Faker-based generators: use built-in functions, community providers, or create custom Python functions to generate data tailored to your needs.
- Referential integrity: The tool generates parent tables before child tables to ensure data integrity. Supports multi-level FK relationships.
- Realistic foreign-key distributions: Generation supports distribution strategies (uniform, zipf, and weighted_parent) to mimic real-world data and
- Multiple output formats: csv, json, parquet, excel, sql, deltalake (streaming).
Installation:
pip install -e .Quickstart:
# generate CSVs to current folder
tablefaker --config tests/test_basic_table.yamlCLI flags (see tablefaker --help for full list):
-
--config <PATH>(required) Path to your YAML config file. Seedocs/yaml-reference.mdfor full schema and examples. -
--file_type <extension>(choices: csv, json, parquet, excel, sql, deltalake, streaming) (default: csv) Output file format to generate. Usestreamingto run TableFaker's streaming server instead of writing files. -
--target <PATH|DIR|HOST:PORT>Output destination. If a directory is provided, multiple files will be written into it; if a single file path is provided, output will be written to that file. When usingfile_type=streaming,--targetcan be used to specify the host:port to bind the streaming server (seedocs/streaming-server.mdfor configuration details). -
--seed <INT>Use a numeric seed to make generation deterministic and reproducible. -
--infer-attrs <true|false>(default: false) Enable name-based attribute inference for columns (attempts to infer semantic attributes from column names). -
--relationshipsGenerate and write a relationships YAML alongside the generated outputs. -
--semantic-viewGenerate and write a Snowflake-compatible semantic view YAML.
Minimal YAML:
version: 1
config:
locale: en_US
tables:
- table_name: person
row_count: 100
columns:
- column_name: id
data: row_id
- column_name: first_name
data: fake.first_name()- For full YAML reference, see
docs/yaml_reference.md - Basic and advanced configuration example:
docs/sample-configs.md - Domain-specific example:
domains/hotel/hotel.yaml
Notes:
- Parent tables must be defined before child tables.
Advanced features:
- Relationships YAML extraction - generate a yaml file with inferred table relationships with
--relationships. Seedocs/relationships.md. - Streaming server - continuous, dependency-aware streaming to Delta/Parquet, see detailed usage in
docs/streaming-server.md. - Semantic View YAML generation - produce Snowflake-compatible semantic view YAML with
--semantic-view. Seedocs/semantic-view.md. Note: Semantic view generation uses LLM to provide descriptions, requiring anllm.config(seedocs/llm-config.mdandtable-faker/llm.config.example). - Plugin provider loading - add packages or local modules via
config.python_importand register community providers inconfig.community_providers. Seedocs/custom-providers.mdfor details. - Custom functions:
docs/custom-functions.md