Skip to content

jonaolden/table-faker

 
 

Repository files navigation

icon Table Faker

tablefaker is a lightweight Python tool to generate realistic synthetic datasets from a YAML schema, for testing, demos, and prototyping.

Key features:

  • Schema-driven YAML config: Specify tables, columns and data type for generation in a simple YAML format.
  • Faker-based generators: use built-in functions, community providers, or create custom Python functions to generate data tailored to your needs.
  • Referential integrity: The tool generates parent tables before child tables to ensure data integrity. Supports multi-level FK relationships.
  • Realistic foreign-key distributions: Generation supports distribution strategies (uniform, zipf, and weighted_parent) to mimic real-world data and
  • Multiple output formats: csv, json, parquet, excel, sql, deltalake (streaming).

Installation:

pip install -e .

Quickstart:

# generate CSVs to current folder
tablefaker --config tests/test_basic_table.yaml

CLI flags (see tablefaker --help for full list):

  • --config <PATH> (required) Path to your YAML config file. See docs/yaml-reference.md for full schema and examples.

  • --file_type <extension> (choices: csv, json, parquet, excel, sql, deltalake, streaming) (default: csv) Output file format to generate. Use streaming to run TableFaker's streaming server instead of writing files.

  • --target <PATH|DIR|HOST:PORT> Output destination. If a directory is provided, multiple files will be written into it; if a single file path is provided, output will be written to that file. When using file_type=streaming, --target can be used to specify the host:port to bind the streaming server (see docs/streaming-server.md for configuration details).

  • --seed <INT> Use a numeric seed to make generation deterministic and reproducible.

  • --infer-attrs <true|false> (default: false) Enable name-based attribute inference for columns (attempts to infer semantic attributes from column names).

  • --relationships Generate and write a relationships YAML alongside the generated outputs.

  • --semantic-view Generate and write a Snowflake-compatible semantic view YAML.

Minimal YAML:

version: 1
config:
  locale: en_US
tables:
  - table_name: person
    row_count: 100
    columns:
      - column_name: id
        data: row_id
      - column_name: first_name
        data: fake.first_name()

Notes:

  • Parent tables must be defined before child tables.

Advanced features:

About

tablefaker is a versatile Python package that enables effortless generation of realistic yet synthetic table data and save in CSV, Parquet, Delta Lake, Excel, Sql formats for various applications.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.7%
  • Just 1.3%