Skip to content

Commit 52056f3

Browse files
More things (#6)
* chore: interim commit * feat: added exceptions * chore: interim commit * feat: schema validation works well * fix: fixed an incorrect column name * fix: made better defaults * chore: dedent * fix: more numpy * chore: different exception catching in one place * chore: wip for fast shuffle * feat: using a c shared library for array shuffling * chore: moved library * fix: identified bug with fast_shuffle; implemented random drop for ids * feat: fixes an off-by-one causing a missing item, and improved randomness The C library needed to use a re-entrant version of rand() in order to not deterministically generate shuffled arrays. Unrelated (but included in this commit, oh well), the schema now defaults to "full_name" instead of "name", as the generation script has also changed to now do first, last, and/or full names. * feat: shifted some files into content/ directory * chore: simplified exception handling for _add_error() * chore: ran black, fixed some mypy typing * chore: added poetry * feat: added json creation * chore: added example of bad json schema * chore: added longer example of valid json schema * feat: added TEXT column generation, made schema validation default * chore: added sql to gitignore * chore: fixed make clean target * chore: added exception handling for missing library * chore: added mpl-2.0 license * chore: updated README * chore: removed dead code, added whitespace * fix: removed some looping lower() calls in favor of doing it once for a given schema * feat: bundled two C libraries * chore: updated gitignore * fix: removed txt option as it didn't make sense; fixed a leftover erroneous note in help * feat: added csv support * chore: updated readme * feat: added a library for darwin_x86-64 * feat: adding constants * chore: added lorem ipsum, attributing names * chore: tweaking full example schema * chore: interim commit, terribly sorry for lack of description * chore: updated .gitignore * fix: checking is now done for column max capacity at validation, and random ids get looped under the column's max * chore: updated readme * chore: renamed a class instance, and reformatted * feat: added chunking of generated sql files * chore: ran black * chore: fixed readme doubling * WIP: interim commit * fix: swapped chunking behavior to default * chore: cleanup * feat: added better defaults for json and timestamps; speedups; bugfixes * chore: updated readme * chore: updated base schemas * chore: split gensql for better readability; renamed create_entries * chore: removed old files not in use * feat: added cities * feat: added countries - wip, speed for random selection is atrocious * fix: improved speed of country by way of severely reducing the amount of american countries - top 600 by population now * fix: power instead of multiplication * chore: updated readme * chore: fixing a regression * fix: improved documentation, made some options more accessible * fix: improved visibility of errors in schema * fix: split out validation into its own class * chore: updated readme * fix: moved schema inputs to their own directory * chore: updated readme * feat: added unique checks for emails, other cleanup and small improvements * fix: no-op gitignore to get the empty directory into git * chore: removing duplicates * fix: better error handling, added fixed-length option * WIP: didn't see a speedup from this, but it may prove useful so storing it * chore: changed to logging, removed an unused module * fix: merged sql usage from sqlite branch; approximately 20% faster for some city/country combinations due to cache usage * WIP: didn't see a speedup from this, but it may prove useful so storing it * fix: fixed edge case if user doesn't specify any primary key * fix: moved json keys into constants * fix: 'upgraded' to string comprehension to create a json column this is in no way good, is extremely fragile, and in general is a terrible idea - but it is slightly faster than json.dumps() * Revert "fix: 'upgraded' to string comprehension to create a json column" This reverts commit d67a835. * chore: minor reformatting, added a note about performance on sample() * chore: broke out runner's init * fix: added some tests * feat: added uuid support * fix: macos creates v4 now * chore: added prebuilts for apple silicon * chore: UUID_STR_LEN should be 37, not 36 * chore: renamed for use with run.sh * chore: added script to symlink correct shared library * fix: modified how doubles/floats are created * chore: clarified a unicode symbol * fix: improved schema validation checks for primary keys * fix: added more helpful error message for JSON validation errors * chore: ran black * feat: added exception for unsupported RDBMS * chore: removed sqlserver as supported type, renamed postgresql to postgres * feat: added extremely poor postgres support - does not generate table schema, but does produce valid sql files
1 parent d7bfdfc commit 52056f3

40 files changed

+9681
-762
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -173,3 +173,4 @@ poetry.toml
173173
*.sql
174174
*.gz
175175
library/fast_shuffle.so
176+
library/uuid.so

Makefile

+24-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,27 @@
1+
CC := gcc
2+
CFLAGS := -Wextra -Wall -O3
3+
LDLIBS := -luuid
4+
LDFLAGS := /usr/lib/x86_64-linux-gnu
5+
6+
ifeq ($(shell uname), Darwin)
7+
MAC_HEADER_DIR := $(shell find /opt/homebrew/Cellar/ossp-uuid -name "uuid.h" -exec dirname {} \;)
8+
MAC_HEADER_DIR := $(strip $(MAC_HEADER_DIR))
9+
LDFLAGS := /opt/homebrew/lib
10+
endif
11+
12+
.PHONY: make
113
make:
2-
gcc -Wextra -Wall -O3 -shared library/fast_shuffle.c -o library/fast_shuffle.so
14+
@if [ ! -f $(LDFLAGS)/libuuid.a ]; then \
15+
echo "FATAL: libuuid.a not found"; \
16+
echo "On MacOS, use: brew install ossp-uuid"; \
17+
echo "On Debian/Ubuntu, use: sudo apt-get install uuid-dev"; \
18+
echo "On CentOS/Fedora/RHEL, use: sudo yum install libuuid-devel"; \
19+
echo "On openSUSE, use: sudo zypper in libuuid-devel"; \
20+
exit 1; \
21+
fi
22+
$(CC) $(CFLAGS) -shared library/fast_shuffle.c -o library/fast_shuffle.so
23+
$(CC) $(CFLAGS) -shared library/uuid.c -L$(LDFLAGS) $(LDLIBS) -o library/uuid.so
324

25+
.PHONY: clean
426
clean:
5-
rm -f library/fast_shuffle.so
27+
rm -f library/fast_shuffle.so library/uuid.so

README.md

+45-33
Original file line numberDiff line numberDiff line change
@@ -7,35 +7,40 @@ Ever want to quickly create millions of rows of random data for a database, with
77
## Usage
88

99
```shell
10-
usage: create_entries.py [-h] [--extended-help] [-c] [--country {au,de,fr,ke,jp,mx,ua,uk,us}] [-d] [--drop-table] [--force] [-f {csv,mysql,postgresql,sqlserver}] [--generate-dates] [-g] [-i INPUT] [-n NUM] [-o OUTPUT] [-r] [-t TABLE] [--validate VALIDATE]
10+
usage: gensql.py [-h] [--extended-help] [--country {random,au,de,fr,gb,ke,jp,mx,ua,us}] [-d] [--drop-table] [--force] [-f {csv,mysql,postgresql,sqlserver}] [--fixed-length]
11+
[--generate-dates] [-g] [-i INPUT] [--no-check] [--no-chunk] [-n NUM] [-o OUTPUT] [-q] [-r] [-t TABLE] [--validate VALIDATE]
1112

1213
options:
1314
-h, --help show this help message and exit
1415
--extended-help Print extended help
15-
-c, --chunk Chunk SQL INSERT statements
16-
--country {au,de,fr,ke,jp,mx,ua,uk,us}
17-
The country's phone number structure to use if generating phone numbers
16+
--country {random,au,de,fr,gb,ke,jp,mx,ua,us}
17+
A specific country (or random) to use for cities, phone numbers, etc.
1818
-d, --debug Print tracebacks for errors
1919
--drop-table WARNING: DESTRUCTIVE - use DROP TABLE with generation
2020
--force WARNING: DESTRUCTIVE - overwrite any files
2121
-f {csv,mysql,postgresql,sqlserver}, --filetype {csv,mysql,postgresql,sqlserver}
2222
Filetype to generate
23+
--fixed-length Disable any variations in length for JSON arrays, text, etc.
2324
--generate-dates Generate a file of datetimes for later use
2425
-g, --generate-skeleton
2526
Generate a skeleton input JSON schema
2627
-i INPUT, --input INPUT
2728
Input schema (JSON)
28-
-n NUM, --num NUM The number of rows to generate
29+
--no-check Do not perform validation checks for unique columns
30+
--no-chunk Do not chunk SQL INSERT statements
31+
-n NUM, --num NUM The number of rows to generate - defaults to 1000
2932
-o OUTPUT, --output OUTPUT
30-
Output filename
33+
Output filename - defaults to gensql
34+
-q, --quiet Suppress printing various informational messages
3135
-r, --random Enable randomness on the length of some items
3236
-t TABLE, --table TABLE
33-
Table name to generate SQL for
37+
Table name to generate SQL for - defaults to the filename
3438
--validate VALIDATE Validate an input JSON schema
3539
```
3640

3741
### Usage example
3842

43+
0. Either build the C libraries with `make`, or execute `run.sh` to symlink the correct file for your arch, if available.
3944
1. Create a schema if you'd like, or use the included examples.
4045

4146
```
@@ -50,7 +55,7 @@ GenSQL expects a JSON input schema, of the format:
5055
}
5156
```
5257
2. If necessary, build the C library with the included Makefile. Otherwise, rename the included file for your platform to `fast_shuffle.so` (or change the name ctypes is looking for, your choice).
53-
3. Run GenSQL, example `python3 create_entries.py -i $YOUR_SCHEMA.json -n 10000 -f mysql`.
58+
3. Run GenSQL, example `python3 gensql.py -i $YOUR_SCHEMA.json -n 10000 -f mysql`.
5459

5560
## Requirements
5661

@@ -61,10 +66,16 @@ GenSQL expects a JSON input schema, of the format:
6166

6267
* The `--filetype` flag only supports `csv` and `mysql`. The only supported RDBMS is MySQL (probably 8.x; it _might_ work with 5.7.8 if you want a JSON column, and earlier if you don't).
6368
* Generated datetimes are in UTC, i.e. no DST events exist. If you remove the query to set the session's timezone, you may have a bad time.
64-
* This uses a C library to perform random shuffles. There are no external libraries, so as long as you have a reasonably new compiler, `make` should work for you.
69+
* This uses a C library for a few functions, notably filling large arrays and shuffling them. For UUID creation, the library <uuid/uuid.h> is required to build the shared library.
70+
* Currently, generating UUIDs only supports v1 and v4, and if they're to be stored as `BINARY` types, only .sql file format is supported. Also as an aside, it's a terrible idea to use a UUID (at least v4) as a PK in InnoDB, so please be sure of what you're doing. If you don't believe me, generate one, and another using a monotonic integer or something similar, and compare on-disk sizes for the tablespaces.
6571
* `--force` and `--drop-table` have warnings for a reason. If you run a query with `DROP TABLE IF EXISTS`, please be sure of what you're doing.
66-
* `--random` allows for TEXT and JSON columns to have varying amounts of length, which may or may not matter to you. It will cause a ~10% slowdown. If not selected, a deterministic 20% of the rows in these columns will have a longer length than the rest. If this also bothers you, change DEFAULT_VARYING_LENGTH to `False`.
72+
* `--random` allows for TEXT and JSON columns to have varying amounts of length, which may or may not matter to you. It will cause a ~10% slowdown. If not selected, a deterministic 20% of the rows in these columns will have a longer length than the rest. If this also bothers you, use `--fixed-length`.
6773
* `--generate-dates` takes practically the same amount of time, or slightly longer, than just having them generated on-demand. It's useful if you want to have the same set of datetimes for a series of tables, although their actual ordering for row generation will remain random.
74+
* Any column with `id` in its name will by default be assumed to be an integer type, and will have integers generated for it. You can provide hints to disable this, or to enable it for columns without `id` in their names, by using `is_id: {true, false}` in your schema.
75+
* To have an empty JSON array be set as the default value for a JSON column, use the default value `array()`.
76+
* The generated values for a JSON column can be an object of random words (the default), or an array of random integers. For the latter, set the hint `is_numeric_array` in the schema's object.
77+
* To have a column be given no `INSERT` statements, e.g. remain empty / with its default value, set the hint `is_empty: true` in the schema definition for the column.
78+
* To have the current datetime statically defined as the default value for a TIMESTAMP column, use the default value `static_now()`. To also have the column's default automatically update the timestamp, use the default value `now()`. To have the column's default value be NULL, but update automatically to the current timestamp when the row is updated, use `null_now()`.
6879
* Using a column of name `phone` will generate realistic - to the best of my knowledge - phone numbers for a given country (very limited set). It's currently non-optimized for performance, and thus incurs a ~40% slowdown over the baseline. A solution in C may or may not speed things up, as it's not that performing `random.shuffle()` on a 10-digit number is slow, it's that doing so `n` times is a lot of function calls. Inlining C functions in Python [does exist](https://github.com/ssize-t/inlinec), but the non-caching of its compilation would probably negate any savings.
6980
* Similarly, a column of name `email` will generate realistic email addresses (all with `.com` TLD), and will incur a ~40% slowdown over the baseline.
7081

@@ -81,7 +92,8 @@ And then, from within the `mysql` client:
8192

8293
```mysql
8394
mysql> SET @@time_zone = '+00:00';
84-
mysql> LOAD DATA INFILE '/path/to/your/file.csv' INTO TABLE $TABLE_NAME FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'" IGNORE 1 LINES;
95+
mysql> SET @@unique_checks = 0;
96+
mysql> LOAD DATA INFILE '/path/to/your/file.csv' INTO TABLE $TABLE_NAME FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'" IGNORE 1 LINES ($COL_0, $COL_1, ... $COL_N);
8597
Query OK, 1000 rows affected (1.00 sec)
8698
Records: 1000 Deleted: 0 Skipped: 0 Warnings: 0
8799
```
@@ -101,9 +113,9 @@ However, if you don't have access to the host, there are some tricks GenSQL has
101113
* Disabling autocommit
102114
* Normally, each statement is committed one at a time. With this disabled, an explicit `COMMIT` statement must be used to commit.
103115
* Disabling unique checks
104-
* Normally, the SQL engine will check that any columns declaring `UNIQUE` constraints do in fact meet that constraint. With this disabled, repetitive `INSERT` statements are much faster, with the obvious risk of violating the constraint. For nonsense data that has been created with unique elements, this is safe to temporarily disable.
116+
* Normally, the SQL engine will check that any columns declaring `UNIQUE` constraints do in fact meet that constraint. With this disabled, repetitive `INSERT` statements are much faster, with the obvious risk of violating the constraint. Since GenSQL by default does its own checks at creation for unique columns (currently limited to integer columns and `email` columns), this is generally safe to disable. If you use `--no-check`, this should not be disabled.
105117
* Multi-INSERT statements
106-
* Normally, an `INSERT` statement might look something like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2);` Instead, they can be written like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2), (row_3, row_4),` with `n` tuples of row data. By default, `mysqld` (the server) is limited to a 64 MiB packet size, and `mysql` (the client) to a 16 MiB packet size. Both of these can be altered up to 1 GiB, but the server side may not be accessible to everyone, so GenSQL limits itself to a 10,000 row chunk size, which should comfortably fit under the server limit. For the client, you'll need to pass `--max-allowed-packet=67108864` as an arg.
118+
* Normally, an `INSERT` statement might look something like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2);` Instead, they can be written like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2), (row_3, row_4),` with `n` tuples of row data. By default, `mysqld` (the server) is limited to a 64 MiB packet size, and `mysql` (the client) to a 16 MiB packet size. Both of these can be altered up to 1 GiB, but the server side may not be accessible to everyone, so GenSQL limits itself to a 10,000 row chunk size, which should comfortably fit under the server limit. For the client, you'll need to pass `--max-allowed-packet=67108864` as an arg. If you don't want this behavior, you can use `--no-chunk` when creating the data.
107119

108120

109121
Testing with inserting 100,000 rows (DB is backed by spinning disks):
@@ -147,55 +159,55 @@ Testing the creation of the standard 4-column schema, as well as an extended 8-c
147159
#### Python 3.11
148160

149161
```shell
150-
time python3.11 create_entries.py -n 1000000 --force --drop-table
151-
python3.11 create_entries.py -n 1000000 --force --drop-table 4.56s user 0.16s system 99% cpu 4.744 total
162+
time python3.11 gensql.py -n 1000000 --force --drop-table
163+
python3.11 gensql.py -n 1000000 --force --drop-table 4.56s user 0.16s system 99% cpu 4.744 total
152164

153-
time python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table
154-
python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table 12.70s user 1.13s system 98% cpu 14.089 total
165+
time python3.11 gensql.py -i full.json -n 1000000 --force --drop-table
166+
python3.11 gensql.py -i full.json -n 1000000 --force --drop-table 12.70s user 1.13s system 98% cpu 14.089 total
155167
```
156168

157169
#### Python 3.10
158170

159171
```shell
160-
time python3 create_entries.py -n 1000000 --force --drop-table
161-
python3 create_entries.py -n 1000000 --force --drop-table 5.27s user 0.17s system 99% cpu 5.442 total
172+
time python3 gensql.py -n 1000000 --force --drop-table
173+
python3 gensql.py -n 1000000 --force --drop-table 5.27s user 0.17s system 99% cpu 5.442 total
162174

163-
time python3 create_entries.py -i full.json -n 1000000 --force --drop-table
164-
python3 create_entries.py -i full.json -n 1000000 --force --drop-table 16.23s user 0.54s system 99% cpu 16.840 total
175+
time python3 gensql.py -i full.json -n 1000000 --force --drop-table
176+
python3 gensql.py -i full.json -n 1000000 --force --drop-table 16.23s user 0.54s system 99% cpu 16.840 total
165177
```
166178

167179
### Intel i9 Macbook Pro
168180

169181
#### Python 3.11
170182

171183
```shell
172-
time python3.11 create_entries.py -n 1000000 --force --drop-table
173-
python3.11 create_entries.py -n 1000000 --force --drop-table 8.51s user 0.47s system 99% cpu 9.023 total
184+
time python3.11 gensql.py -n 1000000 --force --drop-table
185+
python3.11 gensql.py -n 1000000 --force --drop-table 8.51s user 0.47s system 99% cpu 9.023 total
174186

175-
time python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table
176-
python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table 25.68s user 1.60s system 99% cpu 27.395 total
187+
time python3.11 gensql.py -i full.json -n 1000000 --force --drop-table
188+
python3.11 gensql.py -i full.json -n 1000000 --force --drop-table 25.68s user 1.60s system 99% cpu 27.395 total
177189
```
178190

179191
#### Python 3.10
180192

181193
```shell
182-
time python3 create_entries.py -n 1000000 --force --drop-table
183-
python3 create_entries.py -n 1000000 --force --drop-table 9.88s user 0.46s system 99% cpu 10.405 total
194+
time python3 gensql.py -n 1000000 --force --drop-table
195+
python3 gensql.py -n 1000000 --force --drop-table 9.88s user 0.46s system 99% cpu 10.405 total
184196

185-
time python3 create_entries.py -i full.json -n 1000000 --force --drop-table
186-
python3 create_entries.py -i full.json -n 1000000 --force --drop-table 32.60s user 1.66s system 99% cpu 34.364 total
197+
time python3 gensql.py -i full.json -n 1000000 --force --drop-table
198+
python3 gensql.py -i full.json -n 1000000 --force --drop-table 32.60s user 1.66s system 99% cpu 34.364 total
187199
```
188200

189201
### Xeon E5-2650v2 server
190202

191203
A ramdisk was used to eliminate the spinning disk overhead for the server.
192204

193205
```shell
194-
time python3.11 create_entries.py -n 1000000 --force --drop-table -o /mnt/ramdisk/test.sql
195-
python3.11 create_entries.py -n 1000000 --force --drop-table -o 15.35s user 0.85s system 98% cpu 16.377 total
206+
time python3.11 gensql.py -n 1000000 --force --drop-table -o /mnt/ramdisk/test.sql
207+
python3.11 gensql.py -n 1000000 --force --drop-table -o 15.35s user 0.85s system 98% cpu 16.377 total
196208

197-
time python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table -o /mnt/ramdisk/test.sql
198-
python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table -o 45.26s user 3.79s system 99% cpu 49.072 total
209+
time python3.11 gensql.py -i full.json -n 1000000 --force --drop-table -o /mnt/ramdisk/test.sql
210+
python3.11 gensql.py -i full.json -n 1000000 --force --drop-table -o 45.26s user 3.79s system 99% cpu 49.072 total
199211
```
200212

201213
## TODO

broken.json

-34
This file was deleted.

0 commit comments

Comments
 (0)