You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* chore: interim commit
* feat: added exceptions
* chore: interim commit
* feat: schema validation works well
* fix: fixed an incorrect column name
* fix: made better defaults
* chore: dedent
* fix: more numpy
* chore: different exception catching in one place
* chore: wip for fast shuffle
* feat: using a c shared library for array shuffling
* chore: moved library
* fix: identified bug with fast_shuffle; implemented random drop for ids
* feat: fixes an off-by-one causing a missing item, and improved randomness
The C library needed to use a re-entrant version of rand() in order to
not deterministically generate shuffled arrays.
Unrelated (but included in this commit, oh well), the schema now
defaults to "full_name" instead of "name", as the generation script has
also changed to now do first, last, and/or full names.
* feat: shifted some files into content/ directory
* chore: simplified exception handling for _add_error()
* chore: ran black, fixed some mypy typing
* chore: added poetry
* feat: added json creation
* chore: added example of bad json schema
* chore: added longer example of valid json schema
* feat: added TEXT column generation, made schema validation default
* chore: added sql to gitignore
* chore: fixed make clean target
* chore: added exception handling for missing library
* chore: added mpl-2.0 license
* chore: updated README
* chore: removed dead code, added whitespace
* fix: removed some looping lower() calls in favor of doing it once for a given schema
* feat: bundled two C libraries
* chore: updated gitignore
* fix: removed txt option as it didn't make sense; fixed a leftover erroneous note in help
* feat: added csv support
* chore: updated readme
* feat: added a library for darwin_x86-64
* feat: adding constants
* chore: added lorem ipsum, attributing names
* chore: tweaking full example schema
* chore: interim commit, terribly sorry for lack of description
* chore: updated .gitignore
* fix: checking is now done for column max capacity at validation, and random ids get looped under the column's max
* chore: updated readme
* chore: renamed a class instance, and reformatted
* feat: added chunking of generated sql files
* chore: ran black
* chore: fixed readme doubling
* WIP: interim commit
* fix: swapped chunking behavior to default
* chore: cleanup
* feat: added better defaults for json and timestamps; speedups; bugfixes
* chore: updated readme
* chore: updated base schemas
* chore: split gensql for better readability; renamed create_entries
* chore: removed old files not in use
* feat: added cities
* feat: added countries - wip, speed for random selection is atrocious
* fix: improved speed of country by way of severely reducing the amount of american countries - top 600 by population now
* fix: power instead of multiplication
* chore: updated readme
* chore: fixing a regression
* fix: improved documentation, made some options more accessible
* fix: improved visibility of errors in schema
* fix: split out validation into its own class
* chore: updated readme
* fix: moved schema inputs to their own directory
* chore: updated readme
* feat: added unique checks for emails, other cleanup and small improvements
* fix: no-op gitignore to get the empty directory into git
* chore: removing duplicates
* fix: better error handling, added fixed-length option
* WIP: didn't see a speedup from this, but it may prove useful so storing it
* chore: changed to logging, removed an unused module
* fix: merged sql usage from sqlite branch; approximately 20% faster for some city/country combinations due to cache usage
* WIP: didn't see a speedup from this, but it may prove useful so storing it
* fix: fixed edge case if user doesn't specify any primary key
* fix: moved json keys into constants
* fix: 'upgraded' to string comprehension to create a json column
this is in no way good, is extremely fragile, and in general is a
terrible idea - but it is slightly faster than json.dumps()
* Revert "fix: 'upgraded' to string comprehension to create a json column"
This reverts commit d67a835.
* chore: minor reformatting, added a note about performance on sample()
* chore: broke out runner's init
* fix: added some tests
* feat: added uuid support
* fix: macos creates v4 now
* chore: added prebuilts for apple silicon
* chore: UUID_STR_LEN should be 37, not 36
* chore: renamed for use with run.sh
* chore: added script to symlink correct shared library
* fix: modified how doubles/floats are created
* chore: clarified a unicode symbol
* fix: improved schema validation checks for primary keys
* fix: added more helpful error message for JSON validation errors
* chore: ran black
* feat: added exception for unsupported RDBMS
* chore: removed sqlserver as supported type, renamed postgresql to postgres
* feat: added extremely poor postgres support - does not generate table schema, but does produce valid sql files
--fixed-length Disable any variations in length for JSON arrays, text, etc.
23
24
--generate-dates Generate a file of datetimes for later use
24
25
-g, --generate-skeleton
25
26
Generate a skeleton input JSON schema
26
27
-i INPUT, --input INPUT
27
28
Input schema (JSON)
28
-
-n NUM, --num NUM The number of rows to generate
29
+
--no-check Do not perform validation checks for unique columns
30
+
--no-chunk Do not chunk SQL INSERT statements
31
+
-n NUM, --num NUM The number of rows to generate - defaults to 1000
29
32
-o OUTPUT, --output OUTPUT
30
-
Output filename
33
+
Output filename - defaults to gensql
34
+
-q, --quiet Suppress printing various informational messages
31
35
-r, --random Enable randomness on the length of some items
32
36
-t TABLE, --table TABLE
33
-
Table name to generate SQL for
37
+
Table name to generate SQL for - defaults to the filename
34
38
--validate VALIDATE Validate an input JSON schema
35
39
```
36
40
37
41
### Usage example
38
42
43
+
0. Either build the C libraries with `make`, or execute `run.sh` to symlink the correct file for your arch, if available.
39
44
1. Create a schema if you'd like, or use the included examples.
40
45
41
46
```
@@ -50,7 +55,7 @@ GenSQL expects a JSON input schema, of the format:
50
55
}
51
56
```
52
57
2. If necessary, build the C library with the included Makefile. Otherwise, rename the included file for your platform to `fast_shuffle.so` (or change the name ctypes is looking for, your choice).
53
-
3. Run GenSQL, example `python3 create_entries.py -i $YOUR_SCHEMA.json -n 10000 -f mysql`.
58
+
3. Run GenSQL, example `python3 gensql.py -i $YOUR_SCHEMA.json -n 10000 -f mysql`.
54
59
55
60
## Requirements
56
61
@@ -61,10 +66,16 @@ GenSQL expects a JSON input schema, of the format:
61
66
62
67
* The `--filetype` flag only supports `csv` and `mysql`. The only supported RDBMS is MySQL (probably 8.x; it _might_ work with 5.7.8 if you want a JSON column, and earlier if you don't).
63
68
* Generated datetimes are in UTC, i.e. no DST events exist. If you remove the query to set the session's timezone, you may have a bad time.
64
-
* This uses a C library to perform random shuffles. There are no external libraries, so as long as you have a reasonably new compiler, `make` should work for you.
69
+
* This uses a C library for a few functions, notably filling large arrays and shuffling them. For UUID creation, the library <uuid/uuid.h> is required to build the shared library.
70
+
* Currently, generating UUIDs only supports v1 and v4, and if they're to be stored as `BINARY` types, only .sql file format is supported. Also as an aside, it's a terrible idea to use a UUID (at least v4) as a PK in InnoDB, so please be sure of what you're doing. If you don't believe me, generate one, and another using a monotonic integer or something similar, and compare on-disk sizes for the tablespaces.
65
71
*`--force` and `--drop-table` have warnings for a reason. If you run a query with `DROP TABLE IF EXISTS`, please be sure of what you're doing.
66
-
* `--random` allows for TEXT and JSON columns to have varying amounts of length, which may or may not matter to you. It will cause a ~10% slowdown. If not selected, a deterministic 20% of the rows in these columns will have a longer length than the rest. If this also bothers you, change DEFAULT_VARYING_LENGTH to `False`.
72
+
*`--random` allows for TEXT and JSON columns to have varying amounts of length, which may or may not matter to you. It will cause a ~10% slowdown. If not selected, a deterministic 20% of the rows in these columns will have a longer length than the rest. If this also bothers you, use `--fixed-length`.
67
73
*`--generate-dates` takes practically the same amount of time, or slightly longer, than just having them generated on-demand. It's useful if you want to have the same set of datetimes for a series of tables, although their actual ordering for row generation will remain random.
74
+
* Any column with `id` in its name will by default be assumed to be an integer type, and will have integers generated for it. You can provide hints to disable this, or to enable it for columns without `id` in their names, by using `is_id: {true, false}` in your schema.
75
+
* To have an empty JSON array be set as the default value for a JSON column, use the default value `array()`.
76
+
* The generated values for a JSON column can be an object of random words (the default), or an array of random integers. For the latter, set the hint `is_numeric_array` in the schema's object.
77
+
* To have a column be given no `INSERT` statements, e.g. remain empty / with its default value, set the hint `is_empty: true` in the schema definition for the column.
78
+
* To have the current datetime statically defined as the default value for a TIMESTAMP column, use the default value `static_now()`. To also have the column's default automatically update the timestamp, use the default value `now()`. To have the column's default value be NULL, but update automatically to the current timestamp when the row is updated, use `null_now()`.
68
79
* Using a column of name `phone` will generate realistic - to the best of my knowledge - phone numbers for a given country (very limited set). It's currently non-optimized for performance, and thus incurs a ~40% slowdown over the baseline. A solution in C may or may not speed things up, as it's not that performing `random.shuffle()` on a 10-digit number is slow, it's that doing so `n` times is a lot of function calls. Inlining C functions in Python [does exist](https://github.com/ssize-t/inlinec), but the non-caching of its compilation would probably negate any savings.
69
80
* Similarly, a column of name `email` will generate realistic email addresses (all with `.com` TLD), and will incur a ~40% slowdown over the baseline.
70
81
@@ -81,7 +92,8 @@ And then, from within the `mysql` client:
81
92
82
93
```mysql
83
94
mysql>SET @@time_zone ='+00:00';
84
-
mysql> LOAD DATA INFILE '/path/to/your/file.csv' INTO TABLE $TABLE_NAME FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'" IGNORE 1 LINES;
95
+
mysql>SET @@unique_checks =0;
96
+
mysql> LOAD DATA INFILE '/path/to/your/file.csv' INTO TABLE $TABLE_NAME FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'" IGNORE 1 LINES ($COL_0, $COL_1, ... $COL_N);
85
97
Query OK, 1000 rows affected (1.00 sec)
86
98
Records: 1000 Deleted: 0 Skipped: 0 Warnings: 0
87
99
```
@@ -101,9 +113,9 @@ However, if you don't have access to the host, there are some tricks GenSQL has
101
113
* Disabling autocommit
102
114
* Normally, each statement is committed one at a time. With this disabled, an explicit `COMMIT` statement must be used to commit.
103
115
* Disabling unique checks
104
-
* Normally, the SQL engine will check that any columns declaring `UNIQUE` constraints do in fact meet that constraint. With this disabled, repetitive `INSERT` statements are much faster, with the obvious risk of violating the constraint. For nonsense data that has been created with unique elements, this is safe to temporarily disable.
116
+
* Normally, the SQL engine will check that any columns declaring `UNIQUE` constraints do in fact meet that constraint. With this disabled, repetitive `INSERT` statements are much faster, with the obvious risk of violating the constraint. Since GenSQL by default does its own checks at creation for unique columns (currently limited to integer columns and `email` columns), this is generally safe to disable. If you use `--no-check`, this should not be disabled.
105
117
* Multi-INSERT statements
106
-
* Normally, an `INSERT` statement might look something like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2);` Instead, they can be written like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2), (row_3, row_4),` with `n` tuples of row data. By default, `mysqld` (the server) is limited to a 64 MiB packet size, and `mysql` (the client) to a 16 MiB packet size. Both of these can be altered up to 1 GiB, but the server side may not be accessible to everyone, so GenSQL limits itself to a 10,000 row chunk size, which should comfortably fit under the server limit. For the client, you'll need to pass `--max-allowed-packet=67108864` as an arg.
118
+
* Normally, an `INSERT` statement might look something like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2);` Instead, they can be written like `INSERT INTO $TABLE (col_1, col_2) VALUES (row_1, row_2), (row_3, row_4),` with `n` tuples of row data. By default, `mysqld` (the server) is limited to a 64 MiB packet size, and `mysql` (the client) to a 16 MiB packet size. Both of these can be altered up to 1 GiB, but the server side may not be accessible to everyone, so GenSQL limits itself to a 10,000 row chunk size, which should comfortably fit under the server limit. For the client, you'll need to pass `--max-allowed-packet=67108864` as an arg. If you don't want this behavior, you can use `--no-chunk` when creating the data.
107
119
108
120
109
121
Testing with inserting 100,000 rows (DB is backed by spinning disks):
@@ -147,55 +159,55 @@ Testing the creation of the standard 4-column schema, as well as an extended 8-c
147
159
#### Python 3.11
148
160
149
161
```shell
150
-
❯ time python3.11 create_entries.py -n 1000000 --force --drop-table
151
-
python3.11 create_entries.py -n 1000000 --force --drop-table 4.56s user 0.16s system 99% cpu 4.744 total
162
+
❯ time python3.11 gensql.py -n 1000000 --force --drop-table
163
+
python3.11 gensql.py -n 1000000 --force --drop-table 4.56s user 0.16s system 99% cpu 4.744 total
152
164
153
-
❯ time python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table
154
-
python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table 12.70s user 1.13s system 98% cpu 14.089 total
165
+
❯ time python3.11 gensql.py -i full.json -n 1000000 --force --drop-table
166
+
python3.11 gensql.py -i full.json -n 1000000 --force --drop-table 12.70s user 1.13s system 98% cpu 14.089 total
155
167
```
156
168
157
169
#### Python 3.10
158
170
159
171
```shell
160
-
❯ time python3 create_entries.py -n 1000000 --force --drop-table
161
-
python3 create_entries.py -n 1000000 --force --drop-table 5.27s user 0.17s system 99% cpu 5.442 total
172
+
❯ time python3 gensql.py -n 1000000 --force --drop-table
173
+
python3 gensql.py -n 1000000 --force --drop-table 5.27s user 0.17s system 99% cpu 5.442 total
162
174
163
-
❯ time python3 create_entries.py -i full.json -n 1000000 --force --drop-table
164
-
python3 create_entries.py -i full.json -n 1000000 --force --drop-table 16.23s user 0.54s system 99% cpu 16.840 total
175
+
❯ time python3 gensql.py -i full.json -n 1000000 --force --drop-table
176
+
python3 gensql.py -i full.json -n 1000000 --force --drop-table 16.23s user 0.54s system 99% cpu 16.840 total
165
177
```
166
178
167
179
### Intel i9 Macbook Pro
168
180
169
181
#### Python 3.11
170
182
171
183
```shell
172
-
❯ time python3.11 create_entries.py -n 1000000 --force --drop-table
173
-
python3.11 create_entries.py -n 1000000 --force --drop-table 8.51s user 0.47s system 99% cpu 9.023 total
184
+
❯ time python3.11 gensql.py -n 1000000 --force --drop-table
185
+
python3.11 gensql.py -n 1000000 --force --drop-table 8.51s user 0.47s system 99% cpu 9.023 total
174
186
175
-
❯ time python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table
176
-
python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table 25.68s user 1.60s system 99% cpu 27.395 total
187
+
❯ time python3.11 gensql.py -i full.json -n 1000000 --force --drop-table
188
+
python3.11 gensql.py -i full.json -n 1000000 --force --drop-table 25.68s user 1.60s system 99% cpu 27.395 total
177
189
```
178
190
179
191
#### Python 3.10
180
192
181
193
```shell
182
-
❯ time python3 create_entries.py -n 1000000 --force --drop-table
183
-
python3 create_entries.py -n 1000000 --force --drop-table 9.88s user 0.46s system 99% cpu 10.405 total
194
+
❯ time python3 gensql.py -n 1000000 --force --drop-table
195
+
python3 gensql.py -n 1000000 --force --drop-table 9.88s user 0.46s system 99% cpu 10.405 total
184
196
185
-
❯ time python3 create_entries.py -i full.json -n 1000000 --force --drop-table
186
-
python3 create_entries.py -i full.json -n 1000000 --force --drop-table 32.60s user 1.66s system 99% cpu 34.364 total
197
+
❯ time python3 gensql.py -i full.json -n 1000000 --force --drop-table
198
+
python3 gensql.py -i full.json -n 1000000 --force --drop-table 32.60s user 1.66s system 99% cpu 34.364 total
187
199
```
188
200
189
201
### Xeon E5-2650v2 server
190
202
191
203
A ramdisk was used to eliminate the spinning disk overhead for the server.
192
204
193
205
```shell
194
-
❯ time python3.11 create_entries.py -n 1000000 --force --drop-table -o /mnt/ramdisk/test.sql
195
-
python3.11 create_entries.py -n 1000000 --force --drop-table -o 15.35s user 0.85s system 98% cpu 16.377 total
206
+
❯ time python3.11 gensql.py -n 1000000 --force --drop-table -o /mnt/ramdisk/test.sql
207
+
python3.11 gensql.py -n 1000000 --force --drop-table -o 15.35s user 0.85s system 98% cpu 16.377 total
0 commit comments