You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The country's phone number structure to use if generating phone numbers
17
18
-d, --debug Print tracebacks for errors
18
19
--drop-table WARNING: DESTRUCTIVE - use DROP TABLE with generation
19
20
--force WARNING: DESTRUCTIVE - overwrite any files
@@ -63,6 +64,9 @@ GenSQL expects a JSON input schema, of the format:
63
64
* This uses a C library to perform random shuffles. There are no external libraries, so as long as you have a reasonably new compiler, `make` should work for you.
64
65
* `--force` and `--drop-table` have warnings for a reason. If you run a query with `DROP TABLE IF EXISTS`, please be sure of what you're doing.
65
66
* `--random` allows for TEXT and JSON columns to have varying amounts of length, which may or may not matter to you. It will cause a ~10% slowdown. If not selected, a deterministic 20% of the rows in these columns will have a longer length than the rest. If this also bothers you, change DEFAULT_VARYING_LENGTH to `False`.
67
+
* `--generate-dates` takes practically the same amount of time, or slightly longer, than just having them generated on-demand. It's useful if you want to have the same set of datetimes for a series of tables, although their actual ordering for row generation will remain random.
68
+
* Using a column of name `phone` will generate realistic - to the best of my knowledge - phone numbers for a given country (very limited set). It's currently non-optimized for performance, and thus incurs a ~40% slowdown over the baseline. A solution in C may or may not speed things up, as it's not that performing `random.shuffle()` on a 10-digit number is slow, it's that doing so `n` times is a lot of function calls. Inlining C functions in Python [does exist](https://github.com/ssize-t/inlinec), but the non-caching of its compilation would probably negate any savings.
69
+
* Similarly, a column of name `email` will generate realistic email addresses (all with `.com` TLD), and will incur a ~40% slowdown over the baseline.
66
70
67
71
### Loading data
68
72
@@ -120,6 +124,18 @@ mysql -h 127.0.0.1 -usgarland -ppassword test -e 0.02s user 0.01s system 0% cp
120
124
121
125
Or, in terms of ratios, using chunking is approximately 3x as fast as the baseline, while loading a CSV is approximately 4x as fast as the baseline.
122
126
127
+
```
128
+
# baseline
129
+
❯ time mysql -h localhost -usgarland -ppassword test < test.sql
130
+
mysql -h localhost -usgarland -ppassword test < test.sql 32.75s user 10.90s system 14% cpu 4:55.91 total
131
+
# no unique checks
132
+
❯ time mysql -h localhost -usgarland -ppassword test < test.sql
133
+
mysql -h localhost -usgarland -ppassword test < test.sql 25.11s user 8.67s system 14% cpu 3:48.38 total
134
+
# no unique checks, single insert, 1 gb buffer size
135
+
❯ time mysql -h localhost -usgarland -ppassword --max-allowed-packet=1073741824 test < test.sql
136
+
mysql -h localhost -usgarland -ppassword --max-allowed-packet=1073741824 test 10.64s user 0.91s system 7% cpu 2:28.29 total
137
+
```
138
+
123
139
## Benchmarks
124
140
125
141
**NOTE: THESE ARE NOT CURRENT, AND SHOULD NOT BE RELIED ON**
0 commit comments