Skip to content

Commit d7bfdfc

Browse files
feat: added speedup, as well as email and phone (#5)
1 parent 199e4a2 commit d7bfdfc

9 files changed

+210
-55
lines changed

README.md

+20-4
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,14 @@ Ever want to quickly create millions of rows of random data for a database, with
77
## Usage
88

99
```shell
10-
usage: create_entries.py [-h] [--extended-help] [-c] [-d] [--drop-table] [--force] [-f {csv,mysql,postgresql,sqlserver}] [--generate-dates] [-g] [-i INPUT] [-n NUM] [-o OUTPUT]
11-
[-r] [-t TABLE] [--validate VALIDATE]
10+
usage: create_entries.py [-h] [--extended-help] [-c] [--country {au,de,fr,ke,jp,mx,ua,uk,us}] [-d] [--drop-table] [--force] [-f {csv,mysql,postgresql,sqlserver}] [--generate-dates] [-g] [-i INPUT] [-n NUM] [-o OUTPUT] [-r] [-t TABLE] [--validate VALIDATE]
1211

1312
options:
1413
-h, --help show this help message and exit
1514
--extended-help Print extended help
1615
-c, --chunk Chunk SQL INSERT statements
16+
--country {au,de,fr,ke,jp,mx,ua,uk,us}
17+
The country's phone number structure to use if generating phone numbers
1718
-d, --debug Print tracebacks for errors
1819
--drop-table WARNING: DESTRUCTIVE - use DROP TABLE with generation
1920
--force WARNING: DESTRUCTIVE - overwrite any files
@@ -63,6 +64,9 @@ GenSQL expects a JSON input schema, of the format:
6364
* This uses a C library to perform random shuffles. There are no external libraries, so as long as you have a reasonably new compiler, `make` should work for you.
6465
* `--force` and `--drop-table` have warnings for a reason. If you run a query with `DROP TABLE IF EXISTS`, please be sure of what you're doing.
6566
* `--random` allows for TEXT and JSON columns to have varying amounts of length, which may or may not matter to you. It will cause a ~10% slowdown. If not selected, a deterministic 20% of the rows in these columns will have a longer length than the rest. If this also bothers you, change DEFAULT_VARYING_LENGTH to `False`.
67+
* `--generate-dates` takes practically the same amount of time, or slightly longer, than just having them generated on-demand. It's useful if you want to have the same set of datetimes for a series of tables, although their actual ordering for row generation will remain random.
68+
* Using a column of name `phone` will generate realistic - to the best of my knowledge - phone numbers for a given country (very limited set). It's currently non-optimized for performance, and thus incurs a ~40% slowdown over the baseline. A solution in C may or may not speed things up, as it's not that performing `random.shuffle()` on a 10-digit number is slow, it's that doing so `n` times is a lot of function calls. Inlining C functions in Python [does exist](https://github.com/ssize-t/inlinec), but the non-caching of its compilation would probably negate any savings.
69+
* Similarly, a column of name `email` will generate realistic email addresses (all with `.com` TLD), and will incur a ~40% slowdown over the baseline.
6670
6771
### Loading data
6872
@@ -120,6 +124,18 @@ mysql -h 127.0.0.1 -usgarland -ppassword test -e 0.02s user 0.01s system 0% cp
120124

121125
Or, in terms of ratios, using chunking is approximately 3x as fast as the baseline, while loading a CSV is approximately 4x as fast as the baseline.
122126

127+
```
128+
# baseline
129+
❯ time mysql -h localhost -usgarland -ppassword test < test.sql
130+
mysql -h localhost -usgarland -ppassword test < test.sql 32.75s user 10.90s system 14% cpu 4:55.91 total
131+
# no unique checks
132+
❯ time mysql -h localhost -usgarland -ppassword test < test.sql
133+
mysql -h localhost -usgarland -ppassword test < test.sql 25.11s user 8.67s system 14% cpu 3:48.38 total
134+
# no unique checks, single insert, 1 gb buffer size
135+
❯ time mysql -h localhost -usgarland -ppassword --max-allowed-packet=1073741824 test < test.sql
136+
mysql -h localhost -usgarland -ppassword --max-allowed-packet=1073741824 test 10.64s user 0.91s system 7% cpu 2:28.29 total
137+
```
138+
123139
## Benchmarks
124140

125141
**NOTE: THESE ARE NOT CURRENT, AND SHOULD NOT BE RELIED ON**
@@ -184,8 +200,8 @@ python3.11 create_entries.py -i full.json -n 1000000 --force --drop-table -o 4
184200

185201
## TODO
186202

187-
* Support other SQL varieties, as well as CSV and TXT.
188-
* Add more column data sources, like addresses, phone numbers, and email addresses.
203+
* Support other SQL varieties.
204+
* Add more column data sources.
189205
* Create tests.
190206
* Come up with a coherent exception handling mechanism.
191207
* Add logging, maybe.

create_entries.py

+64-17
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
JSON_OBJ_MAX_KEYS,
2424
JSON_OBJ_MAX_VALS,
2525
MYSQL_INT_MIN_MAX,
26+
PHONE_NUMBER,
2627
)
2728
from utilities import utilities
2829

@@ -90,6 +91,8 @@ def _add_error(error_schema: dict, key: tuple, value: dict, error_message: str):
9091
"timestamp",
9192
"text",
9293
"json",
94+
"email",
95+
"phone",
9396
]
9497
pks = []
9598
errors = {}
@@ -116,6 +119,20 @@ def _add_error(error_schema: dict, key: tuple, value: dict, error_message: str):
116119
]
117120
if col_pk:
118121
pks.append(k)
122+
if k == "phone" and "char" not in col_type:
123+
_add_error(
124+
errors,
125+
(k, "type"),
126+
v,
127+
f"column `{k}` must be of type CHAR or VARCHAR",
128+
)
129+
if k == "phone" and col_unique:
130+
_add_error(
131+
errors,
132+
(k, "unique"),
133+
v,
134+
f"unique is not a valid option for column `{k}` - this is a performance decision; numbers are still unlikely to collide",
135+
)
119136
if not col_type:
120137
_add_error(
121138
errors,
@@ -135,7 +152,7 @@ def _add_error(error_schema: dict, key: tuple, value: dict, error_message: str):
135152
errors,
136153
(k, "width"),
137154
v,
138-
f"column type `{col_type}` is not supported",
155+
f"width is not a valid option for column `{k}` of type `{col_type}`",
139156
)
140157
if col_autoinc and "int" not in col_type:
141158
_add_error(
@@ -260,6 +277,9 @@ def mysql(
260277
msg += f"DROP TABLE IF EXISTS `{tbl_name}`;\n"
261278
msg += f"CREATE TABLE `{tbl_name}` (\n"
262279
for col, col_attributes in schema.items():
280+
# this may expand in the future
281+
if col in ["phone"]:
282+
cols[col]["create_ranged_arr"] = True
263283
col_opts = []
264284
for k, v in col_attributes.items():
265285
match k:
@@ -322,6 +342,8 @@ def __init__(self, args, schema, tbl_name, tbl_cols, tbl_create):
322342
self.tbl_cols = tbl_cols
323343
self.tbl_create = tbl_create
324344
self.tbl_name = tbl_name
345+
_has_monotonic = False
346+
_has_unique = False
325347

326348
# exceeding auto_increment capacity is checked at schema validation, but since
327349
# the user can specify --validate without passing --num, uniques have to be checked here
@@ -336,18 +358,25 @@ def __init__(self, args, schema, tbl_name, tbl_cols, tbl_create):
336358
f"MYSQL_MAX_{v['type'].upper().split()[0]}_SIGNED"
337359
]
338360

361+
if v.get("auto_inc"):
362+
_has_monotonic = True
339363
if v.get("unique"):
364+
_has_unique = True
340365
if self.args.num > col_max_val:
341366
raise TooManyRowsError(k, self.args.num, col_max_val) from None
367+
# if uniquity isn't required, and the requested number of rows is greater
368+
# than the column can handle, just set it to the column's max since we can repeat
342369
else:
343370
if self.args.num > col_max_val:
344371
self.rand_max_id = col_max_val
345372
else:
346373
self.rand_max_id = self.args.num
347374

348-
self.monotonic_id = self.allocator(self.args.num)
349-
self.random_id = self.allocator(self.rand_max_id, shuffle=True)
350-
self.unique_id = self.allocator(self.args.num, shuffle=True)
375+
if _has_monotonic:
376+
self.monotonic_id = self.allocator(0, self.args.num)
377+
self.random_id = self.allocator(0, self.rand_max_id, shuffle=True)
378+
if _has_unique:
379+
self.unique_id = self.allocator(0, self.args.num, shuffle=True)
351380
try:
352381
with open("content/dates.txt", "r") as f:
353382
self.dates = f.readlines()
@@ -380,9 +409,9 @@ def sample(
380409
sample_list.append(iterable[idx])
381410
return sample_list
382411

383-
def make_row(self, schema: dict, idx: int) -> dict:
412+
def make_row(self, schema: dict, idx: int, has_timestamp: bool) -> dict:
384413
row = {}
385-
if any("timestamp" in s.values() for s in schema.values()):
414+
if has_timestamp:
386415
date = self.sample(self.dates, self.args.num)
387416
for col, opts in schema.items():
388417
if "id" in col:
@@ -395,7 +424,6 @@ def make_row(self, schema: dict, idx: int) -> dict:
395424

396425
# these are appended to the right of the deque, so they won't be immediately repeated
397426
self.random_id.release(row[col])
398-
399427
elif col == "first_name":
400428
random_first = self.sample(self.first_names, self.num_rows_first_names)
401429
first_name = f"{random_first}".replace("'", "''")
@@ -414,13 +442,14 @@ def make_row(self, schema: dict, idx: int) -> dict:
414442

415443
elif schema[col]["type"] == "json":
416444
json_dict = {}
417-
keys = self.sample(
445+
json_keys = self.sample(
418446
self.wordlist, self.num_rows_wordlist, JSON_OBJ_MAX_KEYS
419447
)
420-
vals = self.sample(
421-
self.wordlist, self.num_rows_wordlist, JSON_OBJ_MAX_VALS
448+
# grab an extra for use with email if needed
449+
json_vals = self.sample(
450+
self.wordlist, self.num_rows_wordlist, JSON_OBJ_MAX_VALS + 1
422451
)
423-
json_dict[keys.pop()] = vals.pop()
452+
json_dict[json_keys.pop()] = json_vals.pop()
424453
max_rows_pct = float(
425454
schema.get(col, {}).get("max_length", DEFAULT_MAX_FIELD_PCT)
426455
)
@@ -432,13 +461,30 @@ def make_row(self, schema: dict, idx: int) -> dict:
432461
json_arr_len = ceil((JSON_OBJ_MAX_VALS - 1) * max_rows_pct)
433462
# make 20% of the JSON objects nested with a list object of length
434463
if not idx % 5:
435-
key = keys.pop()
464+
key = json_keys.pop()
436465
json_dict[key] = {}
437-
json_dict[key][keys.pop()] = [
438-
vals.pop() for _ in range(json_arr_len)
466+
json_dict[key][json_keys.pop()] = [
467+
json_vals.pop() for _ in range(json_arr_len)
439468
]
440469
row[col] = f"'{json.dumps(json_dict)}'"
441470

471+
elif col == "email":
472+
try:
473+
email_domain = json_vals.pop()
474+
except UnboundLocalError:
475+
email_domain = self.sample(self.wordlist, self.num_rows_wordlist)
476+
try:
477+
email_local = random_first
478+
except UnboundLocalError:
479+
email_local = self.sample(
480+
self.first_names, self.num_rows_first_names
481+
)
482+
row[col] = f"'{email_local.lower()}@{email_domain}.com'"
483+
elif col == "phone":
484+
phone_digits = [str(x) for x in range(10)]
485+
random.shuffle(phone_digits)
486+
phone_str = "".join(phone_digits)
487+
row[col] = f"'{PHONE_NUMBER[args.country](phone_str)}'"
442488
elif schema[col]["type"] == "text":
443489
max_rows_pct = float(
444490
schema.get(col, {}).get("max_length", DEFAULT_MAX_FIELD_PCT)
@@ -490,8 +536,8 @@ def make_sql_rows(self, vals: list) -> list:
490536
chunk_list = vals[i : i + DEFAULT_INSERT_CHUNK_SIZE]
491537
for row in chunk_list:
492538
insert_rows.append(f"({row}),\n")
493-
# if we reach the end of a chunk list, make the multi-insert a commit by swapping
494-
# the last comma to a semi-colon
539+
# if we reach the end of a chunk list, make the multi-insert statement a single
540+
# query by swapping the last comma to a semi-colon
495541
insert_rows[-1] = insert_rows[-1][::-1].replace(",", ";", 1)[::-1]
496542
else:
497543
for row in vals:
@@ -508,8 +554,9 @@ def make_sql_rows(self, vals: list) -> list:
508554
def run(self):
509555
sql_inserts = []
510556
random.seed(os.urandom(4))
557+
_has_timestamp = any("timestamp" in s.values() for s in self.schema.values())
511558
for i in range(1, self.args.num + 1):
512-
row = self.make_row(self.schema, i)
559+
row = self.make_row(self.schema, i, _has_timestamp)
513560
sql_inserts.append(row)
514561
vals = [",".join(str(v) for v in d.values()) for d in sql_inserts]
515562
match args.filetype:

full.json

+5
Original file line numberDiff line numberDiff line change
@@ -38,5 +38,10 @@
3838
"type": "timestamp",
3939
"nullable": "true",
4040
"default": "NULL"
41+
},
42+
"email": {
43+
"type": "varchar",
44+
"width": "255",
45+
"nullable": "true"
4146
}
4247
}

library/fast_shuffle.c

+12
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,18 @@ uint32_t *fill_array(uint32_t size) {
1414
return arr;
1515
}
1616

17+
uint32_t *fill_array_range(uint32_t start, uint32_t end) {
18+
uint32_t size = end - start;
19+
uint32_t *arr = calloc(size, sizeof(uint32_t));
20+
if (!arr) {
21+
return NULL;
22+
}
23+
for (uint32_t i = start; i <= end; i++) {
24+
arr[i] = i;
25+
}
26+
return arr;
27+
}
28+
1729
uint32_t right_shift(uint32_t range, uint32_t *seed) {
1830
uint64_t random32bit, multiresult;
1931
uint32_t leftover, threshold;

only_email.json

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
2+
{
3+
"user_id": {
4+
"type": "bigint unsigned",
5+
"nullable": "false",
6+
"auto increment": "true",
7+
"primary key": "true"
8+
},
9+
"email": {
10+
"type": "varchar",
11+
"width": "255",
12+
"nullable": "true"
13+
},
14+
"external_id": {
15+
"type": "bigint unsigned",
16+
"nullable": "false",
17+
"unique": "true",
18+
"default": "0"
19+
},
20+
"last_modified": {
21+
"type": "timestamp",
22+
"nullable": "true",
23+
"default": "null"
24+
}
25+
}

only_phone.json

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
2+
{
3+
"user_id": {
4+
"type": "bigint unsigned",
5+
"nullable": "false",
6+
"auto increment": "true",
7+
"primary key": "true"
8+
},
9+
"phone": {
10+
"type": "varchar",
11+
"width": "255",
12+
"nullable": "true"
13+
},
14+
"external_id": {
15+
"type": "bigint unsigned",
16+
"nullable": "false",
17+
"unique": "true",
18+
"default": "0"
19+
},
20+
"last_modified": {
21+
"type": "timestamp",
22+
"nullable": "true",
23+
"default": "null"
24+
}
25+
}

0 commit comments

Comments
 (0)