Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit 5c3eba4

Browse files
authored
Merge pull request #135 from datafold/test-snowflake-prod
tests: parallel + snowflake, presto in CI + benchmark scripts
2 parents f08d821 + 43c4042 commit 5c3eba4

File tree

13 files changed

+452
-353
lines changed

13 files changed

+452
-353
lines changed

.github/workflows/ci.yml

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,18 @@ jobs:
3232
uses: actions/setup-python@v3
3333
with:
3434
python-version: ${{ matrix.python-version }}
35-
35+
3636
- name: Build the stack
37-
run: docker-compose up -d mysql
37+
run: docker-compose up -d mysql postgres presto
3838

3939
- name: Install Poetry
4040
run: pip install poetry
4141

4242
- name: Install package
43-
run: poetry install
44-
43+
run: "poetry install"
44+
4545
- name: Run unit tests
46-
run: poetry run python3 -m unittest
46+
env:
47+
DATADIFF_SNOWFLAKE_URI: '${{ secrets.DATADIFF_SNOWFLAKE_URI }}'
48+
DATADIFF_PRESTO_URI: '${{ secrets.DATADIFF_PRESTO_URI }}'
49+
run: poetry run unittest-parallel -j 16

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@ ratings*.csv
134134
drive
135135
mysqltuner.pl
136136
benchmark_*.jsonl
137+
benchmark_*.png
137138

138139
# Mac
139140
.DS_Store

README.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -171,9 +171,9 @@ Users can also install several drivers at once:
171171
Usage: `data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]`
172172

173173
See the [example command](#example-command-and-output) and the [sample
174-
connection strings](#supported-databases).
174+
connection strings](#supported-databases).
175175

176-
Note that for some databases, the arguments that you enter in the command line
176+
Note that for some databases, the arguments that you enter in the command line
177177
may be case-sensitive. This is the case for the Snowflake schema and table names.
178178

179179
Options:
@@ -423,11 +423,16 @@ $ docker-compose up -d mysql postgres # run mysql and postgres dbs in background
423423

424424
**3. Run Unit Tests**
425425

426+
There are more than 1000 tests for all the different type and database
427+
combinations, so we recommend using `unittest-parallel` that's installed as a
428+
development dependency.
429+
426430
```shell-session
427-
$ poetry run python3 -m unittest
431+
$ poetry run unittest-parallel -j 16 # run all tests
432+
$ poetry run python -m unittest -k <test> # run individual test
428433
```
429434

430-
**4. Seed the Database(s)**
435+
**4. Seed the Database(s) (optional)**
431436

432437
First, download the CSVs of seeding data:
433438

@@ -451,7 +456,7 @@ $ poetry run preql -f dev/prepare_db.pql mssql://<uri>
451456
$ poetry run preql -f dev/prepare_db.pql bigquery:///<project>
452457
```
453458

454-
**5. Run **data-diff** against seeded database**
459+
**5. Run **data-diff** against seeded database (optional)**
455460

456461
```bash
457462
poetry run python3 -m data_diff postgresql://postgres:Password1@localhost/postgres rating postgresql://postgres:Password1@localhost/postgres rating_del1 --verbose
@@ -460,7 +465,14 @@ poetry run python3 -m data_diff postgresql://postgres:Password1@localhost/postgr
460465
**6. Run benchmarks (optional)**
461466

462467
```shell-session
463-
$ dev/benchmark.sh
468+
$ dev/benchmark.sh # runs benchmarks and puts results in benchmark_<sha>.csv
469+
$ poetry run python3 dev/graph.py # create graphs from benchmark_*.csv files
470+
```
471+
472+
You can adjust how many rows we benchmark with by passing `N_SAMPLES` to `dev/benchmark.sh`:
473+
474+
```shell-session
475+
$ N_SAMPLES=100000000 dev/benchmark.sh # 100m which is our canonical target
464476
```
465477

466478

data_diff/utils.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
import math
2+
13
from typing import Sequence, Optional, Tuple, Union, Dict, Any
24
from uuid import UUID
35

@@ -38,3 +40,14 @@ def is_uuid(u):
3840
except ValueError:
3941
return False
4042
return True
43+
44+
45+
def number_to_human(n):
46+
millnames = ["", "k", "m", "b"]
47+
n = float(n)
48+
millidx = max(
49+
0,
50+
min(len(millnames) - 1, int(math.floor(0 if n == 0 else math.log10(abs(n)) / 3))),
51+
)
52+
53+
return "{:.0f}{}".format(n / 10 ** (3 * millidx), millnames[millidx])

dev/benchmark.sh

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
#!/bin/bash
2+
3+
run_test() {
4+
N_SAMPLES=${N_SAMPLES:-1000000} N_THREADS=${N_THREADS:-16} LOG_LEVEL=${LOG_LEVEL:-info} BENCHMARK=1 \
5+
poetry run python3 -m unittest tests/test_database_types.py -v -k $1
6+
}
7+
8+
run_test "postgresql_int_mysql_int"
9+
run_test "mysql_int_mysql_int"
10+
run_test "postgresql_int_postgresql_int"
11+
run_test "postgresql_ts6_n_tz_mysql_ts0"
12+
run_test "postgresql_ts6_n_tz_snowflake_ts9"
13+
run_test "postgresql_int_presto_int"
14+
run_test "postgresql_int_redshift_int"
15+
run_test "postgresql_int_snowflake_int"
16+
run_test "postgresql_int_bigquery_int"
17+
run_test "snowflake_int_snowflake_int"
18+
19+
poetry run python dev/graph.py

dev/graph.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Use this to graph the benchmarking results (see benchmark.sh)
2+
#
3+
# To run this:
4+
# - pip install pandas
5+
# - pip install plotly
6+
#
7+
8+
import pandas as pd
9+
import plotly.graph_objects as go
10+
from data_diff.utils import number_to_human
11+
import glob
12+
13+
for benchmark_file in glob.glob("benchmark_*.jsonl"):
14+
rows = pd.read_json(benchmark_file, lines=True)
15+
rows["cloud"] = rows["test"].str.match(r".*(snowflake|redshift|presto|bigquery)")
16+
sha = benchmark_file.split("_")[1].split(".")[0]
17+
print(f"Generating graphs from {benchmark_file}..")
18+
19+
for n_rows, group in rows.groupby(["rows"]):
20+
image_path = f"benchmark_{sha}_{number_to_human(n_rows)}.png"
21+
print(f"\t rows: {number_to_human(n_rows)}, image: {image_path}")
22+
23+
r = group.drop_duplicates(subset=["name_human"])
24+
r = r.sort_values(by=["cloud", "source_type", "target_type", "name_human"])
25+
26+
fig = go.Figure(
27+
data=[
28+
go.Bar(
29+
name="count(*)",
30+
x=r["name_human"],
31+
y=r["count_max_sec"],
32+
text=r["count_max_sec"],
33+
textfont=dict(color="blue"),
34+
),
35+
go.Bar(
36+
name="data-diff (checksum)",
37+
x=r["name_human"],
38+
y=r["checksum_sec"],
39+
text=r["checksum_sec"],
40+
textfont=dict(color="red"),
41+
),
42+
go.Bar(
43+
name="Download and compare †",
44+
x=r["name_human"],
45+
y=r["download_sec"],
46+
text=r["download_sec"],
47+
textfont=dict(color="green"),
48+
),
49+
]
50+
)
51+
# Change the bar mode
52+
fig.update_layout(title=f"data-diff {number_to_human(n_rows)} rows, sha: {sha}")
53+
fig.update_traces(texttemplate="%{text:.1f}", textposition="outside")
54+
fig.update_layout(uniformtext_minsize=2, uniformtext_mode="hide")
55+
fig.update_yaxes(title="Time")
56+
fig.write_image(image_path, scale=2)

0 commit comments

Comments
 (0)