-
Notifications
You must be signed in to change notification settings - Fork 485
/
Copy pathslides.py
247 lines (182 loc) · 6.34 KB
/
slides.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "duckdb==1.1.1",
# "marimo",
# "numpy==2.0.2",
# "pandas==2.2.3",
# ]
# ///
import marimo
__generated_with = "0.8.20"
app = marimo.App(width="medium", layout_file="layouts/slides.slides.json")
@app.cell(hide_code=True)
def __(mo):
mo.md(
r"""
!!! tip "This notebook is best viewed as an app."
Hit `Cmd/Ctrl+.` or click the "app view" button in the bottom right.
"""
)
return
@app.cell
def __(mo):
mo.md(
"""
# DuckDB Tricks 🦆
We use a simple example data set to present a few tricks that are useful when using DuckDB.
>
> Turned into slides from <https://duckdb.org/2024/08/19/duckdb-tricks-part-1.html>
"""
)
return
@app.cell
def __(mo):
mo.md(
r"""
## Installation
```bash
uv add duckdb
```
"""
)
return
@app.cell
def __(mo):
mo.md(r"""# Creating the example data set""")
return
@app.cell
def __(duckdb, mo, print_and_run):
_SQL = """
CREATE OR REPLACE TABLE example (s STRING, x DOUBLE);
INSERT INTO example VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4);
COPY example TO 'example.csv';
"""
duckdb.sql(_SQL)
mo.md(
f"""
Creating the example data set
We start by creating a data set that we'll use in the rest of the blog post. To this end, we define a table, populate it with some data and export it to a CSV file.
{print_and_run(_SQL)}
Wait a bit, that’s way too verbose! DuckDB’s syntax has several SQL shorthands including the “friendly SQL” clauses. Here, we combine the VALUES clause with the FROM-first syntax, which makes the SELECT clause optional. With these, we can compress the data creation script to ~60% of its original size. The new formulation omits the schema definition and creates the CSV with a single command:
```sql
COPY (FROM VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4) t(s, x))
TO 'example.csv';
```
Regardless of which script we run, the resulting CSV file will look like this:
{print_and_run("SELECT * from example")}
"""
)
return
@app.cell
def __(mo):
mo.md(r"""# Pretty-printing floating-point numbers""")
return
@app.cell
def __(mo, print_and_run):
mo.md(
f"""
When printing a floating-point number to the output, the fractional parts can be difficult to read and compare. For example, the following query returns three numbers between 1 and 8 but their printed widths are very different due to their fractional parts.
{print_and_run("SELECT x FROM 'example.csv';")}
By casting a column to a DECIMAL with a fixed number of digits after the decimal point, we can pretty-print it as follows:
{print_and_run('''
SELECT x::DECIMAL(15, 3) AS x
FROM 'example.csv';
''')}
A typical alternative solution is to use the printf or format functions, e.g.:
{print_and_run('''
SELECT printf('%.3f', x)
FROM 'example.csv';
''')}
However, these approaches require us to specify a formatting string that's easy to forget. What's worse, the statement above returns string values, which makes subsequent operations (e.g., sorting) more difficult. Therefore, unless keeping the full precision of the floating-point numbers is a concern, casting to DECIMAL values should be the preferred solution for most use cases.
"""
)
return
@app.cell
def __(mo):
mo.md(r"""# Copying the schema of a table""")
return
@app.cell
def __(mo, print_and_run):
mo.md(
f"""
To copy the schema from a table without copying its data, we can use LIMIT 0.
{print_and_run('''
CREATE OR REPLACE TABLE example AS
FROM 'example.csv';
CREATE OR REPLACE TABLE tbl AS
FROM example
LIMIT 0;
''')}
This will result in an empty table with the same schema as the source table:
{print_and_run('DESCRIBE tbl;')}
This will return the schema of the table.
```sql
CREATE TABLE example(s VARCHAR, x DOUBLE);
```
After editing the table’s name (e.g., example to tbl), this query can be used to create a new table with the same schema.
"""
)
return
@app.cell
def __(mo):
mo.md(r"""# Shuffling data""")
return
@app.cell
def __(mo):
rerun = mo.ui.button(label="Run again")
return (rerun,)
@app.cell
def __(mo, print_and_run, rerun):
mo.md(
f"""
Sometimes, we need to introduce some entropy into the ordering of the data by shuffling it. To shuffle non-deterministically, we can simply sort on a random value provided the random() function:
{rerun}
{print_and_run('''
FROM 'example.csv' ORDER BY random();
''')}
Shuffling deterministically is a bit more tricky. To achieve this, we can order on the hash, of the rowid pseudocolumn. Note that this column is only available in physical tables, so we first have to load the CSV in a table, then perform the shuffle operation as follows:
{rerun}
{print_and_run('''
CREATE OR REPLACE TABLE example AS FROM 'example.csv';
FROM example ORDER BY hash(rowid + 42);
''')}
Note that the + 42 is only necessary to nudge the first row from its position – as hash(0) returns 0, the smallest possible value, using it for ordering leaves the first row in its place.
"""
)
return
@app.cell
def __(mo):
mo.md(
"""
# Closing thoughts
That’s it for today. The tricks shown in this post are available on [duckdbsnippets.com](https://duckdbsnippets.com).
"""
)
return
@app.cell
def __(duckdb, mo):
# Utils
def print_and_run(sql: str):
result = duckdb.sql(sql)
sql = sql.strip()
if not result:
return f"""
```sql
{sql}
```
"""
return f"""
```sql
{sql}
```
{mo.ui.table(result.df(), selection=None, pagination=None, show_column_summaries=False)}
"""
return (print_and_run,)
@app.cell
def __():
import marimo as mo
import duckdb
return duckdb, mo
if __name__ == "__main__":
app.run()