Skip to content

Commit 62a03f9

Browse files
authored
feat: add title for "how databend optimizer works" (#2376)
1 parent 0f72dff commit 62a03f9

File tree

1 file changed

+49
-3
lines changed

1 file changed

+49
-3
lines changed

docs/en/guides/81-how-databend-works/02-how-databend-optimizer-works.md

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
# How Databend Optimizer Works
1+
---
2+
title: How Databend Optimizer Works
3+
---
24

35
## Core Concepts
46

@@ -18,12 +20,14 @@ Databend's query optimizer is built on several key abstractions that work togeth
1820
Databend collects and uses these statistics to guide optimization decisions:
1921

2022
**Table Statistics:**
23+
2124
- `num_rows`: Number of rows in the table
2225
- `data_size`: Size of the table data in bytes
2326
- `number_of_blocks`: Number of storage blocks
2427
- `number_of_segments`: Number of segments
2528

2629
**Column Statistics:**
30+
2731
- `min`: Minimum value in the column
2832
- `max`: Maximum value in the column
2933
- `null_count`: Number of null values
@@ -133,12 +137,14 @@ Databend's query optimizer passes through four distinct phases to transform SQL
133137
**1. Subquery Decorrelation (SubqueryDecorrelatorOptimizer)**
134138

135139
**SQL Example:**
140+
136141
```sql
137142
SELECT * FROM customers c
138143
WHERE c.total_orders > (SELECT AVG(total_orders) FROM customers WHERE region = c.region)
139144
```
140145

141146
**Before:**
147+
142148
```
143149
Filter (c.total_orders > Subquery)
144150
└─ Scan (customers as c)
@@ -149,6 +155,7 @@ Filter (c.total_orders > Subquery)
149155
```
150156

151157
**After:**
158+
152159
```
153160
# Correlated subquery transformed into join operation
154161
Join (c.region = r.region)
@@ -165,18 +172,21 @@ Filter (c.total_orders > r.avg_total)
165172
**2. Statistics-based Aggregate Optimization (RuleStatsAggregateOptimizer)**
166173

167174
**SQL Example:**
175+
168176
```sql
169177
SELECT MIN(price) FROM products
170178
```
171179

172180
**Before:**
181+
173182
```
174183
Aggregate (MIN(price))
175184
└─ EvalScalar
176185
└─ Scan (products)
177186
```
178187

179188
**After:**
189+
180190
```
181191
# MIN aggregate replaced with pre-computed value from statistics
182192
EvalScalar (price_min)
@@ -188,18 +198,21 @@ EvalScalar (price_min)
188198
**3. Statistics Collection (CollectStatisticsOptimizer)**
189199

190200
**SQL Example:**
201+
191202
```sql
192203
SELECT * FROM orders WHERE region = 'Asia'
193204
```
194205

195206
**Before:**
207+
196208
```
197209
Filter (region = 'Asia')
198210
└─ Scan (orders)
199211
[No statistics]
200212
```
201213

202214
**After:**
215+
203216
```
204217
Filter (region = 'Asia')
205218
└─ Scan (orders)
@@ -216,11 +229,13 @@ Filter (region = 'Asia')
216229
**4. Aggregate Normalization (RuleNormalizeAggregateOptimizer)**
217230

218231
**SQL Example:**
232+
219233
```sql
220234
SELECT COUNT(id), COUNT(*), COUNT(DISTINCT region) FROM orders GROUP BY region
221235
```
222236

223237
**Before:**
238+
224239
```
225240
Aggregate (
226241
GROUP BY [region],
@@ -232,6 +247,7 @@ Aggregate (
232247
```
233248

234249
**After:**
250+
235251
```
236252
# Optimized aggregates
237253
EvalScalar (COUNT(*) AS count_id, COUNT(*) AS count_star)
@@ -244,18 +260,21 @@ EvalScalar (COUNT(*) AS count_id, COUNT(*) AS count_star)
244260
```
245261

246262
**What it does:** Optimizes aggregate functions by:
247-
1. Rewriting COUNT(non-nullable) to COUNT(*)
248-
2. Reusing a single COUNT(*) for multiple count expressions
263+
264+
1. Rewriting COUNT(non-nullable) to COUNT(\*)
265+
2. Reusing a single COUNT(\*) for multiple count expressions
249266
3. Eliminating DISTINCT when counting columns that are already in GROUP BY
250267

251268
**5. Filter Pull-up (PullUpFilterOptimizer)**
252269

253270
**SQL Example:**
271+
254272
```sql
255273
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id WHERE o.region = 'Asia' AND c.status = 'active'
256274
```
257275

258276
**Before:**
277+
259278
```
260279
Filter (c.status = 'active')
261280
└─ Filter (o.region = 'Asia')
@@ -265,6 +284,7 @@ Filter (c.status = 'active')
265284
```
266285

267286
**After:**
287+
268288
```
269289
# Filters pulled up to the top
270290
Filter (o.region = 'Asia' AND c.status = 'active' AND o.customer_id = c.id)
@@ -284,17 +304,20 @@ Filter (o.region = 'Asia' AND c.status = 'active' AND o.customer_id = c.id)
284304
#### Filter Pushdown Rules
285305

286306
**SQL Example:**
307+
287308
```sql
288309
SELECT * FROM orders WHERE region = 'Asia'
289310
```
290311

291312
**Before:**
313+
292314
```
293315
Filter (region = 'Asia')
294316
└─ Scan (orders)
295317
```
296318

297319
**After (PushDownFilterScan rule):**
320+
298321
```
299322
# Filter pushed down to scan layer
300323
Scan (orders, pushdown_predicates=[region = 'Asia'])
@@ -305,18 +328,21 @@ Scan (orders, pushdown_predicates=[region = 'Asia'])
305328
#### Limit Pushdown Rules
306329

307330
**SQL Example:**
331+
308332
```sql
309333
SELECT * FROM orders ORDER BY order_date LIMIT 10
310334
```
311335

312336
**Before:**
337+
313338
```
314339
Limit (10)
315340
└─ Sort (order_date)
316341
└─ Scan (orders)
317342
```
318343

319344
**After (PushDownLimitSort rule):**
345+
320346
```
321347
# Limit pushed through sort
322348
Sort (order_date)
@@ -329,17 +355,20 @@ Sort (order_date)
329355
#### Elimination Rules
330356

331357
**SQL Example:**
358+
332359
```sql
333360
SELECT * FROM orders WHERE 1=1
334361
```
335362

336363
**Before:**
364+
337365
```
338366
Filter (1=1)
339367
└─ Scan (orders)
340368
```
341369

342370
**After (EliminateFilter rule):**
371+
343372
```
344373
# Redundant filter removed
345374
Scan (orders)
@@ -350,11 +379,13 @@ Scan (orders)
350379
**7. Aggregate Splitting (RecursiveRuleOptimizer - SplitAggregate)**
351380

352381
**SQL Example:**
382+
353383
```sql
354384
SELECT region, SUM(amount) FROM orders GROUP BY region
355385
```
356386

357387
**Before:**
388+
358389
```
359390
# Single-phase aggregation (mode: Initial)
360391
Aggregate (
@@ -366,6 +397,7 @@ Aggregate (
366397
```
367398

368399
**After:**
400+
369401
```
370402
# Two-phase aggregation
371403
Aggregate (
@@ -388,11 +420,13 @@ Aggregate (
388420
**8. Join Order Optimization (DPhpyOptimizer)**
389421

390422
**SQL Example:**
423+
391424
```sql
392425
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id JOIN products p ON o.product_id = p.id WHERE c.region = 'Asia'
393426
```
394427

395428
**Before (original order):**
429+
396430
```
397431
Join
398432
├─ Join
@@ -402,6 +436,7 @@ Join
402436
```
403437

404438
**After (optimized order):**
439+
405440
```
406441
# Optimized join order based on cost estimation
407442
Join
@@ -424,18 +459,21 @@ This optimizer is particularly important for queries involving multiple joins, w
424459
**9. Single Join to Inner Join Conversion (SingleToInnerOptimizer)**
425460

426461
**SQL Example:**
462+
427463
```sql
428464
SELECT o.* FROM orders o LEFT SINGLE JOIN customers c ON o.customer_id = c.id
429465
```
430466

431467
**Before:**
468+
432469
```
433470
LeftSingleJoin (o.customer_id = c.id)
434471
├─ Scan (orders as o)
435472
└─ Scan (customers as c)
436473
```
437474

438475
**After:**
476+
439477
```
440478
# Single join converted to inner join
441479
InnerJoin (o.customer_id = c.id)
@@ -448,11 +486,13 @@ InnerJoin (o.customer_id = c.id)
448486
**10. Join Condition Deduplication (DeduplicateJoinConditionOptimizer)**
449487

450488
**SQL Example:**
489+
451490
```sql
452491
SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t2.id = t3.id AND t3.id = t1.id
453492
```
454493

455494
**Before:**
495+
456496
```
457497
Join (t2.id = t3.id AND t3.id = t1.id)
458498
├─ Scan (t3)
@@ -462,6 +502,7 @@ Join (t2.id = t3.id AND t3.id = t1.id)
462502
```
463503

464504
**After:**
505+
465506
```
466507
# Removed transitive join condition
467508
Join (t2.id = t3.id)
@@ -483,18 +524,21 @@ This optimization reduces the number of join conditions that need to be evaluate
483524
**11. Join Commutation (CommuteJoin Rule)**
484525

485526
**SQL Example:**
527+
486528
```sql
487529
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
488530
```
489531

490532
**Before (orders is larger than customers):**
533+
491534
```
492535
Join (o.customer_id = c.id)
493536
├─ Scan (orders as o) # Larger table (10M rows)
494537
└─ Scan (customers as c) # Smaller table (100K rows)
495538
```
496539

497540
**After (CommuteJoin rule applied):**
541+
498542
```
499543
# Join order swapped to put smaller table on left
500544
Join (c.id = o.customer_id)
@@ -515,6 +559,7 @@ Since Databend typically uses the right side as the build side in hash joins, th
515559
**12. Cost-Based Implementation Selection (CascadesOptimizer)**
516560

517561
**SQL Example:**
562+
518563
```sql
519564
SELECT customer_name, SUM(total_price) as total_spend
520565
FROM customers JOIN orders ON customers.id = orders.customer_id
@@ -606,6 +651,7 @@ Costs are calculated recursively - a plan's total cost includes all its operatio
606651
Databend's query optimizer employs a sophisticated, multi-stage pipeline to transform user SQL queries into highly efficient physical execution plans. It leverages core concepts like SExpr for plan representation, a rich set of transformation rules, detailed statistics, and a cost model to explore and evaluate various plan alternatives.
607652

608653
The process involves:
654+
609655
1. **Preparation:** Decorrelating subqueries and gathering necessary statistics.
610656
2. **Logical Optimization:** Applying rule-based transformations (like filter pushdown, aggregate normalization) to refine the logical plan structure.
611657
3. **Join Optimization:** Strategically determining the best join order and methods using techniques like dynamic programming.

0 commit comments

Comments
 (0)