From d631fbf86552c02a43d33426779a16aa3a385168 Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Sun, 31 Dec 2023 15:03:44 +0800 Subject: [PATCH] docs: finish all docs Signed-off-by: Alex Chi --- docs/README.md | 2 ++ docs/src/SUMMARY.md | 6 +++--- docs/src/datafusion_cli.md | 12 ++++++++++++ docs/src/demo_three_join.md | 30 ++++++++++++++++++++++++++++++ docs/src/demo_tpch_q8.md | 26 ++++++++++++++++++++++++++ docs/src/partial_exploration.md | 2 ++ docs/src/sqlplannertest.md | 4 ++++ docs/src/tpch_q8_demo.md | 1 - 8 files changed, 79 insertions(+), 4 deletions(-) create mode 100644 docs/src/demo_tpch_q8.md delete mode 100644 docs/src/tpch_q8_demo.md diff --git a/docs/README.md b/docs/README.md index 1da9464d..44af2adf 100644 --- a/docs/README.md +++ b/docs/README.md @@ -5,3 +5,5 @@ The docs is written in `mdbook` format. You can follow the [`mdbook` installatio ```shell mdbook serve ``` + +The online version of the documentation can be found at [https://cmu-db.github.io/optd/](https://cmu-db.github.io/optd/). diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md index 2d3b5dc3..7a43ad2e 100644 --- a/docs/src/SUMMARY.md +++ b/docs/src/SUMMARY.md @@ -19,10 +19,10 @@ # Demo -- [(WIP) Three Join Demo](./demo_three_join.md) -- [(WIP) TPC-H Q8 Demo](./tpch_q8_demo.md) +- [Three Join Demo](./demo_three_join.md) +- [TPC-H Q8 Demo](./demo_tpch_q8.md) # Testing -- [(WIP) SQLPlannerTest](./sqlplannertest.md) +- [SQLPlannerTest](./sqlplannertest.md) - [(WIP) Datafusion CLI](./datafusion_cli.md) diff --git a/docs/src/datafusion_cli.md b/docs/src/datafusion_cli.md index 08847c9f..1e8b10b5 100644 --- a/docs/src/datafusion_cli.md +++ b/docs/src/datafusion_cli.md @@ -1 +1,13 @@ # Datafusion CLI + +Developers can interact with optd by using the Datafusion cli. The cli supports creating tables, populating data, and executing ANSI SQL queries. + +```shell +cargo run --bin datafusion-optd-cli +``` + +We also have a scale 0.01 TPC-H dataset to test. The test SQL can be executed with the Datafusion cli. + +```shell +cargo run --bin datafusion-optd-cli -- -f tpch/test.sql +``` diff --git a/docs/src/demo_three_join.md b/docs/src/demo_three_join.md index 39033a9a..3d6abb42 100644 --- a/docs/src/demo_three_join.md +++ b/docs/src/demo_three_join.md @@ -1 +1,31 @@ # Three Join Demo + +You can run this demo with the following command: + +```shell +cargo run --release --bin optd-adaptive-three-join +``` + +We create 3 tables and join them. The underlying data are getting updated every time the query is executed. + +```sql +select * from t1, t2, t3 where t1v1 = t2v1 and t1v2 = t3v2; +``` + +When the data distribution and the table size changes, the optimal join order will be different. The output of this demo is as below. + +```plain +Iter 66: (HashJoin (HashJoin t1 t2) t3) <-> (best) (HashJoin (HashJoin t1 t2) t3), Accuracy: 66/66=100.000 +Iter 67: (HashJoin (HashJoin t2 t1) t3) <-> (best) (HashJoin (HashJoin t1 t2) t3), Accuracy: 66/67=98.507 +Iter 68: (HashJoin t2 (HashJoin t1 t3)) <-> (best) (HashJoin (HashJoin t1 t2) t3), Accuracy: 66/68=97.059 +Iter 69: (HashJoin (HashJoin t1 t2) t3) <-> (best) (HashJoin (HashJoin t1 t2) t3), Accuracy: 67/69=97.101 +Iter 70: (HashJoin (HashJoin t1 t2) t3) <-> (best) (HashJoin (HashJoin t1 t2) t3), Accuracy: 68/70=97.143 +Iter 71: (HashJoin (HashJoin t1 t2) t3) <-> (best) (HashJoin (HashJoin t1 t2) t3), Accuracy: 69/71=97.183 +Iter 72: (HashJoin (HashJoin t2 t1) t3) <-> (best) (HashJoin (HashJoin t1 t2) t3), Accuracy: 69/72=95.833 +``` + +The left plan Lisp representation is the join order determined by the adaptive query optimization algorithm. The right plan is the best plan. The accuracy is the percentage of executions that the adaptive query optimization algorithm generates the best cost-optimal plan. + +To find the optimal plan and compute the accuracy, we set up two optimizers in this demo: the normal optimizer and the optimal optimizer. Each time we insert some data into the tables, we will invoke the normal optimizer once, and invoke the optimal optimizer with all possible combination of join orders, so that the optimal optimizer can produce an optimal plan based on the cost model and the join selectivity. + +As the algorithm can only know the runtime information from last run before new data are added into the tables, there will be some iterations where it cannot generate the optimal plan. But it will converge to the optimal plan as more runtime information is collected. diff --git a/docs/src/demo_tpch_q8.md b/docs/src/demo_tpch_q8.md new file mode 100644 index 00000000..38e18544 --- /dev/null +++ b/docs/src/demo_tpch_q8.md @@ -0,0 +1,26 @@ +# TPC-H Q8 Demo + + +You can run this demo with the following command: + +```shell +cargo run --release --bin optd-adaptive-tpch-q8 +``` + +In this demo, we create the TPC-H schema with test data of scale 0.01. There are 8 tables in TPC-H Q8, and it is impossible to enumerate all join combinations in one run. The demo will run this query multiple times, each time exploring a subset of the plan space. Therefore, optimization will be fast for each iteration, and as the plan space is more explored in each iteration, the produced plan will converge to the optimal join order. + +```plain +--- ITERATION 5 --- +plan space size budget used, not applying logical rules any more. current plan space: 10354 +(HashJoin region (HashJoin (HashJoin (HashJoin (HashJoin (HashJoin part (HashJoin supplier lineitem)) orders) customer) nation) nation)) +plan space size budget used, not applying logical rules any more. current plan space: 11743 ++--------+------------+ +| col0 | col1 | ++--------+------------+ +| 1995.0 | 1.00000000 | +| 1996.0 | 0.32989690 | ++--------+------------+ +2 rows in set. Query took 0.115 seconds. +``` + +The output contains the current join order in Lisp representation, the plan space, and the query result. diff --git a/docs/src/partial_exploration.md b/docs/src/partial_exploration.md index 06b45a97..e1bfcbb0 100644 --- a/docs/src/partial_exploration.md +++ b/docs/src/partial_exploration.md @@ -1,3 +1,5 @@ # Partial Exploration When the plan space is very large, optd will generate a sub-optimal plan at first, and then use the runtime information to continue the plan space search next time the same query (or a similar query) is being optimized. This is partial exploration. + +Developers can pass `partial_explore_iter` and `partial_explore_space` to the optimizer options to specify how large the optimizer will expand each time `step_optimize_rel` is invoked. To use partial exploration, developers should not clear the internal state of the optimizer across different runs. diff --git a/docs/src/sqlplannertest.md b/docs/src/sqlplannertest.md index b6bc5643..121e5f64 100644 --- a/docs/src/sqlplannertest.md +++ b/docs/src/sqlplannertest.md @@ -1 +1,5 @@ # SQLPlannerTest + +optd uses risinglightdb's SQL planner test library to ensure the optimizer works correctly and stably produces an expected plan. SQL planner test is a regression test. Developers provide the test framework a yaml file with the queries to be optimized and the information they want to collect. The test framework generates the test result and store them in SQL files. When a developer submits a pull request, the reviewers should check if any of these outputs are changed unexpectedly. + +The test cases can be found in `optd-sqlplannertest/tests`. Currently, we check if optd can enumerate all join orders by using the `explain:logical_join_orders,physical_plan` task and check if the query output is as expected by using the `execute` task. diff --git a/docs/src/tpch_q8_demo.md b/docs/src/tpch_q8_demo.md deleted file mode 100644 index d01f8c19..00000000 --- a/docs/src/tpch_q8_demo.md +++ /dev/null @@ -1 +0,0 @@ -# TPC-H Q8 Demo