doc: docs for cardinality benchmarking (#183)

wangpatrick57 · web-flow · commit f8f714c5f352 · 2024-05-27T22:06:37.000-07:00
**Summary**: Updated `README.md`, `SUMMARY.md`, and a new file
`cost_model_benchmarking.md` to document information about cardinality
benchmarking.

**Details**:
* `README.md` contains a quickstart command.
* `cost_model_benchmarking.md` contains conceptual info and notes about
operating and extending the system.
* I name it "benchmarking" instead of "testing" in the docs to
distinguish it from functional testing. I renamed `perftest` and
`cardtest` to `perfbench` and `cardbench` to match how we're calling it
"benchmarking" instead of "testing".
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -7,6 +7,6 @@ members = [
     "optd-sqlplannertest",
     "optd-adaptive-demo",
     "optd-gungnir",
-    "optd-perftest",
+    "optd-perfbench",
 ]
 resolver = "2"
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ optd is a research project and is still evolving. It should not be used in produ
 
 ## Get Started
 
-There are two demos you can run with optd. More information available in the [docs](docs/).
+There are three demos you can run with optd. More information available in the [docs](docs/).
 
 ```
 cargo run --release --bin optd-adaptive-tpch-q8
@@ -25,6 +25,13 @@ You can also run the Datafusion cli to interactively experiment with optd.
 cargo run --bin datafusion-optd-cli
 ```
 
+You can also test the performance of the cost model with the "cardinality benchmarking" feature (more info in the [docs](docs/)).
+Before running this, you will need to manually run Postgres on your machine.
+Note that there is a CI script which tests this command (TPC-H with scale factor 0.01) before every merge into main, so it should be very reliable.
+```
+cargo run --release --bin optd-perfbench cardbench tpch --scale-factor 0.01
+```
+
 ## Documentation
 
 The documentation is available in the mdbook format in the [docs](docs) directory.
@@ -38,7 +45,7 @@ The documentation is available in the mdbook format in the [docs](docs) director
 * `optd-adaptive-demo`: Demo of adaptive optimization capabilities of optd. More information available in the [docs](docs/).
 * `optd-sqlplannertest`: Planner test of optd based on [risinglightdb/sqlplannertest-rs](https://github.com/risinglightdb/sqlplannertest-rs).
 * `optd-gungnir`: Scalable, memory-efficient, and parallelizable statistical methods for cardinality estimation (e.g. TDigest, HyperLogLog).
-* `optd-perftest`: A CLI program for testing performance (cardinality, throughput, etc.) against other databases.
+* `optd-perfbench`: A CLI program for benchmarking performance (cardinality, throughput, etc.) against other databases.
 
 
 # Related Works
diff --git a/dev_scripts/which_queries_work.sh b/dev_scripts/which_queries_work.sh
@@ -24,7 +24,7 @@ fi
 successful_ids=()
 IFS=','
 for id in $all_ids; do
-    cargo run --release --bin optd-perftest cardtest $benchmark_name --query-ids $id &>/dev/null
+    cargo run --release --bin optd-perfbench cardbench $benchmark_name --query-ids $id &>/dev/null
 
     if [ $? -eq 0 ]; then
         echo >&2 $id succeeded
diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md
@@ -22,7 +22,10 @@
 - [Three Join Demo](./demo_three_join.md)
 - [TPC-H Q8 Demo](./demo_tpch_q8.md)
 
-# Testing
+# Performance Benchmarking
+- [Cost Model Cardinality Benchmarking](./cost_model_benchmarking.md)
+
+# Functional Testing
 
 - [SQLPlannerTest](./sqlplannertest.md)
 - [Datafusion CLI](./datafusion_cli.md)
diff --git a/docs/src/cost_model_benchmarking.md b/docs/src/cost_model_benchmarking.md
@@ -0,0 +1,37 @@
+# Cost Model Cardinality Benchmarking
+
+## Overview
+You can benchmark the cardinality estimates of optd's cost model against other DBMSs using the optd-perfbench module.
+
+All aspects of benchmarking (except for setting up comparison DBMSs) are handled automatically. This includes loading workload data, building statistics, gathering the true cardinality of workload queries, running explains on workload queries, and aggregating cardinality estimation results.
+
+We elected not to automate the installation and setup of the DBMS in order to accomodate the needs of all users. For instance, some users prefer installing Postgres on Homebrew, others choose to install the Mac application, while others wish to create a Postgres Docker container. However, it could be feasible in the future to standardize on Docker and automatically start a container. The only difficult part in that scenario is tuning Postgres/other DBMSs to the machine being run on, as this is currently done manually using PGTune.
+
+Additionally, our system provides **fine-grained, robust caching** for every single step of the process. After the first run of a workload, all subsequent runs will *only require running explains*, which takes in a matter of seconds for all workloads. We use "acknowledgement files" to ensure that the caching is robust in that we never cache incomplete results.
+
+## Basic Operation
+First, you need to manually install, configure, and start the DBMS(s) being compared against. Currently, only Postgres is supported. To see an example of how Postgres is installed, configured, and started on a Mac, check the `patrick/` folder in the [gungnir-experiments](https://github.com/wangpatrick57/gungnir-experiments) repository.
+
+Once the DBMS(s) being compared against are set up, run this to quickly get started. It should take a few minutes on the first run and a few seconds on subsequent runs. This specific command that tests TPC-H with scale factor 0.01 is **run in a CI script** before every merge to main, so it should be very reliable.
+```
+cargo run --release --bin optd-perfbench cardbench tpch --scale-factor 0.01
+```
+
+After this, you can try out different workloads and scale factors based on the CLI options.
+
+Roughly speaking, there are two main ways the benchmarking system is used: (a) to compare the cardinality estimates of optd against another system *in aggregate* or (b) to investigate the cardinality estimates of a small subset of queries. The command above is for use case (a). The system automatically outputs a variety of *aggregate* information about the q-error including median, p95, max, and more. Additionally, the system outputs *comparative* information which shows the # of queries in which a given DBMS performs the best or is tied for the best.
+
+For use case (b), you will want to set the `RUST_LOG` environment variable to `info` and use the `--query-ids` parameter. Setting `RUST_LOG` to `info` will show the results of the explain commands on all DBMSs and `--query-ids` will let you only run specific queries to avoid cluttering the output.
+```
+RUST_LOG=info cargo run --release --bin optd-perfbench cardbench tpch --scale-factor 0.01 --query-ids 2
+```
+
+## Supporting More Queries
+Currently, we are missing support for a few queries in TPC-H, JOB, and JOB-light. An *approximate* list of supported queries can be found in the `[workload].rs` files (e.g. `tpch.rs` and `job.rs`). If `--query-ids` is ommitted from the command, we use the list of supported queries as defined in the `[workload].rs` file by default. Some of these queries are not supported by DataFusion, some by optd, and some because we run into an OOM error when trying to execute them on Postgres. Because of the last point, the set of supported queries may be different on different machines. The list of queries in `[workload].rs` (at least the one in `tpch.rs`) is tested to be working on the CI machine.
+
+The *definitive* list of supported queries on your machine can be found by running `dev_scripts/which_queries_work.sh`, which simply runs the benchmarking system for each query individually. While this script does take a long time to complete when first run, it has the nice side effect of warming up all your caches so that subsequent runs are fast. The script outputs a string to replace the `WORKING_*QUERY_IDS` variable in `[workload].rs` as well as another string to use as the `--query-ids` argument. If you are use `which_queries_work.sh` to figure out the queries that work on your machine, you probably want to use `--query-ids` instead of setting `WORKING_*QUERY_IDS`.
+
+If you add support for more queries, you will want to rerun `dev_scripts/which_queries_work.sh`. Since you are permanently adding support for more queries, you will want to update `WORKING_*QUERY_IDS`.
+
+## Adding More DBMSs
+Currently, only Postgres is supported. Additional DBMSs can be easily added using the `CardbenchRunnerDBMSHelper` trait and optionally the `TruecardGetter` trait. `CardbenchRunnerDBMSHelper` must be implemented by all DBMSs that are supported because it has functions for gathering estimated cardinalities from DBMSs. `TruecardGetter` only needs to be implemented by at least one DBMS. The true cardinality should be the same across all DBMSs, so we only execute the queries for real on a single DBMS to drastically reduce benchmarking runtime. `TruecardGetter` is currently implemented for Postgres, so it is unnecessary to implement this for any other DBMS unless one wishes to improve the runtime of benchmarking (e.g. by gathering true cardinalities using an OLAP DBMS for OLAP workloads). Do keep in mind that true cardinalities are cached after the first run of a workload and can be shared across users (in the future, perhaps we'll even put the cached true cardinalities in the GitHub repository itself), so this optimization is not terribly important.
diff --git a/optd-perfbench/Cargo.toml b/optd-perfbench/Cargo.toml
@@ -1,5 +1,5 @@
 [package]
-name = "optd-perftest"
+name = "optd-perfbench"
 version = "0.1.0"
 edition = "2021"
 
diff --git a/optd-perfbench/src/benchmark.rs b/optd-perfbench/src/benchmark.rs
diff --git a/optd-perfbench/src/cardbench.rs b/optd-perfbench/src/cardbench.rs
@@ -9,12 +9,12 @@ use anyhow::{self};
 use async_trait::async_trait;
 
 /// This struct performs cardinality testing across one or more DBMSs.
-/// Another design would be for the CardtestRunnerDBMSHelper trait to expose a function
+/// Another design would be for the CardbenchRunnerDBMSHelper trait to expose a function
 ///   to evaluate the Q-error. However, I chose not to do this design for reasons
-///   described in the comments of the CardtestRunnerDBMSHelper trait. This is why
-///   you would use CardtestRunner even for computing the Q-error of a single DBMS.
-pub struct CardtestRunner {
-    pub dbmss: Vec<Box<dyn CardtestRunnerDBMSHelper>>,
+///   described in the comments of the CardbenchRunnerDBMSHelper trait. This is why
+///   you would use CardbenchRunner even for computing the Q-error of a single DBMS.
+pub struct CardbenchRunner {
+    pub dbmss: Vec<Box<dyn CardbenchRunnerDBMSHelper>>,
     truecard_getter: Box<dyn TruecardGetter>,
 }
 
@@ -25,12 +25,12 @@ pub struct Cardinfo {
     pub truecard: usize,
 }
 
-impl CardtestRunner {
+impl CardbenchRunner {
     pub async fn new(
-        dbmss: Vec<Box<dyn CardtestRunnerDBMSHelper>>,
+        dbmss: Vec<Box<dyn CardbenchRunnerDBMSHelper>>,
         truecard_getter: Box<dyn TruecardGetter>,
     ) -> anyhow::Result<Self> {
-        Ok(CardtestRunner {
+        Ok(CardbenchRunner {
             dbmss,
             truecard_getter,
         })
@@ -57,7 +57,7 @@ impl CardtestRunner {
                 .into_iter()
                 .zip(truecards.iter())
                 .map(|(estcard, &truecard)| Cardinfo {
-                    qerror: CardtestRunner::calc_qerror(estcard, truecard),
+                    qerror: CardbenchRunner::calc_qerror(estcard, truecard),
                     estcard,
                     truecard,
                 })
@@ -90,8 +90,8 @@ impl CardtestRunner {
 /// When more performance tests are implemented, you would probably want to extract
 ///   get_name() into a generic "DBMS" trait.
 #[async_trait]
-pub trait CardtestRunnerDBMSHelper {
-    // get_name() has &self so that we're able to do Box<dyn CardtestRunnerDBMSHelper>
+pub trait CardbenchRunnerDBMSHelper {
+    // get_name() has &self so that we're able to do Box<dyn CardbenchRunnerDBMSHelper>
     fn get_name(&self) -> &str;
 
     // The order of queries in the returned vector has to be the same between all databases,
@@ -103,7 +103,7 @@ pub trait CardtestRunnerDBMSHelper {
 }
 
 /// The core logic of cardinality testing.
-pub async fn cardtest_core<P: AsRef<Path>>(
+pub async fn cardbench_core<P: AsRef<Path>>(
     workspace_dpath: P,
     rebuild_cached_optd_stats: bool,
     pguser: &str,
@@ -115,10 +115,10 @@ pub async fn cardtest_core<P: AsRef<Path>>(
     let truecard_getter = pg_dbms.clone();
     let df_dbms =
         Box::new(DatafusionDBMS::new(&workspace_dpath, rebuild_cached_optd_stats, adaptive).await?);
-    let dbmss: Vec<Box<dyn CardtestRunnerDBMSHelper>> = vec![pg_dbms, df_dbms];
+    let dbmss: Vec<Box<dyn CardbenchRunnerDBMSHelper>> = vec![pg_dbms, df_dbms];
 
-    let mut cardtest_runner = CardtestRunner::new(dbmss, truecard_getter).await?;
-    let cardinfos_alldbs = cardtest_runner
+    let mut cardbench_runner = CardbenchRunner::new(dbmss, truecard_getter).await?;
+    let cardinfos_alldbs = cardbench_runner
         .eval_benchmark_cardinfos_alldbs(&benchmark)
         .await?;
     Ok(cardinfos_alldbs)
diff --git a/optd-perfbench/src/datafusion_dbms.rs b/optd-perfbench/src/datafusion_dbms.rs
@@ -7,7 +7,7 @@ use std::{
 
 use crate::{
     benchmark::Benchmark,
-    cardtest::CardtestRunnerDBMSHelper,
+    cardbench::CardbenchRunnerDBMSHelper,
     job::{JobKit, JobKitConfig},
     tpch::{TpchKit, TpchKitConfig},
 };
@@ -47,7 +47,7 @@ const WITH_LOGICAL_FOR_TPCH: bool = true;
 const WITH_LOGICAL_FOR_JOB: bool = true;
 
 #[async_trait]
-impl CardtestRunnerDBMSHelper for DatafusionDBMS {
+impl CardbenchRunnerDBMSHelper for DatafusionDBMS {
     fn get_name(&self) -> &str {
         "DataFusion"
     }
diff --git a/optd-perfbench/src/job.rs b/optd-perfbench/src/job.rs
@@ -192,7 +192,7 @@ impl JobKit {
     }
 
     /// Get an iterator through all generated .sql files _in order_ of a given config
-    /// It's important to iterate _in order_ due to the interface of CardtestRunnerDBMSHelper
+    /// It's important to iterate _in order_ due to the interface of CardbenchRunnerDBMSHelper
     pub fn get_sql_fpath_ordered_iter(
         &self,
         job_kit_config: &JobKitConfig,
diff --git a/optd-perfbench/src/lib.rs b/optd-perfbench/src/lib.rs
@@ -1,5 +1,5 @@
 pub mod benchmark;
-pub mod cardtest;
+pub mod cardbench;
 mod datafusion_dbms;
 pub mod job;
 mod postgres_dbms;
diff --git a/optd-perfbench/src/main.rs b/optd-perfbench/src/main.rs
@@ -1,18 +1,18 @@
 use clap::{Parser, Subcommand, ValueEnum};
-use optd_perftest::benchmark::Benchmark;
-use optd_perftest::cardtest::Cardinfo;
-use optd_perftest::job::JobKitConfig;
-use optd_perftest::shell;
-use optd_perftest::tpch::{TpchKitConfig, TPCH_KIT_POSTGRES};
-use optd_perftest::{cardtest, job, tpch};
+use optd_perfbench::benchmark::Benchmark;
+use optd_perfbench::cardbench::Cardinfo;
+use optd_perfbench::job::JobKitConfig;
+use optd_perfbench::shell;
+use optd_perfbench::tpch::{TpchKitConfig, TPCH_KIT_POSTGRES};
+use optd_perfbench::{cardbench, job, tpch};
 use prettytable::{format, Table};
 use std::fs;
 use std::path::Path;
 
 #[derive(Parser)]
 struct Cli {
     #[clap(long)]
-    #[clap(default_value = "optd_perftest_workspace")]
+    #[clap(default_value = "optd_perfbench_workspace")]
     #[clap(
         help = "The directory where artifacts required for performance testing (such as pgdata or TPC-H queries) are generated. See comment of parse_pathstr() to see what paths are allowed (TLDR: absolute and relative both ok)."
     )]
@@ -31,7 +31,7 @@ enum BenchmarkName {
 
 #[derive(Subcommand)]
 enum Commands {
-    Cardtest {
+    Cardbench {
         #[clap(value_enum)]
         #[clap(default_value = "tpch")]
         benchmark_name: BenchmarkName,
@@ -46,7 +46,7 @@ enum Commands {
 
         #[clap(long)]
         #[clap(value_delimiter = ',', num_args = 1..)]
-        // This is the current list of all queries that work in perftest
+        // This is the current list of all queries that work in perfbench
         #[clap(default_value = None)]
         #[clap(help = "The queries to get the Q-error of")]
         query_ids: Vec<String>,
@@ -87,10 +87,10 @@ fn percentile(sorted_v: &[f64], percentile: f64) -> f64 {
     sorted_v[idx]
 }
 
-/// cardtest::cardtest_core() expects sanitized inputs and returns outputs in their simplest form.
-/// This function wraps around cardtest::cardtest_core() to sanitize the inputs and print the outputs nicely.
+/// cardbench::cardbench_core() expects sanitized inputs and returns outputs in their simplest form.
+/// This function wraps around cardbench::cardbench_core() to sanitize the inputs and print the outputs nicely.
 #[allow(clippy::too_many_arguments)]
-async fn cardtest<P: AsRef<Path>>(
+async fn cardbench<P: AsRef<Path>>(
     workspace_dpath: P,
     benchmark_name: BenchmarkName,
     scale_factor: f64,
@@ -131,7 +131,7 @@ async fn cardtest<P: AsRef<Path>>(
         }),
     };
 
-    let cardinfo_alldbs = cardtest::cardtest_core(
+    let cardinfo_alldbs = cardbench::cardbench_core(
         &workspace_dpath,
         rebuild_cached_optd_stats,
         &pguser,
@@ -268,7 +268,7 @@ async fn main() -> anyhow::Result<()> {
     }
 
     match cli.command {
-        Commands::Cardtest {
+        Commands::Cardbench {
             benchmark_name,
             scale_factor,
             seed,
@@ -278,7 +278,7 @@ async fn main() -> anyhow::Result<()> {
             pgpassword,
             adaptive,
         } => {
-            cardtest(
+            cardbench(
                 workspace_dpath,
                 benchmark_name,
                 scale_factor,
diff --git a/optd-perfbench/src/postgres_dbms.rs b/optd-perfbench/src/postgres_dbms.rs
@@ -1,6 +1,6 @@
 use crate::{
     benchmark::Benchmark,
-    cardtest::CardtestRunnerDBMSHelper,
+    cardbench::CardbenchRunnerDBMSHelper,
     job::{JobKit, JobKitConfig},
     tpch::{TpchKit, TpchKitConfig},
     truecard::{TruecardCache, TruecardGetter},
@@ -404,7 +404,7 @@ impl PostgresDBMS {
 }
 
 #[async_trait]
-impl CardtestRunnerDBMSHelper for PostgresDBMS {
+impl CardbenchRunnerDBMSHelper for PostgresDBMS {
     fn get_name(&self) -> &str {
         POSTGRES_DBMS_NAME
     }
diff --git a/optd-perfbench/src/shell.rs b/optd-perfbench/src/shell.rs
diff --git a/optd-perfbench/src/tpch.rs b/optd-perfbench/src/tpch.rs
@@ -264,7 +264,7 @@ impl TpchKit {
     }
 
     /// Get an iterator through all generated .sql files _in order_ of a given config
-    /// It's important to iterate _in order_ due to the interface of CardtestRunnerDBMSHelper
+    /// It's important to iterate _in order_ due to the interface of CardbenchRunnerDBMSHelper
     pub fn get_sql_fpath_ordered_iter(
         &self,
         tpch_kit_config: &TpchKitConfig,
diff --git a/optd-perfbench/src/truecard.rs b/optd-perfbench/src/truecard.rs
diff --git a/optd-perfbench/tests/cardtest_integration.rs b/optd-perfbench/tests/cardtest_integration.rs
@@ -1,13 +1,13 @@
 #[cfg(test)]
 mod tests {
     use assert_cmd::prelude::CommandCargoExt;
-    use optd_perftest::shell;
+    use optd_perfbench::shell;
     use std::{
         fs,
         process::{Command, Stdio},
     };
 
-    const WORKSPACE: &str = "optd_perftest_integration_workspace";
+    const WORKSPACE: &str = "optd_perfbench_integration_workspace";
 
     /// Make sure Postgres is running before this test is run
     /// The reason I don't start Postgres automatically is because everyone has a different
@@ -17,18 +17,18 @@ mod tests {
     /// While it'd be nice to test JOB, JOB only has one scale factor and that scale factor
     ///   takes 30 minutes to build stats as of 4/15/24, so we don't test it right now.
     #[test_case::test_case("tpch")]
-    fn cli_run_cardtest_twice(benchmark_name: &str) {
+    fn cli_run_cardbench_twice(benchmark_name: &str) {
         // perform cleanup (clear workspace)
         let workspace_dpath = shell::parse_pathstr(WORKSPACE).unwrap();
         shell::make_into_empty_dir(&workspace_dpath).unwrap();
 
         // run command twice
         for i in 1..=2 {
-            let mut cmd = create_cardtest_run_cmd(benchmark_name, false);
+            let mut cmd = create_cardbench_run_cmd(benchmark_name, false);
             let output = cmd.output().unwrap();
             assert!(
                 output.status.success(),
-                "cardtest run #{} failed with ```{}```",
+                "cardbench run #{} failed with ```{}```",
                 i,
                 String::from_utf8_lossy(&output.stderr)
             );
@@ -38,13 +38,13 @@ mod tests {
         fs::remove_dir_all(&workspace_dpath).unwrap();
     }
 
-    fn create_cardtest_run_cmd(benchmark_name: &str, debug_print: bool) -> Command {
-        let mut cmd = Command::cargo_bin("optd-perftest").unwrap();
+    fn create_cardbench_run_cmd(benchmark_name: &str, debug_print: bool) -> Command {
+        let mut cmd = Command::cargo_bin("optd-perfbench").unwrap();
         cmd.current_dir("..");
         cmd.args([
             "--workspace",
             WORKSPACE,
-            "cardtest",
+            "cardbench",
             benchmark_name,
             // make sure scale factor is low so the test runs fast
             "--scale-factor",
diff --git a/tpch_diff.sh b/tpch_diff.sh

Original file line number	Diff line number	Diff line change
`@@ -7,6 +7,6 @@ members = [`
`7`	`7`	`"optd-sqlplannertest",`
`8`	`8`	`"optd-adaptive-demo",`
`9`	`9`	`"optd-gungnir",`
`10`		`- "optd-perftest",`
	`10`	`+ "optd-perfbench",`
`11`	`11`	`]`
`12`	`12`	`resolver = "2"`