|
| 1 | +--- |
| 2 | +title: External Dictionaries |
| 3 | +description: This RFC proposes the implementation of an external dictionary feature in Databend to allow seamless access to data from external sources. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +- RFC PR: [datafuselabs/databend-docs#996](https://github.com/datafuselabs/databend-docs/pull/996) |
| 8 | +- Tracking Issue: [datafuselabs/databend#15901](https://github.com/datafuselabs/databend/issues/15901) |
| 9 | + |
| 10 | +## Summary |
| 11 | + |
| 12 | +Implementing External Dictionary allows Databend to access data from other external data sources. |
| 13 | + |
| 14 | +## Motivation |
| 15 | + |
| 16 | +Accessing data from external databases like MySQL within Databend often requires exporting the MySQL dataset and subsequently importing it into the Databend database. This procedure becomes burdensome when handling substantial amounts of information and may result in inconsistencies due to frequent updates. |
| 17 | + |
| 18 | +The introduction of an external dictionary feature resolves these challenges by facilitating seamless integration between Databend and diverse database systems. Through dictionary creation, direct access to external datasets enables real-time modifcations while streamlining overall data management. |
| 19 | + |
| 20 | +## Guide-level explanation |
| 21 | + |
| 22 | +DICTIONARY employs the subsequent syntax for creation, deletion, and querying. |
| 23 | + |
| 24 | +1. Create a Dictionary named user_info. |
| 25 | + |
| 26 | +```sql |
| 27 | +CREATE DICTIONARY user_info( |
| 28 | + user_id UInt86, |
| 29 | + user_name String, |
| 30 | + user_address String |
| 31 | +) |
| 32 | +primary key(user_id) |
| 33 | +SOURCE(MYSQL( |
| 34 | + host '[localhost](http://localhost/)' |
| 35 | + user 'root' |
| 36 | + password 'root' |
| 37 | + db 'db_name' |
| 38 | + table 'table_name' |
| 39 | +)); |
| 40 | +``` |
| 41 | + |
| 42 | +2. Query the existing dictionary. |
| 43 | + |
| 44 | +```sql |
| 45 | +SHOW DICTIONARIES; |
| 46 | +``` |
| 47 | + |
| 48 | +3. Inquire about the SQL statement utilized for creating the dictionary user_info. |
| 49 | + |
| 50 | +```sql |
| 51 | +SHOW CREATE DICTIONARY user_info; |
| 52 | +``` |
| 53 | + |
| 54 | +4. Delete the Dictionary user_info. |
| 55 | + |
| 56 | +```sql |
| 57 | +DROP DICTIONARY user_info; |
| 58 | +``` |
| 59 | + |
| 60 | +You can use the `dict_get(dict_name, dict_field, dict_id)` to query data from a dictionary. |
| 61 | + |
| 62 | +The `dict_get` function takes three arguments: the first is the name of the dictionary, the second is the field to query, and the third is the ID of the query dictionary. |
| 63 | + |
| 64 | +## Reference-level explanation |
| 65 | + |
| 66 | +The relevant metadata of the DICTIONARY is stored in the meta module of Databend and is used to retrieve the necessary information when executing SQL queries. |
| 67 | + |
| 68 | +### Use protobuf to encode the data |
| 69 | + |
| 70 | +Protocol Buffers (Protobuf), a sophisticated data serialization framework, provides a suite of benefits that are particularly advantageous for high-performance computing environments. Its capabilities include the efficient storage of data in a compact binary format, rapid serialization and deserialization processes, cross-language support, and a well-defined schema for data structures. Therefore, Databend uses Protobuf to encode the data and convert the binary results to the database. |
| 71 | + |
| 72 | +An exemplar Protobuf structure, which encapsulates the essence of this technology, is articulated as follows: |
| 73 | + |
| 74 | +```protobuf |
| 75 | +syntax = "proto3"; |
| 76 | +package databend.meta; |
| 77 | +//Describes the metadata of the dictionary |
| 78 | +message DictionaryMeta { |
| 79 | + //Dictionary name |
| 80 | + string name = 1; |
| 81 | + //Dictionary data source |
| 82 | + string source = 2; |
| 83 | + //Dictionary configuration options |
| 84 | + map<string, string> options = 3; |
| 85 | + //The schema of a table, such as column data types and other meta info. |
| 86 | + DataSchema schema = 4; |
| 87 | + //ID of the primary key column |
| 88 | + u32 primary_column_id = 5; |
| 89 | +} |
| 90 | +``` |
| 91 | + |
| 92 | +### Query the data of the DICTIONARY |
| 93 | + |
| 94 | +Define `DictionaryAsyncFunction` in the `async_function` module to facilitate asynchronous reading of external data. |
| 95 | + |
| 96 | +```rust |
| 97 | +enum DictionarySourceEngine { |
| 98 | + MySQL, |
| 99 | + PostgreSQL, |
| 100 | + .. |
| 101 | +} |
| 102 | +``` |
| 103 | + |
| 104 | +```rust |
| 105 | +pub struct DictionaryAsyncFunction { |
| 106 | + engine: DictionarySourceEngine, |
| 107 | + // dictonary address, for examaple: mysql://root:[email protected]:3306/default |
| 108 | + url: String, |
| 109 | + // sql to get the value from source table. |
| 110 | + // for example: select name from user_info where id=1; |
| 111 | + query_sql: String, |
| 112 | + return_type: DataType, |
| 113 | + //Specify the maximum time to attempt a connection to the data source. |
| 114 | + connection_timeout: std::time::Duration, |
| 115 | + //Specify the maximum execution time for the query operation. |
| 116 | + query_timeout: std::time::Duration, |
| 117 | + //Used to store additional parameters that the query might require, such as the values for placeholders in the SQL query. |
| 118 | + params: Vec<ParameterValue>, |
| 119 | +} |
| 120 | +``` |
| 121 | + |
| 122 | +Rename `AsyncFunction` in the `async_function` module to `AsyncFunctionDesc` to avoid naming conflicts with the logical and physical plan of AsyncFunction. Additionally, include `DictionaryAsyncFunction`. The definition is as follows: |
| 123 | + |
| 124 | +```rust |
| 125 | +pub enum AsyncFunctionDesc { |
| 126 | + SequenceAsyncFunction(SequenceAsyncFunction), |
| 127 | + DictonaryAsyncFunction(DictionaryAsyncFunction), |
| 128 | +} |
| 129 | +``` |
| 130 | + |
| 131 | +Update the `AsyncFunction` definition in both the logical and physical plans by adding the `AsyncFunctionDesc` field. This process reuses existing logic for generating dictionary AsyncFunction logical and physical plans. |
| 132 | + |
| 133 | +- The struct of logical plan is as follows: |
| 134 | + |
| 135 | +```rust |
| 136 | +pub struct AsyncFunction { |
| 137 | + pub func_name: String, |
| 138 | + pub display_name: String, |
| 139 | + pub arguments: Vec<String>, |
| 140 | + pub return_type: DataType, |
| 141 | + pub index: IndexType, |
| 142 | + pub desc: AsyncFunctionDesc,//Newly added property |
| 143 | +} |
| 144 | +``` |
| 145 | + |
| 146 | +- The struct of physical plan is as follows: |
| 147 | + |
| 148 | +```rust |
| 149 | +pub struct AsyncFunction { |
| 150 | + pub plan_id: u32, |
| 151 | + pub func_name: String, |
| 152 | + pub display_name: String, |
| 153 | + pub arguments: Vec<String>, |
| 154 | + pub return_type: DataType, |
| 155 | + pub schema: DataSchemaRef, |
| 156 | + pub input: Box<PhysicalPlan>, |
| 157 | + pub stat_info: Option<PlanStatsInfo>, |
| 158 | + pub desc: AsyncFunctionDesc,//Newly added property |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +The `Transform` in the pipeline, where the actual reading of external data takes place, can be defined as follows: |
| 163 | + |
| 164 | +```rust |
| 165 | +pub struct TransformDictionary { |
| 166 | + ctx: Arc<QueryContext>, |
| 167 | + dict_func: DictionaryAsyncFunction, |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +Implement the `transform` method of the `AsyncTransform` trait and call an external database to obtain dictionary data. The main process is illustrated in the following diagram: |
| 172 | + |
| 173 | +<img src="/docs/public/img/rfc/20240721-external-dictionary/external-dictionary-1.png" alt="Flowchart of getting external data" style={{zoom:"80%"}} /> |
| 174 | + |
| 175 | +The execution process of the `dict_get` function is summarized in the following diagram: |
| 176 | + |
| 177 | +<img src="/docs/public/img/rfc/20240721-external-dictionary/external-dictionary-2.png" alt="Flowchart of the dict_get" style={{zoom:"80%"}} /> |
| 178 | + |
| 179 | +## Unresolved questions |
| 180 | + |
| 181 | +- Can algorithms be employed to improve the speed of data dictionary queries? |
| 182 | + |
| 183 | +## Future possibilities |
| 184 | + |
| 185 | +1. Users can connect multiple types of data sources through the External Dictionary to perform real-time operations on various data endpoints from the same client, such as files, HTTP interfaces, and additional databases like ClickHouse, Redis, MongoDB, etc. |
| 186 | + |
| 187 | + For example, if the data source is a local CSV file: |
| 188 | + |
| 189 | +```sql |
| 190 | +CREATE DICTIONARY dict_name |
| 191 | +( |
| 192 | + ... -- attributes |
| 193 | +) |
| 194 | +SOURCE(FILE(path './user_files/os.csv' format 'CommaSeparated')) -- Source configuration |
| 195 | +``` |
| 196 | + |
| 197 | +2. Add more functions for operating data dictionaries, such as `dict_get_or_default`, `dict_get_or_null`, `dict_has`, etc. |
| 198 | + |
| 199 | + For instance, `dict_get_or_default(dict_name, dict_field, dict_id, default_value)` includes an additional parameter for the default value to be returned if the target data is not found. |
| 200 | + |
| 201 | +3. Support configuring the built-in dictionary using the TOML format. |
| 202 | + |
| 203 | +## Reference |
| 204 | + |
| 205 | +[Clickhouse Dictionary](https://clickhouse.com/docs/en/dictionary) |
0 commit comments