Skip to content

Commit fa85621

Browse files
add external-dictionary rfc (#996)
* add external-dictionary rfc * Update 05-external-dictionary.md * fix * fix * fix --------- Co-authored-by: Quan <[email protected]>
1 parent 8ab8472 commit fa85621

File tree

3 files changed

+205
-0
lines changed

3 files changed

+205
-0
lines changed
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
---
2+
title: External Dictionaries
3+
description: This RFC proposes the implementation of an external dictionary feature in Databend to allow seamless access to data from external sources.
4+
5+
---
6+
7+
- RFC PR: [datafuselabs/databend-docs#996](https://github.com/datafuselabs/databend-docs/pull/996)
8+
- Tracking Issue: [datafuselabs/databend#15901](https://github.com/datafuselabs/databend/issues/15901)
9+
10+
## Summary
11+
12+
Implementing External Dictionary allows Databend to access data from other external data sources.
13+
14+
## Motivation
15+
16+
Accessing data from external databases like MySQL within Databend often requires exporting the MySQL dataset and subsequently importing it into the Databend database. This procedure becomes burdensome when handling substantial amounts of information and may result in inconsistencies due to frequent updates.
17+
18+
The introduction of an external dictionary feature resolves these challenges by facilitating seamless integration between Databend and diverse database systems. Through dictionary creation, direct access to external datasets enables real-time modifcations while streamlining overall data management.
19+
20+
## Guide-level explanation
21+
22+
DICTIONARY employs the subsequent syntax for creation, deletion, and querying.
23+
24+
1. Create a Dictionary named user_info.
25+
26+
```sql
27+
CREATE DICTIONARY user_info(
28+
user_id UInt86,
29+
user_name String,
30+
user_address String
31+
)
32+
primary key(user_id)
33+
SOURCE(MYSQL(
34+
host '[localhost](http://localhost/)'
35+
user 'root'
36+
password 'root'
37+
db 'db_name'
38+
table 'table_name'
39+
));
40+
```
41+
42+
2. Query the existing dictionary.
43+
44+
```sql
45+
SHOW DICTIONARIES;
46+
```
47+
48+
3. Inquire about the SQL statement utilized for creating the dictionary user_info.
49+
50+
```sql
51+
SHOW CREATE DICTIONARY user_info;
52+
```
53+
54+
4. Delete the Dictionary user_info.
55+
56+
```sql
57+
DROP DICTIONARY user_info;
58+
```
59+
60+
You can use the `dict_get(dict_name, dict_field, dict_id)` to query data from a dictionary.
61+
62+
The `dict_get` function takes three arguments: the first is the name of the dictionary, the second is the field to query, and the third is the ID of the query dictionary.
63+
64+
## Reference-level explanation
65+
66+
The relevant metadata of the DICTIONARY is stored in the meta module of Databend and is used to retrieve the necessary information when executing SQL queries.
67+
68+
### Use protobuf to encode the data
69+
70+
Protocol Buffers (Protobuf), a sophisticated data serialization framework, provides a suite of benefits that are particularly advantageous for high-performance computing environments. Its capabilities include the efficient storage of data in a compact binary format, rapid serialization and deserialization processes, cross-language support, and a well-defined schema for data structures. Therefore, Databend uses Protobuf to encode the data and convert the binary results to the database.
71+
72+
An exemplar Protobuf structure, which encapsulates the essence of this technology, is articulated as follows:
73+
74+
```protobuf
75+
syntax = "proto3";
76+
package databend.meta;
77+
//Describes the metadata of the dictionary
78+
message DictionaryMeta {
79+
//Dictionary name
80+
string name = 1;
81+
//Dictionary data source
82+
string source = 2;
83+
//Dictionary configuration options
84+
map<string, string> options = 3;
85+
//The schema of a table, such as column data types and other meta info.
86+
DataSchema schema = 4;
87+
//ID of the primary key column
88+
u32 primary_column_id = 5;
89+
}
90+
```
91+
92+
### Query the data of the DICTIONARY
93+
94+
Define `DictionaryAsyncFunction` in the `async_function` module to facilitate asynchronous reading of external data.
95+
96+
```rust
97+
enum DictionarySourceEngine {
98+
MySQL,
99+
PostgreSQL,
100+
..
101+
}
102+
```
103+
104+
```rust
105+
pub struct DictionaryAsyncFunction {
106+
engine: DictionarySourceEngine,
107+
// dictonary address, for examaple: mysql://root:[email protected]:3306/default
108+
url: String,
109+
// sql to get the value from source table.
110+
// for example: select name from user_info where id=1;
111+
query_sql: String,
112+
return_type: DataType,
113+
//Specify the maximum time to attempt a connection to the data source.
114+
connection_timeout: std::time::Duration,
115+
//Specify the maximum execution time for the query operation.
116+
query_timeout: std::time::Duration,
117+
//Used to store additional parameters that the query might require, such as the values for placeholders in the SQL query.
118+
params: Vec<ParameterValue>,
119+
}
120+
```
121+
122+
Rename `AsyncFunction` in the `async_function` module to `AsyncFunctionDesc` to avoid naming conflicts with the logical and physical plan of AsyncFunction. Additionally, include `DictionaryAsyncFunction`. The definition is as follows:
123+
124+
```rust
125+
pub enum AsyncFunctionDesc {
126+
SequenceAsyncFunction(SequenceAsyncFunction),
127+
DictonaryAsyncFunction(DictionaryAsyncFunction),
128+
}
129+
```
130+
131+
Update the `AsyncFunction` definition in both the logical and physical plans by adding the `AsyncFunctionDesc` field. This process reuses existing logic for generating dictionary AsyncFunction logical and physical plans.
132+
133+
- The struct of logical plan is as follows:
134+
135+
```rust
136+
pub struct AsyncFunction {
137+
pub func_name: String,
138+
pub display_name: String,
139+
pub arguments: Vec<String>,
140+
pub return_type: DataType,
141+
pub index: IndexType,
142+
pub desc: AsyncFunctionDesc,//Newly added property
143+
}
144+
```
145+
146+
- The struct of physical plan is as follows:
147+
148+
```rust
149+
pub struct AsyncFunction {
150+
pub plan_id: u32,
151+
pub func_name: String,
152+
pub display_name: String,
153+
pub arguments: Vec<String>,
154+
pub return_type: DataType,
155+
pub schema: DataSchemaRef,
156+
pub input: Box<PhysicalPlan>,
157+
pub stat_info: Option<PlanStatsInfo>,
158+
pub desc: AsyncFunctionDesc,//Newly added property
159+
}
160+
```
161+
162+
The `Transform` in the pipeline, where the actual reading of external data takes place, can be defined as follows:
163+
164+
```rust
165+
pub struct TransformDictionary {
166+
ctx: Arc<QueryContext>,
167+
dict_func: DictionaryAsyncFunction,
168+
}
169+
```
170+
171+
Implement the `transform` method of the `AsyncTransform` trait and call an external database to obtain dictionary data. The main process is illustrated in the following diagram:
172+
173+
<img src="/docs/public/img/rfc/20240721-external-dictionary/external-dictionary-1.png" alt="Flowchart of getting external data" style={{zoom:"80%"}} />
174+
175+
The execution process of the `dict_get` function is summarized in the following diagram:
176+
177+
<img src="/docs/public/img/rfc/20240721-external-dictionary/external-dictionary-2.png" alt="Flowchart of the dict_get" style={{zoom:"80%"}} />
178+
179+
## Unresolved questions
180+
181+
- Can algorithms be employed to improve the speed of data dictionary queries?
182+
183+
## Future possibilities
184+
185+
1. Users can connect multiple types of data sources through the External Dictionary to perform real-time operations on various data endpoints from the same client, such as files, HTTP interfaces, and additional databases like ClickHouse, Redis, MongoDB, etc.
186+
187+
For example, if the data source is a local CSV file:
188+
189+
```sql
190+
CREATE DICTIONARY dict_name
191+
(
192+
... -- attributes
193+
)
194+
SOURCE(FILE(path './user_files/os.csv' format 'CommaSeparated')) -- Source configuration
195+
```
196+
197+
2. Add more functions for operating data dictionaries, such as `dict_get_or_default`, `dict_get_or_null`, `dict_has`, etc.
198+
199+
For instance, `dict_get_or_default(dict_name, dict_field, dict_id, default_value)` includes an additional parameter for the default value to be returned if the target data is not found.
200+
201+
3. Support configuring the built-in dictionary using the TOML format.
202+
203+
## Reference
204+
205+
[Clickhouse Dictionary](https://clickhouse.com/docs/en/dictionary)
Loading
Loading

0 commit comments

Comments
 (0)