Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-34679][cdc] add doc under core-concept. #3153

Merged
merged 6 commits into from
Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions docs/content/docs/core-concept/data-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,80 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Definition
Since events in Flink CDC flow from the upstream to the downstream in a pipeline manner, the whole ETL task is referred as a **Data Pipeline**.

# Parameters
A pipeline corresponds to a chain of operators in Flink.
To describe a Data Pipeline, the following parts are required:
- [source]({{< ref "docs/core-concept/data-source" >}})
- [sink]({{< ref "docs/core-concept/data-sink" >}})
- [pipeline](#pipeline-configurations)

the following parts are optional:
- [route]({{< ref "docs/core-concept/route" >}})
- [transform]({{< ref "docs/core-concept/transform" >}})

# Example
## Only required
We could use following yaml file to define a concise Data Pipeline describing synchronize all tables under MySQL app_db database to Doris :

```yaml
source:
type: mysql
hostname: localhost
port: 3306
username: root
password: 123456
tables: app_db.\.*

sink:
type: doris
fenodes: 127.0.0.1:8030
username: root
password: ""

pipeline:
name: Sync MySQL Database to Doris
parallelism: 2
```

## With optional
We could use following yaml file to define a complicated Data Pipeline describing synchronize all tables under MySQL app_db database to Doris and give specific target database name ods_db and specific target table name prefix ods_ :

```yaml
source:
type: mysql
hostname: localhost
port: 3306
username: root
password: 123456
tables: app_db.\.*

sink:
type: doris
fenodes: 127.0.0.1:8030
username: root
password: ""
route:
- source-table: app_db.orders
sink-table: ods_db.ods_orders
- source-table: app_db.shipments
sink-table: ods_db.ods_shipments
- source-table: app_db.products
sink-table: ods_db.ods_products

pipeline:
name: Sync MySQL Database to Doris
parallelism: 2
```

# Pipeline Configurations
The following config options of Data Pipeline level are supported:

| parameter | meaning | optional/required |
|-----------------|-----------------------------------------------------------------------------------------|-------------------|
| name | The name of the pipeline, which will be submitted to the Flink cluster as the job name. | optional |
| parallelism | The global parallelism of the pipeline. | required |
| local-time-zone | The local time zone defines current session time zone id. | optional |
25 changes: 25 additions & 0 deletions docs/content/docs/core-concept/data-sink.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,28 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Definition
**Data Sink** is used to apply schema changes and write change data to external systems.
A Data Sink can write to multiple tables simultaneously.

# Parameters
To describe a data sink, the follows are required:

| parameter | meaning | optional/required |
|-----------------------------|-------------------------------------------------------------------------------------------------|-------------------|
| type | The type of the sink, such as doris or starrocks. | required |
| name | The name of the sink, which is user-defined (a default value provided). | optional |
| configurations of Data Sink | Configurations to build the Data Sink e.g. connection configurations and sink table properties. | optional |

# Example
We could use this yaml file to define a doris sink:
```yaml
sink:
type: doris
name: doris-sink # Optional parameter for description purpose
fenodes: 127.0.0.1:8030
username: root
password: ""
table.create.properties.replication_num: 1 # Optional parameter for advanced functionalities
```
26 changes: 26 additions & 0 deletions docs/content/docs/core-concept/data-source.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,29 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Definition
**Data Source** is used to access metadata and read the changed data from external systems.
A Data Source can read data from multiple tables simultaneously.

# Parameters
To describe a data source, the follows are required:

| parameter | meaning | optional/required |
|-------------------------------|-----------------------------------------------------------------------------------------------------|-------------------|
| type | The type of the source, such as mysql. | required |
| name | The name of the source, which is user-defined (a default value provided). | optional |
| configurations of Data Source | Configurations to build the Data Source e.g. connection configurations and source table properties. | optional |

# Example
We could use yaml files to define a mysql source:
```yaml
source:
type: mysql
name: mysql-source #optional,description information
host: localhost
port: 3306
username: admin
password: pass
tables: adb.*, bdb.user_table_[0-9]+, [app|web]_order_\.*
```
49 changes: 49 additions & 0 deletions docs/content/docs/core-concept/route.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,52 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Definition
**Route** specifies the rule of matching a list of source-table and mapping to sink-table. The most typical scenario is the merge of sub-databases and sub-tables, routing multiple upstream source tables to the same sink table.

# Parameters
To describe a route, the follows are required:

| parameter | meaning | optional/required |
|--------------|----------------------------------------------------|-------------------|
| source-table | Source table id, supports regular expressions | required |
| sink-table | Sink table id, supports regular expressions | required |
| description | Routing rule description(a default value provided) | optional |

A route module can contain a list of source-table/sink-table rules.

# Example
## Route one Data Source table to one Data Sink table
if synchronize the table `web_order` in the database `mydb` to a Doris table `ods_web_order`, we can use this yaml file to define this route:

```yaml
route:
source-table: mydb.web_order
sink-table: mydb.ods_web_order
description: sync table to one destination table with given prefix ods_
```

## Route multiple Data Source tables to one Data Sink table
What's more, if you want to synchronize the sharding tables in the database `mydb` to a Doris table `ods_web_order`, we can use this yaml file to define this route:
```yaml
route:
source-table: mydb\.*
sink-table: mydb.ods_web_order
description: sync sharding tables to one destination table
```

## Complex Route via combining route rules
What's more, if you want to specify many different mapping rules, we can use this yaml file to define this route:
```yaml
route:
- source-table: mydb.orders
sink-table: ods_db.ods_orders
description: sync orders table to orders
- source-table: mydb.shipments
sink-table: ods_db.ods_shipments
description: sync shipments table to ods_shipments
- source-table: mydb.products
sink-table: ods_db.ods_products
description: sync products table to ods_products
```
15 changes: 15 additions & 0 deletions docs/content/docs/core-concept/table-id.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,18 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Definition
When connecting to external systems, it is necessary to establish a mapping relationship with the storage objects of the external system. This is what **Table Id** refers to.

# Example
To be compatible with most external systems, the Table Id is represented by a 3-tuple : (namespace, schemaName, tableName).
Connectors should establish the mapping between Table Id and storage objects in external systems.

The following table lists the parts in table Id of different data systems:

| data system | parts in tableId | String example |
|-----------------------|--------------------------|---------------------|
| Oracle/PostgreSQL | database, schema, table | mydb.default.orders |
| MySQL/Doris/StarRocks | database, table | mydb.orders |
| Kafka | topic | orders |
7 changes: 7 additions & 0 deletions docs/content/docs/core-concept/transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,10 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Definition
**Transform** module helps users delete and expand data columns based on the data columns in the table.
What's more, it also helps users filter some unnecessary data during the synchronization process.

# Example
This feature will support soon.
Loading