Skip to content

Commit 4d3da26

Browse files
authored
[FLINK-34682][cdc][docs] Add "Understand Flink CDC API" page for Flink CDC docs
This closes #3162.
1 parent 14b81dc commit 4d3da26

File tree

2 files changed

+98
-0
lines changed

2 files changed

+98
-0
lines changed

docs/content/docs/developer-guide/understand-flink-cdc-api.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ type: docs
55
aliases:
66
- /developer-guide/understand-flink-cdc-api
77
---
8+
89
<!--
910
Licensed to the Apache Software Foundation (ASF) under one
1011
or more contributor license agreements. See the NOTICE file
@@ -23,3 +24,100 @@ KIND, either express or implied. See the License for the
2324
specific language governing permissions and limitations
2425
under the License.
2526
-->
27+
28+
# Understand Flink CDC API
29+
30+
If you are planning to build your own Flink CDC connectors, or considering
31+
contributing to Flink CDC, you might want to hava a deeper look at the APIs of
32+
Flink CDC. This document will go through some important concepts and interfaces
33+
in order to help you with your development.
34+
35+
## Event
36+
37+
An event under the context of Flink CDC is a special kind of record in Flink's
38+
data stream. It describes the captured changes in the external system on source
39+
side, gets processed and transformed by internal operators built by Flink CDC,
40+
and finally passed to data sink then write or applied to the external system on
41+
sink side.
42+
43+
Each change event contains the table ID it belongs to, and the payload that the
44+
event carries. Based on the type of payload, we categorize events into these
45+
kinds:
46+
47+
### DataChangeEvent
48+
49+
DataChangeEvent describes data changes in the source. It consists of 5 fields
50+
51+
- `Table ID`: table ID it belongs to
52+
- `Before`: pre-image of the data
53+
- `After`: post-image of the data
54+
- `Operation type`: type of the change operation
55+
- `Meta`: metadata of the change
56+
57+
For the operation type field, we pre-define 4 operation types:
58+
59+
- Insert: new data entry, with `before = null` and `after = new data`
60+
- Delete: removal of data, with `before = removed` data and `after = null`
61+
- Update: update of existed data, with `before = data before change`
62+
and `after = data after change`
63+
- Replace:
64+
65+
### SchemaChangeEvent
66+
67+
SchemaChangeEvent describes schema changes in the source. Compared to
68+
DataChangeEvent, the payload of SchemaChangeEvent describes changes in the table
69+
structure in the external system, including:
70+
71+
- `AddColumnEvent`: new column in the table
72+
- `AlterColumnTypeEvent`: type change of a column
73+
- `CreateTableEvent`: creation of a new table. Also used to describe the schema
74+
of
75+
a pre-emitted DataChangeEvent
76+
- `DropColumnEvent`: removal of a column
77+
- `RenameColumnEvent`: name change of a column
78+
79+
### Flow of Events
80+
81+
As you may have noticed, data change event doesn't have its schema bound with
82+
it. This reduces the size of data change event and the overhead of
83+
serialization, but makes it not self-descriptive Then how does the framework
84+
know how to interpret the data change event?
85+
86+
To resolve the problem, the framework adds a requirement to the flow of events:
87+
a `CreateTableEvent` must be emitted before any `DataChangeEvent` if a table is
88+
new to the framework, and `SchemaChangeEvent` must be emitted before any
89+
`DataChangeEvent` if the schema of a table is changed. This requirement makes
90+
sure that the framework has been aware of the schema before processing any data
91+
changes.
92+
93+
{{< img src="/fig/flow-of-events.png" alt="Flow of Events" >}}
94+
95+
## Data Source
96+
97+
Data source works as a factory of `EventSource` and `MetadataAccessor`,
98+
constructing runtime implementations of source that captures changes from
99+
external system and provides metadata.
100+
101+
`EventSource` is a Flink source that reads changes, converts them to events
102+
, then emits to downstream Flink operators. You can refer
103+
to [Flink documentation](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/)
104+
to learn internals and how to implement a Flink source.
105+
106+
`MetadataAccessor` serves as the metadata reader of the external system, by
107+
listing namespaces, schemas and tables, and provide the table schema (table
108+
structure) of the given table ID.
109+
110+
## Data Sink
111+
112+
Symmetrical with data source, data sink consists of `EventSink`
113+
and `MetadataApplier`, which writes data change events and apply schema
114+
changes (metadata changes) to external system.
115+
116+
`EventSink` is a Flink sink that receives change event from upstream operator,
117+
and apply them to the external system. Currently we only support Flink's Sink V2
118+
API.
119+
120+
`MetadataApplier` will be used to handle schema changes. When the framework
121+
receives schema change event from source, after making some internal
122+
synchronizations and flushes, it will apply the schema change to
123+
external system via this applier.

docs/static/fig/flow-of-events.png

158 KB
Loading

0 commit comments

Comments
 (0)