@@ -5,6 +5,7 @@ type: docs
5
5
aliases :
6
6
- /developer-guide/understand-flink-cdc-api
7
7
---
8
+
8
9
<!--
9
10
Licensed to the Apache Software Foundation (ASF) under one
10
11
or more contributor license agreements. See the NOTICE file
@@ -23,3 +24,100 @@ KIND, either express or implied. See the License for the
23
24
specific language governing permissions and limitations
24
25
under the License.
25
26
-->
27
+
28
+ # Understand Flink CDC API
29
+
30
+ If you are planning to build your own Flink CDC connectors, or considering
31
+ contributing to Flink CDC, you might want to hava a deeper look at the APIs of
32
+ Flink CDC. This document will go through some important concepts and interfaces
33
+ in order to help you with your development.
34
+
35
+ ## Event
36
+
37
+ An event under the context of Flink CDC is a special kind of record in Flink's
38
+ data stream. It describes the captured changes in the external system on source
39
+ side, gets processed and transformed by internal operators built by Flink CDC,
40
+ and finally passed to data sink then write or applied to the external system on
41
+ sink side.
42
+
43
+ Each change event contains the table ID it belongs to, and the payload that the
44
+ event carries. Based on the type of payload, we categorize events into these
45
+ kinds:
46
+
47
+ ### DataChangeEvent
48
+
49
+ DataChangeEvent describes data changes in the source. It consists of 5 fields
50
+
51
+ - ` Table ID ` : table ID it belongs to
52
+ - ` Before ` : pre-image of the data
53
+ - ` After ` : post-image of the data
54
+ - ` Operation type ` : type of the change operation
55
+ - ` Meta ` : metadata of the change
56
+
57
+ For the operation type field, we pre-define 4 operation types:
58
+
59
+ - Insert: new data entry, with ` before = null ` and ` after = new data `
60
+ - Delete: removal of data, with ` before = removed ` data and ` after = null `
61
+ - Update: update of existed data, with ` before = data before change `
62
+ and ` after = data after change `
63
+ - Replace:
64
+
65
+ ### SchemaChangeEvent
66
+
67
+ SchemaChangeEvent describes schema changes in the source. Compared to
68
+ DataChangeEvent, the payload of SchemaChangeEvent describes changes in the table
69
+ structure in the external system, including:
70
+
71
+ - ` AddColumnEvent ` : new column in the table
72
+ - ` AlterColumnTypeEvent ` : type change of a column
73
+ - ` CreateTableEvent ` : creation of a new table. Also used to describe the schema
74
+ of
75
+ a pre-emitted DataChangeEvent
76
+ - ` DropColumnEvent ` : removal of a column
77
+ - ` RenameColumnEvent ` : name change of a column
78
+
79
+ ### Flow of Events
80
+
81
+ As you may have noticed, data change event doesn't have its schema bound with
82
+ it. This reduces the size of data change event and the overhead of
83
+ serialization, but makes it not self-descriptive Then how does the framework
84
+ know how to interpret the data change event?
85
+
86
+ To resolve the problem, the framework adds a requirement to the flow of events:
87
+ a ` CreateTableEvent ` must be emitted before any ` DataChangeEvent ` if a table is
88
+ new to the framework, and ` SchemaChangeEvent ` must be emitted before any
89
+ ` DataChangeEvent ` if the schema of a table is changed. This requirement makes
90
+ sure that the framework has been aware of the schema before processing any data
91
+ changes.
92
+
93
+ {{< img src="/fig/flow-of-events.png" alt="Flow of Events" >}}
94
+
95
+ ## Data Source
96
+
97
+ Data source works as a factory of ` EventSource ` and ` MetadataAccessor ` ,
98
+ constructing runtime implementations of source that captures changes from
99
+ external system and provides metadata.
100
+
101
+ ` EventSource ` is a Flink source that reads changes, converts them to events
102
+ , then emits to downstream Flink operators. You can refer
103
+ to [ Flink documentation] ( https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/ )
104
+ to learn internals and how to implement a Flink source.
105
+
106
+ ` MetadataAccessor ` serves as the metadata reader of the external system, by
107
+ listing namespaces, schemas and tables, and provide the table schema (table
108
+ structure) of the given table ID.
109
+
110
+ ## Data Sink
111
+
112
+ Symmetrical with data source, data sink consists of ` EventSink `
113
+ and ` MetadataApplier ` , which writes data change events and apply schema
114
+ changes (metadata changes) to external system.
115
+
116
+ ` EventSink ` is a Flink sink that receives change event from upstream operator,
117
+ and apply them to the external system. Currently we only support Flink's Sink V2
118
+ API.
119
+
120
+ ` MetadataApplier ` will be used to handle schema changes. When the framework
121
+ receives schema change event from source, after making some internal
122
+ synchronizations and flushes, it will apply the schema change to
123
+ external system via this applier.
0 commit comments