[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

yuxiqian · 2024-04-30T10:55:14Z

This closes FLINK-35272, FLINK-35774, FLINK-35852.

Currently, pipeline jobs with transform (including projection and filtering) are constructed with the following topology:

SchemaTransformOp --> DataTransformOp --> SchemaOp

where schema projections are applied in SchemaTransformOp and data projection & filtering are applied in DataTransformOp. The idea is SchemaTransformOp might be embedded in Sources in the future to reduce payload data size transferred in Flink Job.

However, current implementation has a known defect that omits unused columns too early, causing some downstream-relied columns got removed after they arrived in DataTransformOp. See a example as follows:

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, upper(name) as newname
    filter: age > 18

Such transformation rules will fail since name and age columns are removed in SchemaTransformOp, and those data rows could not be retrieved in DataTransformOp, where the actual expression evaluation and filtering comes into effect.

This PR introduces a new design, renaming the transform topology as follows:

PreTransformOp --> PostTransformOp --> SchemaOp

where the PreTransformOp filters out columns, but only if:

The column is not present in projection rules
The column is not indirectly referenced by calculation and filtering expressions

Referenced columns will be generated with exact same order as in the original schema. All schema and data events about those temporarily-referenced columns will be omitted after PostTransformOp. For example, given the following transform rule:

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, age + 4 as newage
    filter: age > 4

PreTransformOp will yield an intermediate schema (ID INT NOT NULL, AGE INT) and corresponding trimmed data records to downstream. Calculated columns (newage here) will not be created then since they haven't been evaluated here; Unused columns (name here) will be removed as early as possible.

If a column is explicity written down, it will be passed to downstream as-is. But for referenced columns, a special prefix will be added to their names. In the example above, a schema like [id, newname, __PREFIX__name, __PREFIX__age] will be generated to downstream. Notice that the expression evaluation and filtering will not come into effect for now, so a DataChangeEvent would be like [1, null, 'Alice', 19].

~~Adding prefix is meant to deal with such cases:~~

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, upper(name) as name

Here we need to distinguish the calculated column (new) name and the referenced original column (old) name. So after the name mangling process the schema would be like: [id, name, __PREFIX__name].

Also, the filtering process is still done in PostTransformOp since user could write down a filter expression that references calculated column, but their value won't be available until PostTransformOp's evaluation. It also means in the following somewhat ambigious case:

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, age * 2 as age
    filter: age > 18

~~The filtering expression is applied to the calculated age column (doubled!) instead of the original one.~~

Now, any calculated column referenced in filtering column will be rewritten as its original definition. For example, the following transform rule:

transform:
  - source-table: employee
    projection: id, age * 2 as newage
    filter: newage > 18

...will be rewritten as follows:

transform:
  - source-table: employee
    projection: id, age * 2 as newage
    filter: age * 2 > 18

Hence, no calculated columns need to be evaluated before filtering process.

yuxiqian · 2024-04-30T11:00:31Z

This PR is still in very early progress, looking for @aiwenmo & @lvyanquan's comments.

yuxiqian · 2024-05-06T07:19:18Z

Updated based on previous comments, cc @aiwenmo

flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/parser/TransformParser.java

...src/test/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperatorTest.java

yuxiqian · 2024-05-07T08:38:14Z

Thanks for @aiwenmo's kindly review, addressed comments above.

...me/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformOperator.java

...runtime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformers.java

flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/parser/TransformParser.java

yuxiqian · 2024-05-11T06:38:49Z

Thanks @aiwenmo for reviewing, I've addressed your comments.

aiwenmo

LGTM

yuxiqian · 2024-05-20T02:01:52Z

cc @PatrickRen @lvyanquan

lvyanquan

Thanks @yuxiqian for this, I've left some comments.

...me/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformProcessor.java

...e/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformProcessor.java

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/utils/SchemaUtils.java

yuxiqian · 2024-07-01T03:24:50Z

Thanks for @lvyanquan's comments! Addressed your comments in latest commits.

yuxiqian · 2024-07-23T14:50:14Z

Done, rebased due to some conflicts with 26ff6d2. Will add CAST ... AS tests after #3357 got merged.

...src/main/java/org/apache/flink/cdc/runtime/operators/transform/TransformFilterProcessor.java

yuxiqian · 2024-08-01T03:59:26Z

Resolved conflicts with #3357 and added CAST ... AS tests in UT / E2e cases. Seems CI failure is a known issue and should be fixed by #3449.

yuxiqian · 2024-08-08T06:55:51Z

Rebased to latest master.

leonardBang

Thanks @yuxiqian for the update, LGTM now

…rhaul # Conflicts: # flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperator.java # Conflicts: # flink-cdc-e2e-tests/flink-cdc-pipeline-e2e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/TransformE2eITCase.java # flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformOperator.java

Somehow this has been fixed in FLINK-35272. Just added an E2e case to verify if it works as expected.

yuxiqian · 2024-08-08T11:58:01Z

Seems CI is failing due to an expired link in Doris docs. Pushed another commit to fix this.

…omputed column This closes apache#3285.

github-actions bot added composer runtime e2e-tests labels Apr 30, 2024

yuxiqian marked this pull request as draft April 30, 2024 10:55

yuxiqian force-pushed the FLINK-35272 branch 2 times, most recently from c448e89 to 31a7c1d Compare May 6, 2024 06:06

yuxiqian marked this pull request as ready for review May 6, 2024 07:19

aiwenmo reviewed May 7, 2024

View reviewed changes

flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/parser/TransformParser.java Outdated Show resolved Hide resolved

...src/test/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperatorTest.java Outdated Show resolved Hide resolved

yuxiqian force-pushed the FLINK-35272 branch from 31a7c1d to 615547f Compare May 7, 2024 08:37

yuxiqian force-pushed the FLINK-35272 branch from 615547f to 35ab5f8 Compare May 7, 2024 08:41

github-actions bot added the common label May 10, 2024

yuxiqian requested a review from aiwenmo May 10, 2024 07:44

aiwenmo reviewed May 11, 2024

View reviewed changes

yuxiqian force-pushed the FLINK-35272 branch 2 times, most recently from ec700df to eb1dac3 Compare May 11, 2024 04:32

aiwenmo approved these changes May 12, 2024

View reviewed changes

yuxiqian changed the title ~~[FLINK-35272][cdc][runtime] Transform projection & filter feature overhaul~~ [FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column May 24, 2024

yuxiqian force-pushed the FLINK-35272 branch from 9b7c313 to 423f452 Compare June 12, 2024 03:27

yuxiqian force-pushed the FLINK-35272 branch from d01d9a2 to 3630d61 Compare June 20, 2024 02:00

lvyanquan reviewed Jun 28, 2024

View reviewed changes

yuxiqian force-pushed the FLINK-35272 branch 2 times, most recently from 9274868 to cb119cd Compare July 1, 2024 02:42

yuxiqian force-pushed the FLINK-35272 branch from baca7be to b2db57d Compare July 1, 2024 08:56

yuxiqian requested a review from lvyanquan July 1, 2024 08:58

github-actions bot added the reviewed label Jul 23, 2024

yuxiqian force-pushed the FLINK-35272 branch from bd2df46 to 8f45808 Compare July 23, 2024 14:48

yuxiqian force-pushed the FLINK-35272 branch from 8f45808 to e1bcab8 Compare July 23, 2024 14:55

yuxiqian commented Jul 23, 2024

View reviewed changes

...src/main/java/org/apache/flink/cdc/runtime/operators/transform/TransformFilterProcessor.java Outdated Show resolved Hide resolved

yuxiqian force-pushed the FLINK-35272 branch 2 times, most recently from abb397f to 86515a1 Compare August 1, 2024 02:07

yuxiqian requested a review from leonardBang August 1, 2024 03:59

yuxiqian force-pushed the FLINK-35272 branch 4 times, most recently from 2859296 to d221efa Compare August 8, 2024 06:53

leonardBang approved these changes Aug 8, 2024

View reviewed changes

github-actions bot added the approved label Aug 8, 2024

aiwenmo mentioned this pull request Aug 8, 2024

[FLINK-35774][cdc-runtime] Fix the cache of transform is not updated after process schema change event #3455

Closed

yuxiqian added 7 commits August 8, 2024 17:48

[FLINK-35852] Fix decimal precision mismatch after transformation

c59db41

Somehow this has been fixed in FLINK-35272. Just added an E2e case to verify if it works as expected.

Address comments

c44473c

add cast tests, resolve conflicts

1ebe0c6

Resolve conflicts

e1074a9

refactor: merge MySqlToDorisE2e & ComplexDataTypesE2e

bcd98c2

minor: fix transform temporal function failure when millisecond is .000

5fa85fc

yuxiqian force-pushed the FLINK-35272 branch from d221efa to 5fa85fc Compare August 8, 2024 09:49

docs: fix dead links

2b719ea

leonardBang merged commit 81d916f into apache:master Aug 8, 2024
18 of 22 checks passed

qiaozongmi pushed a commit to qiaozongmi/flink-cdc that referenced this pull request Sep 23, 2024

[FLINK-35272][cdc-runtime] Transform supports omitting and renaming c…

9705c5d

…omputed column This closes apache#3285.

ChaomingZhangCN pushed a commit to ChaomingZhangCN/flink-cdc that referenced this pull request Jan 13, 2025

[FLINK-35272][cdc-runtime] Transform supports omitting and renaming c…

b51a3a9

…omputed column This closes apache#3285.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

yuxiqian commented Apr 30, 2024 •

edited

Loading

yuxiqian commented Apr 30, 2024

yuxiqian commented May 6, 2024

yuxiqian commented May 7, 2024

yuxiqian commented May 11, 2024 •

edited

Loading

aiwenmo left a comment

yuxiqian commented May 20, 2024

lvyanquan left a comment

yuxiqian commented Jul 1, 2024 •

edited

Loading

yuxiqian commented Jul 23, 2024

yuxiqian commented Aug 1, 2024

yuxiqian commented Aug 8, 2024

leonardBang left a comment

yuxiqian commented Aug 8, 2024

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

Conversation

yuxiqian commented Apr 30, 2024 • edited Loading

yuxiqian commented Apr 30, 2024

yuxiqian commented May 6, 2024

yuxiqian commented May 7, 2024

yuxiqian commented May 11, 2024 • edited Loading

aiwenmo left a comment

Choose a reason for hiding this comment

yuxiqian commented May 20, 2024

lvyanquan left a comment

Choose a reason for hiding this comment

yuxiqian commented Jul 1, 2024 • edited Loading

yuxiqian commented Jul 23, 2024

yuxiqian commented Aug 1, 2024

yuxiqian commented Aug 8, 2024

leonardBang left a comment

Choose a reason for hiding this comment

yuxiqian commented Aug 8, 2024

yuxiqian commented Apr 30, 2024 •

edited

Loading

yuxiqian commented May 11, 2024 •

edited

Loading

yuxiqian commented Jul 1, 2024 •

edited

Loading