Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36964] Fix potential exception when SchemaChange in parallel w… #3818

Merged
merged 4 commits into from
Jan 13, 2025

Conversation

lvyanquan
Copy link
Contributor

@lvyanquan lvyanquan commented Dec 26, 2024

Fix potential exception when SchemaChange in parallel with Paimon Sink.

This close FLINK-36964 and FLINK-35888.

@yuxiqian
Copy link
Contributor

yuxiqian commented Dec 27, 2024

Hi @lvyanquan, I have some concern about how BucketAssignOperator works with schema evolution stuff.

  1. FlushEvent now works as a "pipeline drain indicator", which means after all data sink writer acknowledges them, there should't be any unhandled / uncommitted events flowing the whole pipeline, so SchemaRegistry could evolve downstream DB safely.

Normal Flush Event Semantic

However, the bucket assigning strategy might break that assumption, where data sink writers might receive data change events with stale schema, even after external schema evolution processes have finished.

Questionable FlushEvent Semantic

  1. Potential schema operator hanging risks with "distributed" topology.

Hanging risks

Basically same as described in #3680. In short, if a broadcast / custom partitioning topology is applied, then blocking one upstream partition will implicitly block all downstream partitions from handling events.

  1. Why we need another Bucket assigner operator?

AFAIK all data change events have been hashed & distributed in PartitionOperator. Since adding a BucketAssignOperator does not change the parallelism, is there any reason we can't calculate the correct bucket partition ID in advance, instead of creating another partitioning topology?

@yuxiqian
Copy link
Contributor

yuxiqian commented Dec 27, 2024

A little off topic, but we're really lacking E2e tests to real-life data sinks with complicated writing topologies. #3491 (with much higher testing pressure) might be necessary to expose more issues like this in the future.

@lvyanquan lvyanquan marked this pull request as ready for review January 6, 2025 15:34
@lvyanquan
Copy link
Contributor Author

Hi @yuxiqian, could you help to review this?

Copy link
Contributor

@yuxiqian yuxiqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for @lvyanquan's great work, just left some comments.

Copy link
Contributor

@yuxiqian yuxiqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to have corresponding E2e cases to verify changes in this PR. Could @leonardBang @ruanhang1993 please trigger the CI workflow?

@lvyanquan
Copy link
Contributor Author

Last CI passed and rebase master to fix the conflict.

@leonardBang
Copy link
Contributor

@lvyanquan Would you like to rebase to latest master branch as conflicts happens?

@lvyanquan
Copy link
Contributor Author

Rebased to master.

Copy link
Contributor

@leonardBang leonardBang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lvyanquan for the contribution and @yuxiqian for the review, LGTM

@leonardBang leonardBang merged commit 75b8a0c into apache:master Jan 13, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants