-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS #8768
Conversation
Friendly ping, do you have time to take a look @Hisoka-X 🙏 ? |
Please add testcase (reproduce & verify) |
Ok. I'll try to complete it today. |
I added a test case, When I set the table's buckets to 3: DISTRIBUTED BY HASH(`BIGINT_COL`) BUCKETS 3 At the same time, when request_tablet_size is set to a value less than 3: The In this case, the row count is 31, which is less than the expected 100. After applying the fix, the StarRocksIT#testStarRocksReadRowCount() test now passes successfully: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xiaochen-zhou
good pr |
Purpose of this pull request
When users explicitly set QUERY_TABLET_SIZE instead of using the default value Integer.MAX_VALUE, the returned List partitions contains partitions with the same beAddress.
To avoid establishing a connection to the BE for each split during the read process, the StarRocksBeReadClient is cached based on the beAddress from the split's partition. However, in StarRocksBeReadClient#openScanner(), some variables are not being reset, leading to the potential use of stale values from the cache. For example, if eos (end of stream) is true from a previous partition read, it can cause data loss for new partitions on the same BE.
Does this PR introduce any user-facing change?
no
How was this patch tested?
add new tests
Check list
New License Guide
release-note
.