[Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS #8768

xiaochen-zhou · 2025-02-19T04:24:31Z

Purpose of this pull request

When users explicitly set QUERY_TABLET_SIZE instead of using the default value Integer.MAX_VALUE, the returned List partitions contains partitions with the same beAddress.

List<StarRocksSourceSplit> getStarRocksSourceSplit() {
  List<StarRocksSourceSplit> sourceSplits = new ArrayList<>();
  List<QueryPartition> partitions = starRocksQueryPlanReadClient.findPartitions();
  for (int i = 0; i < partitions.size(); i++) {
    sourceSplits.add(
      new StarRocksSourceSplit(
        partitions.get(i), String.valueOf(partitions.get(i).hashCode())));
  }
  return sourceSplits;
}

To avoid establishing a connection to the BE for each split during the read process, the StarRocksBeReadClient is cached based on the beAddress from the split's partition. However, in StarRocksBeReadClient#openScanner(), some variables are not being reset, leading to the potential use of stale values from the cache. For example, if eos (end of stream) is true from a previous partition read, it can cause data loss for new partitions on the same BE.

Does this PR introduce any user-facing change?

no

How was this patch tested?

add new tests

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config
Update the release-note.

xiaochen-zhou · 2025-02-19T04:28:17Z

Friendly ping, do you have time to take a look @Hisoka-X 🙏 ?

hailin0 · 2025-02-19T13:40:42Z

Please add testcase (reproduce & verify)

xiaochen-zhou · 2025-02-20T02:39:24Z

Please add testcase (reproduce & verify)

Ok. I'll try to complete it today.

xiaochen-zhou · 2025-02-20T17:03:14Z

I added a test case, StarRocksIT#testStarRocksReadRowCount(), to verify whether the number of rows written to the sink matches the number of rows read from the source in scenarios where request_tablet_size is less than the number of BUCKETS.

When I set the table's buckets to 3:

DISTRIBUTED BY HASH(`BIGINT_COL`) BUCKETS 3

At the same time, when request_tablet_size is set to a value less than 3:

The StarRocksIT#testStarRocksReadRowCount() test could not pass before the fix:

In this case, the row count is 31, which is less than the expected 100.

After applying the fix, the StarRocksIT#testStarRocksReadRowCount() test now passes successfully:

@hailin0 @Hisoka-X

Hisoka-X

Thanks @xiaochen-zhou

hailin0 · 2025-02-21T03:38:35Z

good pr

xiaochen-zhou added 2 commits February 19, 2025 02:00

fix sr data loss

c3d4fe2

add config request_tablet_size

fbae7cb

github-actions bot added connectors-v2 e2e starrocks labels Feb 19, 2025

xiaochen-zhou changed the title ~~[Feature][Connector-V2] Fix possible data loss in certain scenarios of starrocks~~ [Fix][Connector-V2] Fix possible data loss in certain scenarios of starrocks Feb 19, 2025

xiaochen-zhou added 6 commits February 20, 2025 22:59

add e2e test

08c26e9

add e2e test

a8b765d

remove log

716d63c

trim starrocks-to-assert.conf

eef92bf

fix e2e

cb35439

fix e2e

3222b61

xiaochen-zhou changed the title ~~[Fix][Connector-V2] Fix possible data loss in certain scenarios of starrocks~~ [Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS. Feb 20, 2025

xiaochen-zhou changed the title ~~[Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS.~~ [Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS Feb 20, 2025

Hisoka-X approved these changes Feb 21, 2025

View reviewed changes

github-actions bot added approved reviewed labels Feb 21, 2025

hailin0 approved these changes Feb 21, 2025

View reviewed changes

hailin0 merged commit 3c6f216 into apache:dev Feb 21, 2025
3 checks passed

xiaochen-zhou deleted the fix_sr_data_loss branch February 22, 2025 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS #8768

[Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS #8768

xiaochen-zhou commented Feb 19, 2025

xiaochen-zhou commented Feb 19, 2025

hailin0 commented Feb 19, 2025

xiaochen-zhou commented Feb 20, 2025

xiaochen-zhou commented Feb 20, 2025

Hisoka-X left a comment

hailin0 commented Feb 21, 2025

[Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS #8768

[Fix][Connector-V2] Fix possible data loss in scenarios of request_tablet_size is less than the number of BUCKETS #8768

Conversation

xiaochen-zhou commented Feb 19, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

xiaochen-zhou commented Feb 19, 2025

hailin0 commented Feb 19, 2025

xiaochen-zhou commented Feb 20, 2025

xiaochen-zhou commented Feb 20, 2025

Hisoka-X left a comment

Choose a reason for hiding this comment

hailin0 commented Feb 21, 2025