Duplicate data was generated during expansion. #2723

VIVALXH · 2025-01-16T05:41:11Z

Search before asking

I had searched in the issues and found no similar issues.

Version

2.10

Minimal reproduce step

Env: Kvrocks cluster with 4 shards.

Using go-redis and opening multiple goroutines to concurrently generate data continuously.

userId := helper.GetRandomInt64InRange(1, 10000)
key := strconv.FormatInt(userId, 10)

uuidStr := uuid.New().String()
timestamp := time.Now().UnixNano()
value := fmt.Sprintf("%s-%d", uuidStr, timestamp)

_, err := c.Do(context.TODO(), 2, "LPUSH", key, value)

Use the following command to expand the Kvrocks cluster from 4 shards to 8 shards.

CLUSTERX SETNODEID $NODE_ID $NODE_ID
CLUSTERX MIGRATE $slot $dst_nodeid
CLUSTERX SETSLOT $slot NODE $node_id $new_version

Check data after expand.

What did you expect to see?

no duplicate data was generated.

What did you see instead?

[root@iZwz9dff7pa0adxvbwkbc8Z ~]# ./check_redis_list
2025/01/16 13:34:20 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
LIST 1978 has duplicate data:
duplicate data: 99b22bda-96de-43ec-98d4-d74d5a590d11-1736994529584945168
duplicate data index: [43 44]
list length: 395

LIST 3171 has duplicate data:
duplicate data: 9bb7a7cc-3004-4d5b-965d-fe171badc21a-1736994509584166314
duplicate data index: [86 87]
list length: 412

LIST 3468 has duplicate data:
duplicate data: 9c3bc47b-d92c-4d55-b588-9fbc594849fc-1736994519583905106
duplicate data index: [65 66]
list length: 406

LIST 3616 has duplicate data:
duplicate data: 41342d98-ace0-4212-b9ea-437772a17e68-1736994539585064098
duplicate data index: [32 33]
list length: 376

LIST 5519 has duplicate data:
duplicate data: c9fb599e-453e-4615-bc7c-74b1b8a8bec0-1736994529583967377
duplicate data index: [51 52]
list length: 414

LIST 6445 has duplicate data:
duplicate data: 3d1c3404-af08-4d47-a0f8-2fa0cea0885a-1736994509583942704
duplicate data index: [87 88]
list length: 377

LIST 8973 has duplicate data:
duplicate data: ab937bce-f43f-4f2c-a65a-b1c6cc5f08b9-1736994529584052504
duplicate data index: [55 56]
list length: 414

LIST 9250 has duplicate data:
duplicate data: 1550b5fc-ba5a-4d88-9b9d-daf9c52650ef-1736994509583124277
duplicate data index: [82 83]
list length: 421

LIST 9555 has duplicate data:
duplicate data: 606ab92e-4052-43fb-900d-91e45892ef2a-1736994529583492941
duplicate data index: [59 60]
list length: 395

Anything Else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

git-hulk · 2025-01-17T03:08:23Z

@VIVALXH Great! thanks for your detailed information, it's helpful to identify this isue.

git-hulk · 2025-02-02T09:01:06Z

Hi @VIVALXH

go-redis might retry on network error. Would you mind disabling the retry to see if the issue is still existing? https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/options.go#L75C2-L75C12

VIVALXH · 2025-02-06T07:20:22Z

@git-hulk thanks.

No duplicate data was generated, but this resulted in the loss of a significant amount of data that should have been written.

I controlled the osscluster retry count by setting redis.UniversalOptions.MaxRedirects = -1.
(https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/osscluster.go#L45)

However, this inevitably led to Moved errors and TryAgain errors as mentioned in this PR(#1240).

The Moved error occurs between the following two commands:

CLUSTERX MIGRATE $slot $dst_nodeid
CLUSTERX SETSLOT $slot NODE $node_id $new_version

Additionally, if the interval between these two commands is longer (e.g., 10 seconds), the probability of generating duplicate data increases. The duplicate data primarily appears on the new nodes after scaling out. Following the process I described, this issue can be consistently reproduced.

git-hulk · 2025-02-06T12:49:35Z

I controlled the osscluster retry count by setting redis.UniversalOptions.MaxRedirects = -1.

@VIVALXH You should set MaxRetries to -1 instead of MaxRedirects. Or it will prevent retrying while receiving the TRYAGAIN error.

VIVALXH · 2025-02-06T13:31:20Z

@git-hulk

I am using ClusterClient, and internally it uses MaxRedirects instead of MaxRetries in its process function. It worked as expected and achieved our goal of preventing duplicate data.
https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/osscluster.go#L967

This suggests that retries might have caused the data duplication, but in reality, we cannot set the retries to zero because it lost amount of data.

git-hulk · 2025-02-06T14:44:32Z

@VIVALXH The cluster client also supports the MaxRetries, set MaxRedirects=-1 wouldn't retry when occuring the TRYGAIN error.

VIVALXH · 2025-02-06T14:53:35Z

Hi @VIVALXH

go-redis might retry on network error. Would you mind disabling the retry to see if the issue is still existing? https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/options.go#L75C2-L75C12

@git-hulk
Now we conclude that the retry may have caused the duplicate data, but in practice it is impossible to cancel the retry.

VIVALXH · 2025-02-07T01:35:05Z

@VIVALXH The cluster client also supports the MaxRetries, set MaxRedirects=-1 wouldn't retry when occuring the TRYGAIN error.

@git-hulk
Hello, I tried setting MaxRetries=-1 today, but duplicate data still occurred. Only when I set MaxRedirects=-1 did the duplicate data stop.

So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?

git-hulk · 2025-02-07T12:08:58Z

So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?

I'm not quite sure about this. But after considering this issue, it's impossible to push an element exactly once in a list without the deduplicate mechanism. Because we always need to retry once there are any network issues or timeout, or it might lose data.

VIVALXH · 2025-02-07T14:26:44Z

So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?

I'm not quite sure about this. But after considering this issue, it's impossible to push an element exactly once in a list without the deduplicate mechanism. Because we always need to retry once there are any network issues or timeout, or it might lose data.

yes, it is impossible to cancel the retry because it lost amount of data.

At present, the situation of duplicate data only occurs when scaling. Is there any other solution that can avoid this?

VIVALXH added the bug type bug label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate data was generated during expansion. #2723

Duplicate data was generated during expansion. #2723

VIVALXH commented Jan 16, 2025

git-hulk commented Jan 17, 2025

git-hulk commented Feb 2, 2025

VIVALXH commented Feb 6, 2025 •

edited

Loading

git-hulk commented Feb 6, 2025 •

edited

Loading

VIVALXH commented Feb 6, 2025

git-hulk commented Feb 6, 2025

VIVALXH commented Feb 6, 2025 •

edited

Loading

VIVALXH commented Feb 7, 2025 •

edited

Loading

git-hulk commented Feb 7, 2025

VIVALXH commented Feb 7, 2025

Duplicate data was generated during expansion. #2723

Duplicate data was generated during expansion. #2723

Comments

VIVALXH commented Jan 16, 2025

Search before asking

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything Else?

Are you willing to submit a PR?

git-hulk commented Jan 17, 2025

git-hulk commented Feb 2, 2025

VIVALXH commented Feb 6, 2025 • edited Loading

git-hulk commented Feb 6, 2025 • edited Loading

VIVALXH commented Feb 6, 2025

git-hulk commented Feb 6, 2025

VIVALXH commented Feb 6, 2025 • edited Loading

VIVALXH commented Feb 7, 2025 • edited Loading

git-hulk commented Feb 7, 2025

VIVALXH commented Feb 7, 2025

VIVALXH commented Feb 6, 2025 •

edited

Loading

git-hulk commented Feb 6, 2025 •

edited

Loading

VIVALXH commented Feb 6, 2025 •

edited

Loading

VIVALXH commented Feb 7, 2025 •

edited

Loading