Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate data was generated during expansion. #2723

Open
1 of 2 tasks
VIVALXH opened this issue Jan 16, 2025 · 10 comments
Open
1 of 2 tasks

Duplicate data was generated during expansion. #2723

VIVALXH opened this issue Jan 16, 2025 · 10 comments
Labels
bug type bug

Comments

@VIVALXH
Copy link

VIVALXH commented Jan 16, 2025

Search before asking

  • I had searched in the issues and found no similar issues.

Version

2.10

Minimal reproduce step

Env: Kvrocks cluster with 4 shards.

  1. Using go-redis and opening multiple goroutines to concurrently generate data continuously.
userId := helper.GetRandomInt64InRange(1, 10000)
key := strconv.FormatInt(userId, 10)

uuidStr := uuid.New().String()
timestamp := time.Now().UnixNano()
value := fmt.Sprintf("%s-%d", uuidStr, timestamp)

_, err := c.Do(context.TODO(), 2, "LPUSH", key, value)
  1. Use the following command to expand the Kvrocks cluster from 4 shards to 8 shards.
CLUSTERX SETNODEID $NODE_ID $NODE_ID
CLUSTERX MIGRATE $slot $dst_nodeid
CLUSTERX SETSLOT $slot NODE $node_id $new_version
  1. Check data after expand.

What did you expect to see?

no duplicate data was generated.

What did you see instead?

[root@iZwz9dff7pa0adxvbwkbc8Z ~]# ./check_redis_list
2025/01/16 13:34:20 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
LIST 1978 has duplicate data:
duplicate data: 99b22bda-96de-43ec-98d4-d74d5a590d11-1736994529584945168
duplicate data index: [43 44]
list length: 395

LIST 3171 has duplicate data:
duplicate data: 9bb7a7cc-3004-4d5b-965d-fe171badc21a-1736994509584166314
duplicate data index: [86 87]
list length: 412

LIST 3468 has duplicate data:
duplicate data: 9c3bc47b-d92c-4d55-b588-9fbc594849fc-1736994519583905106
duplicate data index: [65 66]
list length: 406

LIST 3616 has duplicate data:
duplicate data: 41342d98-ace0-4212-b9ea-437772a17e68-1736994539585064098
duplicate data index: [32 33]
list length: 376

LIST 5519 has duplicate data:
duplicate data: c9fb599e-453e-4615-bc7c-74b1b8a8bec0-1736994529583967377
duplicate data index: [51 52]
list length: 414

LIST 6445 has duplicate data:
duplicate data: 3d1c3404-af08-4d47-a0f8-2fa0cea0885a-1736994509583942704
duplicate data index: [87 88]
list length: 377

LIST 8973 has duplicate data:
duplicate data: ab937bce-f43f-4f2c-a65a-b1c6cc5f08b9-1736994529584052504
duplicate data index: [55 56]
list length: 414

LIST 9250 has duplicate data:
duplicate data: 1550b5fc-ba5a-4d88-9b9d-daf9c52650ef-1736994509583124277
duplicate data index: [82 83]
list length: 421

LIST 9555 has duplicate data:
duplicate data: 606ab92e-4052-43fb-900d-91e45892ef2a-1736994529583492941
duplicate data index: [59 60]
list length: 395

Anything Else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@VIVALXH VIVALXH added the bug type bug label Jan 16, 2025
@git-hulk
Copy link
Member

@VIVALXH Great! thanks for your detailed information, it's helpful to identify this isue.

@git-hulk
Copy link
Member

git-hulk commented Feb 2, 2025

Hi @VIVALXH

go-redis might retry on network error. Would you mind disabling the retry to see if the issue is still existing? https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/options.go#L75C2-L75C12

@VIVALXH
Copy link
Author

VIVALXH commented Feb 6, 2025

@git-hulk thanks.

No duplicate data was generated, but this resulted in the loss of a significant amount of data that should have been written.

I controlled the osscluster retry count by setting redis.UniversalOptions.MaxRedirects = -1.
(https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/osscluster.go#L45)

However, this inevitably led to Moved errors and TryAgain errors as mentioned in this PR(#1240).

The Moved error occurs between the following two commands:

  • CLUSTERX MIGRATE $slot $dst_nodeid
  • CLUSTERX SETSLOT $slot NODE $node_id $new_version

Additionally, if the interval between these two commands is longer (e.g., 10 seconds), the probability of generating duplicate data increases. The duplicate data primarily appears on the new nodes after scaling out. Following the process I described, this issue can be consistently reproduced.

@git-hulk
Copy link
Member

git-hulk commented Feb 6, 2025

I controlled the osscluster retry count by setting redis.UniversalOptions.MaxRedirects = -1.

@VIVALXH You should set MaxRetries to -1 instead of MaxRedirects. Or it will prevent retrying while receiving the TRYAGAIN error.

@VIVALXH
Copy link
Author

VIVALXH commented Feb 6, 2025

@git-hulk

I am using ClusterClient, and internally it uses MaxRedirects instead of MaxRetries in its process function. It worked as expected and achieved our goal of preventing duplicate data.
https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/osscluster.go#L967

This suggests that retries might have caused the data duplication, but in reality, we cannot set the retries to zero because it lost amount of data.

@git-hulk
Copy link
Member

git-hulk commented Feb 6, 2025

@VIVALXH The cluster client also supports the MaxRetries, set MaxRedirects=-1 wouldn't retry when occuring the TRYGAIN error.

@VIVALXH
Copy link
Author

VIVALXH commented Feb 6, 2025

Hi @VIVALXH

go-redis might retry on network error. Would you mind disabling the retry to see if the issue is still existing? https://github.com/redis/go-redis/blob/1139bc3aa9073851f67faa6d68df07a566901dd7/options.go#L75C2-L75C12

@git-hulk
Now we conclude that the retry may have caused the duplicate data, but in practice it is impossible to cancel the retry.

@VIVALXH
Copy link
Author

VIVALXH commented Feb 7, 2025

@VIVALXH The cluster client also supports the MaxRetries, set MaxRedirects=-1 wouldn't retry when occuring the TRYGAIN error.

@git-hulk
Hello, I tried setting MaxRetries=-1 today, but duplicate data still occurred. Only when I set MaxRedirects=-1 did the duplicate data stop.

So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?

@git-hulk
Copy link
Member

git-hulk commented Feb 7, 2025

So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?

I'm not quite sure about this. But after considering this issue, it's impossible to push an element exactly once in a list without the deduplicate mechanism. Because we always need to retry once there are any network issues or timeout, or it might lose data.

@VIVALXH
Copy link
Author

VIVALXH commented Feb 7, 2025

So I guess the data duplication is caused by retries due to Moved errors or TryAgain errors?

I'm not quite sure about this. But after considering this issue, it's impossible to push an element exactly once in a list without the deduplicate mechanism. Because we always need to retry once there are any network issues or timeout, or it might lose data.

yes, it is impossible to cancel the retry because it lost amount of data.

At present, the situation of duplicate data only occurs when scaling. Is there any other solution that can avoid this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug type bug
Projects
None yet
Development

No branches or pull requests

2 participants