-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
driver reports connection errors for decommissioned nodes and a delay for cassandra-stress, resulted with: "Command did not complete within 12870 seconds!" #401
Comments
Scylla version: Kernel Version: Issue descriptionReproduced again. Cassandra stress test tries to connect to decommissioned nodes.
This list contains just 3 nodes for illustration—more nodes terminated before connection. Please take a look at the logs for more details. OS / Image: Test:
|
The issue seem to reproduce now in a more specific scenario.
hmmm, i see an unexpected log message of restarting the decommissioned node as well (or is it a different node with the same ip?):
it could be node-7:
PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 6 nodes (Standard_L8s_v3) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Are you sure it's not related to this one: scylladb/scylla-cluster-tests#8855 (comment) |
This issue is not accurately reported. It looks like the decommissioned node ip is reused for a new added node, so the driver not necessarily got decommissioned node connectivity errors. more likely it's the new node restarted by nemesis.
|
@yarongilor , could you please confirm that there are two issues:
|
As for the first one - i think it may be a false alarm, since the driver tried connecting to the ip of a new node, which happen to be identical to the decommissioned node. So it might be the something wrong happened to the driver following the 2 rolling-restart nemesis.
|
I believe it is not false alarm, we observe in many different runs (might not reproduce in your particular one) that C-S tried to access nodes that were terminated 2 hours ago |
Packages
Scylla version:
6.3.0~dev-20241217.01cdba9a9894
with build-idf5cdbc08a2634f6f378e901fbb10a27fc164783e
Kernel Version:
6.8.0-1018-azure
Issue description
Describe your issue in detail and steps it took to produce it.
Run a 3 hours longevity on Azure.
The node
longevity-10gb-3h-master-db-node-413a3a9b-eastus-2
was decommissioned at 2:About 2 hours later, 2 nemesis that issues a rolling restart of all cluster nodes are executed:
Then c-s got errors for node-2 (that was decommissioned ~ 2 hours ago) like:
Multiple connection errors (that might be expected) are reported for some other nodes as well, during these rolling restarts:
Another unclear connection issues are found for node-10 that was decommissioned at 03:22 -
node-10 removal:
connection errors:
node-10 private address is also reported long afterwards:
The cassandra-stress eventually failed the test for an error of:
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
OS / Image:
/subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2024-12-18T02-02-40
(azure: undefined_region)Test:
longevity-10gb-3h-azure-test
Test id:
413a3a9b-fe7b-4e5e-b864-6f1f26628226
Test name:
scylla-master/longevity/longevity-10gb-3h-azure-test
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 413a3a9b-fe7b-4e5e-b864-6f1f26628226
$ hydra investigate show-logs 413a3a9b-fe7b-4e5e-b864-6f1f26628226
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: