Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver reports connection errors for decommissioned nodes and a delay for cassandra-stress, resulted with: "Command did not complete within 12870 seconds!" #401

Open
2 tasks
yarongilor opened this issue Dec 19, 2024 · 7 comments
Labels
bug Something isn't working CI Stability

Comments

@yarongilor
Copy link

Packages

Scylla version: 6.3.0~dev-20241217.01cdba9a9894 with build-id f5cdbc08a2634f6f378e901fbb10a27fc164783e

Kernel Version: 6.8.0-1018-azure

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Run a 3 hours longevity on Azure.
The node longevity-10gb-3h-master-db-node-413a3a9b-eastus-2 was decommissioned at 2:

disrupt_nodetool_seed_decommission	longevity-10gb-3h-master-db-node-413a3a9b-eastus-2	Succeeded	2024-12-18 01:54:18	2024-12-18 02:07:15
Nemesis Information
Class: Sisyphus
Name: disrupt_nodetool_seed_decommission
Status: Succeeded
Target Information
Name: longevity-10gb-3h-master-db-node-413a3a9b-eastus-2
Public IP: null
Private IP: 10.0.0.6
State: terminated
Shards: 7

About 2 hours later, 2 nemesis that issues a rolling restart of all cluster nodes are executed:

disrupt_rolling_restart_cluster	longevity-10gb-3h-master-db-node-413a3a9b-eastus-11	Succeeded	2024-12-18 03:55:11	2024-12-18 03:57:50
disrupt_rolling_config_change_internode_compression	longevity-10gb-3h-master-db-node-413a3a9b-eastus-4	Succeeded	2024-12-18 04:00:20	2024-12-18 04:08:15

Then c-s got errors for node-2 (that was decommissioned ~ 2 hours ago) like:

WARN  [cluster1-nio-worker-2] 2024-12-18 03:56:11,975 DefaultPromise.java:593 - An exception was thrown by com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete()
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
java.lang.AssertionError: null
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionFailure(ConvictionPolicy.java:101)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection.defunct(Connection.java:812)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1667)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1657)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed
	at com.datastax.shaded.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.6:9042] Connection has been closed

Multiple connection errors (that might be expected) are reported for some other nodes as well, during these rolling restarts:

WARN  [cluster1-nio-worker-5] 2024-12-18 03:55:13,594 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
WARN  [cluster1-nio-worker-6] 2024-12-18 03:55:15,606 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
total,     769417033,   78364,   78364,   78364,    12.8,     7.8,    40.2,    62.0,    90.7,   148.6, 9965.0,  0.00209,      0,      0,       0,       0,       0,       0
WARN  [cluster1-nio-worker-0] 2024-12-18 03:55:19,611 Connection.java:284 - Error creating netty channel to /10.0.0.8:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.8:9042
...
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.9:9042] Connection has been closed
...
WARN  [cluster1-nio-worker-0] 2024-12-18 03:55:30,963 Connection.java:284 - Error creating netty channel to /10.0.0.9:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.0.9:9042
Caused by: java.net.ConnectException: Connection refused

Another unclear connection issues are found for node-10 that was decommissioned at 03:22 -

longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14]
live time:
	2024-12-18 03:12:00	2024-12-18 03:20:08

node-10 removal:

< t:2024-12-18 03:19:34,951 f:cluster.py      l:1169 c:sdcm.cluster_azure   p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14]: Set termination_event
< t:2024-12-18 03:20:08,988 f:cluster.py      l:1231 c:sdcm.cluster         p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 [None | 10.0.0.14] destroyed

connection errors:

WARN  [cluster1-nio-worker-2] 2024-12-18 04:16:37,254 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
...

WARN  [cluster1-nio-worker-3] 2024-12-18 04:26:40,326 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
...

WARN  [cluster1-nio-worker-5] 2024-12-18 04:36:43,398 Connection.java:284 - Error creating netty channel to /10.0.0.14:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.0.0.14:9042
Caused by: java.net.NoRouteToHostException: No route to host
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
        at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)

node-10 private address is also reported long afterwards:

< t:2024-12-18 03:55:54,995 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-18T03:55:54.979+00:00 longevity-10gb-3h-master-db-node-413a3a9b-eastus-7     !INFO | scylla[5195]:  [shard 0:main] messaging_service - Stopping client for address: 10.0.0.14:0
< t:2024-12-18 03:55:55,016 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2024-12-18T03:55:54.980+00:00 longevity-10gb-3h-master-db-node-413a3a9b-eastus-7     !INFO | scylla[5195]:  [shard 0:main] messaging_service - Stopping client for address: 10.0.0.14:0 - Done

The cassandra-stress eventually failed the test for an error of:

2024-12-18 04:43:29.357: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=36f22416-2cde-4c0c-bccd-2deeac0e027f duration=3h34m30s: node=Node longevity-10gb-3h-master-loader-node-413a3a9b-eastus-1 [None | 10.0.0.11]
stress_cmd=cassandra-stress write cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5
errors:
Stress command execution failed with: Command did not complete within 12870 seconds!
Command: 'sudo  docker exec 6b4a56b5947dea0840cfcb21e1e974a93072cc15cb75c103041fb8075e6f6db5 /bin/sh -c \'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=HITDHJBU91WD2JHM5LG4; cassandra-stress write no-warmup cl=QUORUM duration=180m -schema keyspace=keyspace1 \'"\'"\'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)\'"\'"\' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5 -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.9,10.0.0.10 -errors skip-unsupported-columns\''
Stdout:
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
INFO  [cluster1-reconnection-1] 2024-12-18 04:37:08,585 HostConnectionPool.java:200 - Using advanced port-based shard awareness with /10.0.0.8:9042
Stderr:
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.0.0.8:9042] Connection has been closed

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-9 (null | 10.0.0.5) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-8 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-7 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-6 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-5 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-4 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-3 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-2 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-11 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-10 (null | 10.0.0.14) (shards: -1)
  • longevity-10gb-3h-master-db-node-413a3a9b-eastus-1 (null | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2024-12-18T02-02-40 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test
Test id: 413a3a9b-fe7b-4e5e-b864-6f1f26628226
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 413a3a9b-fe7b-4e5e-b864-6f1f26628226
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 413a3a9b-fe7b-4e5e-b864-6f1f26628226

Logs:

Jenkins job URL
Argus

@pehala pehala added bug Something isn't working CI Stability labels Jan 16, 2025
@jsmolar
Copy link

jsmolar commented Jan 16, 2025

Scylla version: 6.3.0~dev-20250115.72f28ce81e17 with build-id d583b7e768996945fc24bf7f697f5ec41306253a

Kernel Version: 6.8.0-1020-gcp

Issue description

Reproduced again. Cassandra stress test tries to connect to decommissioned nodes.

  • node 10.142.0.40 - terminated at 02:24:06 | connection attempt at 03:40:14
  • node 10.142.0.53 - terminated at 01:55:40 | connection attempt at 03:40:23
  • node 10.142.0.59 - terminated at 01:15:19 | connection attempt at 03:40:28

This list contains just 3 nodes for illustration—more nodes terminated before connection. Please take a look at the logs for more details.

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/609539021565457724 (NO RUNNER: NO RUNNER)

Test: longevity-10gb-3h-gce-test
Test id: 09a348c2-c651-412a-8600-d824f4e088c4
Test name: scylla-master/longevity/longevity-10gb-3h-gce-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file: longevity-10gb-3h.yaml

Jenkins job URL
Argus

WARN  [cluster1-nio-worker-0] 2025-01-16 03:40:18,779 Connection.java:284 - Error creating netty channel to /10.142.0.40:9042
com.datastax.shaded.netty.channel.ConnectTimeoutException: connection timed out: /10.142.0.40:9042
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
	at com.datastax.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at com.datastax.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

WARN  [cluster1-nio-worker-0] 2025-01-16 03:40:23,785 Connection.java:284 - Error creating netty channel to /10.142.0.53:9042
com.datastax.shaded.netty.channel.ConnectTimeoutException: connection timed out: /10.142.0.53:9042
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
	at com.datastax.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at com.datastax.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

WARN  [cluster1-nio-worker-7] 2025-01-16 03:40:28,792 Connection.java:284 - Error creating netty channel to /10.142.0.59:9042
com.datastax.shaded.netty.channel.ConnectTimeoutException: connection timed out: /10.142.0.59:9042
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
	at com.datastax.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at com.datastax.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at com.datastax.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

@yarongilor
Copy link
Author

yarongilor commented Jan 26, 2025

The issue seem to reproduce now in a more specific scenario.
It takes about half an hours. perhaps it can even narrow down further more.
The current scenario is:

  • decommission node-6
  • (Run disrupt_bootstrap_streaming_error - might be unneeded).
  • Start rolling restart cluster (disrupt_rolling_restart_cluster).
  • driver got the error of "Connection has been closed" to node-6.

hmmm, i see an unexpected log message of restarting the decommissioned node as well (or is it a different node with the same ip?):

< t:2025-01-26 13:34:38,694 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2025-01-26T13:34:36.966+00:00 longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-5     !INFO | scylla[5454]:  [shard 0: gms] gossip - Node 10.0.0.10 has restarted, now UP, status = NORMAL

it could be node-7:

< t:2025-01-26 13:15:09,905 f:cluster.py      l:446  c:sdcm.cluster_azure   p:DEBUG > Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 [None | 10.0.0.10]: SSH access -> 'ssh -i ~/.ssh/scylla_test_id_ed25519 [email protected]'
egrep "\[/10.0.0.10:9042\] Connection has been closed|sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey" csct 
< t:2025-01-26 13:06:06,709 f:tester.py       l:1207 c:LongevityTest        p:DEBUG > Nemesis threads [{'nemesis': <class 'sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey'>, 'nemesis_selector': []}]
< t:2025-01-26 13:06:56,977 f:nemesis.py      l:467  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Interval: 120 s
< t:2025-01-26 13:07:11,655 f:nemesis.py      l:416  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2025-01-26 13:07:11,655 f:nemesis.py      l:1196 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Set current_disruption -> NodetoolDecommission
< t:2025-01-26 13:07:11,655 f:nemesis.py      l:459  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Current Target: Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10] with running nemesis: NodetoolDecommission
< t:2025-01-26 13:07:11,724 f:nemesis.py      l:5474 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: >>>>>>>>>>>> Started disruption disrupt_nodetool_decommission (NodetoolDecommission nemesis) on the target node 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10]' >>>>>>>>>>>>
< t:2025-01-26 13:08:38,108 f:nemesis.py      l:1330 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Adding 1 new nodes to cluster...
< t:2025-01-26 13:17:51,276 f:nemesis.py      l:5537 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: NodetoolDecommission duration -> 639 s
< t:2025-01-26 13:17:51,276 f:nemesis.py      l:362  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Update nemesis info with: {'start': 1737896831, 'end': 1737897471, 'duration': 639, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10]', 'subtype': 'end'}
< t:2025-01-26 13:17:51,820 f:nemesis.py      l:5552 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: log_info: {'start': 1737896831, 'end': 1737897471, 'duration': 639, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 [None | 10.0.0.10]', 'subtype': 'end'}
< t:2025-01-26 13:17:52,163 f:nemesis.py      l:5561 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: <<<<<<<<<<<< Finished disruption disrupt_nodetool_decommission (NodetoolDecommission nemesis) with status 'succeeded' <<<<<<<<<<<<
< t:2025-01-26 13:18:23,120 f:nemesis.py      l:416  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2025-01-26 13:18:23,120 f:nemesis.py      l:1196 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Set current_disruption -> BootstrapStreamingError
< t:2025-01-26 13:18:23,120 f:nemesis.py      l:459  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Current Target: Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6] with running nemesis: BootstrapStreamingError
< t:2025-01-26 13:18:23,168 f:nemesis.py      l:5474 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: >>>>>>>>>>>> Started disruption disrupt_bootstrap_streaming_error (BootstrapStreamingError nemesis) on the target node 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6]' >>>>>>>>>>>>
< t:2025-01-26 13:26:39,461 f:nemesis.py      l:5301 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Wait 5 minutes with new topology
< t:2025-01-26 13:31:39,533 f:nemesis.py      l:5303 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Decommission added node
< t:2025-01-26 13:33:35,527 f:nemesis.py      l:5537 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: BootstrapStreamingError duration -> 912 s
< t:2025-01-26 13:33:35,527 f:nemesis.py      l:362  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Update nemesis info with: {'start': 1737897503, 'end': 1737898415, 'duration': 912, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6]', 'subtype': 'end'}
< t:2025-01-26 13:33:35,585 f:nemesis.py      l:5552 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: log_info: {'start': 1737897503, 'end': 1737898415, 'duration': 912, 'node': 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 [None | 10.0.0.6]', 'subtype': 'end'}
< t:2025-01-26 13:33:35,924 f:nemesis.py      l:5561 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: <<<<<<<<<<<< Finished disruption disrupt_bootstrap_streaming_error (BootstrapStreamingError nemesis) with status 'succeeded' <<<<<<<<<<<<
< t:2025-01-26 13:34:06,615 f:nemesis.py      l:416  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2025-01-26 13:34:06,615 f:nemesis.py      l:1196 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Set current_disruption -> RollingRestartCluster
< t:2025-01-26 13:34:06,615 f:nemesis.py      l:459  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: Current Target: Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 [None | 10.0.0.10] with running nemesis: RollingRestartCluster
< t:2025-01-26 13:34:06,665 f:nemesis.py      l:5474 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.UniqueDriverDisconnectionSequenceMonkey: >>>>>>>>>>>> Started disruption disrupt_rolling_restart_cluster (RollingRestartCluster nemesis) on the target node 'Node longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 [None | 10.0.0.10]' >>>>>>>>>>>>
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed
< t:2025-01-26 13:35:32,653 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.12>: com.datastax.driver.core.exceptions.TransportException: [/10.0.0.10:9042] Connection has been closed

Packages

Scylla version: 2025.1.0~dev-20250125.f4b1ad43d4e7 with build-id f287d3c3c2679d52454be4519393ac84ecdaf2c8

Kernel Version: 6.8.0-1020-azure

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-9 (null | 10.0.0.14) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-8 (null | 10.0.0.14) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-7 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-6 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-5 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-4 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-3 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-2 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-16 (null | 10.0.0.9) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-15 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-14 (null | 10.0.0.9) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-13 (null | 10.0.0.14) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-12 (null | 10.0.0.9) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-11 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-10 (null | 10.0.0.6) (shards: -1)
  • longevity-10gb-3h-repro-ja-db-node-ed6aec62-eastus-1 (null | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-2025.1.0-dev-x86_64-2025-01-25T21-46-28 (NO RUNNER: NO RUNNER)

Test: longevity-10gb-3h-azure-repro-java-driver
Test id: ed6aec62-676f-45d4-b256-17077796046b
Test name: scylla-master/reproducers/longevity-10gb-3h-azure-repro-java-driver
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor ed6aec62-676f-45d4-b256-17077796046b
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs ed6aec62-676f-45d4-b256-17077796046b

Logs:

Jenkins job URL
Argus

@roydahan
Copy link
Collaborator

Are you sure it's not related to this one: scylladb/scylla-cluster-tests#8855 (comment)

@yarongilor
Copy link
Author

This issue is not accurately reported. It looks like the decommissioned node ip is reused for a new added node, so the driver not necessarily got decommissioned node connectivity errors. more likely it's the new node restarted by nemesis.

< t:2024-12-18 02:03:52,813 f:cluster.py      l:2209 c:sdcm.cluster_azure   p:INFO  > Node longevity-10gb-3h-master-db-node-413a3a9b-eastus-8 [None | 10.0.0.6]: Found ScyllaDB version: 6.3.0~dev

@dkropachev
Copy link
Collaborator

@yarongilor , could you please confirm that there are two issues:

  1. That driver trying to reconnect to decomissioned node
  2. c-s get stuck and does not finish in time.

@yarongilor
Copy link
Author

yarongilor commented Jan 28, 2025

@yarongilor , could you please confirm that there are two issues:

  1. That driver trying to reconnect to decomissioned node
  2. c-s get stuck and does not finish in time.

As for the first one - i think it may be a false alarm, since the driver tried connecting to the ip of a new node, which happen to be identical to the decommissioned node.
As for 2, i'm not sure what happened. grafana does show a drop from 150k ops to ~ 60k after a rolling nemesis:

Image

So it might be the something wrong happened to the driver following the 2 rolling-restart nemesis.
There's also an error that didn't happen before, like:

WARN  [cluster1-nio-worker-3] 2024-12-18 03:57:10,443 DefaultPromise.java:593 - An exception was thrown by com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete()

@pehala
Copy link

pehala commented Feb 4, 2025

As for the first one - i think it may be a false alarm, since the driver tried connecting to the ip of a new node, which happen to be identical to the decommissioned node.

Reproduced again. Cassandra stress test tries to connect to decommissioned nodes.

* node 10.142.0.40 - terminated at 02:24:06 | connection attempt at 03:40:14

* node 10.142.0.53 - terminated at 01:55:40 | connection attempt at 03:40:23

* node 10.142.0.59 - terminated at 01:15:19 | connection attempt at 03:40:28

I believe it is not false alarm, we observe in many different runs (might not reproduce in your particular one) that C-S tried to access nodes that were terminated 2 hours ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI Stability
Projects
None yet
Development

No branches or pull requests

5 participants