Don't deplete all the startup nodes after ConnectionError/TimeoutError#3697
Don't deplete all the startup nodes after ConnectionError/TimeoutError#3697eoghanmurray wants to merge 3 commits intoredis:masterfrom
Conversation
… or TimeoutError against all nodes, rather keep one around so that retry algorithm has at least one node to work with
There was a problem hiding this comment.
Pull Request Overview
This PR updates the cluster command execution logic to avoid removing the last remaining startup node on connection or timeout failures, ensuring retries can still proceed.
- Preserve one startup node when all others fail and wrap the original exception in a
RedisClusterException - Only remove failed nodes if more than one startup node remains
- Re-raise the appropriate exception after forcing a cluster layout reinitialization
Comments suppressed due to low confidence (2)
redis/asyncio/cluster.py:824
- [nitpick] The error message could be more descriptive and grammatically clear, e.g., 'Unable to connect to Redis Cluster: connection or timeout errors on all startup nodes'.
'Connection or Timeout Errors across all startup nodes'
redis/asyncio/cluster.py:820
- Add a unit test covering the scenario where only one startup node remains to ensure it isn't removed and the correct RedisClusterException is raised with the original cause.
if len(self.nodes_manager.startup_nodes) == 1:
| ce = RedisClusterException( | ||
| 'Redis Cluster cannot be connected. ' | ||
| 'Connection or Timeout Errors across all startup nodes' | ||
| ) | ||
| ce.__cause__ = e | ||
| e = ce |
There was a problem hiding this comment.
[nitpick] Reassigning the caught exception variable e to a new exception can be confusing; consider raising the new RedisClusterException directly or using a separate variable name for clarity.
| ce = RedisClusterException( | |
| 'Redis Cluster cannot be connected. ' | |
| 'Connection or Timeout Errors across all startup nodes' | |
| ) | |
| ce.__cause__ = e | |
| e = ce | |
| raise RedisClusterException( | |
| 'Redis Cluster cannot be connected. ' | |
| 'Connection or Timeout Errors across all startup nodes' | |
| ) from e |
There was a problem hiding this comment.
@eoghanmurray , Hello!
Thanks for the PR!
We also encountered a similar error when redis removes all startup nodes from the pool and goes into an endless retreat.
Would you like to see the comments from Copilot?
I want to see the merged PR as soon as possible :)
|
Hi @eoghanmurray, thank you for your PR! |
Don't deplete all the startup nodes after ConnectionError/TimeoutError against all nodes, rather keep one around so that retry algorithm has at least one node to work with
Description of change
See bug report #3693