(fix): Separated retries for read and write operations #3559

vladvildanov · 2025-03-14T08:07:52Z

Pull Request check-list

Please make sure to review and check all of these items:

Do tests and lints pass with this change?
Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
Is the new or changed code fully tested?
Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
Is there an example added to the examples folder (if applicable)?
Was the change added to CHANGES file?

Description of change

Separate retries for read and write operations to prevent duplicated writes on read failure.

Closes #3554

tests/test_asyncio/test_connection.py

ManelCoutinhoSensei

This approach doesn't seem to work:

There is no reconnect mechanism inside the parse_reponse method. If Redis goes down between send_command and the parse retry, it will get stuck retrying parse_response indefinitely.
Even if you add a guard to AbstractConnection.read_response, similar to the one in can_read :

redis-py/redis/connection.py

Lines 592 to 594 in 9251650

sock = self._sock

if not sock:

self.connect()

it will still send some commands, such as those in the on_connect method, which generate new responses. As a result, the response to the original command (e.g., XADD) may no longer be available or the correct one.

Unfortunately, fixing this seems to be more complex than it appears.

vladvildanov · 2025-03-17T07:05:04Z

@ManelCoutinhoSensei I think we're talking here about different issues, the one that you refers is #3555. This one fixes potential duplicate writes if read fails and it does fix that

ManelCoutinhoSensei · 2025-03-17T11:28:58Z

No, I believe I'm referring to the same issue you are (@vladvildanov).

Let me clarify my first point and let me know if I'm missing something:
As explained in the issue, the problem arises when Redis goes down between send and read, Consider a scenario where conn.send_command succeeds, but Redis goes down immediately afterward, and now we are attempting self.parse_response.
Redis.parse_response → AbstractConnection.read_response → _HiredisParser.read_response → _HiredisParser.read_from_socket (this is only a possible path). This last method fails at the call self._sock.recv_into(self._buffer) (line 88), because the socket was closed, as per the problem definition. This triggers your your newly added retry _disconnect_raise, which deleted the socket and and retries the described path.
However, upon retrying, Redis.parse_response does not contain logic for reconnecting the socket, so the retry inevitably fails again, making this retry mechanism ineffective.

I understand that your fix technically avoids retrying send operations when a read fails, but it introduces additional problems in the exact scenario you're attempting to address. The duplication issue you're fixing only occurs in this specific scenario, and therefore the current approach introduces new challenges.

A potential (though admittedly suboptimal) alternative would be to remove the retry mechanism for reads entirely. Instead, you could return a custom error indicating that the send operation was done but reading the response failed, clearly informing the user that it’s their responsibility to handle or verify the outcome.

(My second point about adding reconnect mechanisms to the read part is covered in my previous comment.)

vladvildanov · 2025-03-18T08:57:28Z

@ManelCoutinhoSensei Thanks for this robust explanation, now I'm so much into a core of the issue! However, wouldn't it makes sense to reconnect inside of _disconnect_raise after the disconnect, it should solve the issue I suppose. It already works like this for Pub/Sub

https://github.com/redis/redis-py/blob/master/redis/client.py#L902

ManelCoutinhoSensei · 2025-03-18T10:56:08Z

@vladvildanov, That might actually be a better approach than the one I was describing. However, I believe it still encounters the issue I mentioned in my second point:
Calling AbstractConnection.connect, will eventually trigger AbstractConnection.on_connect, which sets several configurations by sending commands like these:

redis-py/redis/connection.py

Lines 498 to 509 in 9251650

    
               if self.lib_name: 
        
                   self.send_command("CLIENT", "SETINFO", "LIB-NAME", self.lib_name) 
        
                   self.read_response() 
        
           except ResponseError: 
        
               pass 
        
           try: 
        
               if self.lib_version: 
        
                   self.send_command("CLIENT", "SETINFO", "LIB-VER", self.lib_version) 
        
                   self.read_response() 
        
           except ResponseError: 
        
               pass

As a result, by the time you attempt to read the response to the original command, it will no longer be available.

vladvildanov · 2025-03-19T13:29:06Z

@ManelCoutinhoSensei Well you're right, the existing logic that consider disconnect on retry makes it impossible to separate retries for write and read, I did some investigation in other clients and they also stick to this "transactional" logic when doing retries, so to be consistent we assume that command is called at least once, but you may live with duplicate. Other way you need to handle retries in the application.

https://redis.github.io/lettuce/advanced-usage/#failures-and-at-least-once-execution

ManelCoutinhoSensei · 2025-03-19T15:20:58Z

Hey @vladvildanov,

Before closing this PR and the issue completely, I wanted to check if an alternative approach could address the duplicate problem while still maintaining the at least once assumption used by other libraries.

Wouldn't something like this work?

try:
    conn.retry.call_with_retry(
        lambda: conn.send_command(*args, **options),
        lambda error: self._disconnect_raise(conn, error),
    )
    return self.parse_response(conn, command_name, **options)
except ParseResponseError as e:    # To be replaced with correct Error
    raise CustomErrorSayingThatReadResponseFailed
finally:
    ...

This keeps the changes minimal, avoids the duplication caused by read issues, preserves the original assumption (considering retries are already in place—without them, the guarantee of at least once doesn't hold anyway). and, instead of requiring users to track whether the command was sent once or twice, they would only need to check whether it was successful in case of an error.

I know other clients aren't doing this, but I believe we can do better 😉. This seems like a reasonable compromise between the ideal solution and taking no action.

What do you think?

vladvildanov · 2025-03-20T08:15:01Z

@ManelCoutinhoSensei Currently our users have an expectations that retry happens on timeout errors, your approach doesn't take that into a count. If you disable the retries you would achieve exactly what you mentioned "only need to check whether it was successful in case of an error".

vladvildanov · 2025-03-20T10:26:12Z

@ManelCoutinhoSensei Let's say this way, we can support an additional retry strategy, something like "at most once", but the existing one should be kept

ManelCoutinhoSensei · 2025-03-20T11:31:34Z

@vladvildanov, I have bad news and good news.

The bad news is that, as it stands, the retry mechanism does not work properly with non-idempotent operations. I believe this should be made clearer in the documentation. In the worst-case scenario, if the Redis connection is unstable, the same command could be sent up to N_RETRIES times, with the end user only being aware of the last attempt.
I mention this because even you initially found this behavior unintuitive and unexpected. This suggests that the current behavior might not align with user expectations and could benefit from a clearer explanation.

The good news is that since every execute_command call is funneled through a single method, I believe we can modify my previous suggestion into a more flexible solution. This would allow the system to support both "at least once" and "at most once" execution strategies, making the retry mechanism more useful for non-idempotent operations.

Here’s a rough draft of how it could work that would help you jump start this small implementation:

    def _execute_command_with_strategy(self, conn, command_name, *args, **options):
        if self.at_least_once: # Or retrieve it from the retry object (e.g., conn.retry.at_least_once)
            # This behaves as before, ensuring "at least once" execution.
            return self._send_command_parse_response(conn, command_name, *args, **options)

        # "At most once" case:
        # - Send the command with retries
        # - If an exception occurs during response parsing, leave it to the user to handle
        conn.send_command(*args, **options)
        try:
            return self.parse_response(conn, command_name, **options)
        except Exception as e:
            raise InternalException # Decide the exception to send here: if internal or smt already defined for the final user


    def _execute_command(self, *args, **options):
        """Execute a command and return a parsed response"""
        pool = self.connection_pool
        command_name = args[0]
        conn = self.connection or pool.get_connection(command_name, **options)
        try:
            return conn.retry.call_with_retry(
                lambda: self._execute_command_with_strategy(conn, command_name, *args, **options),
                lambda error: self._disconnect_raise(conn, error),

            )

This way, we don't have to discard the retry mechanism entirely. Instead, we can provide a configurable approach that better supports non-idempotent operations.

What do you think? 🚀

vladvildanov · 2025-03-21T08:23:43Z

@ManelCoutinhoSensei Looks good to me, I'll take it into an action as a separate PR, will add you as a reviewer

vladvildanov · 2025-03-21T09:23:55Z

@ManelCoutinhoSensei My proposal is following, for me the behaviour when we always disconnecting on retry is something odd here for timeout errors it doesn't makes sense to do it every time, only after retry exceeding. It makes sense to disconnect on ConnectionError as it means that the connection is unhealthy.

So let's change the _disconnect_raise method to not disconnect on timeouts, this way we can support both at-least-once by keeping existing "transactional" logic of retries and "at-most-once" by separating retries for read and writes. This way in "at-most-once" case if Redis goes down after send_command, the parse_response will immediately fails so user can handle this by itself and in case of timeout happens client still able to retry reads separately and might return the result of previous write eventually, WDYT?

vladvildanov · 2025-03-21T09:26:37Z

@petyaslavova Would appreciate your thoughts on this ⬆️⬆️

petyaslavova · 2025-03-21T10:17:38Z

@petyaslavova Would appreciate your thoughts on this ⬆️⬆️

It sounds reasonable. When we have a broken connection, retrying the response reading won't bring any value.
Combined with a new configuration for strategy selection, it would be a working solution for users who don't care how many times the write has been executed—they won't need to change their implementation. At the same time, we would add flexibility for use cases where writing at most once is critical. In both cases, the time spent on reconnection will be saved.

ManelCoutinhoSensei · 2025-03-21T15:04:12Z

@vladvildanov Let me try to rephrase your proposal to make sure I understood it correctly:

You're suggesting the following:

At-least-once: Keep it as I described earlier.
At-most-once: Slightly adjust the logic so that we can still retry read_response if the exception is either redis.exceptions.TimeoutError or socket.timeout. In those specific cases, the error handler wouldn’t disconnect, and we'd pass something like lambda error: None to skip the disconnect logic.

If that's what you're proposing, I think there are two things worth considering:

The _supported_errors in the Retry object are user-configurable, so modifying the behavior for specific error types might interfere with user expectations. (Also, redoing or wrapping a Retry object manually could introduce its own complications—but maybe I'm overthinking this.)
I'm not sure if there are edge cases where reading from a socket that has timed out could return stale or out-of-date data (like in some failover transition state). That might be worth double-checking.

vladvildanov · 2025-03-24T08:24:13Z

@ManelCoutinhoSensei Yeah, you're get it right! Let me elaborate a bit on your worries:

Instead of setting lambda error: None for read_response I'm planning to add a condition inside retry callback itself, something like if error in [TimeoutError, socket.Timeout] conn.disconnect(), this way we don't need to worry about user-defined errors cause even if it will be fully overridden the initial behaviour will be kept and socket will be closed.
That's something that makes sense to check.

ManelCoutinhoSensei · 2025-03-24T17:55:44Z

@vladvildanov I assume that you meant if error not in ...: conn.disconnect()
The only thing that I'm afraid of is that it seems that there was a conscious decision to disconnect by default even on socket.timeout, so changing it might be more dangerous than expected:

redis-py/redis/connection.py

Lines 604 to 625 in 9251650

    
           def read_response( 
        
               self, 
        
               disable_decoding=False, 
        
               *, 
        
               disconnect_on_error=True, 
        
               push_request=False, 
        
           ): 
        
               """Read the response from a previously sent command""" 
        
               host_error = self._host_error() 
        
               try: 
        
                   if self.protocol in ["3", 3] and not HIREDIS_AVAILABLE: 
        
                       response = self._parser.read_response( 
        
                           disable_decoding=disable_decoding, push_request=push_request 
        
                       ) 
        
                   else: 
        
                       response = self._parser.read_response(disable_decoding=disable_decoding) 
        
               except socket.timeout: 
        
                   if disconnect_on_error: 
        
                       self.disconnect() 
        
                   raise TimeoutError(f"Timeout reading from {host_error}")

vladvildanov · 2025-03-25T09:37:43Z

@ManelCoutinhoSensei Well, I don't get it, but then let's restrict it for TimeoutError only for now

ManelCoutinhoSensei · 2025-03-25T10:24:10Z

@vladvildanov I think TimeoutError is just a wrapper for the socket.timeout. I think that if you are going to do one, do both (or neither)

vladvildanov · 2025-03-27T09:06:41Z

@ManelCoutinhoSensei It's still a question how do we want to handle failures on read in case of "at-most-once" strategies and ConnectionError, cause it's impossible to distinguish if error thrown because or read or write. Also it's not that obvious how do we need to behave in case of pipelines and transactions, this thing turns to be more complicated than initially looked like, I started an internal discussion about retries and let you know how do we proceed with it

ManelCoutinhoSensei · 2025-03-27T10:17:27Z

@vladvildanov Sounds good!

It's impossible to distinguish if error thrown because or read or write

True. That's why, in my suggestion, I proposed a new custom error for this case.

how do we need to behave in case of pipelines and transactions

Good catch!! Looking at it, it shouldn't be too hard to implement something like the one that I proposed for the regular Redis client tho.

I started an internal discussion about retries and let you know how do we proceed with it

Awesome!! Let me know if you need anything or any help! 🚀

(fix): Separated retries for read and write operations

aac241b

vladvildanov requested a review from petyaslavova March 14, 2025 08:07

vladvildanov and others added 3 commits March 14, 2025 10:10

Merge branch 'master' into vv-3554-fix

8a6fe92

Removed unused import

5cad7d4

Merge branch 'vv-3554-fix' of github.com:redis/redis-py into vv-3554-fix

cc88554

vladvildanov added bug Bug async labels Mar 14, 2025

vladvildanov and others added 6 commits March 14, 2025 10:42

Fixed incorrect on error callback

bc7aaf1

Merge branch 'master' into vv-3554-fix

307108e

Removed unused import

ebbb44b

Merge branch 'vv-3554-fix' of github.com:redis/redis-py into vv-3554-fix

94edf9c

Fixed broken tests

f55e0bc

Added hack to make mocks awaitable

93da5b1

petyaslavova reviewed Mar 14, 2025

View reviewed changes

tests/test_asyncio/test_connection.py Show resolved Hide resolved

vladvildanov added 6 commits March 14, 2025 11:15

Convert hack into class

7c9fd86

Changed assertion to assert called once

de0e4d7

Removed unused import

add64c5

Added additional call count check for read_response

5cf8e14

Restrict async tests to > 3.8

12e6b49

Changed version restrictions

0855a56

petyaslavova approved these changes Mar 14, 2025

View reviewed changes

ManelCoutinhoSensei suggested changes Mar 14, 2025

View reviewed changes

vladvildanov added 2 commits March 17, 2025 09:29

Fixed error with awaitable tests

311a223

Removed unused import

8612a59

ManelCoutinhoSensei mentioned this pull request Mar 17, 2025

Potential Command Duplication in _send_command_parse_response Retry Mechanism #3554

Closed

vladvildanov closed this Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): Separated retries for read and write operations #3559

(fix): Separated retries for read and write operations #3559

vladvildanov commented Mar 14, 2025

ManelCoutinhoSensei left a comment

vladvildanov commented Mar 17, 2025

ManelCoutinhoSensei commented Mar 17, 2025 •

edited

Loading

vladvildanov commented Mar 18, 2025

ManelCoutinhoSensei commented Mar 18, 2025

vladvildanov commented Mar 19, 2025

ManelCoutinhoSensei commented Mar 19, 2025 •

edited

Loading

vladvildanov commented Mar 20, 2025

vladvildanov commented Mar 20, 2025

ManelCoutinhoSensei commented Mar 20, 2025 •

edited

Loading

vladvildanov commented Mar 21, 2025

vladvildanov commented Mar 21, 2025

vladvildanov commented Mar 21, 2025

petyaslavova commented Mar 21, 2025

ManelCoutinhoSensei commented Mar 21, 2025

vladvildanov commented Mar 24, 2025

ManelCoutinhoSensei commented Mar 24, 2025

vladvildanov commented Mar 25, 2025

ManelCoutinhoSensei commented Mar 25, 2025

vladvildanov commented Mar 27, 2025

ManelCoutinhoSensei commented Mar 27, 2025

	sock = self._sock
	if not sock:
	self.connect()

(fix): Separated retries for read and write operations #3559

(fix): Separated retries for read and write operations #3559

Conversation

vladvildanov commented Mar 14, 2025

Pull Request check-list

Description of change

ManelCoutinhoSensei left a comment

Choose a reason for hiding this comment

vladvildanov commented Mar 17, 2025

ManelCoutinhoSensei commented Mar 17, 2025 • edited Loading

vladvildanov commented Mar 18, 2025

ManelCoutinhoSensei commented Mar 18, 2025

vladvildanov commented Mar 19, 2025

ManelCoutinhoSensei commented Mar 19, 2025 • edited Loading

vladvildanov commented Mar 20, 2025

vladvildanov commented Mar 20, 2025

ManelCoutinhoSensei commented Mar 20, 2025 • edited Loading

vladvildanov commented Mar 21, 2025

vladvildanov commented Mar 21, 2025

vladvildanov commented Mar 21, 2025

petyaslavova commented Mar 21, 2025

ManelCoutinhoSensei commented Mar 21, 2025

vladvildanov commented Mar 24, 2025

ManelCoutinhoSensei commented Mar 24, 2025

vladvildanov commented Mar 25, 2025

ManelCoutinhoSensei commented Mar 25, 2025

vladvildanov commented Mar 27, 2025

ManelCoutinhoSensei commented Mar 27, 2025

ManelCoutinhoSensei commented Mar 17, 2025 •

edited

Loading

ManelCoutinhoSensei commented Mar 19, 2025 •

edited

Loading

ManelCoutinhoSensei commented Mar 20, 2025 •

edited

Loading