Skip to content

(fix): Separated retries for read and write operations #3559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

vladvildanov
Copy link
Collaborator

Pull Request check-list

Please make sure to review and check all of these items:

  • Do tests and lints pass with this change?
  • Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
  • Is the new or changed code fully tested?
  • Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
  • Is there an example added to the examples folder (if applicable)?
  • Was the change added to CHANGES file?

Description of change

Separate retries for read and write operations to prevent duplicated writes on read failure.

Closes #3554

Copy link

@ManelCoutinhoSensei ManelCoutinhoSensei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach doesn't seem to work:

  1. There is no reconnect mechanism inside the parse_reponse method. If Redis goes down between send_command and the parse retry, it will get stuck retrying parse_response indefinitely.
  2. Even if you add a guard to AbstractConnection.read_response, similar to the one in can_read :
    sock = self._sock
    if not sock:
    self.connect()

    it will still send some commands, such as those in the on_connect method, which generate new responses. As a result, the response to the original command (e.g., XADD) may no longer be available or the correct one.

Unfortunately, fixing this seems to be more complex than it appears.

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei I think we're talking here about different issues, the one that you refers is #3555. This one fixes potential duplicate writes if read fails and it does fix that

@ManelCoutinhoSensei
Copy link

ManelCoutinhoSensei commented Mar 17, 2025

No, I believe I'm referring to the same issue you are (@vladvildanov).

Let me clarify my first point and let me know if I'm missing something:
As explained in the issue, the problem arises when Redis goes down between send and read, Consider a scenario where conn.send_command succeeds, but Redis goes down immediately afterward, and now we are attempting self.parse_response.
Redis.parse_responseAbstractConnection.read_response_HiredisParser.read_response_HiredisParser.read_from_socket (this is only a possible path). This last method fails at the call self._sock.recv_into(self._buffer) (line 88), because the socket was closed, as per the problem definition. This triggers your your newly added retry _disconnect_raise, which deleted the socket and and retries the described path.
However, upon retrying, Redis.parse_response does not contain logic for reconnecting the socket, so the retry inevitably fails again, making this retry mechanism ineffective.

I understand that your fix technically avoids retrying send operations when a read fails, but it introduces additional problems in the exact scenario you're attempting to address. The duplication issue you're fixing only occurs in this specific scenario, and therefore the current approach introduces new challenges.

A potential (though admittedly suboptimal) alternative would be to remove the retry mechanism for reads entirely. Instead, you could return a custom error indicating that the send operation was done but reading the response failed, clearly informing the user that it’s their responsibility to handle or verify the outcome.

(My second point about adding reconnect mechanisms to the read part is covered in my previous comment.)

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei Thanks for this robust explanation, now I'm so much into a core of the issue! However, wouldn't it makes sense to reconnect inside of _disconnect_raise after the disconnect, it should solve the issue I suppose. It already works like this for Pub/Sub

https://github.com/redis/redis-py/blob/master/redis/client.py#L902

@ManelCoutinhoSensei
Copy link

@vladvildanov, That might actually be a better approach than the one I was describing. However, I believe it still encounters the issue I mentioned in my second point:
Calling AbstractConnection.connect, will eventually trigger AbstractConnection.on_connect, which sets several configurations by sending commands like these:

if self.lib_name:
self.send_command("CLIENT", "SETINFO", "LIB-NAME", self.lib_name)
self.read_response()
except ResponseError:
pass
try:
if self.lib_version:
self.send_command("CLIENT", "SETINFO", "LIB-VER", self.lib_version)
self.read_response()
except ResponseError:
pass

As a result, by the time you attempt to read the response to the original command, it will no longer be available.

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei Well you're right, the existing logic that consider disconnect on retry makes it impossible to separate retries for write and read, I did some investigation in other clients and they also stick to this "transactional" logic when doing retries, so to be consistent we assume that command is called at least once, but you may live with duplicate. Other way you need to handle retries in the application.

https://redis.github.io/lettuce/advanced-usage/#failures-and-at-least-once-execution

@ManelCoutinhoSensei
Copy link

ManelCoutinhoSensei commented Mar 19, 2025

Hey @vladvildanov,

Before closing this PR and the issue completely, I wanted to check if an alternative approach could address the duplicate problem while still maintaining the at least once assumption used by other libraries.

Wouldn't something like this work?

try:
    conn.retry.call_with_retry(
        lambda: conn.send_command(*args, **options),
        lambda error: self._disconnect_raise(conn, error),
    )
    return self.parse_response(conn, command_name, **options)
except ParseResponseError as e:    # To be replaced with correct Error
    raise CustomErrorSayingThatReadResponseFailed
finally:
    ...

This keeps the changes minimal, avoids the duplication caused by read issues, preserves the original assumption (considering retries are already in place—without them, the guarantee of at least once doesn't hold anyway). and, instead of requiring users to track whether the command was sent once or twice, they would only need to check whether it was successful in case of an error.

I know other clients aren't doing this, but I believe we can do better 😉. This seems like a reasonable compromise between the ideal solution and taking no action.

What do you think?

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei Currently our users have an expectations that retry happens on timeout errors, your approach doesn't take that into a count. If you disable the retries you would achieve exactly what you mentioned "only need to check whether it was successful in case of an error".

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei Let's say this way, we can support an additional retry strategy, something like "at most once", but the existing one should be kept

@ManelCoutinhoSensei
Copy link

ManelCoutinhoSensei commented Mar 20, 2025

@vladvildanov, I have bad news and good news.

The bad news is that, as it stands, the retry mechanism does not work properly with non-idempotent operations. I believe this should be made clearer in the documentation. In the worst-case scenario, if the Redis connection is unstable, the same command could be sent up to N_RETRIES times, with the end user only being aware of the last attempt.
I mention this because even you initially found this behavior unintuitive and unexpected. This suggests that the current behavior might not align with user expectations and could benefit from a clearer explanation.


The good news is that since every execute_command call is funneled through a single method, I believe we can modify my previous suggestion into a more flexible solution. This would allow the system to support both "at least once" and "at most once" execution strategies, making the retry mechanism more useful for non-idempotent operations.

Here’s a rough draft of how it could work that would help you jump start this small implementation:

    def _execute_command_with_strategy(self, conn, command_name, *args, **options):
        if self.at_least_once: # Or retrieve it from the retry object (e.g., conn.retry.at_least_once)
            # This behaves as before, ensuring "at least once" execution.
            return self._send_command_parse_response(conn, command_name, *args, **options)

        # "At most once" case:
        # - Send the command with retries
        # - If an exception occurs during response parsing, leave it to the user to handle
        conn.send_command(*args, **options)
        try:
            return self.parse_response(conn, command_name, **options)
        except Exception as e:
            raise InternalException # Decide the exception to send here: if internal or smt already defined for the final user


    def _execute_command(self, *args, **options):
        """Execute a command and return a parsed response"""
        pool = self.connection_pool
        command_name = args[0]
        conn = self.connection or pool.get_connection(command_name, **options)
        try:
            return conn.retry.call_with_retry(
                lambda: self._execute_command_with_strategy(conn, command_name, *args, **options),
                lambda error: self._disconnect_raise(conn, error),

            )

This way, we don't have to discard the retry mechanism entirely. Instead, we can provide a configurable approach that better supports non-idempotent operations.

What do you think? 🚀

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei Looks good to me, I'll take it into an action as a separate PR, will add you as a reviewer

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei My proposal is following, for me the behaviour when we always disconnecting on retry is something odd here for timeout errors it doesn't makes sense to do it every time, only after retry exceeding. It makes sense to disconnect on ConnectionError as it means that the connection is unhealthy.

So let's change the _disconnect_raise method to not disconnect on timeouts, this way we can support both at-least-once by keeping existing "transactional" logic of retries and "at-most-once" by separating retries for read and writes. This way in "at-most-once" case if Redis goes down after send_command, the parse_response will immediately fails so user can handle this by itself and in case of timeout happens client still able to retry reads separately and might return the result of previous write eventually, WDYT?

@vladvildanov
Copy link
Collaborator Author

@petyaslavova Would appreciate your thoughts on this ⬆️⬆️

@petyaslavova
Copy link
Collaborator

@petyaslavova Would appreciate your thoughts on this ⬆️⬆️

It sounds reasonable. When we have a broken connection, retrying the response reading won't bring any value.
Combined with a new configuration for strategy selection, it would be a working solution for users who don't care how many times the write has been executed—they won't need to change their implementation. At the same time, we would add flexibility for use cases where writing at most once is critical. In both cases, the time spent on reconnection will be saved.

@ManelCoutinhoSensei
Copy link

@vladvildanov Let me try to rephrase your proposal to make sure I understood it correctly:

You're suggesting the following:

  • At-least-once: Keep it as I described earlier.
  • At-most-once: Slightly adjust the logic so that we can still retry read_response if the exception is either redis.exceptions.TimeoutError or socket.timeout. In those specific cases, the error handler wouldn’t disconnect, and we'd pass something like lambda error: None to skip the disconnect logic.

If that's what you're proposing, I think there are two things worth considering:

  1. The _supported_errors in the Retry object are user-configurable, so modifying the behavior for specific error types might interfere with user expectations. (Also, redoing or wrapping a Retry object manually could introduce its own complications—but maybe I'm overthinking this.)
  2. I'm not sure if there are edge cases where reading from a socket that has timed out could return stale or out-of-date data (like in some failover transition state). That might be worth double-checking.

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei Yeah, you're get it right! Let me elaborate a bit on your worries:

  1. Instead of setting lambda error: None for read_response I'm planning to add a condition inside retry callback itself, something like if error in [TimeoutError, socket.Timeout] conn.disconnect(), this way we don't need to worry about user-defined errors cause even if it will be fully overridden the initial behaviour will be kept and socket will be closed.

  2. That's something that makes sense to check.

@ManelCoutinhoSensei
Copy link

@vladvildanov I assume that you meant if error not in ...: conn.disconnect()
The only thing that I'm afraid of is that it seems that there was a conscious decision to disconnect by default even on socket.timeout, so changing it might be more dangerous than expected:

def read_response(
self,
disable_decoding=False,
*,
disconnect_on_error=True,
push_request=False,
):
"""Read the response from a previously sent command"""
host_error = self._host_error()
try:
if self.protocol in ["3", 3] and not HIREDIS_AVAILABLE:
response = self._parser.read_response(
disable_decoding=disable_decoding, push_request=push_request
)
else:
response = self._parser.read_response(disable_decoding=disable_decoding)
except socket.timeout:
if disconnect_on_error:
self.disconnect()
raise TimeoutError(f"Timeout reading from {host_error}")

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei Well, I don't get it, but then let's restrict it for TimeoutError only for now

@ManelCoutinhoSensei
Copy link

@vladvildanov I think TimeoutError is just a wrapper for the socket.timeout. I think that if you are going to do one, do both (or neither)

@vladvildanov
Copy link
Collaborator Author

@ManelCoutinhoSensei It's still a question how do we want to handle failures on read in case of "at-most-once" strategies and ConnectionError, cause it's impossible to distinguish if error thrown because or read or write. Also it's not that obvious how do we need to behave in case of pipelines and transactions, this thing turns to be more complicated than initially looked like, I started an internal discussion about retries and let you know how do we proceed with it

@ManelCoutinhoSensei
Copy link

@vladvildanov Sounds good!

It's impossible to distinguish if error thrown because or read or write

True. That's why, in my suggestion, I proposed a new custom error for this case.

how do we need to behave in case of pipelines and transactions

Good catch!! Looking at it, it shouldn't be too hard to implement something like the one that I proposed for the regular Redis client tho.

I started an internal discussion about retries and let you know how do we proceed with it

Awesome!! Let me know if you need anything or any help! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential Command Duplication in _send_command_parse_response Retry Mechanism
3 participants