feat(proxy): add upstream_response_decision hook for status-code retries#872
Open
rickcrawford wants to merge 2 commits intocloudflare:mainfrom
Open
feat(proxy): add upstream_response_decision hook for status-code retries#872rickcrawford wants to merge 2 commits intocloudflare:mainfrom
rickcrawford wants to merge 2 commits intocloudflare:mainfrom
Conversation
Today Pingora retries upstream calls only on connect-time errors via
fail_to_connect. There is no way to drive the retry loop from a
response-side signal, which means status-code-driven retries (the
common 502/503/504 case during rolling restarts) cannot be expressed
without working around Pingora's flow control.
This adds one minimal hook that reuses the existing retry path:
async fn upstream_response_decision(
&self,
_session: &mut Session,
_upstream_response: &ResponseHeader,
_ctx: &mut Self::CTX,
) -> Option<Box<Error>> { None }
The hook fires once per upstream response, right after
upstream_response_filter and before any bytes flow downstream.
Returning Some(err) aborts the response. If err.set_retry(true) is
set, the existing proxy_to_upstream retry loop kicks in: the
upstream connection is dropped, upstream_peer() runs again, and
the retry buffer / error_while_proxy machinery decides whether the
request body can be replayed (same gating that connect-error retries
already use).
Default returns None so existing callers see no behaviour change.
Use case: API gateways routing to fleets of replicas where transient
5xx during restarts are common. nginx (proxy_next_upstream), Envoy
(retry_policy), and HAProxy (retry-on) all support this.
Implementation: 33-line trait method addition + 15 lines wiring it
into upstream_filter. No new state, no API changes elsewhere.
Workspace builds and pingora-proxy lib tests pass unchanged.
The advisory landed 2026-04-22 and affects rustls-webpki 0.101.7 reached via reqwest 0.11.27. Same chain is present on upstream main, so this is unrelated to the response-decision hook in the previous commit. Following the existing audit.toml convention: temp ignore until the internal sync applies the reqwest bump.
Author
|
CI follow-up: pushed The audit failure is unrelated to this PR. The advisory landed 2026-04-22 and affects The fix follows the convention you've already set in that file ("Temp before internal sync applies dependency bumps"). Happy to drop the audit-config commit if you'd rather handle the ignore separately; the trait-method change in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Discussion: #873
Summary
Today Pingora retries upstream calls only on connect-time errors via
fail_to_connect. There is no way to drive the retry loop from a response-side signal, which means status-code-driven retries — the common 502/503/504 case during rolling restarts — cannot be expressed without working around Pingora's flow control.This adds one minimal hook that reuses the existing retry path.
API
The hook fires once per upstream response, right after
upstream_response_filterand before any bytes flow downstream.Some(err)aborts the response.err.set_retry(true), the existingproxy_to_upstreamretry loop kicks in: the upstream connection is dropped,upstream_peer()runs again, and the request goes to the next peer.error_while_proxy— same behaviour as today's connect-error retries.The default returns
None, so existing callers see no behaviour change.Use case
API gateways routing to fleets of replicas where transient 5xx during restarts are common. nginx (
proxy_next_upstream), Envoy (retry_policy), and HAProxy (retry-on) all support this. Pingora is the outlier in not being able to express it.Example user code:
Why this fits Pingora's design
e.set_retry(true)already drives theupstream_peerloop. We're not inventing a retry state machine; we're extending where it can be triggered.error_while_proxyalready fires "after a connection is established"; this fires at the same lifecycle point but for a status-code signal instead of a transport error.proxy_next_upstreamhas the same restriction.retry_buffer_truncated()and the request-body retry buffer for connect-error retries. Status-code retries reuse it. If the buffer was truncated (large body, streamed past the cap), the existingerror_while_proxyclears the retry flag — same gating as today.Implementation
pingora-proxy/src/proxy_trait.rs(default returnsNone, fully documented).upstream_filterinpingora-proxy/src/lib.rs, right afterupstream_response_filter.Test plan
cargo fmt --all -- --checkclean.cargo check --workspaceclean.cargo test -p pingora-proxy --libpasses.cargo build --workspace --testsclean.upstream_peercall wins). Happy to add this on review — wanted to keep the initial diff minimal so the API can be discussed first.Alternative considered
Extending
error_while_proxyto also fire on response-header arrival, with a discriminator. One hook reused for two phases instead of two hooks. Smaller diff but less ergonomic — the same method would need both response-header and error inputs. I went with the dedicated hook for clarity, but happy to switch.Prior art
proxy_next_upstreamretry_policyretry-on