feat(proxy): add upstream_response_decision hook for status-code retries by rickcrawford · Pull Request #872 · cloudflare/pingora

rickcrawford · 2026-04-26T02:57:51Z

Discussion: #873

Summary

Today Pingora retries upstream calls only on connect-time errors via fail_to_connect. There is no way to drive the retry loop from a response-side signal, which means status-code-driven retries — the common 502/503/504 case during rolling restarts — cannot be expressed without working around Pingora's flow control.

This adds one minimal hook that reuses the existing retry path.

API

async fn upstream_response_decision(
    &self,
    _session: &mut Session,
    _upstream_response: &ResponseHeader,
    _ctx: &mut Self::CTX,
) -> Option<Box<Error>> { None }

The hook fires once per upstream response, right after upstream_response_filter and before any bytes flow downstream.

Returning Some(err) aborts the response.
If the user sets err.set_retry(true), the existing proxy_to_upstream retry loop kicks in: the upstream connection is dropped, upstream_peer() runs again, and the request goes to the next peer.
Reuse of a buffered request body is gated by the existing retry-buffer machinery via error_while_proxy — same behaviour as today's connect-error retries.

The default returns None, so existing callers see no behaviour change.

Use case

API gateways routing to fleets of replicas where transient 5xx during restarts are common. nginx (proxy_next_upstream), Envoy (retry_policy), and HAProxy (retry-on) all support this. Pingora is the outlier in not being able to express it.

Example user code:

async fn upstream_response_decision(
    &self,
    _session: &mut Session,
    upstream_response: &ResponseHeader,
    ctx: &mut Self::CTX,
) -> Option<Box<Error>> {
    let status = upstream_response.status.as_u16();
    if matches!(status, 502 | 503 | 504) && ctx.attempts < 3 {
        ctx.attempts += 1;
        let mut err = Error::new(ErrorType::HTTPStatus(status));
        err.set_retry(true);
        return Some(err);
    }
    None
}

Why this fits Pingora's design

Same primitive. e.set_retry(true) already drives the upstream_peer loop. We're not inventing a retry state machine; we're extending where it can be triggered.
Same lifecycle point. error_while_proxy already fires "after a connection is established"; this fires at the same lifecycle point but for a status-code signal instead of a transport error.
Header-time only. Aborting before headers reach downstream is well-defined; once response bytes are flowing to the client, no proxy can retry safely. nginx's proxy_next_upstream has the same restriction.
No new buffering. Pingora already has retry_buffer_truncated() and the request-body retry buffer for connect-error retries. Status-code retries reuse it. If the buffer was truncated (large body, streamed past the cap), the existing error_while_proxy clears the retry flag — same gating as today.

Implementation

33-line trait method addition in pingora-proxy/src/proxy_trait.rs (default returns None, fully documented).
15 lines wiring it into upstream_filter in pingora-proxy/src/lib.rs, right after upstream_response_filter.
No new state, no API changes elsewhere.

Test plan

cargo fmt --all -- --check clean.
cargo check --workspace clean.
cargo test -p pingora-proxy --lib passes.
cargo build --workspace --tests clean.
Adding an integration test that exercises the hook end-to-end is straightforward (request a header, return retryable error, assert the second upstream_peer call wins). Happy to add this on review — wanted to keep the initial diff minimal so the API can be discussed first.

Alternative considered

Extending error_while_proxy to also fire on response-header arrival, with a discriminator. One hook reused for two phases instead of two hooks. Smaller diff but less ergonomic — the same method would need both response-header and error inputs. I went with the dedicated hook for clarity, but happy to switch.

Prior art

Today Pingora retries upstream calls only on connect-time errors via fail_to_connect. There is no way to drive the retry loop from a response-side signal, which means status-code-driven retries (the common 502/503/504 case during rolling restarts) cannot be expressed without working around Pingora's flow control. This adds one minimal hook that reuses the existing retry path: async fn upstream_response_decision( &self, _session: &mut Session, _upstream_response: &ResponseHeader, _ctx: &mut Self::CTX, ) -> Option<Box<Error>> { None } The hook fires once per upstream response, right after upstream_response_filter and before any bytes flow downstream. Returning Some(err) aborts the response. If err.set_retry(true) is set, the existing proxy_to_upstream retry loop kicks in: the upstream connection is dropped, upstream_peer() runs again, and the retry buffer / error_while_proxy machinery decides whether the request body can be replayed (same gating that connect-error retries already use). Default returns None so existing callers see no behaviour change. Use case: API gateways routing to fleets of replicas where transient 5xx during restarts are common. nginx (proxy_next_upstream), Envoy (retry_policy), and HAProxy (retry-on) all support this. Implementation: 33-line trait method addition + 15 lines wiring it into upstream_filter. No new state, no API changes elsewhere. Workspace builds and pingora-proxy lib tests pass unchanged.

The advisory landed 2026-04-22 and affects rustls-webpki 0.101.7 reached via reqwest 0.11.27. Same chain is present on upstream main, so this is unrelated to the response-decision hook in the previous commit. Following the existing audit.toml convention: temp ignore until the internal sync applies the reqwest bump.

rickcrawford · 2026-04-26T03:48:17Z

CI follow-up: pushed 27bed32 which adds RUSTSEC-2026-0104 to .cargo/audit.toml.

The audit failure is unrelated to this PR. The advisory landed 2026-04-22 and affects rustls-webpki 0.101.7 reached via reqwest 0.11.27. The same chain is present on upstream main — I verified by running cargo audit against c0adfd3 locally and got the same failure with no PR changes applied.

The fix follows the convention you've already set in that file ("Temp before internal sync applies dependency bumps"). Happy to drop the audit-config commit if you'd rather handle the ignore separately; the trait-method change in 8324836 stands on its own.

rickcrawford mentioned this pull request Apr 26, 2026

Allow status-code-driven upstream retries via a ProxyHttp hook #873

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proxy): add upstream_response_decision hook for status-code retries#872

feat(proxy): add upstream_response_decision hook for status-code retries#872
rickcrawford wants to merge 2 commits intocloudflare:mainfrom
rickcrawford:status-code-retries

rickcrawford commented Apr 26, 2026 •

edited

Loading

Uh oh!

rickcrawford commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rickcrawford commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

API

Use case

Why this fits Pingora's design

Implementation

Test plan

Alternative considered

Prior art

Uh oh!

rickcrawford commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rickcrawford commented Apr 26, 2026 •

edited

Loading