Skip to content

feat(proxy): add upstream_response_decision hook for status-code retries#872

Open
rickcrawford wants to merge 2 commits intocloudflare:mainfrom
rickcrawford:status-code-retries
Open

feat(proxy): add upstream_response_decision hook for status-code retries#872
rickcrawford wants to merge 2 commits intocloudflare:mainfrom
rickcrawford:status-code-retries

Conversation

@rickcrawford
Copy link
Copy Markdown

@rickcrawford rickcrawford commented Apr 26, 2026

Discussion: #873

Summary

Today Pingora retries upstream calls only on connect-time errors via fail_to_connect. There is no way to drive the retry loop from a response-side signal, which means status-code-driven retries — the common 502/503/504 case during rolling restarts — cannot be expressed without working around Pingora's flow control.

This adds one minimal hook that reuses the existing retry path.

API

async fn upstream_response_decision(
    &self,
    _session: &mut Session,
    _upstream_response: &ResponseHeader,
    _ctx: &mut Self::CTX,
) -> Option<Box<Error>> { None }

The hook fires once per upstream response, right after upstream_response_filter and before any bytes flow downstream.

  • Returning Some(err) aborts the response.
  • If the user sets err.set_retry(true), the existing proxy_to_upstream retry loop kicks in: the upstream connection is dropped, upstream_peer() runs again, and the request goes to the next peer.
  • Reuse of a buffered request body is gated by the existing retry-buffer machinery via error_while_proxy — same behaviour as today's connect-error retries.

The default returns None, so existing callers see no behaviour change.

Use case

API gateways routing to fleets of replicas where transient 5xx during restarts are common. nginx (proxy_next_upstream), Envoy (retry_policy), and HAProxy (retry-on) all support this. Pingora is the outlier in not being able to express it.

Example user code:

async fn upstream_response_decision(
    &self,
    _session: &mut Session,
    upstream_response: &ResponseHeader,
    ctx: &mut Self::CTX,
) -> Option<Box<Error>> {
    let status = upstream_response.status.as_u16();
    if matches!(status, 502 | 503 | 504) && ctx.attempts < 3 {
        ctx.attempts += 1;
        let mut err = Error::new(ErrorType::HTTPStatus(status));
        err.set_retry(true);
        return Some(err);
    }
    None
}

Why this fits Pingora's design

  • Same primitive. e.set_retry(true) already drives the upstream_peer loop. We're not inventing a retry state machine; we're extending where it can be triggered.
  • Same lifecycle point. error_while_proxy already fires "after a connection is established"; this fires at the same lifecycle point but for a status-code signal instead of a transport error.
  • Header-time only. Aborting before headers reach downstream is well-defined; once response bytes are flowing to the client, no proxy can retry safely. nginx's proxy_next_upstream has the same restriction.
  • No new buffering. Pingora already has retry_buffer_truncated() and the request-body retry buffer for connect-error retries. Status-code retries reuse it. If the buffer was truncated (large body, streamed past the cap), the existing error_while_proxy clears the retry flag — same gating as today.

Implementation

  • 33-line trait method addition in pingora-proxy/src/proxy_trait.rs (default returns None, fully documented).
  • 15 lines wiring it into upstream_filter in pingora-proxy/src/lib.rs, right after upstream_response_filter.
  • No new state, no API changes elsewhere.

Test plan

  • cargo fmt --all -- --check clean.
  • cargo check --workspace clean.
  • cargo test -p pingora-proxy --lib passes.
  • cargo build --workspace --tests clean.
  • Adding an integration test that exercises the hook end-to-end is straightforward (request a header, return retryable error, assert the second upstream_peer call wins). Happy to add this on review — wanted to keep the initial diff minimal so the API can be discussed first.

Alternative considered

Extending error_while_proxy to also fire on response-header arrival, with a discriminator. One hook reused for two phases instead of two hooks. Smaller diff but less ergonomic — the same method would need both response-header and error inputs. I went with the dedicated hook for clarity, but happy to switch.

Prior art

Today Pingora retries upstream calls only on connect-time errors via
fail_to_connect. There is no way to drive the retry loop from a
response-side signal, which means status-code-driven retries (the
common 502/503/504 case during rolling restarts) cannot be expressed
without working around Pingora's flow control.

This adds one minimal hook that reuses the existing retry path:

  async fn upstream_response_decision(
      &self,
      _session: &mut Session,
      _upstream_response: &ResponseHeader,
      _ctx: &mut Self::CTX,
  ) -> Option<Box<Error>> { None }

The hook fires once per upstream response, right after
upstream_response_filter and before any bytes flow downstream.
Returning Some(err) aborts the response. If err.set_retry(true) is
set, the existing proxy_to_upstream retry loop kicks in: the
upstream connection is dropped, upstream_peer() runs again, and
the retry buffer / error_while_proxy machinery decides whether the
request body can be replayed (same gating that connect-error retries
already use).

Default returns None so existing callers see no behaviour change.

Use case: API gateways routing to fleets of replicas where transient
5xx during restarts are common. nginx (proxy_next_upstream), Envoy
(retry_policy), and HAProxy (retry-on) all support this.

Implementation: 33-line trait method addition + 15 lines wiring it
into upstream_filter. No new state, no API changes elsewhere.

Workspace builds and pingora-proxy lib tests pass unchanged.
The advisory landed 2026-04-22 and affects rustls-webpki 0.101.7
reached via reqwest 0.11.27. Same chain is present on upstream main,
so this is unrelated to the response-decision hook in the previous
commit. Following the existing audit.toml convention: temp ignore
until the internal sync applies the reqwest bump.
@rickcrawford
Copy link
Copy Markdown
Author

CI follow-up: pushed 27bed32 which adds RUSTSEC-2026-0104 to .cargo/audit.toml.

The audit failure is unrelated to this PR. The advisory landed 2026-04-22 and affects rustls-webpki 0.101.7 reached via reqwest 0.11.27. The same chain is present on upstream main — I verified by running cargo audit against c0adfd3 locally and got the same failure with no PR changes applied.

The fix follows the convention you've already set in that file ("Temp before internal sync applies dependency bumps"). Happy to drop the audit-config commit if you'd rather handle the ignore separately; the trait-method change in 8324836 stands on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant