Skip to content

Allow status-code-driven upstream retries via a ProxyHttp hook #873

@rickcrawford

Description

@rickcrawford

What is the problem your feature solves, or the need it fulfills?

Pingora retries upstream calls only on connect-time errors (fail_to_connect returning an error with e.set_retry(true)). There is no way to drive the same retry loop from a response-side signal, so status-code-driven retries — the common 502/503/504 case during rolling restarts — cannot be expressed cleanly.

This blocks API-gateway and reverse-proxy use cases where transient 5xx responses during deployments should land on a different replica rather than being surfaced to the client. nginx (proxy_next_upstream), Envoy (retry_policy.retry_on), and HAProxy (retry-on) all support this; Pingora is the outlier.

Describe the solution you'd like

Add one minimal ProxyHttp trait method that fires after upstream_response_filter and lets user code abort the response with a retryable error. The existing retry path then handles re-running upstream_peer(), with the request-body retry-buffer and error_while_proxy machinery deciding whether the request can be replayed.

async fn upstream_response_decision(
    &self,
    _session: &mut Session,
    _upstream_response: &ResponseHeader,
    _ctx: &mut Self::CTX,
) -> Option<Box<Error>> { None }
  • Default returns None — no behaviour change for existing callers.
  • Returning Some(err) aborts the response before any bytes flow downstream.
  • If err.set_retry(true), Pingora re-runs upstream_peer() exactly like a connect-time retry.

The hook fires at header-arrival time so aborting is safe; once response bytes are flowing to the client, no proxy can retry safely (same restriction as nginx's proxy_next_upstream).

User code example:

async fn upstream_response_decision(
    &self,
    _session: &mut Session,
    upstream_response: &ResponseHeader,
    ctx: &mut Self::CTX,
) -> Option<Box<Error>> {
    let status = upstream_response.status.as_u16();
    if matches!(status, 502 | 503 | 504) && ctx.attempts < 3 {
        ctx.attempts += 1;
        let mut err = Error::new(ErrorType::HTTPStatus(status));
        err.set_retry(true);
        return Some(err);
    }
    None
}

I have a draft PR ready: #872. Happy to revise based on design feedback.

Describe alternatives you've considered

  1. Extending error_while_proxy to also fire on response-header arrival. One hook reused for two phases instead of two hooks. Smaller diff but less ergonomic — the same method would need both response-header and error inputs. The dedicated hook is clearer at the API surface.

  2. User-side workaround buffering responses outside Pingora. Doable in user code: buffer small upstream responses, inspect status, and re-issue by recursion. Costs ~1 day per consumer, only works for responses under the buffer cap, and is obsolete once Pingora ships a first-class hook. We've prototyped this and the workaround code is the kind of thing we'd rather not ship.

  3. Status-aware fail_to_connect. Doesn't fit — fail_to_connect is conceptually about connection failures, and broadening its semantics would muddy the trait contract.

Additional context

Prior art:

The implementation is a 33-line trait method addition + 15 lines wiring it into upstream_filter. No new state, no API changes elsewhere. Default behaviour is identical to today. Workspace builds and pingora-proxy lib tests pass unchanged. Diff is in the linked PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions