Add background revalidation for `stale-while-revalidate` (RFC 5861) in response_cache

### Is your feature request related to a problem? Please describe.

`response_cache` already has solid RFC 5861 groundwork: it **parses**, **serializes**, and **merges** the `stale-while-revalidate` directive across subgraph calls today. The directive round-trips cleanly through the router to downstream CDNs. This issue proposes completing that story by adding the serve-stale *runtime* so the directive takes full effect inside the router's own cache.

The opportunity: when an upstream `Cache-Control` declares a `stale-while-revalidate` window, the router can serve the slightly-stale entry immediately while refreshing it in the background — smoothing the load spike that otherwise lands at the moment of TTL expiry. This is the standard HTTP-caching tool for absorbing expiry under sustained load, and operators who already set `stale-while-revalidate` on their subgraph responses would benefit from the router honoring it end-to-end, the same way a CDN does.

The existing foundation, verified against `main` at `a412b4d6e`:

- **Parsed** into `Option<u64>` on the `CacheControl` struct — `apollo-router/src/plugins/response_cache/cache_control.rs:25`, parse at `:109`.
- **Serialized** back into the outgoing `Cache-Control` header — `cache_control.rs:210-217`, so the directive already propagates downstream.
- **Merged** across subgraph calls — `merge_inner` takes the minimum `stale-while-revalidate` across combined headers — `cache_control.rs:308-324`.
- A clearly-marked extension point in `can_use()` — `cache_control.rs:384-386`:
  ```rust
  // FIXME: we don't honor stale-while-revalidate yet
  // !expired || self.stale_while_revalidate
  !expired && !self.no_cache
  ```

Because the parsing/serialization/merge layer is already built, this is a well-scoped enhancement that builds on existing structure rather than a from-scratch feature. The remaining work is the serve-stale runtime, described below.

### Describe the solution you'd like

Concrete proposal below, intentionally specific so it's easy to refine. None of the names or shapes are load-bearing — they're starting points. The serve-stale runtime adds four pieces, in dependency order:

**1. Extend the stored entry's lifetime through the stale window.**

The Redis entry is currently written with `Expiration::EXAT((now + document.expire.as_secs()))`, where `document.expire` is `cache_control.ttl()` — `max-age` (minus age) — at `storage/redis.rs:384` (cache-tag z-set score mirrors this at `:308`; `document.expire` set at `plugin.rs:1706` and `:2569`). To serve an entry during its stale window, the stored TTL becomes `ttl + stale_while_revalidate` so the entry remains available for that window.

**2. Make the freshness check tri-state.**

`can_use()` returns `bool` today (`cache_control.rs:376`). Serve-stale introduces a third outcome:
- `Fresh` — within `max-age`, serve directly (today's `true`).
- `Stale` — past `max-age` but within the `stale-while-revalidate` window: serve **and** schedule a background refresh.
- `Expired` — past the stale window or `no-cache`: fetch fresh (today's `false`).

**3. Add a serve-stale branch to the lookup paths.**

The three `can_use()` call sites — `cache_lookup_root` (`plugin.rs:1290`), `cache_lookup_entities` (`:1551`), and `filter_representations` (`:2394`) — currently choose between serve and fetch. The enhancement adds a third path: return the stale value to the client **and** enqueue a background revalidation for that key.

**4. Add a background revalidation executor with in-flight dedupe.**

On the `Stale` path, a background task re-runs the subgraph fetch and writes the result back to cache. The primitives are already available — the cache service holds a `subgraph::BoxCloneService` (`plugin.rs:734`, already `Clone`) and `subgraph::Request` is `Clone`. The executor would:
- Run off the user's response path (the user already received the stale response with no added latency).
- Use a scoped request context so the background fetch stays out of the user's span/trace.
- Write the refreshed entry back via the existing cache store path.
- Respect the plugin's shutdown signal.
- **Dedupe in-flight refreshes (singleflight)** so concurrent requests for the same stale key trigger a single background fetch — preserving the load-smoothing benefit. This is a small net-new addition; there isn't a singleflight primitive in `response_cache` yet.

Strawman config (opt-in, so existing behavior is preserved by default):

```yaml
response_cache:
  enabled: true
  subgraph:
    all:
      stale_while_revalidate:
        enabled: false          # opt-in; serving stale data is a deliberate choice
        max_window: 60s         # optional cap on the SWR window the router will honor,
                                # independent of what upstream declares
```

### Describe alternatives you've considered

**Serve stale at the gate only (`!expired || self.stale_while_revalidate`).** Returns stale correctly, but without the background refresh the entry would serve stale until the window closes and then fetch synchronously. Pairing serve-stale with background revalidation is what delivers the smooth-refresh behavior the directive is designed for, so the executor (#4) is the valuable half.

**Coprocessor / rhai serve-stale.** The serve-stale decision and the background write-back both live inside the cache plugin's lookup path, below the coprocessor/rhai boundary, so this is best handled natively in `response_cache`.

**Rely on the CDN's `stale-while-revalidate`.** The directive already round-trips to the CDN, so a CDN can serve stale at the edge — a great complement. This proposal adds the same benefit on the router→subgraph hop, where origin load concentrates. The two compose well.

**Tune TTLs higher.** A blunt trade of staleness ceiling for miss rate, without background refresh or a bounded staleness window. `stale-while-revalidate` is the precise mechanism that supersedes this workaround.

### Additional context

**`stale-if-error` (RFC 5861) is a natural sibling** that reuses most of this foundation — it's also already parsed (`cache_control.rs:45`, `:139`) and serialized (`:247`). The storage-TTL extension (#1), tri-state check (#2), and serve-stale branch (#3) apply to both; the difference is the trigger (origin error vs. TTL expiry), and `stale-if-error` doesn't require the background-refresh executor. They're separate concerns with separate semantics and are tracked separately, but whoever implements one can cheaply extend to the other in the same PR — flagging so the shared groundwork isn't built twice.

**`response_cache` is still preview** (`HIDDEN_FROM_CONFIG_JSON_SCHEMA = true`, `plugin.rs:281`), so additive opt-in config carries low compatibility risk.

**Open design questions, with my current lean:**

- *Opt-in vs. always-on*: opt-in (`enabled: false` default), since serving stale is a deliberate operator choice. Open to honoring the directive whenever upstream sends it if that's preferred.
- *Singleflight scope*: in-process dedupe (`Arc<DashMap<CacheKey, ...>>` or a singleflight crate) is simplest and removes per-replica stampedes; a distributed Redis lock is stronger under horizontal scale at the cost of a round-trip. My lean: start in-process, document the multi-replica behavior, revisit if needed.
- *Background-fetch failure handling*: keep serving stale until the window closes (RFC 5861 §3 semantics) rather than evicting on failure — this is where the `stale-if-error` relationship is most relevant.
- *Background-task context*: a scoped context so the refresh stays off the user's trace, while carrying over any request context the subgraph fetch needs (auth, etc.). The subtlest correctness question; input welcome from anyone who's wired up similar background tasks in the router.
- *Telemetry*: a distinct `cache.status` value for stale-served (e.g. `stale` / `revalidating`) so hit-ratio dashboards stay meaningful, plus counters for background-refresh success/failure.
- *Config granularity*: per-subgraph (shown above) vs. global. Per-subgraph matches the existing `response_cache.subgraph.{all,subgraphs.NAME}` shape.

**Source references** (verified against `main` at `a412b4d6e`, 2026-05-29):
- Parse / serialize / merge / `can_use` extension point: `apollo-router/src/plugins/response_cache/cache_control.rs:25,109,210-217,279-324,376-386`
- `can_use()` call sites: `plugin.rs:1290` (root), `:1551` (entities), `:2394` (representation filter)
- Cache store TTL source: `plugin.rs:1706,2569`; Redis `EXAT` write: `storage/redis.rs:384`; cache-tag z-set score: `storage/redis.rs:308`
- Background-refresh primitives available: `subgraph::BoxCloneService` at `plugin.rs:734`; `subgraph::Request: Clone`
- Plugin still preview: `plugin.rs:281`

**Implementation intent:** I plan to follow this issue up with a PR once the design direction (singleflight scope, `stale-if-error` relationship, context handling) is settled.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add background revalidation for `stale-while-revalidate` (RFC 5861) in response_cache #9560

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add background revalidation for stale-while-revalidate (RFC 5861) in response_cache #9560

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Add background revalidation for `stale-while-revalidate` (RFC 5861) in response_cache #9560