Skip to content

Add background revalidation for stale-while-revalidate (RFC 5861) in response_cache #9560

@ebylund

Description

@ebylund

Is your feature request related to a problem? Please describe.

response_cache already has solid RFC 5861 groundwork: it parses, serializes, and merges the stale-while-revalidate directive across subgraph calls today. The directive round-trips cleanly through the router to downstream CDNs. This issue proposes completing that story by adding the serve-stale runtime so the directive takes full effect inside the router's own cache.

The opportunity: when an upstream Cache-Control declares a stale-while-revalidate window, the router can serve the slightly-stale entry immediately while refreshing it in the background — smoothing the load spike that otherwise lands at the moment of TTL expiry. This is the standard HTTP-caching tool for absorbing expiry under sustained load, and operators who already set stale-while-revalidate on their subgraph responses would benefit from the router honoring it end-to-end, the same way a CDN does.

The existing foundation, verified against main at a412b4d6e:

  • Parsed into Option<u64> on the CacheControl struct — apollo-router/src/plugins/response_cache/cache_control.rs:25, parse at :109.
  • Serialized back into the outgoing Cache-Control header — cache_control.rs:210-217, so the directive already propagates downstream.
  • Merged across subgraph calls — merge_inner takes the minimum stale-while-revalidate across combined headers — cache_control.rs:308-324.
  • A clearly-marked extension point in can_use()cache_control.rs:384-386:
    // FIXME: we don't honor stale-while-revalidate yet
    // !expired || self.stale_while_revalidate
    !expired && !self.no_cache

Because the parsing/serialization/merge layer is already built, this is a well-scoped enhancement that builds on existing structure rather than a from-scratch feature. The remaining work is the serve-stale runtime, described below.

Describe the solution you'd like

Concrete proposal below, intentionally specific so it's easy to refine. None of the names or shapes are load-bearing — they're starting points. The serve-stale runtime adds four pieces, in dependency order:

1. Extend the stored entry's lifetime through the stale window.

The Redis entry is currently written with Expiration::EXAT((now + document.expire.as_secs())), where document.expire is cache_control.ttl()max-age (minus age) — at storage/redis.rs:384 (cache-tag z-set score mirrors this at :308; document.expire set at plugin.rs:1706 and :2569). To serve an entry during its stale window, the stored TTL becomes ttl + stale_while_revalidate so the entry remains available for that window.

2. Make the freshness check tri-state.

can_use() returns bool today (cache_control.rs:376). Serve-stale introduces a third outcome:

  • Fresh — within max-age, serve directly (today's true).
  • Stale — past max-age but within the stale-while-revalidate window: serve and schedule a background refresh.
  • Expired — past the stale window or no-cache: fetch fresh (today's false).

3. Add a serve-stale branch to the lookup paths.

The three can_use() call sites — cache_lookup_root (plugin.rs:1290), cache_lookup_entities (:1551), and filter_representations (:2394) — currently choose between serve and fetch. The enhancement adds a third path: return the stale value to the client and enqueue a background revalidation for that key.

4. Add a background revalidation executor with in-flight dedupe.

On the Stale path, a background task re-runs the subgraph fetch and writes the result back to cache. The primitives are already available — the cache service holds a subgraph::BoxCloneService (plugin.rs:734, already Clone) and subgraph::Request is Clone. The executor would:

  • Run off the user's response path (the user already received the stale response with no added latency).
  • Use a scoped request context so the background fetch stays out of the user's span/trace.
  • Write the refreshed entry back via the existing cache store path.
  • Respect the plugin's shutdown signal.
  • Dedupe in-flight refreshes (singleflight) so concurrent requests for the same stale key trigger a single background fetch — preserving the load-smoothing benefit. This is a small net-new addition; there isn't a singleflight primitive in response_cache yet.

Strawman config (opt-in, so existing behavior is preserved by default):

response_cache:
  enabled: true
  subgraph:
    all:
      stale_while_revalidate:
        enabled: false          # opt-in; serving stale data is a deliberate choice
        max_window: 60s         # optional cap on the SWR window the router will honor,
                                # independent of what upstream declares

Describe alternatives you've considered

Serve stale at the gate only (!expired || self.stale_while_revalidate). Returns stale correctly, but without the background refresh the entry would serve stale until the window closes and then fetch synchronously. Pairing serve-stale with background revalidation is what delivers the smooth-refresh behavior the directive is designed for, so the executor (#4) is the valuable half.

Coprocessor / rhai serve-stale. The serve-stale decision and the background write-back both live inside the cache plugin's lookup path, below the coprocessor/rhai boundary, so this is best handled natively in response_cache.

Rely on the CDN's stale-while-revalidate. The directive already round-trips to the CDN, so a CDN can serve stale at the edge — a great complement. This proposal adds the same benefit on the router→subgraph hop, where origin load concentrates. The two compose well.

Tune TTLs higher. A blunt trade of staleness ceiling for miss rate, without background refresh or a bounded staleness window. stale-while-revalidate is the precise mechanism that supersedes this workaround.

Additional context

stale-if-error (RFC 5861) is a natural sibling that reuses most of this foundation — it's also already parsed (cache_control.rs:45, :139) and serialized (:247). The storage-TTL extension (#1), tri-state check (#2), and serve-stale branch (#3) apply to both; the difference is the trigger (origin error vs. TTL expiry), and stale-if-error doesn't require the background-refresh executor. They're separate concerns with separate semantics and are tracked separately, but whoever implements one can cheaply extend to the other in the same PR — flagging so the shared groundwork isn't built twice.

response_cache is still preview (HIDDEN_FROM_CONFIG_JSON_SCHEMA = true, plugin.rs:281), so additive opt-in config carries low compatibility risk.

Open design questions, with my current lean:

  • Opt-in vs. always-on: opt-in (enabled: false default), since serving stale is a deliberate operator choice. Open to honoring the directive whenever upstream sends it if that's preferred.
  • Singleflight scope: in-process dedupe (Arc<DashMap<CacheKey, ...>> or a singleflight crate) is simplest and removes per-replica stampedes; a distributed Redis lock is stronger under horizontal scale at the cost of a round-trip. My lean: start in-process, document the multi-replica behavior, revisit if needed.
  • Background-fetch failure handling: keep serving stale until the window closes (RFC 5861 §3 semantics) rather than evicting on failure — this is where the stale-if-error relationship is most relevant.
  • Background-task context: a scoped context so the refresh stays off the user's trace, while carrying over any request context the subgraph fetch needs (auth, etc.). The subtlest correctness question; input welcome from anyone who's wired up similar background tasks in the router.
  • Telemetry: a distinct cache.status value for stale-served (e.g. stale / revalidating) so hit-ratio dashboards stay meaningful, plus counters for background-refresh success/failure.
  • Config granularity: per-subgraph (shown above) vs. global. Per-subgraph matches the existing response_cache.subgraph.{all,subgraphs.NAME} shape.

Source references (verified against main at a412b4d6e, 2026-05-29):

  • Parse / serialize / merge / can_use extension point: apollo-router/src/plugins/response_cache/cache_control.rs:25,109,210-217,279-324,376-386
  • can_use() call sites: plugin.rs:1290 (root), :1551 (entities), :2394 (representation filter)
  • Cache store TTL source: plugin.rs:1706,2569; Redis EXAT write: storage/redis.rs:384; cache-tag z-set score: storage/redis.rs:308
  • Background-refresh primitives available: subgraph::BoxCloneService at plugin.rs:734; subgraph::Request: Clone
  • Plugin still preview: plugin.rs:281

Implementation intent: I plan to follow this issue up with a PR once the design direction (singleflight scope, stale-if-error relationship, context handling) is settled.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions