Is your feature request related to a problem? Please describe.
response_cache already has solid RFC 5861 groundwork: it parses, serializes, and merges the stale-while-revalidate directive across subgraph calls today. The directive round-trips cleanly through the router to downstream CDNs. This issue proposes completing that story by adding the serve-stale runtime so the directive takes full effect inside the router's own cache.
The opportunity: when an upstream Cache-Control declares a stale-while-revalidate window, the router can serve the slightly-stale entry immediately while refreshing it in the background — smoothing the load spike that otherwise lands at the moment of TTL expiry. This is the standard HTTP-caching tool for absorbing expiry under sustained load, and operators who already set stale-while-revalidate on their subgraph responses would benefit from the router honoring it end-to-end, the same way a CDN does.
The existing foundation, verified against main at a412b4d6e:
- Parsed into
Option<u64> on the CacheControl struct — apollo-router/src/plugins/response_cache/cache_control.rs:25, parse at :109.
- Serialized back into the outgoing
Cache-Control header — cache_control.rs:210-217, so the directive already propagates downstream.
- Merged across subgraph calls —
merge_inner takes the minimum stale-while-revalidate across combined headers — cache_control.rs:308-324.
- A clearly-marked extension point in
can_use() — cache_control.rs:384-386:
// FIXME: we don't honor stale-while-revalidate yet
// !expired || self.stale_while_revalidate
!expired && !self.no_cache
Because the parsing/serialization/merge layer is already built, this is a well-scoped enhancement that builds on existing structure rather than a from-scratch feature. The remaining work is the serve-stale runtime, described below.
Describe the solution you'd like
Concrete proposal below, intentionally specific so it's easy to refine. None of the names or shapes are load-bearing — they're starting points. The serve-stale runtime adds four pieces, in dependency order:
1. Extend the stored entry's lifetime through the stale window.
The Redis entry is currently written with Expiration::EXAT((now + document.expire.as_secs())), where document.expire is cache_control.ttl() — max-age (minus age) — at storage/redis.rs:384 (cache-tag z-set score mirrors this at :308; document.expire set at plugin.rs:1706 and :2569). To serve an entry during its stale window, the stored TTL becomes ttl + stale_while_revalidate so the entry remains available for that window.
2. Make the freshness check tri-state.
can_use() returns bool today (cache_control.rs:376). Serve-stale introduces a third outcome:
Fresh — within max-age, serve directly (today's true).
Stale — past max-age but within the stale-while-revalidate window: serve and schedule a background refresh.
Expired — past the stale window or no-cache: fetch fresh (today's false).
3. Add a serve-stale branch to the lookup paths.
The three can_use() call sites — cache_lookup_root (plugin.rs:1290), cache_lookup_entities (:1551), and filter_representations (:2394) — currently choose between serve and fetch. The enhancement adds a third path: return the stale value to the client and enqueue a background revalidation for that key.
4. Add a background revalidation executor with in-flight dedupe.
On the Stale path, a background task re-runs the subgraph fetch and writes the result back to cache. The primitives are already available — the cache service holds a subgraph::BoxCloneService (plugin.rs:734, already Clone) and subgraph::Request is Clone. The executor would:
- Run off the user's response path (the user already received the stale response with no added latency).
- Use a scoped request context so the background fetch stays out of the user's span/trace.
- Write the refreshed entry back via the existing cache store path.
- Respect the plugin's shutdown signal.
- Dedupe in-flight refreshes (singleflight) so concurrent requests for the same stale key trigger a single background fetch — preserving the load-smoothing benefit. This is a small net-new addition; there isn't a singleflight primitive in
response_cache yet.
Strawman config (opt-in, so existing behavior is preserved by default):
response_cache:
enabled: true
subgraph:
all:
stale_while_revalidate:
enabled: false # opt-in; serving stale data is a deliberate choice
max_window: 60s # optional cap on the SWR window the router will honor,
# independent of what upstream declares
Describe alternatives you've considered
Serve stale at the gate only (!expired || self.stale_while_revalidate). Returns stale correctly, but without the background refresh the entry would serve stale until the window closes and then fetch synchronously. Pairing serve-stale with background revalidation is what delivers the smooth-refresh behavior the directive is designed for, so the executor (#4) is the valuable half.
Coprocessor / rhai serve-stale. The serve-stale decision and the background write-back both live inside the cache plugin's lookup path, below the coprocessor/rhai boundary, so this is best handled natively in response_cache.
Rely on the CDN's stale-while-revalidate. The directive already round-trips to the CDN, so a CDN can serve stale at the edge — a great complement. This proposal adds the same benefit on the router→subgraph hop, where origin load concentrates. The two compose well.
Tune TTLs higher. A blunt trade of staleness ceiling for miss rate, without background refresh or a bounded staleness window. stale-while-revalidate is the precise mechanism that supersedes this workaround.
Additional context
stale-if-error (RFC 5861) is a natural sibling that reuses most of this foundation — it's also already parsed (cache_control.rs:45, :139) and serialized (:247). The storage-TTL extension (#1), tri-state check (#2), and serve-stale branch (#3) apply to both; the difference is the trigger (origin error vs. TTL expiry), and stale-if-error doesn't require the background-refresh executor. They're separate concerns with separate semantics and are tracked separately, but whoever implements one can cheaply extend to the other in the same PR — flagging so the shared groundwork isn't built twice.
response_cache is still preview (HIDDEN_FROM_CONFIG_JSON_SCHEMA = true, plugin.rs:281), so additive opt-in config carries low compatibility risk.
Open design questions, with my current lean:
- Opt-in vs. always-on: opt-in (
enabled: false default), since serving stale is a deliberate operator choice. Open to honoring the directive whenever upstream sends it if that's preferred.
- Singleflight scope: in-process dedupe (
Arc<DashMap<CacheKey, ...>> or a singleflight crate) is simplest and removes per-replica stampedes; a distributed Redis lock is stronger under horizontal scale at the cost of a round-trip. My lean: start in-process, document the multi-replica behavior, revisit if needed.
- Background-fetch failure handling: keep serving stale until the window closes (RFC 5861 §3 semantics) rather than evicting on failure — this is where the
stale-if-error relationship is most relevant.
- Background-task context: a scoped context so the refresh stays off the user's trace, while carrying over any request context the subgraph fetch needs (auth, etc.). The subtlest correctness question; input welcome from anyone who's wired up similar background tasks in the router.
- Telemetry: a distinct
cache.status value for stale-served (e.g. stale / revalidating) so hit-ratio dashboards stay meaningful, plus counters for background-refresh success/failure.
- Config granularity: per-subgraph (shown above) vs. global. Per-subgraph matches the existing
response_cache.subgraph.{all,subgraphs.NAME} shape.
Source references (verified against main at a412b4d6e, 2026-05-29):
- Parse / serialize / merge /
can_use extension point: apollo-router/src/plugins/response_cache/cache_control.rs:25,109,210-217,279-324,376-386
can_use() call sites: plugin.rs:1290 (root), :1551 (entities), :2394 (representation filter)
- Cache store TTL source:
plugin.rs:1706,2569; Redis EXAT write: storage/redis.rs:384; cache-tag z-set score: storage/redis.rs:308
- Background-refresh primitives available:
subgraph::BoxCloneService at plugin.rs:734; subgraph::Request: Clone
- Plugin still preview:
plugin.rs:281
Implementation intent: I plan to follow this issue up with a PR once the design direction (singleflight scope, stale-if-error relationship, context handling) is settled.
Is your feature request related to a problem? Please describe.
response_cachealready has solid RFC 5861 groundwork: it parses, serializes, and merges thestale-while-revalidatedirective across subgraph calls today. The directive round-trips cleanly through the router to downstream CDNs. This issue proposes completing that story by adding the serve-stale runtime so the directive takes full effect inside the router's own cache.The opportunity: when an upstream
Cache-Controldeclares astale-while-revalidatewindow, the router can serve the slightly-stale entry immediately while refreshing it in the background — smoothing the load spike that otherwise lands at the moment of TTL expiry. This is the standard HTTP-caching tool for absorbing expiry under sustained load, and operators who already setstale-while-revalidateon their subgraph responses would benefit from the router honoring it end-to-end, the same way a CDN does.The existing foundation, verified against
mainata412b4d6e:Option<u64>on theCacheControlstruct —apollo-router/src/plugins/response_cache/cache_control.rs:25, parse at:109.Cache-Controlheader —cache_control.rs:210-217, so the directive already propagates downstream.merge_innertakes the minimumstale-while-revalidateacross combined headers —cache_control.rs:308-324.can_use()—cache_control.rs:384-386:Because the parsing/serialization/merge layer is already built, this is a well-scoped enhancement that builds on existing structure rather than a from-scratch feature. The remaining work is the serve-stale runtime, described below.
Describe the solution you'd like
Concrete proposal below, intentionally specific so it's easy to refine. None of the names or shapes are load-bearing — they're starting points. The serve-stale runtime adds four pieces, in dependency order:
1. Extend the stored entry's lifetime through the stale window.
The Redis entry is currently written with
Expiration::EXAT((now + document.expire.as_secs())), wheredocument.expireiscache_control.ttl()—max-age(minus age) — atstorage/redis.rs:384(cache-tag z-set score mirrors this at:308;document.expireset atplugin.rs:1706and:2569). To serve an entry during its stale window, the stored TTL becomesttl + stale_while_revalidateso the entry remains available for that window.2. Make the freshness check tri-state.
can_use()returnsbooltoday (cache_control.rs:376). Serve-stale introduces a third outcome:Fresh— withinmax-age, serve directly (today'strue).Stale— pastmax-agebut within thestale-while-revalidatewindow: serve and schedule a background refresh.Expired— past the stale window orno-cache: fetch fresh (today'sfalse).3. Add a serve-stale branch to the lookup paths.
The three
can_use()call sites —cache_lookup_root(plugin.rs:1290),cache_lookup_entities(:1551), andfilter_representations(:2394) — currently choose between serve and fetch. The enhancement adds a third path: return the stale value to the client and enqueue a background revalidation for that key.4. Add a background revalidation executor with in-flight dedupe.
On the
Stalepath, a background task re-runs the subgraph fetch and writes the result back to cache. The primitives are already available — the cache service holds asubgraph::BoxCloneService(plugin.rs:734, alreadyClone) andsubgraph::RequestisClone. The executor would:response_cacheyet.Strawman config (opt-in, so existing behavior is preserved by default):
Describe alternatives you've considered
Serve stale at the gate only (
!expired || self.stale_while_revalidate). Returns stale correctly, but without the background refresh the entry would serve stale until the window closes and then fetch synchronously. Pairing serve-stale with background revalidation is what delivers the smooth-refresh behavior the directive is designed for, so the executor (#4) is the valuable half.Coprocessor / rhai serve-stale. The serve-stale decision and the background write-back both live inside the cache plugin's lookup path, below the coprocessor/rhai boundary, so this is best handled natively in
response_cache.Rely on the CDN's
stale-while-revalidate. The directive already round-trips to the CDN, so a CDN can serve stale at the edge — a great complement. This proposal adds the same benefit on the router→subgraph hop, where origin load concentrates. The two compose well.Tune TTLs higher. A blunt trade of staleness ceiling for miss rate, without background refresh or a bounded staleness window.
stale-while-revalidateis the precise mechanism that supersedes this workaround.Additional context
stale-if-error(RFC 5861) is a natural sibling that reuses most of this foundation — it's also already parsed (cache_control.rs:45,:139) and serialized (:247). The storage-TTL extension (#1), tri-state check (#2), and serve-stale branch (#3) apply to both; the difference is the trigger (origin error vs. TTL expiry), andstale-if-errordoesn't require the background-refresh executor. They're separate concerns with separate semantics and are tracked separately, but whoever implements one can cheaply extend to the other in the same PR — flagging so the shared groundwork isn't built twice.response_cacheis still preview (HIDDEN_FROM_CONFIG_JSON_SCHEMA = true,plugin.rs:281), so additive opt-in config carries low compatibility risk.Open design questions, with my current lean:
enabled: falsedefault), since serving stale is a deliberate operator choice. Open to honoring the directive whenever upstream sends it if that's preferred.Arc<DashMap<CacheKey, ...>>or a singleflight crate) is simplest and removes per-replica stampedes; a distributed Redis lock is stronger under horizontal scale at the cost of a round-trip. My lean: start in-process, document the multi-replica behavior, revisit if needed.stale-if-errorrelationship is most relevant.cache.statusvalue for stale-served (e.g.stale/revalidating) so hit-ratio dashboards stay meaningful, plus counters for background-refresh success/failure.response_cache.subgraph.{all,subgraphs.NAME}shape.Source references (verified against
mainata412b4d6e, 2026-05-29):can_useextension point:apollo-router/src/plugins/response_cache/cache_control.rs:25,109,210-217,279-324,376-386can_use()call sites:plugin.rs:1290(root),:1551(entities),:2394(representation filter)plugin.rs:1706,2569; RedisEXATwrite:storage/redis.rs:384; cache-tag z-set score:storage/redis.rs:308subgraph::BoxCloneServiceatplugin.rs:734;subgraph::Request: Cloneplugin.rs:281Implementation intent: I plan to follow this issue up with a PR once the design direction (singleflight scope,
stale-if-errorrelationship, context handling) is settled.