Skip to content

Data Plane routes traffic to stale Pod IPs after backend rollout, causing 504 timeouts #4697

@trantandat18pm

Description

@trantandat18pm

Summary

We are experiencing intermittent 504 Gateway Timeout errors when deploying new versions of backend applications behind NGINX Gateway Fabric.

The issue occurs after a backend rollout, where the NGINX data plane continues routing traffic to stale Pod IPs that no longer exist, even though:

The new backend Pods are running and healthy

Control plane logs show successful configuration updates

Direct access via NodePort / ClusterIP works correctly

Restarting the data plane pods immediately resolves the issue.

Environment

  • NGINX Gateway Fabric version: v2.3.0
  • Gateway API version: v1.4.1
  • Deployment mode:
    • Control Plane replicas: 1
    • Data Plane replicas: 5

Gateway Configuration (simplified)

  • Single Gateway
  • Multiple HTTPS listeners
  • Wildcard hostnames
  • Routes attached from multiple namespaces
listeners:
- protocol: HTTP
  port: 80
- protocol: HTTPS
  port: 443
  hostname: "*.xxxtest.com"
- protocol: HTTPS
  port: 443
  hostname: "*.xxx.com"

Symptoms

During a backend rollout:

  • Requests through the Gateway intermittently return 504 Gateway Timeout
  • Requests to the same Service via NodePort succeed
  • Gateway access logs show traffic being forwarded to unexpected backend IPs
    • These IPs correspond to Pods from the previous version, already terminated
  • Restarting the NGF data plane clears the stale IPs and restores normal traffic

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions