Skip to content

KubernetesJobOperator does not recover when pods are deleted on completion #56693

@pmcquighan-camus

Description

@pmcquighan-camus

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==10.7.0

Apache Airflow version

3.0.6

Operating System

debian 12

Deployment

Official Apache Airflow Helm Chart

Deployment details

Running on GKE , kubernetes version 1.33

What happened

A job with parallelism 1 and 1 completion (i.e. just running a single pod to completion) completed successfully. The triggerer detected the job completion, but before the task was restarted GKE deleted the pod for a node scaling event. Since the pod is Complete the Job is also considered Complete and so kubernetes will not retry the pod or anything. Then, when the task wakes up, it fails when trying to resume_execution, notably when trying to fetch logs. The worst part is that on task retries the operator sees "job is completed" and tries to resume from execute_complete and hits the same pod not found error again (instead of perhaps retrying the Job from the start).

[2025-10-15, 09:48:22] ERROR - Task failed with exception: source="task"
ApiException: (404)
Reason: Not Found

File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 920 in run
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1215 in _execute_task
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py", line 1606 in resume_execution
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/job.py", line 276 in execute_complete
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 470 in get_pod
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 23999 in read_namespaced_pod
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 24086 in read_namespaced_pod_with_http_info
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348 in call_api
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180 in __call_api
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 373 in request
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 244 in GET
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238 in request

I think a primary workaround is to set get_logs=False, but I'm not totally certain that this workaround fixes all cases where a PodNotFound might occur.

Also note that the None-check on getting the pod here is not hit since the method get_pod ends up throwing a kubernetes.client.ApiException. I tried patching the code to catch that exception and rethrow the PodNotFoundException, but that had no effect.

This feels similar to, but not fixed by #39239, notably a task retry does not result in a successful execution.

What you think should happen instead

I think failing with PodNotFoundException for the task when get_logs=True is reasonable, however it seems like a task retry should then result in the full task being retried instead of just re-running execute_complete and failing on the same exception multiple times. This behavior seemed to occur regardless of if the kubernetes Job object still remained either.

How to reproduce

Run a KubernetesJobOperator that does anything, and once the pod completes (but prior to airflow fetching logs/marking the task complete), manually delete the pod. In an actual cloud-hosted Kubernetes environment, a cluster-autoscaling component might result in the pod being deleted, but it is hard to rely on that so a manual delete mimics the same behavior.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions