-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Description
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==10.7.0
Apache Airflow version
3.0.6
Operating System
debian 12
Deployment
Official Apache Airflow Helm Chart
Deployment details
Running on GKE , kubernetes version 1.33
What happened
A job with parallelism 1 and 1 completion (i.e. just running a single pod to completion) completed successfully. The triggerer detected the job completion, but before the task was restarted GKE deleted the pod for a node scaling event. Since the pod is Complete
the Job is also considered Complete
and so kubernetes will not retry the pod or anything. Then, when the task wakes up, it fails when trying to resume_execution
, notably when trying to fetch logs. The worst part is that on task retries the operator sees "job is completed" and tries to resume from execute_complete
and hits the same pod not found error again (instead of perhaps retrying the Job from the start).
[2025-10-15, 09:48:22] ERROR - Task failed with exception: source="task"
ApiException: (404)
Reason: Not Found
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 920 in run
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1215 in _execute_task
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py", line 1606 in resume_execution
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/job.py", line 276 in execute_complete
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 470 in get_pod
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 23999 in read_namespaced_pod
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 24086 in read_namespaced_pod_with_http_info
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348 in call_api
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180 in __call_api
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 373 in request
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 244 in GET
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238 in request
I think a primary workaround is to set get_logs=False
, but I'm not totally certain that this workaround fixes all cases where a PodNotFound might occur.
Also note that the None-check on getting the pod here is not hit since the method get_pod
ends up throwing a kubernetes.client.ApiException. I tried patching the code to catch that exception and rethrow the PodNotFoundException
, but that had no effect.
This feels similar to, but not fixed by #39239, notably a task retry does not result in a successful execution.
What you think should happen instead
I think failing with PodNotFoundException for the task when get_logs=True
is reasonable, however it seems like a task retry should then result in the full task being retried instead of just re-running execute_complete
and failing on the same exception multiple times. This behavior seemed to occur regardless of if the kubernetes Job object still remained either.
How to reproduce
Run a KubernetesJobOperator that does anything, and once the pod completes (but prior to airflow fetching logs/marking the task complete), manually delete the pod. In an actual cloud-hosted Kubernetes environment, a cluster-autoscaling component might result in the pod being deleted, but it is hard to rely on that so a manual delete mimics the same behavior.
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct