KubernetesJobOperator does not recover when pods are deleted on completion

### Apache Airflow Provider(s)

cncf-kubernetes

### Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==10.7.0

### Apache Airflow version

3.0.6

### Operating System

debian 12

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

Running on GKE , kubernetes version 1.33

### What happened

A job with parallelism 1 and 1 completion (i.e. just running a single pod to completion) completed successfully.  The triggerer detected the job completion, but before the task was restarted GKE deleted the pod for a node scaling event.  Since the pod is `Complete` the Job is also considered `Complete` and so kubernetes will not retry the pod or anything.  Then, when the task wakes up, it fails when trying to `resume_execution`, notably when trying to fetch logs.  The worst part is that on *task retries* the operator sees "job is completed" and tries to resume from `execute_complete` and hits the same pod not found error again (instead of perhaps retrying the Job from the start).

```
[2025-10-15, 09:48:22] ERROR - Task failed with exception: source="task"
ApiException: (404)
Reason: Not Found

File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 920 in run
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1215 in _execute_task
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py", line 1606 in resume_execution
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/job.py", line 276 in execute_complete
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 470 in get_pod
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 23999 in read_namespaced_pod
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 24086 in read_namespaced_pod_with_http_info
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348 in call_api
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180 in __call_api
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 373 in request
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 244 in GET
File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238 in request
```

I think a primary workaround is to set `get_logs=False`, but I'm not totally certain that this workaround fixes all cases where a PodNotFound might occur.

Also note that the None-check on getting the pod [here](https://github.com/apache/airflow/blob/providers-cncf-kubernetes/10.7.0/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py#L276-L278) is not hit since the method `get_pod` ends up throwing a kubernetes.client.ApiException.  I tried patching the code to catch that exception and rethrow the `PodNotFoundException`, but that had no effect.

This feels similar to, but not fixed by https://github.com/apache/airflow/issues/39239, notably a task retry does not result in a successful execution.

### What you think should happen instead

I think failing with PodNotFoundException for the task when `get_logs=True` is reasonable, however it seems like a task retry should then result in the full task being retried instead of just re-running `execute_complete` and failing on the same exception multiple times.  This behavior seemed to occur regardless of if the kubernetes Job object still remained either.

### How to reproduce

Run a KubernetesJobOperator that does anything, and once the pod completes (but prior to airflow fetching logs/marking the task complete), manually delete the pod.  In an actual cloud-hosted Kubernetes environment, a cluster-autoscaling component might result in the pod being deleted, but it is hard to rely on that so a manual delete mimics the same behavior. 

### Anything else

_No response_

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KubernetesJobOperator does not recover when pods are deleted on completion #56693

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KubernetesJobOperator does not recover when pods are deleted on completion #56693

Description

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions