You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running linkerd with linkerd-cni. Nodes are managed by Karpenter.
I am facing issue with Dagster jobs, when job is scheduled on the spot node, Karpenter creates new node and this job pod is stuck because network validator fails.
What is strange this works on my dev cluster but not on stage one.
Linkerd and karpenter config is identical on both (I guess on dev somehow linkerd-cni is scheduled before job pod when on stage it's opposite).
Only thing I did to make it work on dev was to update network validator timeout from default 10s to 60s.
I have tried to get this running by enabling repair container, but looks like it's not working for job pods. It never detects it. I have tested it with simple pod and it deletes pod as expected.
I have also tried to taint the node in karpenter nodepool and make linkerd-cni remove this taint after initialization, but there is no way to add extra sidecar container in linkerd-cni helm chart, only extra init containers can be added.
With init container I am of course facing chicken and egg scenario.
What is recommended solution in that case? How to get first job pod running on the fresh node?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am running linkerd with linkerd-cni. Nodes are managed by Karpenter.
I am facing issue with Dagster jobs, when job is scheduled on the spot node, Karpenter creates new node and this job pod is stuck because network validator fails.
What is strange this works on my dev cluster but not on stage one.
Linkerd and karpenter config is identical on both (I guess on dev somehow linkerd-cni is scheduled before job pod when on stage it's opposite).
Only thing I did to make it work on dev was to update network validator timeout from default 10s to 60s.
I have tried to get this running by enabling repair container, but looks like it's not working for job pods. It never detects it. I have tested it with simple pod and it deletes pod as expected.
I have also tried to taint the node in karpenter nodepool and make linkerd-cni remove this taint after initialization, but there is no way to add extra sidecar container in linkerd-cni helm chart, only extra init containers can be added.
With init container I am of course facing chicken and egg scenario.
What is recommended solution in that case? How to get first job pod running on the fresh node?
Beta Was this translation helpful? Give feedback.
All reactions