Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync-waves not working as expected #12376

Closed
3 tasks done
jgato opened this issue Feb 9, 2023 · 22 comments · Fixed by #16748
Closed
3 tasks done

Sync-waves not working as expected #12376

jgato opened this issue Feb 9, 2023 · 22 comments · Fixed by #16748
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache sync-waves

Comments

@jgato
Copy link

jgato commented Feb 9, 2023

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

ArgoCD 2.4.20

Describe the bug

When doing a sync delete, the objects are not deleted according to the sync-waves

To Reproduce

I have an ArgoCD app with different Resources. The one with the lowest syn-wave is a Namespace, which is -10. All the other objects have a sync-wave 1.
When I delete all the objects and sync (and prune) all the objects are deleted at the same time. In the following video, it can be seen, how the Namespace is marked as terminating immediately after the sync is invoked.

delete-no-respect-waves.mp4

Other objects, in this case some Secrets and BareMetalHost, are taking longer to be deleted. But there is no wait between waves.

I have been looking this feature:
#3959
that should implement this waits, according to this code:
https://github.com/argoproj/argo-cd/pull/3959/files/c2e5ccc81e0aaca81a2d002436d39d52a4364d2d#diff-f952d05ea83b61f771541425e28fa3931af2f2e7950261dd59cab93bdcfe2e9e
Actually, there is one comment about that. There is a wait of objects deletion before going to other wave.

Expected behavior

The Namespace object is deleted, only, after the other objects have been deleted.

Screenshots

Version

image

ArgoCD 2.4.20

Logs

Paste any relevant application logs here.
@jgato jgato added the bug Something isn't working label Feb 9, 2023
@keithchong keithchong added the component:core Syncing, diffing, cluster state cache label Feb 11, 2023
@leoluz
Copy link
Collaborator

leoluz commented Feb 16, 2023

Needs help with investigation. Possibly a bug

@jcooklin
Copy link

We've observed the same behavior.

@jgato
Copy link
Author

jgato commented Feb 22, 2023

If it helps, I was using ArgoCD 2.3, where it was happening the same. If it is a bug, it would came from some previous versions.

@gogovan-vincentngai
Copy link

Same issue here
My ArgoCD version v2.4.14+029be59

i am using crossplane
I need create secret for the provider usage , i give the secret a sync-wave to 1
and the provider sync-wave to 2

It have no problem when it create it will follow the order
but when i delete them it will first delete the secret and makes the provider unable to find secret and unable to delete

@jgato
Copy link
Author

jgato commented Mar 22, 2023

do we know of a version where synch-waves are working?

@drpaneas
Copy link
Contributor

it looks like the never worked as supposed to.

@Breee
Copy link

Breee commented Mar 30, 2023

I observe the same problems, especially with crossplane.

It really just fires out all the ressources without waiting for healthy ressources or waiting "ARGOCD_SYNC_WAVE_DELAY" seconds. Which is by default 2.

where is the code for the sync waves located?
how can we support with fixing the issue?

@tharpooljha
Copy link

We are seeing this as well with argo 2.5 and have seen this behavior since 2.3.

@drpaneas
Copy link
Contributor

drpaneas commented Apr 4, 2023

Hey there, I've raised a PR. Can somebody from you folks give it a try and see if it fixes the problem? Here's a test scenario:

Use: https://github.com/rh-gitops-qe/argocd-apps-examples/tree/main/syncwaves-example

e.g.

argocd app create syncwaves-example \
  --repo="https://github.com/rh-gitops-qe/argocd-apps-examples" \
  --path="./syncwaves-example" \
  --dest-server="https://kubernetes.default.svc" \
  --sync-policy="auto" \
  --project="default" \
  --dest-namespace="default" 

You should see namespaces-test 1, 2, 3 created in order.
Then delete this, and you should see namespaces 3, 2, 1 deleted in this order.

To see the order, you might want to use kubectl get events | grep <for the namespaces>

@jannfis
Copy link
Member

jannfis commented Apr 4, 2023

Deletion isn't supposed to obey sync wave semantics, though. It is supposed to follow the order (which it does, I guess).

What just doesn't happen is that we actually wait for the resources to be fully recycled.

So if I understand this bug correctly, when sync waves 1, 2 and 3 are in place, the order of deletion would be 3, 2, 1 (which it currently is, I believe) but wait between the syncs, e.g. first delete resources from wave 3, wait for them to be completely recycled (and not just putting a DeletionTimestamp on it), then proceed to delete wave 2 resources, etc

Is that a correct summary of the expectation?

@jgato
Copy link
Author

jgato commented Apr 5, 2023

this is what we exactly expect.
Now it iseems is not even waiting the time between waves.

@jgato
Copy link
Author

jgato commented Apr 5, 2023

but @jannfis , reading this piece of code:
https://github.com/argoproj/argo-cd/pull/3959/files/c2e5ccc81e0aaca81a2d002436d39d52a4364d2d#diff-f952d05ea83b61f771541425e28fa3931af2f2e7950261dd59cab93bdcfe2e9e
@darshanime seemed to be trying the same,
"if we have objects pending deletion we want to wait for them to complete before we proceed with the next wave of deletion."

@jannfis
Copy link
Member

jannfis commented Apr 5, 2023

It seems that this particular piece of code only targets deletion when you delete the Application.

The pruning that happens during a sync (like, you demonstrated in your original description) is handled in another part of the code, more likely in the sync code within gitops-engine.

You could see how the pruning behaves when you delete the Application vs when you sync-with-prune.

@drpaneas
Copy link
Contributor

I think I have a simple reproducer for this:

It's not as easy as I initially thought of. According to @jannfis new comment, the code for Sync Prune is different from what we were originally looking at. It lives in the gitops-engine repository.

At: https://github.com/argoproj/gitops-engine/blob/ed70eac8b7bd6b2f276502398fdbccccab5d189a/pkg/sync/sync_context.go#L387 

The core goal of this card is to have a test, that allows us to verify when this bug is fixed. So the best approach is not to write a unit-test, at least not for now. The best approach is to have a reproducer, so right now it fails (because of the wrong behavior) and when we fix the code, it should work fine.

Here's the reproducer I came up with:

  1. Setup ArgoCD for development purposes (see my tutorial).
  2. Use my GitHub Repository, called syncwavetest, inside you will find a manifest.yaml file.
    Start ArgoCD and deploy my manifest with the following command:

./dist/argocd app create syncwaves-example --repo="https://github.com/drpaneas/syncwavetest" --path="./" --dest-server="https://kubernetes.default.svc" --sync-policy="auto" --project="default" --dest-namespace="default"

This will create a namespace, called syncwavetest and three pods podtest{1,2,3} inside of it. If you look at the manifest, I have attached the following wave annotations:

namespace syncwavetest -> argocd.argoproj.io/sync-wave: "10"
pod podtest1 -> argocd.argoproj.io/sync-wave: "20"
pod podtest2 -> argocd.argoproj.io/sync-wave: "30"
pod podtest3 -> argocd.argoproj.io/sync-wave: "40

Notice that ArgoCD correctly sets them up in the appropriate order (you can see this with
kubectl get events -A --sort-by='.metadata.creationTimestamp' | grep 'synvwavetest\|podtest')

  1. Now go to my github and remove this manifest (make it empty, without contents). Force push with HEAD~1.
    Then go back to ArgoCD and click refresh. You will notice that it will say "OutOfSync". This is correct, because git has changed (has nothing), but the cluster has stuff deployed.

  2. Click "Sync" and select "Prune" as well from the options. And see what happens:

Actual Behavior:

ArgoCD tried to delete everything at once (the namespace and the pods). See my screenshot:

sync_prune (1)

You can also verify this behavior via terminal:

drpaneas@linux:~/syncwaves$ kubectl -n syncwavetest get pods
NAME       READY   STATUS        RESTARTS   AGE
podtest1   0/1     Terminating   0          23h
podtest2   0/1     Terminating   0          23h
# podtest3 has been deleted already

As you can see, ArgoCD tried to delete everything, but it couldn't delete podtest1 and podtest2 and the namespace. It couldn't do it because in my original manifest.yaml file (the one you deployed) I included finalizers, that block this pods from getting deleted. This was done on purpose so we can witness the Terminating and 0/1 status, indicating that ArgoCD didn't respect the syncwave order during pruning.

Expected behavior:

Since according to syncwaves attached to the pods, the podtest2 needs to be deleted before podtest1, then I would expect ArgoCD to not even try to delete podtest1, given that podtest2 is still there. Yes, it tried to delete it, but it didn't check if it's actually deleted/recycled. I've blocked the deletion of podtest2 and podtest1 on purpose so we can see this. In other words, this is the output I would expect:

drpaneas@linux:~/syncwaves$ kubectl -n syncwavetest get pods
NAME       READY   STATUS        RESTARTS   AGE
podtest1   1/1     Ready         0          23h
podtest2   0/1     Terminating   0          23h

So podtest3 is correctly deleted first in that case, and podtest2 is trying to get deleted, so Terminating looks fine for podtest2. The only problem is podtest1, that should say Ready and not Terminating, that is because ArgoCD shouldn't try to delete it before podtest2 is actually recycled.

@drpaneas
Copy link
Contributor

Ok, some findings:

  1. The order is wrong as I suspected originally.
30 resource /Pod:syncwavetest/podtest2 obj->nil (,,), Sync/40 resource /Pod:syncwavetest/podtest3 obj->nil (,,)]"
02:34:50                controller | INFO[0979] Running tasks                                 application=argocd/syncwaves-example dryRun=true numTasks=4 syncId=00023-iIRkb
02:34:50                controller | INFO[0979] Pruning                                       application=argocd/syncwaves-example dryRun=true syncId=00023-iIRkb task="Sync/10 resource /Namespace:/syncwavetest obj->nil (,,)"
02:34:50                controller | INFO[0979] Pruning                                       application=argocd/syncwaves-example dryRun=true syncId=00023-iIRkb task="Sync/20 resource /Pod:syncwavetest/podtest1 obj->nil (,,)"
02:34:50                controller | INFO[0979] Pruning                                       application=argocd/syncwaves-example dryRun=true syncId=00023-iIRkb task="Sync/30 resource /Pod:syncwavetest/podtest2 obj->nil (,,)"
02:34:50                controller | INFO[0979] Pruning                                       application=argocd/syncwaves-example dryRun=true syncId=00023-iIRkb task="Sync/40 resource /Pod:syncwavetest/podtest3 obj->nil (,,)" 

Here I have 4 namespaces with (10, 20, 30, 40 sync waves) but the pruning doesn't happen in reverse order!

  1. The sync & prune is happening at the function pruneObject function. I have modified the code to not dry-run but actually call the DeleteResource function and I have noticed that even I have a finalizer and the object is not deleted, this function returns no error – which is a bug in my opinion. I will modify this function and understand what it does, as this is a "wet run" and not a dry-run (I modified the code for my purposes to trigger this).

so the question here is, should I try to modify the DeleteResource function to make sure the resources is actually deleted and not just scheduled for deletion? If so, I am afraid the deletion/prunning process will become very slow. Are we ok with that? Is there maybe another way you can think of?

@jgato
Copy link
Author

jgato commented Apr 28, 2023

could this be an option when deleting? Something that is not clear for me is the expected behaviour. A part from the issue of not ordering correctly. Is it expected that only mark objects for deletion? If we need some kind of sync between each object deletion, are there any other alternatives for deletion? If no, I guess this is a feature very interesting.

spjmurray added a commit to eschercloudai/helm-cluster-api that referenced this issue May 23, 2023
According to argoproj/argo-cd#12376 waves
don't actually work properly for deletion.  I can see in the code it
does a reverse sort and picks off the largest wave first, but I guess
there's a bug where it'll kick off the next wave before the first has
actually deleted (or something to that effect). However, we can tell
argo not to delete something and let CAPI do its own thing, which
appears to work more reliably. Anecdotally, the secret at wave -1 still
gets deleted last, which defies all logic and tells some something kinda
works in some situations and not others!!
spjmurray added a commit to eschercloudai/helm-cluster-api that referenced this issue May 23, 2023
According to argoproj/argo-cd#12376 waves
don't actually work properly for deletion.  I can see in the code it
does a reverse sort and picks off the largest wave first, but I guess
there's a bug where it'll kick off the next wave before the first has
actually deleted (or something to that effect). However, we can tell
argo not to delete something and let CAPI do its own thing, which
appears to work more reliably. Anecdotally, the secret at wave -1 still
gets deleted last, which defies all logic and tells some something kinda
works in some situations and not others!!
@ozhankaraman
Copy link

Any news on this issue? I also hit the same issue that syncwave deletion order is not working as expected, It tried positive and negative syncwaves but no luck

@sambonbonne
Copy link

sambonbonne commented Aug 1, 2023

I still have the same issue on ArgoCD 2.7.10, does anyone have a trick while waiting for a fix?

I use the External PostgreSQL server operator and spin-up some temporary review environments (thanks to ApplicationSet generators based on merge requests) but when the environment is removed, the sync wave of the Postgres and PostgresUser resources I create is not respected. So sometime the Postgres is removed before the PostgresUser, or worse: the operator is removed before those resources, so I have blocking finalizers and I need to remove those manually (which is annoying for something that should be automated).

Edit: typo.

@svghadi
Copy link
Contributor

svghadi commented Aug 16, 2023

I have created an enhancement proposal (#15074) to address this. Please let me know what you think.

@svghadi
Copy link
Contributor

svghadi commented Nov 27, 2023

Hello everyone! 👋 I've created a patch (argoproj/gitops-engine#538) to address this issue. I'd love for some folks to give it a try and share your thoughts. If you're up for it, let me know!

@xsoheilalizadeh
Copy link

Will this change be part of v2.11.0?

@svghadi
Copy link
Contributor

svghadi commented Feb 28, 2024

@xsoheilalizadeh - I believe so. The PR was merged after the 2.10 RCs were out, therefore it didn't go in 2.10.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache sync-waves
Projects
None yet