-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
The end to end tests can be quite flaky, and often due to ImgPullBackoff errors. In the majority of cases though, the image is not unavailable, instead the image puller is being rate limited (either from the image server or client-side) and eventually fails before the throttling is overcome. Client and Server side throttling can be mitigated by ensuring Task Step images in our e2e tests (i.e., the /examples directory) are frozen to a particular digest. Server-side throttling can be mitigated by using a pull-through image proxy cache. However it's worth noting that github container registry does not document any server side rate limiting.
I see three ways we can address the above:
Option 1: Freeze all images to their current tag's digest or their current latest digest
Pros:
- Very simple
- Little work required to migrate
- Mitigates both server-side and client-side rate limiting
Cons:
- Maintenance: Image digests may need to be periodically updated as older images could be deprecated/deleted
- It may become unclear which images require a specific version vs which images are just frozen for optimization
Option 2: Run a pre-test script which resolves the image references before we execute e2e tests
(this would run in CI and the resolved references would not need to be committed into the repo)
Pros:
- No additional image update maintenance required
- Examples can continue to freeze their images when necessary or leave without tags when not
- Mitigates both server-side and client-side rate limiting
- Heckin cool
Cons:
- Added complexity to the CI, an additional failure mode
- Script would need to be maintained
- Implementation more involved
Option 3: Image proxy cache (either per e2e-test cluster or global)
Pros:
- No additional image update maintenance required
- Examples can continue to freeze their images when necessary or leave without tags when not
- A more standard solution
- Mitigates server-side throttling
Cons:
- Added complexity to CI setup, as the registry would need to be setup
- Would not mitigate client-side throttling
- Mainly valuable if all e2e test clusters can share the registry, which might be a challenge to configure with Github workflow job networking
Metadata
Metadata
Assignees
Labels
Type
Projects
Status