Skip to content

Example Task Steps referencing images by tag resulting in throttling during e2e tests #9084

@aThorp96

Description

@aThorp96

The end to end tests can be quite flaky, and often due to ImgPullBackoff errors. In the majority of cases though, the image is not unavailable, instead the image puller is being rate limited (either from the image server or client-side) and eventually fails before the throttling is overcome. Client and Server side throttling can be mitigated by ensuring Task Step images in our e2e tests (i.e., the /examples directory) are frozen to a particular digest. Server-side throttling can be mitigated by using a pull-through image proxy cache. However it's worth noting that github container registry does not document any server side rate limiting.

I see three ways we can address the above:

Option 1: Freeze all images to their current tag's digest or their current latest digest

Pros:

  • Very simple
  • Little work required to migrate
  • Mitigates both server-side and client-side rate limiting

Cons:

  • Maintenance: Image digests may need to be periodically updated as older images could be deprecated/deleted
  • It may become unclear which images require a specific version vs which images are just frozen for optimization

Option 2: Run a pre-test script which resolves the image references before we execute e2e tests

(this would run in CI and the resolved references would not need to be committed into the repo)
Pros:

  • No additional image update maintenance required
  • Examples can continue to freeze their images when necessary or leave without tags when not
  • Mitigates both server-side and client-side rate limiting
  • Heckin cool

Cons:

  • Added complexity to the CI, an additional failure mode
  • Script would need to be maintained
  • Implementation more involved

Option 3: Image proxy cache (either per e2e-test cluster or global)

Pros:

  • No additional image update maintenance required
  • Examples can continue to freeze their images when necessary or leave without tags when not
  • A more standard solution
  • Mitigates server-side throttling

Cons:

  • Added complexity to CI setup, as the registry would need to be setup
  • Would not mitigate client-side throttling
  • Mainly valuable if all e2e test clusters can share the registry, which might be a challenge to configure with Github workflow job networking

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/flakeCategorizes issue or PR as related to a flakey test

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions