Skip to content

CDN invalidation: decide what to do about quotas #1871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsha opened this issue Oct 5, 2022 · 12 comments
Open

CDN invalidation: decide what to do about quotas #1871

jsha opened this issue Oct 5, 2022 · 12 comments
Labels
C-bug Category: This is a bug E-easy Effort: Should be easy to implement and would make a good first PR

Comments

@jsha
Copy link
Contributor

jsha commented Oct 5, 2022

According to https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cloudfront-limits.html#limits-invalidations:

File invalidation: maximum number of active wildcard invalidations allowed
15

So, during times of heavy building (particularly when many crates in a family are released at once), we are likely to hit this limit. Presumably it will depend on how long the invalidations actually take.

We should decide what to do about these:

  • queue the invalidations
  • drop them when we hit the limit
  • try to cluster same-prefixed crates that are built at once into a single invalidation
  • if we hit the limit, set a ten-minute timer (to capture additional clustered builds), and invalidate the whole distribution when the timer expires
@syphar
Copy link
Member

syphar commented Oct 5, 2022

Valid point, I missed that, we will definitely hit this at some point, my guess would be only when we have much higher build capacity, much better caching, or many build-failures in a short time.

In these cases I prefer optimistic / simple approaches, which means:

when we get a rate-limit error we retry after some time.
A local queue could optimize this behaviour when we have many builds.

@syphar syphar added E-easy Effort: Should be easy to implement and would make a good first PR C-bug Category: This is a bug labels Oct 5, 2022
@jsha
Copy link
Contributor Author

jsha commented Oct 5, 2022

In these cases I prefer optimistic / simple approaches

Same!

when we get a rate-limit error we retry after some time.

One question with this: will we get into a state where the retry queue grows without bound?

Looking at https://docs.rs/releases/activity it seems we average at least 600 releases per day. If an average invalidation takes 5 minutes and we can have 15 in parallel, that's 3 invalidations per minute throughput. With 1440 minutes in a day, we could handle up to 4320 builds per day before we wind up in unbounded growth land. Of course, that's based on a significant assumption about how long an invalidation takes.

If we're going to have a queue anyhow, maybe it makes more sense for all invalidations to go onto that queue, and have an independent component responsible for managing the queue? That way it could keep track of how many validations are inflight and avoid hitting the quota unnecessarily.

We'll also want a way for the docs.rs team to clear the queue (and, separately if needed, invalidate the whole distribution).

One other consideration: when the queue does start growing faster than we can clear it, how do we want to handle that? It may be better to treat it as a stack. That way more recently built crates are more likely to have a successful invalidation; crates that have been waiting on an invalidation for a long time are lower priority since their contents are likely to fall out of the CDN independently due to age.

@syphar
Copy link
Member

syphar commented Oct 5, 2022

Another approach (a pessimistic one) would be probably:

  • collect paths to invalidate in a separate table
  • once a minute (or something like that), push the collected invalidations to cloudfront

the problematic limit is on invalidation-requests, not on paths.

Since we would have to have a persistent queue anyways for the retries we could also base the whole thing on it. Without a persistent queue we might loose a needed purge after build.

@syphar
Copy link
Member

syphar commented Oct 5, 2022

If an average invalidation takes 5 minutes and we can have 15 in parallel

On that I'll try to collect some data. According to the docs an invalidation can take up to 15 minutes.

@jsha
Copy link
Contributor Author

jsha commented Oct 5, 2022

Another approach (a pessimistic one) would be probably:

collect paths to invalidate in a separate table
once a minute (or something like that), push the collected invalidations to cloudfront

Aha, I like this approach. Coindicentally, it's quite similar to what we implemented for letsencrypt/boulder's similar cache invalidation (for OCSP responses).

@syphar
Copy link
Member

syphar commented Oct 5, 2022

in a small dummy distribution the invalidation only takes some seconds.

This probably looks different with more files to invalidate, we'll see in prod.

@syphar
Copy link
Member

syphar commented Oct 5, 2022

It seems like there is also a rate-limit on the API calls, but information is confusion here (aws/aws-sdk-js#3983 (comment)).

We'll see how it looks like after #1864 is deployed

@Nemo157
Copy link
Member

Nemo157 commented Oct 5, 2022

the problematic limit is on invalidation-requests, not on paths.

I think it's number of paths, for the non-wildcard ones they say

If you’re invalidating files individually, you can have invalidation requests for up to 3,000 files per distribution in progress at one time. This can be one invalidation request for up to 3,000 files, up to 3,000 requests for one file each, or any other combination that doesn’t exceed 3,000 files.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html#InvalidationLimits

@syphar
Copy link
Member

syphar commented Nov 20, 2022

short update here:
especially in combination with the rebuilds from #1181 we're running into an too many active wildcard invaldations error in cloudfront, multiple times a day:
grafik

So we have to handle this before we can activate the full page cache again.

@0xadada
Copy link

0xadada commented Jan 13, 2023

@syphar

According to the docs an invalidation can take up to 15 minutes

where did you find that number in the docs, i'm not seeing it and just trying to confirm.

@syphar
Copy link
Member

syphar commented Jan 13, 2023

I'm sorry but I don't find it any more. I'm not sure if they removed it.

One thing I am sure: I did many manual tests and the invalidations took ~13-15 minutes to finish.

@syphar
Copy link
Member

syphar commented Feb 4, 2023

update on this issue here:

Since #1961 we have a queue for these invalidations.

Through the queue we are

  • fully using the available invalidations,
  • directly enqueueing new ones after a spot becomes empty, and - spreading out eventual peaks

Also we are starting to track some metrics around the queue, later around invalidation execution times.

There is a pending optimization to be done where paths are sometimes queued multiple times for example when multiple releases of a crate are yanked. These can be de-duplicated before they are sent to CloudFront.

We can think about further improvements (from above):

  • escalating into a full distribution invalidation, when the queue is too full, or
  • switching to invalidating a full crate-name prefix
    if we see the invalidations take too long.

Before digging into optimizations there is also the option to at some point switch to fastly, where invalidation are much faster and without limits, but also not path-based but tag-based.

Currently I'm leaning towards closing this issue when we de-duplicate the paths, and added the pending metric PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug E-easy Effort: Should be easy to implement and would make a good first PR
Projects
None yet
Development

No branches or pull requests

4 participants