-
Notifications
You must be signed in to change notification settings - Fork 212
CDN invalidation: decide what to do about quotas #1871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Valid point, I missed that, we will definitely hit this at some point, my guess would be only when we have much higher build capacity, much better caching, or many build-failures in a short time. In these cases I prefer optimistic / simple approaches, which means: when we get a rate-limit error we retry after some time. |
Same!
One question with this: will we get into a state where the retry queue grows without bound? Looking at https://docs.rs/releases/activity it seems we average at least 600 releases per day. If an average invalidation takes 5 minutes and we can have 15 in parallel, that's 3 invalidations per minute throughput. With 1440 minutes in a day, we could handle up to 4320 builds per day before we wind up in unbounded growth land. Of course, that's based on a significant assumption about how long an invalidation takes. If we're going to have a queue anyhow, maybe it makes more sense for all invalidations to go onto that queue, and have an independent component responsible for managing the queue? That way it could keep track of how many validations are inflight and avoid hitting the quota unnecessarily. We'll also want a way for the docs.rs team to clear the queue (and, separately if needed, invalidate the whole distribution). One other consideration: when the queue does start growing faster than we can clear it, how do we want to handle that? It may be better to treat it as a stack. That way more recently built crates are more likely to have a successful invalidation; crates that have been waiting on an invalidation for a long time are lower priority since their contents are likely to fall out of the CDN independently due to age. |
Another approach (a pessimistic one) would be probably:
the problematic limit is on invalidation-requests, not on paths. Since we would have to have a persistent queue anyways for the retries we could also base the whole thing on it. Without a persistent queue we might loose a needed purge after build. |
On that I'll try to collect some data. According to the docs an invalidation can take up to 15 minutes. |
Aha, I like this approach. Coindicentally, it's quite similar to what we implemented for letsencrypt/boulder's similar cache invalidation (for OCSP responses). |
in a small dummy distribution the invalidation only takes some seconds. This probably looks different with more files to invalidate, we'll see in prod. |
It seems like there is also a rate-limit on the API calls, but information is confusion here (aws/aws-sdk-js#3983 (comment)). We'll see how it looks like after #1864 is deployed |
I think it's number of paths, for the non-wildcard ones they say
|
short update here: So we have to handle this before we can activate the full page cache again. |
where did you find that number in the docs, i'm not seeing it and just trying to confirm. |
I'm sorry but I don't find it any more. I'm not sure if they removed it. One thing I am sure: I did many manual tests and the invalidations took ~13-15 minutes to finish. |
update on this issue here: Since #1961 we have a queue for these invalidations. Through the queue we are
Also we are starting to track some metrics around the queue, later around invalidation execution times. There is a pending optimization to be done where paths are sometimes queued multiple times for example when multiple releases of a crate are yanked. These can be de-duplicated before they are sent to CloudFront. We can think about further improvements (from above):
Before digging into optimizations there is also the option to at some point switch to fastly, where invalidation are much faster and without limits, but also not path-based but tag-based. Currently I'm leaning towards closing this issue when we de-duplicate the paths, and added the pending metric PRs. |
According to https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cloudfront-limits.html#limits-invalidations:
So, during times of heavy building (particularly when many crates in a family are released at once), we are likely to hit this limit. Presumably it will depend on how long the invalidations actually take.
We should decide what to do about these:
The text was updated successfully, but these errors were encountered: