Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle rate limits (429 responses) better #574

Open
gavinclarkeuk opened this issue Feb 25, 2025 · 5 comments
Open

Handle rate limits (429 responses) better #574

gavinclarkeuk opened this issue Feb 25, 2025 · 5 comments
Labels
question Further information is requested

Comments

@gavinclarkeuk
Copy link

We have a terraform module for creating topics which does a couple of data source lookups for env and cluster details. If module users use the module in a for_each loop they very quickly run into rate limit issues. We mainly hit these just doing simple environment lookups, but I imagine this could happen for any resource.

Given the API responds with a 429, and various headers to tell the caller when they could retry, couldn't the terraform provider respect these and respond accordingly? Also I'm not sure if the provider follows all the other recommendations from the confluent api docs (e.g. introducing jitter).

@linouk23
Copy link
Contributor

linouk23 commented Feb 25, 2025

@gavinclarkeuk thanks for creating this issue!

The Terraform Provider for Confluent uses a smart HTTP client that retries up to four times for 429 and 5xx errors, employing an exponential backoff strategy:

In case you're still encountering issues, we recommend overriding the max_retries attribute.

provider "confluent" {
    cloud_api_key    = "..."
    cloud_api_secret = "..."
    max_retries = 10 # defaults to 4
}

Let us know if that helps!

Note: Regarding the process that performs a couple of data source lookups for environment and cluster details, consider updating your data sources to use lookup by id rather than by display_name to reduce the number of API calls.

@linouk23 linouk23 added the question Further information is requested label Feb 25, 2025
@gavinclarkeuk
Copy link
Author

Setting max_retries has fixed our immediate issue, but I still think there could be room for improvement here.

I appreciate we could restructure our terraform module so it didn't need to do the lookups, but that does have an impact on it's ease of use, so solving this upstream would be preferable (and useful for others).

I guess another approach could be to somehow cache the results for duplicate datasource lookups, or batch up and dedupe the api requests. I don't know enough about provider internals to know if that would be possible, but it would certainly reduce pressure on the API.

@linouk23
Copy link
Contributor

@gavinclarkeuk could you share the data sources where you can observe 429 issues? Thank you!

@gavinclarkeuk
Copy link
Author

The two we are hitting the most are confluent_environment and confluent_kafka_cluster doing lookups by display name.

@linouk23
Copy link
Contributor

linouk23 commented Mar 4, 2025

I guess another approach could be to somehow cache the results for duplicate datasource lookups, or batch up and dedupe the api requests. I don't know enough about provider internals to know if that would be possible, but it would certainly reduce pressure on the API.

That's a great idea, but I'm not sure whether it's possible, though 😁

The catch is that our API doesn't really support filtering by display_name for the majority of resources. However, we did add these filter parameters in Terraform (TF) by accepting display_name instead of the id attribute for a number of data sources due to customer demand. That said, we now believe it might have been a mistake. I feel it's okay to use them as long as there are no error 429s. But if you do run into this situation, I'm not sure we want to encourage users to keep leveraging this approach long-term to avoid issues.

Given the API responds with a 429, and various headers to tell the caller when they could retry, couldn't the terraform provider respect these and respond accordingly?

We do use a smart HTTP client: https://github.com/hashicorp/go-retryablehttp that parses these ratelimit headers automatically.

The two we are hitting the most are confluent_environment and confluent_kafka_cluster doing lookups by display name.

Could you file a support ticket to ask for a rate limit increase for org/v2 (confluent_environment) and cmk/v2 (confluent_kafka_cluster) APIs for list API calls, so our Product team could prioritize accordingly? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants