Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability limits in the Resource Processor #4293

Open
TonyWildish-BH opened this issue Jan 31, 2025 · 1 comment
Open

Scalability limits in the Resource Processor #4293

TonyWildish-BH opened this issue Jan 31, 2025 · 1 comment
Labels
question Further information is requested

Comments

@TonyWildish-BH
Copy link
Contributor

Description

In my scalability tests last year, I ran a script that attempted to create dozens of resources - workspaces in this case. With the Resource Processor at the default setting of a max pool size of one, my workspaces were all created, but of course I had to wait a long time, as only 5 processes were running at a time.

I tried enlarging the Resource Processor pool to see if I could create more resources in parallel. I just went into the Azure portal and manually increased the pool size from max 1 to max 4, then re-ran my tests. I saw that it did indeed try to create 20 workspaces in one go, but it failed with terraform errors, the APIs were being throttled by Azure, and resources were left in a bad state. Unfortunately, I no longer have the logs, so I can't give the precise message. However, I do recall that terraform was not handling the throttling well.

Anyway, my question is: How can I increase the parallelism of the Resource Processor without terraform falling over?

This likely requires two things:

  • better retry-handling in terraform, so it doesn't just crash and burn when it doesn't need to. That will prevent the failures, but not speed things up.
  • Relaxing the throttling requirements for the Azure APIs, which would speed things up.

Steps

The steps I have tried are:

  1. Increase the size of the Resource Processor pool in the Azure portal.
  2. Write a Bash script to use the tre CLI to create 20-30 workspaces in a tight loop, using --no-wait so they queue.
  3. Watch as the workspaces fail in various stages, due to throttling of terraform API calls by Azure.
@TonyWildish-BH TonyWildish-BH added the question Further information is requested label Jan 31, 2025
@marrobi
Copy link
Member

marrobi commented Jan 31, 2025

Hi @TonyWildish-BH there are some scenarios where parallel operation's hit transient errors - #3177 (comment)

When it's Azure platform issues, all we can do is try work around by blocking parallel operations or throttling.

If you can reproduce the scenario and provide the specific errors then we look at next steps. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants