Skip to content

Commit 802091b

Browse files
authored
Worker manager launch configs (#191)
* 0191 - worker manager launch configurations * Launch configuration ID would only be a has of its properties To avoid changing ID too much when capacity and lifecycle values are changed, which happens more frequently than the config itself * Updated TOC * Add worker created/shutdown events * Leave note about the static workers and worker interactions * Updated info about static workers
1 parent 9f45947 commit 802091b

File tree

3 files changed

+136
-0
lines changed

3 files changed

+136
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,4 @@ See [mechanics](mechanics.md) for more detail.
6868
| RFC#180 | [Github cancel previous tasks](rfcs/0180-Github-cancel-previous-tasks.md) |
6969
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md) |
7070
| RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md) |
71+
| RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md) |
+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# RFC 0191 - Worker Manager launch configurations
2+
* Comments: [#0191](https://github.com/taskcluster/taskcluster-rfcs/pull/191)
3+
* Proposed by: @lotas
4+
5+
## Summary
6+
7+
This proposal outlines enhancements to the worker-manager service that would address the following:
8+
9+
* make worker pools launch configurations first-class citizens in the worker-manager service
10+
* balanced distribution of workers across multiple launch configurations (regions)
11+
* dynamic adjustment of launch configuration likelihood to be provisioned based on **health metrics**
12+
13+
**Health metrics** - is the combination of the number of fail-to-success provisioning attempts ratio, the number of workers currently running.
14+
15+
## Motivation
16+
17+
The current worker-manager lacks the ability to deal with broken configurations, monitor the health of launch configurations and the availability of cloud resources.
18+
19+
This often leads to the creation of workers that are stuck in provisioning or fail to become operational, wasting resources and reducing system efficiency.
20+
21+
By introducing health tracking and dynamic adjustment mechanisms, we aim to optimize the provisioning process and enhance system reliability.
22+
23+
## Details
24+
25+
New entity will be introduced to worker-manager - **Launch Configuration**.
26+
Each worker pool will have zero or more **launch configurations** associated with it.
27+
28+
Each **launch configuration** will be assigned a unique ID (hash of its properties), stored in the database and will be immutable (except for the status flag).
29+
30+
During updates of the worker pool configurations, all existing launch configurations that are not present in the new configuration will be marked as `archived`. Launch configurations that have same unique ID (hash) will be kept active. All other will be created as new launch configurations and marked as active.
31+
32+
### Launch Configuration weight
33+
34+
Provisioned workers would be associated with a specific launch configuration (`worker.launchConfigurationId`).
35+
This will allow us to to know how many workers with this launch configuration were successfully registered and claimed work.
36+
37+
Each launch configuration will have a dynamic `weight` property that will be adjusted automatically based on the following events and metrics:
38+
39+
* total number of successful worker provisioning attempts / registrations
40+
* total number of failed worker provisioning attempts / registrations
41+
* has any worker claimed a task
42+
* fail-to-success ratio over a specific time period (i.e. last hour)
43+
* number of non-stopped workers currently running
44+
45+
Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning.
46+
47+
This will be calculated at the provisioning time.
48+
49+
### Making worker-manager extensible
50+
51+
Worker-manager will publish additional events to Pulse to allow external systems to react:
52+
53+
* `launch-configuration-created`
54+
* `launch-configuration-archived`
55+
* `launch-configuration-paused`
56+
* `launch-configuration-resumed`
57+
* `worker-error` (provisioning or starting up failure, will include `workerPoolId` and `launchConfigurationId`)
58+
* `worker-running` (registered, ready for work)
59+
* `worker-requested` (worker just requested, provisioning is starting)
60+
* `worker-stopping` (for azure when initial stopping request comes in)
61+
* `worker-stopped`
62+
63+
New API endpoints will be introduced:
64+
65+
* `workerManager.getLaunchConfigs(workerPoolId)` - to retrieve all launch configurations with their statuses
66+
* `workerManager.getLaunchConfig(workerPoolId, launchConfigId)` - to retrieve specific launch configuration
67+
* `workerManager.pauseLaunchConfig(workerPoolId, launchConfigId)` - to deactivate specific active launch configuration (time bound)
68+
* `workerManager.resumeLaunchConfig(workerPoolId, launchConfigId)` - to resume specific active launch configuration
69+
70+
Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria.
71+
This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region.
72+
73+
Existing endpoints will continue to accept the same payload as before for backward compatibility.
74+
75+
### Expiry of launch configurations
76+
77+
During worker pool configuration updates, previous launch configurations would be marked as `archived` and kept in the database for a certain amount of time.
78+
Usually they should be kept for as long as the workers, that were created with this configuration.
79+
Once such workers are expired and removed from db, we can remove the launch configuration as well.
80+
81+
### Worker interaction
82+
83+
Optionally, workers might be able to call worker-manager API periodically to check if their launch configuration is still active.
84+
This could superseed previous `deploymentId` mechanism.
85+
86+
### Static workers
87+
88+
`static` worker pool configuration differs from the regular worker pool configuration in that it does not have any launch configurations.
89+
90+
It is currently stored as `config.workerConfig`.
91+
To make it consistent with the rest of the worker pools, we would move it to `config.launchConfigurations` with a single launch configuration.
92+
93+
Static workers could use new worker-manager API to check if their launch configuration is still active.
94+
95+
## Examples of weight adjusting
96+
97+
### Scenario 1 - no failures
98+
99+
No workers have been provisioned yet, and we have two launch configurations A and B.
100+
Both of them would have the same weight - `1`, so the likelihood of selecting one of them would be `50%`.
101+
102+
After some time, there could be 10 workers running for config A, and 5 workers running for config B.
103+
With this information, the weight would be adjusted to `0.33` for config A and `0.66` for config B.
104+
105+
### Scenario 2 - failures in some regions
106+
107+
There are three launch configurations A, B and C.
108+
At some point, provisioning workers in region A fails with quota exceeded errors.
109+
Weight of A would be adjusted proportionally to the error rate - `1 - (failed / total)`.
110+
111+
Note: To avoid permanently disabling launch config, we would only adjust the weight for a specific time period (i.e. *last hour*).
112+
113+
### Scenario 3 - new launch configurations
114+
115+
We want to avoid situation where workers cannot be created or started.
116+
This can happen when configuration is broken, or there are temporary issues on the cloud provider side.
117+
118+
During provisioning we would check: (a) count of workers created, (b) count of workers that registered and claimed tasks, (c) count of errors *last hour*
119+
120+
1. No workers created yet: (a) == 0
121+
122+
Lowering weight for all launch configurations would not help, as they all will have the same weight, we keep as is
123+
124+
2. Workers created, but none of them registered: (a) > 0, (b) == 0
125+
126+
This could indicate that workers are still starting up, we don't adjust weight.
127+
Alternatively we could look at the creation time of those workers and after some period (30 minutes) start to lower the weight.
128+
129+
3. Workers created, none registered, errors exist: (a) > 0, (b) == 0, (c) > 0
130+
131+
This could indicate that there are issues with the launch configuration, we would lower the weight for this launch configuration to `0` to avoid provisioning more workers
132+
133+
This should be sufficient to react to the most common issues that can happen during provisioning and prevent creating too many workers that are expected to fail.
134+
It also allows to resume provisioning after error expiration timeout (last hour by default).

rfcs/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,4 @@
5656
| RFC#180 | [Github cancel previous tasks](0180-Github-cancel-previous-tasks.md) |
5757
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](0182-taskcluster-yml-remote-references.md) |
5858
| RFC#189 | [Batch APIs for task definition, status and index path](0189-batch-task-apis.md) |
59+
| RFC#191 | [Worker Manager launch configurations](0191-worker-manager-launch-configs.md) |

0 commit comments

Comments
 (0)