|
| 1 | +# RFC 0191 - Worker Manager launch configurations |
| 2 | +* Comments: [#0191](https://github.com/taskcluster/taskcluster-rfcs/pull/191) |
| 3 | +* Proposed by: @lotas |
| 4 | + |
| 5 | +## Summary |
| 6 | + |
| 7 | +This proposal outlines enhancements to the worker-manager service that would address the following: |
| 8 | + |
| 9 | +* make worker pools launch configurations first-class citizens in the worker-manager service |
| 10 | +* balanced distribution of workers across multiple launch configurations (regions) |
| 11 | +* dynamic adjustment of launch configuration likelihood to be provisioned based on **health metrics** |
| 12 | + |
| 13 | +**Health metrics** - is the combination of the number of fail-to-success provisioning attempts ratio, the number of workers currently running. |
| 14 | + |
| 15 | +## Motivation |
| 16 | + |
| 17 | +The current worker-manager lacks the ability to deal with broken configurations, monitor the health of launch configurations and the availability of cloud resources. |
| 18 | + |
| 19 | +This often leads to the creation of workers that are stuck in provisioning or fail to become operational, wasting resources and reducing system efficiency. |
| 20 | + |
| 21 | +By introducing health tracking and dynamic adjustment mechanisms, we aim to optimize the provisioning process and enhance system reliability. |
| 22 | + |
| 23 | +## Details |
| 24 | + |
| 25 | +New entity will be introduced to worker-manager - **Launch Configuration**. |
| 26 | +Each worker pool will have zero or more **launch configurations** associated with it. |
| 27 | + |
| 28 | +Each **launch configuration** will be assigned a unique ID (hash of its properties), stored in the database and will be immutable (except for the status flag). |
| 29 | + |
| 30 | +During updates of the worker pool configurations, all existing launch configurations that are not present in the new configuration will be marked as `archived`. Launch configurations that have same unique ID (hash) will be kept active. All other will be created as new launch configurations and marked as active. |
| 31 | + |
| 32 | +### Launch Configuration weight |
| 33 | + |
| 34 | +Provisioned workers would be associated with a specific launch configuration (`worker.launchConfigurationId`). |
| 35 | +This will allow us to to know how many workers with this launch configuration were successfully registered and claimed work. |
| 36 | + |
| 37 | +Each launch configuration will have a dynamic `weight` property that will be adjusted automatically based on the following events and metrics: |
| 38 | + |
| 39 | +* total number of successful worker provisioning attempts / registrations |
| 40 | +* total number of failed worker provisioning attempts / registrations |
| 41 | +* has any worker claimed a task |
| 42 | +* fail-to-success ratio over a specific time period (i.e. last hour) |
| 43 | +* number of non-stopped workers currently running |
| 44 | + |
| 45 | +Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning. |
| 46 | + |
| 47 | +This will be calculated at the provisioning time. |
| 48 | + |
| 49 | +### Making worker-manager extensible |
| 50 | + |
| 51 | +Worker-manager will publish additional events to Pulse to allow external systems to react: |
| 52 | + |
| 53 | +* `launch-configuration-created` |
| 54 | +* `launch-configuration-archived` |
| 55 | +* `launch-configuration-paused` |
| 56 | +* `launch-configuration-resumed` |
| 57 | +* `worker-error` (provisioning or starting up failure, will include `workerPoolId` and `launchConfigurationId`) |
| 58 | +* `worker-running` (registered, ready for work) |
| 59 | +* `worker-requested` (worker just requested, provisioning is starting) |
| 60 | +* `worker-stopping` (for azure when initial stopping request comes in) |
| 61 | +* `worker-stopped` |
| 62 | + |
| 63 | +New API endpoints will be introduced: |
| 64 | + |
| 65 | +* `workerManager.getLaunchConfigs(workerPoolId)` - to retrieve all launch configurations with their statuses |
| 66 | +* `workerManager.getLaunchConfig(workerPoolId, launchConfigId)` - to retrieve specific launch configuration |
| 67 | +* `workerManager.pauseLaunchConfig(workerPoolId, launchConfigId)` - to deactivate specific active launch configuration (time bound) |
| 68 | +* `workerManager.resumeLaunchConfig(workerPoolId, launchConfigId)` - to resume specific active launch configuration |
| 69 | + |
| 70 | +Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria. |
| 71 | +This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region. |
| 72 | + |
| 73 | +Existing endpoints will continue to accept the same payload as before for backward compatibility. |
| 74 | + |
| 75 | +### Expiry of launch configurations |
| 76 | + |
| 77 | +During worker pool configuration updates, previous launch configurations would be marked as `archived` and kept in the database for a certain amount of time. |
| 78 | +Usually they should be kept for as long as the workers, that were created with this configuration. |
| 79 | +Once such workers are expired and removed from db, we can remove the launch configuration as well. |
| 80 | + |
| 81 | +### Worker interaction |
| 82 | + |
| 83 | +Optionally, workers might be able to call worker-manager API periodically to check if their launch configuration is still active. |
| 84 | +This could superseed previous `deploymentId` mechanism. |
| 85 | + |
| 86 | +### Static workers |
| 87 | + |
| 88 | +`static` worker pool configuration differs from the regular worker pool configuration in that it does not have any launch configurations. |
| 89 | + |
| 90 | +It is currently stored as `config.workerConfig`. |
| 91 | +To make it consistent with the rest of the worker pools, we would move it to `config.launchConfigurations` with a single launch configuration. |
| 92 | + |
| 93 | +Static workers could use new worker-manager API to check if their launch configuration is still active. |
| 94 | + |
| 95 | +## Examples of weight adjusting |
| 96 | + |
| 97 | +### Scenario 1 - no failures |
| 98 | + |
| 99 | +No workers have been provisioned yet, and we have two launch configurations A and B. |
| 100 | +Both of them would have the same weight - `1`, so the likelihood of selecting one of them would be `50%`. |
| 101 | + |
| 102 | +After some time, there could be 10 workers running for config A, and 5 workers running for config B. |
| 103 | +With this information, the weight would be adjusted to `0.33` for config A and `0.66` for config B. |
| 104 | + |
| 105 | +### Scenario 2 - failures in some regions |
| 106 | + |
| 107 | +There are three launch configurations A, B and C. |
| 108 | +At some point, provisioning workers in region A fails with quota exceeded errors. |
| 109 | +Weight of A would be adjusted proportionally to the error rate - `1 - (failed / total)`. |
| 110 | + |
| 111 | +Note: To avoid permanently disabling launch config, we would only adjust the weight for a specific time period (i.e. *last hour*). |
| 112 | + |
| 113 | +### Scenario 3 - new launch configurations |
| 114 | + |
| 115 | +We want to avoid situation where workers cannot be created or started. |
| 116 | +This can happen when configuration is broken, or there are temporary issues on the cloud provider side. |
| 117 | + |
| 118 | +During provisioning we would check: (a) count of workers created, (b) count of workers that registered and claimed tasks, (c) count of errors *last hour* |
| 119 | + |
| 120 | +1. No workers created yet: (a) == 0 |
| 121 | + |
| 122 | + Lowering weight for all launch configurations would not help, as they all will have the same weight, we keep as is |
| 123 | + |
| 124 | +2. Workers created, but none of them registered: (a) > 0, (b) == 0 |
| 125 | + |
| 126 | + This could indicate that workers are still starting up, we don't adjust weight. |
| 127 | + Alternatively we could look at the creation time of those workers and after some period (30 minutes) start to lower the weight. |
| 128 | + |
| 129 | +3. Workers created, none registered, errors exist: (a) > 0, (b) == 0, (c) > 0 |
| 130 | + |
| 131 | + This could indicate that there are issues with the launch configuration, we would lower the weight for this launch configuration to `0` to avoid provisioning more workers |
| 132 | + |
| 133 | +This should be sufficient to react to the most common issues that can happen during provisioning and prevent creating too many workers that are expected to fail. |
| 134 | +It also allows to resume provisioning after error expiration timeout (last hour by default). |
0 commit comments