|
| 1 | +<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> |
| 2 | + |
| 3 | +<!-- BEGIN STRIP_FOR_RELEASE --> |
| 4 | + |
| 5 | +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" |
| 6 | + width="25" height="25"> |
| 7 | +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" |
| 8 | + width="25" height="25"> |
| 9 | +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" |
| 10 | + width="25" height="25"> |
| 11 | +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" |
| 12 | + width="25" height="25"> |
| 13 | +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" |
| 14 | + width="25" height="25"> |
| 15 | + |
| 16 | +<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> |
| 17 | + |
| 18 | +If you are using a released version of Kubernetes, you should |
| 19 | +refer to the docs that go with that version. |
| 20 | + |
| 21 | +Documentation for other releases can be found at |
| 22 | +[releases.k8s.io](http://releases.k8s.io). |
| 23 | +</strong> |
| 24 | +-- |
| 25 | + |
| 26 | +<!-- END STRIP_FOR_RELEASE --> |
| 27 | + |
| 28 | +<!-- END MUNGE: UNVERSIONED_WARNING --> |
| 29 | + |
| 30 | +# Kubernetes/Ubernetes Control Plane Resilience |
| 31 | + |
| 32 | +## Long Term Design and Current Status |
| 33 | + |
| 34 | +### by Quinton Hoole, Mike Danese and Justin Santa-Barbara |
| 35 | + |
| 36 | +### December 14, 2015 |
| 37 | + |
| 38 | +## Summary |
| 39 | + |
| 40 | +Some amount of confusion exists around how we currently, and in future |
| 41 | +want to ensure resilience of the Kubernetes (and by implication |
| 42 | +Ubernetes) control plane. This document is an attempt to capture that |
| 43 | +definitively. It covers areas including self-healing, high |
| 44 | +availability, bootstrapping and recovery. Most of the information in |
| 45 | +this document already exists in the form of github comments, |
| 46 | +PR's/proposals, scattered documents, and corridor conversations, so |
| 47 | +document is primarily a consolidation and clarification of existing |
| 48 | +ideas. |
| 49 | + |
| 50 | +## Terms |
| 51 | + |
| 52 | +* **Self-healing:** automatically restarting or replacing failed |
| 53 | + processes and machines without human intervention |
| 54 | +* **High availability:** continuing to be available and work correctly |
| 55 | + even if some components are down or uncontactable. This typically |
| 56 | + involves multiple replicas of critical services, and a reliable way |
| 57 | + to find available replicas. Note that it's possible (but not |
| 58 | + desirable) to have high |
| 59 | + availability properties (e.g. multiple replicas) in the absence of |
| 60 | + self-healing properties (e.g. if a replica fails, nothing replaces |
| 61 | + it). Fairly obviously, given enough time, such systems typically |
| 62 | + become unavailable (after enough replicas have failed). |
| 63 | +* **Bootstrapping**: creating an empty cluster from nothing |
| 64 | +* **Recovery**: recreating a non-empty cluster after perhaps |
| 65 | + catastrophic failure/unavailability/data corruption |
| 66 | + |
| 67 | +## Overall Goals |
| 68 | + |
| 69 | +1. **Resilience to single failures:** Kubernetes clusters constrained |
| 70 | + to single availability zones should be resilient to individual |
| 71 | + machine and process failures by being both self-healing and highly |
| 72 | + available (within the context of such individual failures). |
| 73 | +1. **Ubiquitous resilience by default:** The default cluster creation |
| 74 | + scripts for (at least) GCE, AWS and basic bare metal should adhere |
| 75 | + to the above (self-healing and high availability) by default (with |
| 76 | + options available to disable these features to reduce control plane |
| 77 | + resource requirements if so required). It is hoped that other |
| 78 | + cloud providers will also follow the above guidelines, but the |
| 79 | + above 3 are the primary canonical use cases. |
| 80 | +1. **Resilience to some correlated failures:** Kubernetes clusters |
| 81 | + which span multiple availability zones in a region should by |
| 82 | + default be resilient to complete failure of one entire availability |
| 83 | + zone (by similarly providing self-healing and high availability in |
| 84 | + the default cluster creation scripts as above). |
| 85 | +1. **Default implementation shared across cloud providers:** The |
| 86 | + differences between the default implementations of the above for |
| 87 | + GCE, AWS and basic bare metal should be minimized. This implies |
| 88 | + using shared libraries across these providers in the default |
| 89 | + scripts in preference to highly customized implementations per |
| 90 | + cloud provider. This is not to say that highly differentiated, |
| 91 | + customized per-cloud cluster creation processes (e.g. for GKE on |
| 92 | + GCE, or some hosted Kubernetes provider on AWS) are discouraged. |
| 93 | + But those fall squarely outside the basic cross-platform OSS |
| 94 | + Kubernetes distro. |
| 95 | +1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms |
| 96 | + for achieving system resilience (replication controllers, health |
| 97 | + checking, service load balancing etc) should be used in preference |
| 98 | + to building a separate set of mechanisms to achieve the same thing. |
| 99 | + This implies that self hosting (the kubernetes control plane on |
| 100 | + kubernetes) is strongly preferred, with the caveat below. |
| 101 | +1. **Recovery from catastrophic failure:** The ability to quickly and |
| 102 | + reliably recover a cluster from catastrophic failure is critical, |
| 103 | + and should not be compromised by the above goal to self-host |
| 104 | + (i.e. it goes without saying that the cluster should be quickly and |
| 105 | + reliably recoverable, even if the cluster control plane is |
| 106 | + broken). This implies that such catastrophic failure scenarios |
| 107 | + should be carefully thought out, and the subject of regular |
| 108 | + continuous integration testing, and disaster recovery exercises. |
| 109 | + |
| 110 | +## Relative Priorities |
| 111 | + |
| 112 | +1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all |
| 113 | + applications running inside it, disappear forever perhaps is the worst |
| 114 | + possible failure mode. So it is critical that we be able to |
| 115 | + recover the applications running inside a cluster from such |
| 116 | + failures in some well-bounded time period. |
| 117 | + 1. In theory a cluster can be recovered by replaying all API calls |
| 118 | + that have ever been executed against it, in order, but most |
| 119 | + often that state has been lost, and/or is scattered across |
| 120 | + multiple client applications or groups. So in general it is |
| 121 | + probably infeasible. |
| 122 | + 1. In theory a cluster can also be recovered to some relatively |
| 123 | + recent non-corrupt backup/snapshot of the disk(s) backing the |
| 124 | + etcd cluster state. But we have no default consistent |
| 125 | + backup/snapshot, verification or restoration process. And we |
| 126 | + don't routinely test restoration, so even if we did routinely |
| 127 | + perform and verify backups, we have no hard evidence that we |
| 128 | + can in practise effectively recover from catastrophic cluster |
| 129 | + failure or data corruption by restoring from these backups. So |
| 130 | + there's more work to be done here. |
| 131 | +1. **Self-healing:** Most major cloud providers provide the ability to |
| 132 | + easily and automatically replace failed virtual machines within a |
| 133 | + small number of minutes (e.g. GCE |
| 134 | + [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) |
| 135 | + and Managed Instance Groups, |
| 136 | + AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) |
| 137 | + and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This |
| 138 | + can fairly trivially be used to reduce control-plane down-time due |
| 139 | + to machine failure to a small number of minutes per failure |
| 140 | + (i.e. typically around "3 nines" availability), provided that: |
| 141 | + 1. cluster persistent state (i.e. etcd disks) is either: |
| 142 | + 1. truely persistent (i.e. remote persistent disks), or |
| 143 | + 1. reconstructible (e.g. using etcd [dynamic member |
| 144 | + addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member) |
| 145 | + or [backup and |
| 146 | + recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)). |
| 147 | + |
| 148 | + 1. and boot disks are either: |
| 149 | + 1. truely persistent (i.e. remote persistent disks), or |
| 150 | + 1. reconstructible (e.g. using boot-from-snapshot, |
| 151 | + boot-from-pre-configured-image or |
| 152 | + boot-from-auto-initializing image). |
| 153 | +1. **High Availability:** This has the potential to increase |
| 154 | + availability above the approximately "3 nines" level provided by |
| 155 | + automated self-healing, but it's somewhat more complex, and |
| 156 | + requires additional resources (e.g. redundant API servers and etcd |
| 157 | + quorum members). In environments where cloud-assisted automatic |
| 158 | + self-healing might be infeasible (e.g. on-premise bare-metal |
| 159 | + deployments), it also gives cluster administrators more time to |
| 160 | + respond (e.g. replace/repair failed machines) without incurring |
| 161 | + system downtime. |
| 162 | + |
| 163 | +## Design and Status (as of December 2015) |
| 164 | + |
| 165 | +<table> |
| 166 | +<tr> |
| 167 | +<td><b>Control Plane Component</b></td> |
| 168 | +<td><b>Resilience Plan</b></td> |
| 169 | +<td><b>Current Status</b></td> |
| 170 | +</tr> |
| 171 | +<tr> |
| 172 | +<td><b>API Server</b></td> |
| 173 | +<td> |
| 174 | + |
| 175 | +Multiple stateless, self-hosted, self-healing API servers behind a HA |
| 176 | +load balancer, built out by the default "kube-up" automation on GCE, |
| 177 | +AWS and basic bare metal (BBM). Note that the single-host approach of |
| 178 | +hving etcd listen only on localhost to ensure that onyl API server can |
| 179 | +connect to it will no longer work, so alternative security will be |
| 180 | +needed in the regard (either using firewall rules, SSL certs, or |
| 181 | +something else). All necessary flags are currently supported to enable |
| 182 | +SSL between API server and etcd (OpenShift runs like this out of the |
| 183 | +box), but this needs to be woven into the "kube-up" and related |
| 184 | +scripts. Detailed design of self-hosting and related bootstrapping |
| 185 | +and catastrophic failure recovery will be detailed in a separate |
| 186 | +design doc. |
| 187 | + |
| 188 | +</td> |
| 189 | +<td> |
| 190 | + |
| 191 | +No scripted self-healing or HA on GCE, AWS or basic bare metal |
| 192 | +currently exists in the OSS distro. To be clear, "no self healing" |
| 193 | +means that even if multiple e.g. API servers are provisioned for HA |
| 194 | +purposes, if they fail, nothing replaces them, so eventually the |
| 195 | +system will fail. Self-healing and HA can be set up |
| 196 | +manually by following documented instructions, but this is not |
| 197 | +currently an automated process, and it is not tested as part of |
| 198 | +continuous integration. So it's probably safest to assume that it |
| 199 | +doesn't actually work in practise. |
| 200 | + |
| 201 | +</td> |
| 202 | +</tr> |
| 203 | +<tr> |
| 204 | +<td><b>Controller manager and scheduler</b></td> |
| 205 | +<td> |
| 206 | + |
| 207 | +Multiple self-hosted, self healing warm standby stateless controller |
| 208 | +managers and schedulers with leader election and automatic failover of API server |
| 209 | +clients, automatically installed by default "kube-up" automation. |
| 210 | + |
| 211 | +</td> |
| 212 | +<td>As above.</td> |
| 213 | +</tr> |
| 214 | +<tr> |
| 215 | +<td><b>etcd</b></td> |
| 216 | +<td> |
| 217 | + |
| 218 | +Multiple (3-5) etcd quorum members behind a load balancer with session |
| 219 | +affinity (to prevent clients from being bounced from one to another). |
| 220 | + |
| 221 | +Regarding self-healing, if a node running etcd goes down, it is always necessary to do three |
| 222 | +things: |
| 223 | +<ol> |
| 224 | +<li>allocate a new node (not necessary if running etcd as a pod, in |
| 225 | +which case specific measures are required to prevent user pods from |
| 226 | +interfering with system pods, for example using node selectors as |
| 227 | +described in <A HREF=") |
| 228 | +<li>start an etcd replica on that new node, |
| 229 | +<li>have the new replica recover the etcd state. |
| 230 | +</ol> |
| 231 | +In the case of local disk (which fails in concert with the machine), the etcd |
| 232 | +state must be recovered from the other replicas. This is called <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member |
| 233 | + addition</A>. |
| 234 | +In the case of remote persistent disk, the etcd state can be recovered |
| 235 | +by attaching the remote persistent disk to the replacement node, thus |
| 236 | +the state is recoverable even if all other replicas are down. |
| 237 | + |
| 238 | +There are also significant performance differences between local disks and remote |
| 239 | +persistent disks. For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput |
| 240 | +local disks in GCE is approximatley 20x that of remote disks</A>. |
| 241 | + |
| 242 | +Hence we suggest that self-healing be provided by remotely mounted persistent disks in |
| 243 | +non-performance critical, single-zone cloud deployments. For |
| 244 | +performance critical installations, faster local SSD's should be used, |
| 245 | +in which case remounting on node failure is not an option, so |
| 246 | +<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A> |
| 247 | +should be used to replace the failed machine. Similarly, for |
| 248 | +cross-zone self-healing, cloud persistent disks are zonal, so |
| 249 | +automatic |
| 250 | +<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A> |
| 251 | +is required. Similarly, basic bare metal deployments cannot generally |
| 252 | +rely on |
| 253 | +remote persistent disks, so the same approach applies there. |
| 254 | +</td> |
| 255 | +<td> |
| 256 | +<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html"> |
| 257 | +Somewhat vague instructions exist</A> |
| 258 | +on how to set some of this up manually in a self-hosted |
| 259 | +configuration. But automatic bootstrapping and self-healing is not |
| 260 | +described (and is not implemented for the non-PD cases). This all |
| 261 | +still needs to be automated and continuously tested. |
| 262 | +</td> |
| 263 | +</tr> |
| 264 | +</table> |
| 265 | + |
| 266 | + |
| 267 | +<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> |
| 268 | +[]() |
| 269 | +<!-- END MUNGE: GENERATED_ANALYTICS --> |
0 commit comments