Skip to content

Commit 227710d

Browse files
author
Quinton Hoole
committed
RFC design docs for Cluster Federation/Ubernetes.
1 parent afa7816 commit 227710d

6 files changed

+1253
-0
lines changed
+269
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
2+
3+
<!-- BEGIN STRIP_FOR_RELEASE -->
4+
5+
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
6+
width="25" height="25">
7+
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
8+
width="25" height="25">
9+
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
10+
width="25" height="25">
11+
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
12+
width="25" height="25">
13+
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
14+
width="25" height="25">
15+
16+
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
17+
18+
If you are using a released version of Kubernetes, you should
19+
refer to the docs that go with that version.
20+
21+
Documentation for other releases can be found at
22+
[releases.k8s.io](http://releases.k8s.io).
23+
</strong>
24+
--
25+
26+
<!-- END STRIP_FOR_RELEASE -->
27+
28+
<!-- END MUNGE: UNVERSIONED_WARNING -->
29+
30+
# Kubernetes/Ubernetes Control Plane Resilience
31+
32+
## Long Term Design and Current Status
33+
34+
### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
35+
36+
### December 14, 2015
37+
38+
## Summary
39+
40+
Some amount of confusion exists around how we currently, and in future
41+
want to ensure resilience of the Kubernetes (and by implication
42+
Ubernetes) control plane. This document is an attempt to capture that
43+
definitively. It covers areas including self-healing, high
44+
availability, bootstrapping and recovery. Most of the information in
45+
this document already exists in the form of github comments,
46+
PR's/proposals, scattered documents, and corridor conversations, so
47+
document is primarily a consolidation and clarification of existing
48+
ideas.
49+
50+
## Terms
51+
52+
* **Self-healing:** automatically restarting or replacing failed
53+
processes and machines without human intervention
54+
* **High availability:** continuing to be available and work correctly
55+
even if some components are down or uncontactable. This typically
56+
involves multiple replicas of critical services, and a reliable way
57+
to find available replicas. Note that it's possible (but not
58+
desirable) to have high
59+
availability properties (e.g. multiple replicas) in the absence of
60+
self-healing properties (e.g. if a replica fails, nothing replaces
61+
it). Fairly obviously, given enough time, such systems typically
62+
become unavailable (after enough replicas have failed).
63+
* **Bootstrapping**: creating an empty cluster from nothing
64+
* **Recovery**: recreating a non-empty cluster after perhaps
65+
catastrophic failure/unavailability/data corruption
66+
67+
## Overall Goals
68+
69+
1. **Resilience to single failures:** Kubernetes clusters constrained
70+
to single availability zones should be resilient to individual
71+
machine and process failures by being both self-healing and highly
72+
available (within the context of such individual failures).
73+
1. **Ubiquitous resilience by default:** The default cluster creation
74+
scripts for (at least) GCE, AWS and basic bare metal should adhere
75+
to the above (self-healing and high availability) by default (with
76+
options available to disable these features to reduce control plane
77+
resource requirements if so required). It is hoped that other
78+
cloud providers will also follow the above guidelines, but the
79+
above 3 are the primary canonical use cases.
80+
1. **Resilience to some correlated failures:** Kubernetes clusters
81+
which span multiple availability zones in a region should by
82+
default be resilient to complete failure of one entire availability
83+
zone (by similarly providing self-healing and high availability in
84+
the default cluster creation scripts as above).
85+
1. **Default implementation shared across cloud providers:** The
86+
differences between the default implementations of the above for
87+
GCE, AWS and basic bare metal should be minimized. This implies
88+
using shared libraries across these providers in the default
89+
scripts in preference to highly customized implementations per
90+
cloud provider. This is not to say that highly differentiated,
91+
customized per-cloud cluster creation processes (e.g. for GKE on
92+
GCE, or some hosted Kubernetes provider on AWS) are discouraged.
93+
But those fall squarely outside the basic cross-platform OSS
94+
Kubernetes distro.
95+
1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
96+
for achieving system resilience (replication controllers, health
97+
checking, service load balancing etc) should be used in preference
98+
to building a separate set of mechanisms to achieve the same thing.
99+
This implies that self hosting (the kubernetes control plane on
100+
kubernetes) is strongly preferred, with the caveat below.
101+
1. **Recovery from catastrophic failure:** The ability to quickly and
102+
reliably recover a cluster from catastrophic failure is critical,
103+
and should not be compromised by the above goal to self-host
104+
(i.e. it goes without saying that the cluster should be quickly and
105+
reliably recoverable, even if the cluster control plane is
106+
broken). This implies that such catastrophic failure scenarios
107+
should be carefully thought out, and the subject of regular
108+
continuous integration testing, and disaster recovery exercises.
109+
110+
## Relative Priorities
111+
112+
1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
113+
applications running inside it, disappear forever perhaps is the worst
114+
possible failure mode. So it is critical that we be able to
115+
recover the applications running inside a cluster from such
116+
failures in some well-bounded time period.
117+
1. In theory a cluster can be recovered by replaying all API calls
118+
that have ever been executed against it, in order, but most
119+
often that state has been lost, and/or is scattered across
120+
multiple client applications or groups. So in general it is
121+
probably infeasible.
122+
1. In theory a cluster can also be recovered to some relatively
123+
recent non-corrupt backup/snapshot of the disk(s) backing the
124+
etcd cluster state. But we have no default consistent
125+
backup/snapshot, verification or restoration process. And we
126+
don't routinely test restoration, so even if we did routinely
127+
perform and verify backups, we have no hard evidence that we
128+
can in practise effectively recover from catastrophic cluster
129+
failure or data corruption by restoring from these backups. So
130+
there's more work to be done here.
131+
1. **Self-healing:** Most major cloud providers provide the ability to
132+
easily and automatically replace failed virtual machines within a
133+
small number of minutes (e.g. GCE
134+
[Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
135+
and Managed Instance Groups,
136+
AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
137+
and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
138+
can fairly trivially be used to reduce control-plane down-time due
139+
to machine failure to a small number of minutes per failure
140+
(i.e. typically around "3 nines" availability), provided that:
141+
1. cluster persistent state (i.e. etcd disks) is either:
142+
1. truely persistent (i.e. remote persistent disks), or
143+
1. reconstructible (e.g. using etcd [dynamic member
144+
addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
145+
or [backup and
146+
recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
147+
148+
1. and boot disks are either:
149+
1. truely persistent (i.e. remote persistent disks), or
150+
1. reconstructible (e.g. using boot-from-snapshot,
151+
boot-from-pre-configured-image or
152+
boot-from-auto-initializing image).
153+
1. **High Availability:** This has the potential to increase
154+
availability above the approximately "3 nines" level provided by
155+
automated self-healing, but it's somewhat more complex, and
156+
requires additional resources (e.g. redundant API servers and etcd
157+
quorum members). In environments where cloud-assisted automatic
158+
self-healing might be infeasible (e.g. on-premise bare-metal
159+
deployments), it also gives cluster administrators more time to
160+
respond (e.g. replace/repair failed machines) without incurring
161+
system downtime.
162+
163+
## Design and Status (as of December 2015)
164+
165+
<table>
166+
<tr>
167+
<td><b>Control Plane Component</b></td>
168+
<td><b>Resilience Plan</b></td>
169+
<td><b>Current Status</b></td>
170+
</tr>
171+
<tr>
172+
<td><b>API Server</b></td>
173+
<td>
174+
175+
Multiple stateless, self-hosted, self-healing API servers behind a HA
176+
load balancer, built out by the default "kube-up" automation on GCE,
177+
AWS and basic bare metal (BBM). Note that the single-host approach of
178+
hving etcd listen only on localhost to ensure that onyl API server can
179+
connect to it will no longer work, so alternative security will be
180+
needed in the regard (either using firewall rules, SSL certs, or
181+
something else). All necessary flags are currently supported to enable
182+
SSL between API server and etcd (OpenShift runs like this out of the
183+
box), but this needs to be woven into the "kube-up" and related
184+
scripts. Detailed design of self-hosting and related bootstrapping
185+
and catastrophic failure recovery will be detailed in a separate
186+
design doc.
187+
188+
</td>
189+
<td>
190+
191+
No scripted self-healing or HA on GCE, AWS or basic bare metal
192+
currently exists in the OSS distro. To be clear, "no self healing"
193+
means that even if multiple e.g. API servers are provisioned for HA
194+
purposes, if they fail, nothing replaces them, so eventually the
195+
system will fail. Self-healing and HA can be set up
196+
manually by following documented instructions, but this is not
197+
currently an automated process, and it is not tested as part of
198+
continuous integration. So it's probably safest to assume that it
199+
doesn't actually work in practise.
200+
201+
</td>
202+
</tr>
203+
<tr>
204+
<td><b>Controller manager and scheduler</b></td>
205+
<td>
206+
207+
Multiple self-hosted, self healing warm standby stateless controller
208+
managers and schedulers with leader election and automatic failover of API server
209+
clients, automatically installed by default "kube-up" automation.
210+
211+
</td>
212+
<td>As above.</td>
213+
</tr>
214+
<tr>
215+
<td><b>etcd</b></td>
216+
<td>
217+
218+
Multiple (3-5) etcd quorum members behind a load balancer with session
219+
affinity (to prevent clients from being bounced from one to another).
220+
221+
Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
222+
things:
223+
<ol>
224+
<li>allocate a new node (not necessary if running etcd as a pod, in
225+
which case specific measures are required to prevent user pods from
226+
interfering with system pods, for example using node selectors as
227+
described in <A HREF=")
228+
<li>start an etcd replica on that new node,
229+
<li>have the new replica recover the etcd state.
230+
</ol>
231+
In the case of local disk (which fails in concert with the machine), the etcd
232+
state must be recovered from the other replicas. This is called <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
233+
addition</A>.
234+
In the case of remote persistent disk, the etcd state can be recovered
235+
by attaching the remote persistent disk to the replacement node, thus
236+
the state is recoverable even if all other replicas are down.
237+
238+
There are also significant performance differences between local disks and remote
239+
persistent disks. For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
240+
local disks in GCE is approximatley 20x that of remote disks</A>.
241+
242+
Hence we suggest that self-healing be provided by remotely mounted persistent disks in
243+
non-performance critical, single-zone cloud deployments. For
244+
performance critical installations, faster local SSD's should be used,
245+
in which case remounting on node failure is not an option, so
246+
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
247+
should be used to replace the failed machine. Similarly, for
248+
cross-zone self-healing, cloud persistent disks are zonal, so
249+
automatic
250+
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
251+
is required. Similarly, basic bare metal deployments cannot generally
252+
rely on
253+
remote persistent disks, so the same approach applies there.
254+
</td>
255+
<td>
256+
<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
257+
Somewhat vague instructions exist</A>
258+
on how to set some of this up manually in a self-hosted
259+
configuration. But automatic bootstrapping and self-healing is not
260+
described (and is not implemented for the non-PD cases). This all
261+
still needs to be automated and continuously tested.
262+
</td>
263+
</tr>
264+
</table>
265+
266+
267+
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
268+
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
269+
<!-- END MUNGE: GENERATED_ANALYTICS -->

0 commit comments

Comments
 (0)