Applying Chaos Engineering

Table of Contents

Prerequisites
Introduction to Chaos Engineering
- The Chaos Engineering Process
Application Architecture Recap
- Deploy the Application
Creating the Experiment
Running the Experiment
- Inspecting the details in the Journal
- Learning from the Experiment

This chapter explores how you can conduct Chaos Engineering against your application running in Kubernetes.

Prerequisites

This chapter uses a cluster with 3 master nodes and 5 worker nodes as described here: multi-master, multi-node gossip based cluster.

All configuration files for this chapter are in the experiments directory. Make sure you change to that directory before executing any commands from this chapter.

You will need to install the open source Chaos Toolkit to run the examples in this chapter.

Introduction to Chaos Engineering

Chaos Engineering is the discipline of working with the uncertainty of distributed systems at scale. Modern distributed systems naturally become more complex in order to accomodate essential system size, distribution and the desirable benefit of increased speed of system change.

Given the number of variables involved, it is difficult in the extreme to formally, and before-time, prove that a system is faultless, especially when the system is constantly changing. Chaos Engineering is an answer to this challenge as it specifically provides a means of improving confidence in the system’s availability under various production conditions.

An empirical process, Chaos Engineering experiments exercise a distributed system to see what weaknesses can be found.

The Chaos Engineering Process

The "Principles of Chaos" define the practical process that Chaos Engineering executes as:

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

In this chapter we will explore implementing this process using the free and open source Chaos Toolkit.

Application Architecture Recap

The sample application already created by the pre-requisites uses three services:

webapp: Web application microservice uses greeter and name microservice to generate a greeting for a person.
greeter: A microservice returns a greeting based upon greet name/value keypair in the URL.
name: A microservice that returns person’s name based upon id name/value keypair in the URL.

These services are built as Docker images and deployed in Kubernetes. All services are built as Node.js applications. The source code for the services is at https://github.com/arun-gupta/container-service-discovery/tree/master/services.

Deploy the Application

To deploy the application so that it can be a subject of a chaos experiment, execute the following:

$ kubectl create -f templates/app.yaml

Wait for approximately 3 mins for the load balancer to accept request.

Creating the Experiment

A Chaos Toolkit experiment is defined using a JSON file format.

Each experiment consists of:

Header
Steady-state
Method & Probes

Let’s look at how each of these are defined next.

Header

The experiment begins with some header information that describes the experiment being conducted:

{
    "version": "1.0.0",
    "title": "Terminating the greeting service should not impact users",
    "description": "How does the greeting service unavailbility impacts our users? Do they see an error or does the webapp gets slower?",
    "tags": [
        "kubernetes",
        "aws"
    ],
    "configuration": {
        "web_app_url": {
            "type": "env",
            "key": "WEBAPP_URL"
        }
    },

The version describes the version of the experiment definition being followed. title and description describe the experimental hypothesis being explored.

It is typical to build up a catalog of experiments when exploring the weaknesses in a system, and so tags are used to provide searchable labels to make that catalogue more easily navigable.

Finally configuration is used to supply configuration parameters to the experiment, in this case populating the web_app_url configuration parameter with the contents of the WEBAPP_URL environment variable.

Defining Steady-state

Steady-State defines how a system should observably respond, often within a tolerance, in order to be characterised as behaving "Normally".

For the sample application, steady-state could be defined as:

The root URL of the webapp microservice should always respond with a 200 OK HTTP Status Code within a timeout of 3 seconds.

Using the Chaos Toolkit’s JSON experiment definition format, steady-state hypothesis can be defined as:

    "steady-state-hypothesis": {
        "title": "Services are all available and healthy",
        "probes": [
            {
                "type": "probe",
                "name": "application-should-be-alive-and-healthy",
                "tolerance": true,
                "provider": {
                    "type": "python",
                    "module": "chaosk8s.pod.probes",
                    "func": "pods_in_phase",
                    "arguments": {
                        "label_selector": "app=webapp-pod",
                        "phase": "Running",
                        "ns": "default"
                    }
                }
            },
            {
                "type": "probe",
                "name": "application-must-respond-normally",
                "tolerance": 200,
                "provider": {
                    "type": "http",
                    "url": "${web_app_url}",
                    "timeout": 3
                }
            }
        ]
    },

Steady-state begins with a title, which describes what the steady-state represents. Then a collection of probes are defined that describe how the steady-state can be observed.

In this case the probes detect that all the pods are in the running phase, and that the URL, supplied by the web_app_url configuration parameter, returns the specified status code, 200, within the specified timeout, 3 seconds.

Method & Probes

The last step of the Chaos Engineering process is to introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

These variables are introduced using method:

    "method": [
        {
            "type": "action",
            "name": "terminate-greeting-service",
            "provider": {
                "type": "python",
                "module": "chaosk8s.pod.actions",
                "func": "terminate_pods",
                "arguments": {
                    "label_selector": "app=greeter-pod",
                    "ns": "default"
                }
            }
        },
        {
            "type": "probe",
            "name": "fetch-application-logs",
            "provider": {
                "type": "python",
                "module": "chaosk8s.pod.probes",
                "func": "read_pod_logs",
                "arguments": {
                    "label_selector": "app=webapp-pod",
                    "last": "20s",
                    "ns": "default"
                }
            }
        }
    ],

This experiment’s method first has an action that kills all pods that have the label of app=greeter-pod. Often Chaos Toolkit experimental methods only contain actions, as it is the actions that manipulate the real-world variables of the distributed system.

In this experiment’s case there is also a probe in the method. Probes in an experiment’s method give us a chance to collate more information as the real-world variables are being manipulated by the experiment. The probe here extends the output of the experiment with the logs from pods labelled with app==webapp-pod.

Install the Kubernetes extension for Chaos Toolkit:

pip install chaostoolkit-kubernetes

Rollbacks

It is sometimes useful to supply an additional set of actions at the end of an experiment so that any actions in the method that were undertaken can be explicitly reversed. These are contained in a rollback section, but as Kubernetes will recover from this experiment’s actions anyway there are no rollback actions required in this case:

    "rollbacks": []
}

This completes the experiment definition.

Running the Experiment

With your cluster running you will first need to ensure you populate the WEBAPP_URL environment variable with the URL of your cluster’s webapp-service endpoint.

$ export WEBAPP_URL="http://$(kubectl get svc/webapp-service -o jsonpath={.status.loadBalancer.ingress[0].hostname})/"

Note

Amazon EKS uses a non-default Service Account for authenticating with the Kubernetes cluster. Until upstream kubectl supports the needed authentication mechanism, a Service Account with the required RBAC privileges needs to be created and configured in the context. This can be done using the script https://gist.github.com/mreferre/6aae10ddc313dd28b72bdc9961949978.

In addition, chaos CLI needs to pick up the right context by using the command:

export KUBERNETES_CONTEXT=user1-eks-cluster

More discussion on this topic at aws-samples#428.

Now you can run the experiment using the chaos run command:

$ chaos run experiments/experiment.json
[2018-03-10 14:42:38 INFO] Validating the experiment's syntax
[2018-03-10 14:42:38 INFO] Experiment looks valid
[2018-03-10 14:42:38 INFO] Running experiment: Terminate the greeting service should not impact users
[2018-03-10 14:42:38 INFO] Steady state hypothesis: Services are all available and healthy
[2018-03-10 14:42:38 INFO] Probe: application-should-be-alive-and-healthy
[2018-03-10 14:42:38 INFO] Probe: application-must-respond-normally
[2018-03-10 14:42:39 INFO] Steady state hypothesis is met!
[2018-03-10 14:42:39 INFO] Action: terminate-greeting-service
[2018-03-10 14:42:40 INFO] Probe: fetch-application-logs
[2018-03-10 14:42:41 INFO] Steady state hypothesis: Services are all available and healthy
[2018-03-10 14:42:41 INFO] Probe: application-should-be-alive-and-healthy
[2018-03-10 14:42:42 INFO] Probe: application-must-respond-normally
[2018-03-10 14:42:45 ERROR]   => failed: activity took too long to complete
[2018-03-10 14:42:45 CRITICAL] Steady state probe 'application-must-respond-normally' is not in the given tolerance so failing this experiment
[2018-03-10 14:42:45 INFO] Let's rollback...
[2018-03-10 14:42:45 INFO] No declared rollbacks, let's move on.
[2018-03-10 14:42:45 INFO] Experiment ended with status: failed

The output of the chaos run command shows that the experiment was run but there is a weakness in the system. When the greeting-service is killed the webapp-service endpoint returns a response that is greater than the 3 seconds allowed as the tolerance for the system to be observed as still in steady-state.

Inspecting the details in the Journal

More detail on the weaknesses discovered can be inspected by opening the journal.json file that is produced after every experiment execution. For example, the journal.json contains the log details retrieved during the experiment’s method execution:

{
    "activity": {
    "type": "probe",
    "name": "application-must-respond-normally",
    "tolerance": 200,
    "provider": {
        "type": "http",
        "url": "${web_app_url}",
        "timeout": 3
    }
    },
    "output": null,
    "status": "failed",
    "exception": [
    "Traceback (most recent call last):\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/connectionpool.py\", line 387, in _make_request\n    six.raise_from(e, None)\n",
    "  File \"<string>\", line 2, in raise_from\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/connectionpool.py\", line 383, in _make_request\n    httplib_response = conn.getresponse()\n",
    "  File \"/usr/local/Cellar/python/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py\", line 1331, in getresponse\n    response.begin()\n",
    "  File \"/usr/local/Cellar/python/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py\", line 297, in begin\n    version, status, reason = self._read_status()\n",
    "  File \"/usr/local/Cellar/python/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py\", line 258, in _read_status\n    line = str(self.fp.readline(_MAXLINE + 1), \"iso-8859-1\")\n",
    "  File \"/usr/local/Cellar/python/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py\", line 586, in readinto\n    return self._sock.recv_into(b)\n",
    "socket.timeout: timed out\n",
    "\nDuring handling of the above exception, another exception occurred:\n\n",
    "Traceback (most recent call last):\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/requests/adapters.py\", line 440, in send\n    timeout=timeout\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/connectionpool.py\", line 639, in urlopen\n    _stacktrace=sys.exc_info()[2])\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/util/retry.py\", line 357, in increment\n    raise six.reraise(type(error), error, _stacktrace)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/packages/six.py\", line 686, in reraise\n    raise value\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/connectionpool.py\", line 601, in urlopen\n    chunked=chunked)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/connectionpool.py\", line 389, in _make_request\n    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/urllib3/connectionpool.py\", line 309, in _raise_timeout\n    raise ReadTimeoutError(self, url, \"Read timed out. (read timeout=%s)\" % timeout_value)\n",
    "urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='35.230.7.162', port=80): Read timed out. (read timeout=3)\n",
    "\nDuring handling of the above exception, another exception occurred:\n\n",
    "Traceback (most recent call last):\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/chaoslib/provider/http.py\", line 48, in run_http_activity\n    verify=verify_tls)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n",
    "  File \"/Users/russellmiles/.venvs/chaostk/lib/python3.6/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\n",
    "requests.exceptions.ReadTimeout: HTTPConnectionPool(host='35.230.7.162', port=80): Read timed out. (read timeout=3)\n",
    "\nDuring handling of the above exception, another exception occurred:\n\n",
    "chaoslib.exceptions.FailedActivity: activity took too long to complete\n"
    ],
    "start": "2018-03-10T14:42:42.120249",
    "end": "2018-03-10T14:42:45.280973",
    "duration": 3.160724,
    "tolerance_met": false
}

Learning from the Experiment

Now that, through chaos engineering, a weakness has been identified, it is now time to discuss and decide on how to overcome that weakness. This is the final part of the learning loop that chaos engineering provides: experiment→discover→diagnose→decide→fix.

In the case here, the weakness could be overcome at several levels. For example, at the platform infrastructure level, additional instances of the greeter service could be enabled and provided as a High Availability failover option. At the Application level, a circuit breaker could be implemented in the client code in the webapp-service to protect it against delayed invocations of the greeter-service.

You’ve now completed your first Chaos Engineering exercise and are now ready to continue on with the workshop!

Go to Developer Index