How Chaos Engineering Works

By: Milecia McGregor

How Chaos Engineering Works

With cloud-native applications, there's always a chance that something could interrupt your services. Maybe a wire gets unplugged and that brings down your server or one of your services loses network connections that you depend on.

These are issues your system doesn't typically account for in code or infrastructure. You have a way to figure out where some of your system weaknesses are and give those areas extra attention. It's difficult to build a system that accounts for every odd occurrence that might happen, but with some chaos engineering, you can make your system resilient against a lot of unexpected conditions.

What is chaos engineering

The definition of chaos engineering is experimenting on your system to build confidence in the system's ability to withstand unpredictable conditions in production. That means chaos engineering happens when you run experiments on a system in production.

This goes against almost everything we know about best practices and that's what makes it a useful concept. You're trying to see how well your system will stand against any number of random things that could happen.

With all of the different dependencies and services an application can need to operate, there are a lot of places where the system could unexpectedly fail. The goal of chaos engineering is to find where those failure points are and build safeguards around them before they become critical issues.

Why you would use it

If you have a cloud-native application that you know customers are dependent on, it wouldn't hurt to run some chaos engineering experiments. (That means you should probably run some experiments if your application is in production and you have users.) Keeping your application up and running is important to keep your customers' faith in how great your tool is.

Knowing that your system can handle any number of mishaps will give you more confidence in pushing changes to production frequently and help you figure out where you can make improvements in resource allocation and performance. One thing that's hard for many teams to define is the normal working state of their application.

Understanding what steady-state means for your system isn't something that comes up until there is a problem and we're looking for a starting point. Once you have the steady-state conditions for your application, that's when you can start experimenting and seeing where things have the potential to break.

How it works

There are four main steps to any chaos engineering experiment.

  • Define the steady-state that represents the normal working behavior of your application
  • Make a hypothesis about what will happen if something in the system breaks or fails
  • Design some experiments with different variables to reflect those real-world system failures like a server going down or a connection being interrupted
  • Compare the results from the experiment to the steady-state and see what changed

This is how chaos engineering experiments are defined, regardless of what tools you decide to use. As an example, here's what a chaos experiment would look like if it were implemented with the Chaos Toolkit.

{
    "title": "Does our service tolerate the loss of a third-party service?",
    "description": "Our service uses this third-party service, can it handle the third-party service going down?",
    "tags": [
        "tutorial",
        "filesystem"
    ],
   "steady-state-hypothesis": {
        "title": "The third-party service is running",
        "probes": [
            {
                "type": "probe",
                "name": "service-is-unavailable",
                "tolerance": [200, 503],
                "provider": {
                    "type": "http",
                    "url": "http://localhost:8080/"
                }
            }
        ]
    },
   "method": [
        {
            "name": "turn-off-third-party-service",
            "type": "action",
            "provider": {
                "type": "python",
                "module": "os",
                "func": "shutdown"
            }
        }
    ]
}

This experiment checks to see how well your service would do if the third-party service you're depending on goes down. Experiments are written in JSON format with Chaos Toolkit so you can focus on writing out your steady-state and hypothesis. Just remember that these are experiments that get run in production.

You want to make sure that you're experimenting when users aren't super active or you want to test areas that aren't being used by customers yet. Chaos engineering can be dangerous if the experiments haven't been thought through completely. You can try running some experiments in staging or a different environment to see some initial effects, but the best results will come from running experiments in production.

Tools you can use

Since this is such a delicate task, you want to make sure that you are using a tool that meets your needs. There are a number of tools out there, but here are some of the popular ones.

Conclusion

Chaos engineering is a great tool to add to your resiliency testing because it imitates what would actually happen to your system in real-time. It gives you insights on what your steady-state is and how different events make your application deviate from it. It might be hard to get people in at first because you're telling them that there are flaws in the system that no one accounted for.

That's one of the reasons to pitch chaos engineering. It's a way to prevent huge issues from happening and keep your customers' full confidence in your product.