Chaos in the Cloud: A look at chaos engineering and Amazon’s Fault Injection Service
December 24, 2024

Chaos in the Cloud: A look at chaos engineering and Amazon’s Fault Injection Service

When I first started developing, we wrote huge monolithic applications that either ran locally on the desktop or in a datacenter. We write applications with tens or even hundreds of thousands of lines of code. However, the applications we write usually consist of a single component, or if we use a database, maybe two components, handling all the logic and functionality in a single application. While this means we often have complex, difficult-to-navigate code bases, it does mean that architecturally, our applications are relatively simple.

Today, we build systems with multiple components, such as databases, message queues, and application servers, all communicating over the Internet. The functionality of a system is often broken down into microservices, each of which handles a small subset of the overall problem space. We typically deploy to the cloud, which allows us to
Build more scalable and resilient systems. This approach also means we can speed up the development process because we can assign multiple individuals or teams to work on different parts of the system.


The evolution of testing methods

Looking back at those monolithic applications, testing was often a very manual process. We will have test writers who review the requirements and write test specifications. They outline the steps needed to test each part of the system, and the tester will then go through the list of steps, noting whether the system is functioning as expected. If there is a discrepancy, they will file a defect and the developers will fix the problem.

As software development matures, we move forward and add new testing methods. We introduced unit tests that allow us to prove that small, independent snippets of code work as we expect. Some of us even pursue a test-driven development (TDD) approach, where the first code we write is a test. Typically, these unit tests are executed automatically as part of a pipeline, stopping new code from being deployed until all tests pass successfully.

We’ve also introduced integration testing to ensure our code works with other systems and applications and doesn’t introduce issues. We also introduced performance testing to see if our code performs well under load.

However, one area that we haven’t been able to address until recently is thinking about how the infrastructure running our code behaves. We’re not very good at looking at how all the individual components work together, especially if they don’t work as expected – how many times have projects been delayed or even canceled because there were unexpected delays in communication when using the app in anger, or we Part of the code cannot handle a failure elsewhere in the system?


Harnessing chaos to bring confidence

As more and more companies begin to build complex, decentralized systems, people are starting to think about how to test the resiliency of their systems.

It’s probably no surprise that Amazon was one of the first companies to consider this, with Jesse Robbins introducing the idea of ​​”game days” in 20031attempts to improve reliability by periodically introducing faults into the system.

However, it wasn’t until Netflix began its move to the cloud in 2011 that the idea of ​​”chaos engineering” began to spread, mainly due to a set of tools launched by Netflix.
in a blog post 2Netflix engineers came up with the idea of ​​a toolkit, the so-called “Simian Army,” that could be used to create glitches in their systems.

“Simian Army” is a set of open source tools that can be used to introduce different types of failures into Netflix’s systems. For example, “Chaos Monkey” will randomly terminate instances in the production environment, “Latency Monkey” will cause network communication delays, and “Chaos Gorilla” will simulate the loss of an entire AWS region.

Over time, this approach came to be known as “chaos engineering,” which can be defined as

The discipline of conducting experiments on decentralized systems with the goal of building confidence in the system’s ability to withstand turbulent conditions in production.3

Many people have adopted this toolset and started using it directly, or building their own scripts to introduce failures in a managed manner to test the resiliency of their systems. However, all this requires a lot of energy and resources to operate and manage.


Undifferentiated burdens come to the rescue

Amazon Web Services has a concept called “undifferentiated heavy lifting.” The idea is that when many customers spend a lot of effort solving the same problem, AWS should solve it for them, allowing them to focus on their core business.

Amazon realizes many of its customers are considering a chaos engineering approach, and in 20214they launched a new service, initially called “Fault Injection Simulator” but quickly renamed “Fault Injection Service” (FIS).

FIS is designed to allow customers to perform controlled, repeatable experiments on their workloads; introduce bugs and issues and examine how their systems respond.


The perfect recipe for creating chaos

At the core of FIS is the concept of “experimentation”. You can think of it like a recipe – you need ingredients (or Target As FIS is called), these are things you want to test, such as EC2 instances, RDS databases, lambdas, and even the underlying network. Once you have a list of ingredients, you need to know the steps to combine them – these are action Actions you will take, such as introducing delays or terminating the instance.

Just like we collect recipes in a cookbook, we can store experiments in template This way we can run them again and again, knowing that our test masterpieces will be recreated perfectly every time.

These templates are made up of many different components, the first 3 components are required and the rest are optional. Let’s take a look at these:

  • Target – As mentioned above, these define the resources you want to test, the full list is at https://docs.aws.amazon.com/fis/latest/userguide/targets.html#resource-types. After specifying the resource types to test, you can filter the resources to be tested by tags or specific identifiers (such as EC2 execution instance ID).
  • action – These describe the actions you want to perform on the target. There are a number of different actions available, depending on the type of target you are testing – a full list is at https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html.
  • IAM roles – When you execute an experiment, FIS will use the role you defined to execute the experiment. This means that the role requires appropriate permissions to interact with resources as well as FIS itself.
  • Stop condition – Sometimes you may want to stop an experiment, for example if it starts affecting the running environment. Stop condition Allows you to link predefined CloudWatch alarms to your template. Then, if the alert is triggered while the experiment is running, it will stop the experiment.
  • log – When you start running an experiment, you may want to capture logs to understand what happened during the experiment. FIS allows you to ingest logs to an S3 bucket or CloudWatch, which will capture information such as the start and end of the experiment, the target resources, and the actions taken.
  • Report – Just ahead of re:Invent 2024, new functionality was announced that allows FIS to produce reports that can be shared with others to document experimental results. FIS allows you to generate reports in PDF format, with the option to include charts from CloudWatch in a format that can be shared with others (such as a change review board or during a code review).
  • Other options – In addition to the items above, other options are available in the template, such as defining how long the experiment should run, how many items should be targeted (e.g. 75% of the autoscale group).

One thing to remember is that you are not limited to a set of goals and actions in a template. You can add multiple groups, for example, to increase the CPU usage of a group of servers while restarting the RDS database.

Once a template is defined, it can be run multiple times manually, as part of a pipeline, or on a schedule. The results of these runs are then stored and available for review.


Pricing

As with most AWS services, you only pay for what you use. For FIS, the base fee is $0.10 per action per minute per account.

warn
Please note that if you would like to use lab reports that generate PDFs, there is an additional fee of $5 per report.


coming soon!

I hope this overview of Chaos Engineering and Amazon’s Fault Injection Service was helpful. In an upcoming article, we’ll look at how to set up a template to test an EC2 autoscaling group and then test the group’s response to failures.


  1. https://dl.acm.org/doi/10.1145/2367376.2371297

  2. https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116

  3. https://principlesofchaos.org/

  4. https://aws.amazon.com/blogs/aws/aws-fault-injection-simulator-use-control-experiments-to-boost-resilience/

2024-12-24 14:37:00

Leave a Reply

Your email address will not be published. Required fields are marked *