Chaos engineering has been gaining a lot of traction over the last few years as it moved from its origins at Netflix to more and more companies across the industry. Many development teams use it to prevent downtime by trying to break their systems on purpose to improve those systems before they cause problems down the line.
Given the resilient nature of serverless computing, based on agreements of uptime and availability by the cloud providers, it might seem that chaos engineering is one method of testing that wouldn’t be practical in serverless. But Emrah Samdan, vice president of product for Thundra, believes that serverless computing and chaos engineering go well together.
Because the cloud vendor guarantees availability and scalability, when doing chaos engineering in serverless environments, the goal is not necessarily to bring down the system but to find application-level failures caused by a lack of memory or time. “The purpose of chaos experiments is not to take the whole software down but to learn from failures by injecting small, controllable failures,” Samdan said.
RELATED CONTENT: To build resilient systems, embrace the chaos
Some of the most common examples of chaos engineering in serverless that Samdan sees are injecting latency into serverless functions to check that timeouts work correctly and injecting failures into third-party connections.
Samdan noted that chaos engineering of defining the status state is an essential first step but often overlooked. “People just want to break things, but the first step is to understand how they work, what are the ups and downs of the system, what are the limits, how resilient is your system already,” he said.
He believes that determining this baseline is even more critical in serverless environments. This is because what is considered normal for serverless can differ greatly from what is expected in other systems. For example, both latency and the number of executions in serverless are significant, which isn’t as accurate in other systems.
“Chaos engineering experiments are all about asking questions to understand what happened during the experiment. Because of this, an engineering team must have proper observability in place. You cannot achieve this by examining metric charts designed to answer general questions. To ask questions about the unknowns of the distributed system, you need to have all three pillars of observability — logs, metrics, and traces — together and integrated. I see the adoption of correct observability continues, and we see more and more companies using modern tools for this purpose. I frankly believe that we’ll see more and more companies stepping into chaos engineering as modern observability becomes more widespread,” Samdan said.
For those looking to start doing chaos experiments in serverless environments, Samdan recommends starting small and starting in the staging environment. Rather than throttling all serverless functions, he advises throttling or injecting latency into one or two downstream services. “It’s not only about testing failures on your system; it’s also about testing how your team will react to these failures. So starting small encourages persevering for more comprehensive experiments,” Samdan said.
Like adopting any new methodology, changing culture is the biggest challenge. Chaos engineering needs to be initiative and sponsored by higher-level folks in the company, Samdan believes. “Teams should be able to work in harmony by planning, running, and evaluating the game days. We should always remember that chaos experiments are not for criticizing colleagues for the weaknesses in their modules. It’s more about fixing those weaknesses before customers get impacted and letting those colleagues grow due to the experiments,” said Samdan.
Samdan also advised developers to remember that chaos engineering isn’t a silver bullet for finding every failure. It works best when used to complement other testing methodologies like unit tests and integration tests. “However, chaos engineering taps into a different point than other tests. It tests the resiliency of other parts of your system when one part has problems due to latency or failure. Considering the distributed systems serverless paradigm implies, running chaos experiments become a no-brainer to reveal the hidden traps before customers reveal them on production,” he said.