I. One Week before the Chaos:
Year 2021. A Tax Portal. Flagship project of the Government. Prestige project for the Service Provider. Go live with much fanfare. Complete disaster post go live. System unresponsive. Defects. Memes on Social Media.
Probably, Project Manger was fired. Hopefully, Not.
This blog post delves into how the Service Provider could have used Chaos Engineering, a proactive approach to identify weaknesses before production go-live and identify weaknesses in the system.
II. The Problem Statement:
Modern IT systems are complex, combining distributed software, Containers running on Kubernetes/Docker engine, hosted on multi-cloud infrastructure (and on-prem). While developers will perform testing to identify code defects, the implementation team often fails to take into account the unpredictable nature of real-world scenarios. Peak traffic, over utilized work loads, connectivity issues, or even database glitches can bring down even the most meticulously tested system.
III. Chaos Engineering (The Savior):
Chaos Engineering brings in preventive element to the traditional testing. Instead of waiting for failures to happen organically, it deliberately tests the systems for disruption events or conditions. These disruptions can mimic real-world scenarios like Region/Zone outages, server outages, network congestion, or memory leaks or Events like time range activities (employees logging in to submit time sheets on Friday evening), month end (Payroll run) or year end processes (Last day — Income Tax filing scramble) or other events triggering spike in traffic. By observing how the system reacts to these simulated failures, engineers can proactively identify weak points and implement preventative measures.
IV. Chaos Engineering seems to be a Chaos

Chaos Engineering is a methodical process that involves:
- Scope Definition: Identify critical systems and functionalities to target with chaos experiments.
- Hypothesizing Failure Modes: Brainstorm potential real-world disruptions that could impact your system.
- Experiment Design: Develop safe and controlled tests in UAT that simulate these failure modes.
- Running Experiments: Execute the tests and monitor system behavior using Observability tools [DataDog, Grafana, Splunk, SolarWinds].
- Learning & Iteration: Analyze the results, Generate Reliability scores, identify vulnerabilities, and iterate on your system design to improve resilience.
V. Automation:
Automation is a game-changer for chaos engineering, making it more efficient, scalable, and reliable. Automation lets you define and execute experiments quickly, freeing up engineers for other tasks.
Complex systems with many components can be difficult to test thoroughly manually. Automation allows you to run many experiments simultaneously, ensuring comprehensive coverage.
Automated tests produce consistent results, making it easier to compare outcomes and track improvements over time.
Chaos testing can be integrated into your continuous integration and continuous delivery (CI/CD) pipeline making it CI/CD/CT. This allows you to automatically test the resiliency of your system for every product release.
Tools like Chaos Monkey, Chaos Toolkit, and Gremlin can automate experiment execution, enabling faster, more scalable testing with fewer errors. This allows you to continuously ensure your systems can handle disruptions.
VI. Back to our Project Manager:
Here are some steps that Project Manager could have taken:
- Deploy Chaos Engineering tools: Explore, Select and Deploy one of the tools like Gremlin, Chaos Monkey, and Hey!Chaos.
- Build a Chaos Engineering team: Assemble a team with expertise in system architecture, testing, and automation.
- Start small and iterate: Begin with simple experiments and gradually increase complexity as your confidence grows.
VII. Final Thoughts:
Chaos Engineering is a strategic approach to building systems that are functional and resilient. By proactively identifying and addressing vulnerabilities & weaknesses in design and architecture, Chaos Engineering empowers to handle the unexpected events and deliver a seamless user experience.
A little chaos today can save your job tomorrow.
No comments:
Post a Comment