#42 - Chaos Engineering - Mikołaj Pawlikowski

14-Jun-2021 53 mins Mikołaj Pawlikowski

included in Culture & Practices Software Craftsmanship DevOps Infrastructure Observability Architecture & Design Architecture

Buy me a coffee

“Chaos engineering is the discipline of experimenting on the system in order to increase your confidence that the system will survive difficult conditions.”

Mikołaj Pawlikowski is an engineering lead at Bloomberg and the author of “Chaos Engineering: Site reliability through controlled disruption“. In this episode, Miko shared about what chaos engineering is, including clarifications on some of the common misconceptions. Miko also mentioned about the chaos engineering tools, steps and prerequisites to do chaos engineering, and the skill set required of a chaos engineer, and how we should explain the rationale and motivation behind chaos engineering to get the management buy-in. Towards the end, Miko also shared about chaos engineering for people; an interesting excerpt taken from his book, and his mission over the last few years to make chaos engineering boring.

Listen out for:

Career Journey - [00:04:26]
Chaos Engineering - [00:05:15]
“Chaos“ in Chaos Engineering - [00:08:37]
Getting Management Buy-in - [00:13:08]
Running in Production - [00:18:29]
Other Common Misconceptions - [00:20:53]
4 Steps to Chaos Engineering - [00:25:41]
Skill Set of a Chaos Engineer - [00:28:11]
Examples of Chaos Engineering Tools - [00:32:09]
Chaos Engineering for People - [00:37:48]
Make Chaos Engineering Boring - [00:42:08]
3 Tech Lead Wisdom - [00:46:12]

_____

Mikołaj Pawlikowski’s Bio
Mikołaj Pawlikowski is an engineering lead at Bloomberg. He’s the author of “Chaos Engineering: Site reliability through controlled disruption”, a frequent speaker, and the maintainer of Goldpinger and PowerfuSeal open source projects.

Follow Mikołaj:

LinkedIn – https://www.linkedin.com/in/mikolajpawlikowski/
Twitter – https://twitter.com/mikopawlikowski
Github – https://github.com/seeker89/chaos-engineering-book
“Chaos Engineering“ book – https://www.manning.com/books/chaos-engineering
Chaos Engineering Newsletter – https://chaosengineering.news/

Mentions & Links:

Chaos Engineering – https://principlesofchaos.org/
SRE (Site Reliability Engineering) – https://en.wikipedia.org/wiki/Site_reliability_engineering
Dave Rensin – https://www.linkedin.com/in/drensin/
“Dave Rensin: Chaos Engineering for People Systems - Chaos Conf 2019“ presentation – https://www.gremlin.com/blog/dave-rensin-chaos-engineering-for-people-systems-chaos-conf-2019/
Kubernetes – https://kubernetes.io/
Chaos Monkey – https://netflix.github.io/chaosmonkey/
strace – https://github.com/strace/strace
eBPF (extended Berkeley Packet Filter) – https://ebpf.io/
Netflix – https://www.netflix.com/
SpaceX – https://www.spacex.com/
Duolingo – https://www.duolingo.com/

Our Sponsors

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Chaos Engineering

There are a lot of misconceptions, and the name itself doesn’t help.
Chaos engineering is basically this discipline of experimenting on the system and the system can be anything — it doesn’t have to be massive or Netflix scale — in order to increase your confidence that [the] system will survive difficult conditions.
A unit of chaos engineering, the way that I see it is this chaos experiment, when you basically have a bunch of assumptions about the system. Obviously we all design the systems, and we want them to behave in certain ways. But then in practice, it turns out that there’s a lot of behavior that we do not account for, or the system is doing what we told it instead of what we intended.
The experiment’s real goal is to either confirm that your assumptions about the system are correct — basically, your hypothesis works — or you find a problem or you find place where your assumptions and the reality don’t necessarily adapt.
The chaos engineering experiment typically has four steps.
1. Defining observability, because this is a prerequisite for all of that to happen. If you can’t observe the system, any kind of variable reliably, then you can’t really conduct a scientific experiment.
2. Then you go for what we typically call steady state, which is just a fancy way of saying the system-normal kind of behavior. This is the normal range.
3. And then we go and we do the fun stuff. So we try to turn our expectations of the system into hypothesis, and we say, “Okay. So we design the system to be redundant. That means that if we take away one of the servers from the pool, it should keep working within these parameters.“
4. You implement that, and you verify what happened.
The nice thing about it is that everybody wins because if your hypothesis was wrong, then you discover something you can fix, and you can save yourself trouble and fix it before your users notice. If your hypothesis was right, that means that your system is pretty good. So, you increase your confidence.

“Chaos” in Chaos Engineering

So chaos is a little bit of a double-edged sword. Because on one hand, it captures attention to headlines. On the other hand, it requires a little bit of explanation.
One is that the chaos that we mentioned here, it’s not about increasing the amount of chaos in your system, it’s about decreasing the amount of chaos.
If we can’t control as many of these fireballs as possible, we can’t reliably confirm or deny our hypothesis.
There’s an entire spectrum of things that you can do with chaos engineering.
Chaos Monkey, the thing that this entire discipline started with, was very chaotic in the sense of randomness of the work. So if you don’t know much about your system or you’re looking for the emerging properties, randomness can really give you a lot of value, kind of out of the box that requires very little setup.
And I see the sign of a spectrum as similar to the discipline of fuzzing in testing, where you produce a lot of valid inputs that you would probably not think of if you were writing unit tests manually. You just run it through whatever you’re testing, and you might discover things that you didn’t think of.
And so if you start with a system, it’s a really nice way to get into the chaos engineering, because it doesn’t require much of a setup. You need to have a vague understanding of the system. You can release it, and you can find things.
The other thing I mentioned [are] this emergent properties… In any complex enough system, you’re going to have interactions that you just didn’t predict, and sometimes they’re going to be pretty bad for you.
Over the last few years, we’ve been trying to push for the other side of the spectrum, where you go into the system, you already know the system well, and you’re working on particular properties, and you want to make sure that the particular failure scenarios are covered.
From the SRE point of view, if you had an outage that you didn’t predict for because the system worked differently than you expected, you would like to make sure that doesn’t happen again. One of the best kind of regression tests that you can come up with is to simulate that failure scenario. Make sure that the system actually still survives that.

Getting Management Buy-in

So it depends whom you talk to, and I, in general, tend to have seed this to big groups, either it’s your colleagues or people who have had the experience of actually being paged in the middle of the night. And I think the only real argument you need to get them on board is to tell them that, “Listen, if we do it right, there is a huge potential for you to be called less at night.” So basically the worst-case scenario is that you do some chaos engineering and you break something. If you do it well, and you do it with common sense, and you apply all the same best principles for deploying the code because the chaos experiments are code like any other, really. You’re going to only affect a certain blast radius, anyway.
But the worst-case scenario is that, “Okay, so we did chaos engineering and we broke something. So what happens then?” One of the things that is worth mentioning is that you’re doing it on purpose. So, if it breaks, you’re typically in the office, you’re typically ready to jump to fix that. You don’t have to context switch.
The other group is people who might be decision-making on this kind of stuff, your managers and this kind of person. From my experience, the best arguments to start with, are to do some back of the napkin maths. Now I even have a phrase for that. I typically call it the “hamburger vs shark problem”. It’s about the perception of risk.
If you do it right, and you are careful about the blast radius, it’s a lot like releasing any other change to production. You test things through the stages, and you apply all the same principles you do for all other bits of code.
The point I’m trying to make is that a chaos engineering, when you first hear about that, and about introducing failure on-purpose, it’s a lot like end-to-end testing of the unhappy paths. It’s like the evolution. You do the unit test, you test a small little subset, then you do maybe component testing, some kind of integration testing. Then at some point, you get to an end-to-end testing where you typically test happy path or some popular path and stuff like that. Sometimes chaos engineering is a lot like end-to-end testing of a system when you take it as a whole. During unhappy events, when things break, when machines go down, when the network gets slow, and this kind of thing.
There are ways to manage the blast radius and there are ways to manage that risk. And the goal of the entire exercise is to decrease the amount of chaos, and decrease the amount of risk that you have in your system, not to increase it.

Running in Production

If you don’t run it in production, you’re excluded from the chaos community.
You want to be as comfortable with the system that you’ve done this for so long in the other stages that it’s now part of your routine, and you do it in production, and that’s great. And if it can do that, that’s obviously great.
Because if you think about that, you’re never fully testing things until they get into production. The data patterns will be different. Usage patterns will be different. So by definition, you can never fully test things on it until the actual proverbial rubber hits the proverbial road.
You are progressing at the same way that you progress all other software. You write this code. You run it in your test environments. You might increase it. You’re probably going to do the same thing that you do when you release any other software, and limit the percentage of traffic that goes through the systems that have that and then increase it. If anything goes wrong, you rollback.
And also it’s probably worth mentioning that there are use cases where it’s probably never going to be okay. If your choice is either to potentially introduce this failure, and you’re not confident whether it’s going to work, and someone might die because of that, that’s probably not the right moral choice to do that. Or if you have very heavy contractual obligations, now from the legal point of view, you might not be able to do that. But that doesn’t mean that you can’t get 80% of the value of harvesting the lower hanging fruit in the process.

Other Common Misconceptions

On the regular basis, I asked my LinkedIn network to say what’s the biggest blocker, and it still seems to be getting the buy-in. In big part because of this misconception. So the production one is a big one, and that’s typically something that people really think.
The other thing is randomness. So a lot of people basically disregard chaos engineering because they feel like it’s a gimmick, a chaos monkey randomly smashing things.
While for bigger systems, like Netflix, that example that works nicely because doing something as simple as taking down the VMs is already giving you value. It might not work for a smaller system.
It’s not just about randomness, it’s about an entire spectrum from random to very deliberate is also important.
Another thing I keep hearing is, “Oh, we already have chaos enough. We don’t need new chaos.”
If you still think that chaos engineering is about adding chaos to your systems, then you still haven’t gotten the memo. We add the failure so that the amount of incertitude, the amount of chaos in your system actually decreases. So the net’s net. We do all of that to decrease the amount of chaos.
A lot of people say that, because they don’t feel that confident in their systems.
I think that’s a false premise too, because regardless of where you are on your maturity spectrum, you can get some value out of doing that.
Typically these chaos experiments, unless you really go into the deep weeds, are simple to set up. They don’t take that long to implement. And especially as the tooling around that gets better and better by today, it’s easier and easier to do pretty sophisticated things with little effort.
Worst-case scenario, you don’t discover anything and you feel a bit better because you know that this particular scenario is not going to take your system down. And best-case scenario, you can discover something that’s free, like the small little bets that can pay off that are really cheap to do. There’s also the benefit of the kind of mindset that goes with that, that if your engineers on your team think about the fact that this is going to be tested, that this is actually going to be put in practice, and they have to design the systems with this in mind. It just goes more naturally to bake this reliability from day one, rather than doing this as it breaks.
I think one more is about the scale. People look at Netflix and they look at Google and they say, “Oh, okay. Yeah, these are massive companies, and they’re bleeding edge, and they have all the scale and that’s great.” But it’s also worth noting that chaos engineering doesn’t require you to have all this fancy and massive distributed systems.
It’s more of a methodology that can really work on any kind of system. Whether you’re working with a single process legacy server, and you would like to make sure that you understand how it breaks, and the kind of things that you expect to happen.

4 Steps to Chaos Engineering

So first is the observability… what it means is being able to measure something reliably.
- It’s important to be good at observing this. It can be very simple. This variable can be anything. It could be whether the server is up and down, then that’s something we can observe. Or throughput. Or number of requests per second or anything, really.
Once you have that [observability], I mentioned the steady state — the normal range of that — is how you verify what’s going on. So normally, the server is up. Or normally, I get this many requests per second.
And then, we hypothesize.
And then the last step is to implement that [hypothesis]. Take one of the many tools that are available right now to do that. Make sure that it doesn’t affect your observability. Go run it and see what happens.

More often than not, you’re going to discover things that you might have not predicted.

Skill Set of a Chaos Engineer

One thing that I really like about chaos engineering, is that it really cuts through different stacks and different technologies and different kinds of design and systems. It’s really more of a meta skill. That also means that it’s not necessarily a job description; it’s more of a mindset / skill that a lot of different people can have.
As a SRE-type person, instead, I care a lot about my system not blowing up and my system working well. And as the scale increases and everything, you realize that the failure is [a] when rather than an if, and you start seeing more and more of that as it grows.
So the primary driver for me is to have another tool that helps me sleep better at night and get paged less or called at night.
If you’re an application team, you can also leverage the same thing. You design your system, your application, whatever you’re working on in a way that it’s the most reliable possible.
The SRE kind of person might be running a platform that runs a lot of clients and code that they know little about potentially. On the application side, you run fewer of the applications, but you have much more intimate knowledge about them.
At every level of the stack, whatever you’re working, you can use the same thing.
Wherever you are on the stack, on the platform, whether you’re working with the kernel and you want to verify what happens when syscalls are blocked or delayed or something like that. Through the platform levels, the networking, maybe Kubernetes level, containers level, all the way up to the browser. So it’s really applicable to all the levels of the stack.

Chaos Engineering for People

If you think about your people systems, also known as teams, a lot of the same rules that apply when you’re building a reliable computer system also apply to building reliable human systems.
Probably easiest way to illustrate that is that every team is always going to have some bottlenecks, and the bottlenecks might be in a form of throughput bottleneck, or they might be in the form of knowledge bottleneck.
So a lot of the job of having a performing and good team is to continuously try to find those bottlenecks and resolve them.
That goes into the chaos engineering mindset. Basically, the way that you think of your system with the expectation of failures happening rather than a possibility of failures happening, is going to help you build more reliable systems from the ground up.
It’s kind of like if you have a good CI system that always runs your unit tests, and every time you push a change upstream, you’re going to automatically get the feedback from the unit test. It becomes part of the culture. You get this immediate feedback, quick turnaround for the feedback. It works. It doesn’t work. It detected the problem. It’s similar with the chaos engineering for mindset.
If you think about it, it’s not a question of if, it’s a question of when, and it’s fairly easy to calculate, depending on your scale, how often you’re going to see these kind of failures.
He [Dave Rensin] basically described the teams — and I loved the description, I still remember that — as a set of bio robots executing a distributed algorithm to produce some work. And he actually went much into the details of the games that they came up with to surface and detect the problems.

Make Chaos Engineering Boring

And so for technology or methodology to reach to wide audience, it has to become mainstream enough; basically, boring enough that it can be adopted by a lot of people. Because not everybody who’s working is happy to work around the rough edges.
So I would really like chaos engineering to stop being about exciting stuff that you can do and break things in production and not get fired for that and instead become this routine thing that we do just because it creates a lot of value. There are benefits to doing that.

3 Tech Lead Wisdom

A lot of what we do is about removing the BS from the equation.
- Engineers, apart from the big egos and everything, are really very finely tuned to detect BS. And so a lot of tech leadership and leadership in tech in general, in my opinion, is about just making sure that there is as little BS as possible.
- So if someone asks you a question, you have few options: You can say yes, no, if you know for sure the answer; you can say, I don’t know, if you don’t know the answer. Or you can try to BS your way through it and try to come up with something on the spot. And I got a lot of mileage just by removing that last option.
- If you just tell people honestly, I know, I don’t know, yes or no, it builds this relationship that’s based on trust when they also feel like they can also say, “I don’t know.”
- You work with other people who are smarter than you, who have more experience than you, who are by definition are going to be much better at parts of your job. If you are humble enough to just say, “Oh, I don’t know. Well, what do you think?”, it gets you a lot of mileage.
Regardless of what you think about yourself, there’s always something that you can learn from everybody.
- Coming from a mindset that if that person disagrees with me, there is probably something that they think about or they know about that I don’t know about.
- If I just try to convince them to my way of doing things, I’m not going to come out of that conversation wiser. But if I at least by default give it a shot — maybe they’re wrong, maybe there’s something I know that they don’t know — I’m going to get more value of that. And if you keep getting value from every conversation, you’re really going to accumulate that over time.
The learning mindsets, lifelong learning is very important.
- By learning, I mean like a lot of things. There might be a point where you know everything about your niche and there isn’t much more to learn. Instead of stopping learning, you should probably start exploring another niche. Or maybe do something completely out of your comfort zone, and go learn a language, or do some art, or sport and see how that affects your brain.
- Learning things might have unexpected benefits for you. And this is something that I really recommend doing.

Transcript

Episode Introduction [00:01:03]

Henry Suryawirawan: [00:01:03] Hello, everyone. I’m so happy to be back here again with another new episode of the Tech Lead Journal podcast. Thanks for tuning in and spending your time with me today listening to this episode. If you haven’t, please subscribe to Tech Lead Journal on your favorite podcast apps and also follow Tech Lead Journal social media channels on LinkedIn, Twitter, and Instagram. And you can also make some contribution to the show and support the creation of this podcast by subscribing as a patron at techleadjournal.dev/patron, and help me towards producing great content every week.

For today’s episode, I am happy to share my conversation with Mikolaj Pawlikowski. Mikolaj is an engineering lead at Bloomberg and the author of “Chaos Engineering: site reliability through controlled disruption”. I’m sure that many of you would have heard about chaos engineering before, popularized by Netflix, along with its tools such as Chaos Monkey and Simian Army. I’m personally curious about chaos engineering and how to implement an execute it properly in order to continuously improve systems reliability in the midst of disruption and disaster.

In this episode, Miko shared in-depth about what chaos engineering is, and importantly, what chaos engineering is not by clarifying some of the common misconceptions surrounding it. Miko explained the prerequisites and steps required in order for us to start doing chaos engineering, and also mentioned some of the chaos engineering tools that we can use at different layers of our system, the skill set required of a chaos engineer, and how we should explain the rationale and motivation behind chaos engineering to get the management buy-in, and using a fun analogy by involving hamburger and shark. Towards the end, Miko also shared about chaos engineering for people, an interesting excerpt taken from his book, and his ultimate mission over the last few years to make chaos engineering boring.

I hope you will enjoy this episode. And if you like it, consider helping the show by leaving at the rating, review, or comment on your podcast app or social media channels. Those reviews and comments are one of the best ways to help me get this podcast to reach more listeners. And hopefully they can also benefit from all the contents in this podcast. So let’s get this episode started right after our sponsor message.

Introduction [00:03:54]

Henry Suryawirawan: [00:03:54] Hey, everyone. Welcome back to another new show of the Tech Lead Journal. Today I have with me, a guy called Mikołaj Pawlikowski, or in short let’s call him Miko. So Miko is the author of a recent book titled “Chaos Engineering”. So today, I’m sure we are going to be talking a lot about chaos engineering. How to implement it correctly? And maybe some of the gotcha or misconceptions about chaos engineering. So Miko, it’s good to have you here in the show. Thanks for being here.

Mikołaj Pawlikowski: [00:04:23] Very glad to be here. Thanks for inviting me.

Career Journey [00:04:26]

Henry Suryawirawan: [00:04:26] Sure. So, before we start, probably to introduce yourself to the audience here, can you tell us a little bit more about your career? Maybe some highlights or turning points?

Mikołaj Pawlikowski: [00:04:35] Sure. So I don’t know how far back you’d like me to go. Yeah, obviously right now, I do a lot of chaos engineering. I run a small SRE team that does manage Kubernetes and chaos engineering kind of evolved us. One of the very important tools that we use to basically make our systems more reliable, and it plays nicely into this entire site reliability engineering mindset. I’ve been at Bloomberg for a while now. Before that, I attempted a couple of startups. So, pretty I think, common tech backgrounds. I’ve been hacking away at coding since I was a kid. So I think probably a lot of people can relate to that.

Chaos Engineering [00:05:15]

Henry Suryawirawan: [00:05:15] Okay, cool. I mean, let’s maybe dive deeper as we talk along about your career. You publish this book not so long ago, right? Chaos engineering. I think the first time I heard about this term is, during that time, when Netflix popularized this idea, chaos engineering, and they’ve been running it in their production, killing their servers in order to increase their reliability. But maybe for all the audience here, can you maybe explain what exactly is chaos engineering?

Mikołaj Pawlikowski: [00:05:41] Sure thing. And that’s probably a good idea because there’s a lot of misconceptions, and the name itself doesn’t help. So chaos engineering is basically this discipline of experimenting on the system and the system can be anything. It doesn’t have to be massive or Netflix scale. In order to increase your confidence, that system will survive difficult conditions. So, a unit of chaos engineering, the way that I see it is this chaos experiment, when you basically have a bunch of assumptions about the system. Yeah. Obviously we all design the systems, and we want them to behave certain ways. But then in practice, it turns out that there’s a lot of behavior that we do not account for, or the system is doing what we told it instead of what we intended. So the experiment’s real goal is to either confirm that your assumptions about the system are correct, basically, your hypothesis works, or you find a problem, or you find place where your assumptions and the reality don’t necessarily adapt.

So this chaos engineering experiment, they typically have four steps. One is defining observability, because this is like a prerequisite for all of that to happen. If you can’t observe the system, any kind of variable reliably, then you can’t really conduct a scientific experiment. Then you go for what we typically call steady state, which is just a fancy way of saying the system normal kind of behavior. This is the normal range. So let’s say that for observability, we’re looking at variable like throughput or number of requests per second that some kind of server can handle. The normal range, the steady state might be this number of thousands of requests per second. And then we go and we do the fun stuff. So we try to turn our expectations of the system into hypothesis, and we say, “Okay. So we design the system to be redundant. That means that if we take away one of the servers from the pool, it should keep working within these parameters.” And you go number four, and you implement that, and you verify what happened.

The nice thing about it is that everybody wins because if your hypothesis was wrong, then you discover something you can fix, and you can save yourself trouble and fix it before your users notice. If your hypothesis was right, was correct, that means that your system is pretty good. So you increase your confidence. So, yeah, it’s the nice part of doing chaos engineering. So this is really it like, it doesn’t need to be more complicated. And then obviously, it came out of Netflix, and it made for some really good headlines because they were already doing this in production. Breaking things in production is typically one of the internet memes. So if someone comes out and says that with a straight face, it’s a bit of a controversy. So that’s really what I see as chaos engineering. I think it’s simple enough that pretty much everybody can at least kick the tires, and see what value they can get out of it.

“Chaos” in Chaos Engineering [00:08:37]

Henry Suryawirawan: [00:08:37] From your explanation just now, so there’s nothing mentioning about chaos, so to speak. It’s more about scientific experiment. You know what you’re going to test. You have a hypothesis. And like you have the steady state that you want to test about, and then you introduce some kind of tests and experiments. Not necessarily all like chaotic stuffs, like just killing things, and doing some like, one data center take down or something like that. But it’s not necessarily about the chaos itself. Although a lot of people actually have this misconception. Okay. Let’s just introduce this chaos software, and then everything just go haywire and let’s see how the system goes.

Mikołaj Pawlikowski: [00:09:12] Yeah. So chaos is a little bit of a double-edged sword. Because on one hand, it captures attention on headlines. On the other hand, it requires a little bit of explanation. So at least two things to touch up on here. One is that the chaos that we mentioned here, it’s not about increasing the amount of chaos in your system, it’s about decreasing the amount of chaos. In my definition, I was trying to make you picture a lab coat and protective glove wear and eye wear and stuff. Because if we can’t control as many of these fireballs as possible, we can’t reliably confirm or deny our hypothesis. But the other reason for the chaos in the name is that there’s an entire spectrum of things that you can do with chaos engineering. Chaos Monkey, the kind of thing that this entire discipline started with really, was very chaotic in the sense of randomness of the work. So if you don’t know much about your system or you’re looking for the emerging properties, randomness can really give you a lot of value, kind of out of the box that requires very little setup.

So, you take the system. You release a chaos monkey on that. You let it run, and you probably find some things. You can parameterize it, you can make sure that it breaks only a certain percentage or whatnot, and this already gives you a value. And I see the sign of a spectrum as similar to the discipline of fuzzing in testing, where you produce a lot of valid inputs that you would probably not think of if you were writing unit tests manually. You just run it through whatever you’re testing, and you might discover things that you didn’t think of. If you just cram all these inputs, and it’s like brute force-ish approach, right? And so if you start with a system, it’s a really nice way to get into the chaos engineering, because it doesn’t require much of a setup. You need to have a vague understanding of the system. You can release it, and you can find things. And the other thing I mentioned, this emergent properties, it’s also a fancy way of saying that. But to illustrate that for those of your audience who not heard about that, think about neurons in your brain. Any single neuron doesn’t have the property of human conscious, or it doesn’t have a property of thinking per se. But then, when you put them all together and they interact from within these interactions, you have the emergent property of a human conscious. Same thing for another popular example, all the cells in your heart. None of them have the property of pumping blood and oxygenating your body. But put together, they create a system that actually has those property. These examples give you very nice properties, right? But in any complex enough system, you’re going to have to interactions that you just didn’t predict, and sometimes they’re going to be pretty bad for you.

So this kind of fuzzing randomness, chaos engineering side of the spectrum is great, and that’s useful, and it’s part of the discipline. Over the last few years, we’ve been trying to push for the other side of the spectrum, where you go into the system, you already know the system well, and you’re working on particular properties, and you want to make sure that the particular failure scenarios are covered. From the SRE point of view, if you had an outage that you didn’t predict for because the system worked differently than you expected, you would like to make sure that doesn’t happen again. One of the best kind of regression tests that you can come up with is to simulate that failure scenario. Make sure that the system actually still survives that. So the other side of the spectrum, if they’re sophisticated, very deliberate practice when there is very little randomness is also available to you. Right now, as a practitioner of chaos engineering, you can pick wherever is best on that spectrum for you. And I think that’s definitely a good thing.

Getting Management Buy-in [00:13:08]

Henry Suryawirawan: [00:13:08] So, as a techie, everyone understands at least the concept. But when you introduce this to the management or the business, for example, so we are going to introduce Chaos Monkey or chaos engineering in our system. How do you actually explain to them? What is the motivation, rationale behind this? Because I’m sure any business and any executives, they all want stability. They don’t want some kind of a randomness, or attack to the system that is working fine. So how do you explain the rationale and motivation behind this to sell it to them?

Mikołaj Pawlikowski: [00:13:36] Yeah, that’s probably like the number one question. And that goes back to the name being a little bit misleading and a little bit of a double-edged sword. So it depends whom you talk to, and I, in general, tend to have seed this to big groups, either it’s your colleagues or people who have had the experience of actually being paged in the middle of the night. And I think the only real argument you need to get them on board is to tell them that, “Listen, if we do it right, there is a huge potential for you to be called less at night.” So they’re like, basically the worst-case scenario is that you do some chaos engineering and you break something. If you do it well, and you do it with common sense, and you apply all the same best principles for deploying the code because the chaos experiments are code like any other, really. You’re going to only affect a certain blast radius, anyway.

But the worst-case scenario is that okay, so we did chaos engineering and we broke something. So what happens then? One of the things that is worth mentioning is that you’re doing it on purpose. So, if it breaks, you’re typically in the office, you’re typically ready to jump to fix that. You don’t have to context switch. Waking up in the middle of the night when you’re being paged to figure out what’s going on is nothing pleasant. I’m sure a lot of your audience will have had that experience. You have to wake up. You need to make the coffee. Read through the alerts that may or may not be easy to understand. Deal with people who are panicking because they just got woken up and they don’t know what’s going on. You context switch, because between all those things you log in, it takes time. It’s typically not a very pleasant experience. Let’s just be honest. So if you can minimize the number of times that this happens, and instead try to do it more purposefully, that’s a big step up. So if you are on rota for support in your system, this is probably all you need to know.

The other group is people who might be decision-making on this kind of stuff, your managers and this kind of person. From my experience, the best arguments to start with, is to do some back of the napkin maths. Now I even have a phrase for that. I typically call it the hamburger versus shark problem. It’s about the perception of risk. So like I mentioned just a second ago, if you do it right, and you are careful about the blast radius, it’s pretty a lot like releasing any other change to production. You test things through the stages, and you apply all the same principles you do for all other bits of code. But if you ask a person on the street, how afraid should they be of sharks? They have been already primed by Hollywood movies, the Jaws, the Meg, and all of that, they have to be very afraid of sharks. But if they actually look at the statistics of how many people die of shark attacks, it’s a very minuscule number. There’s more people dying of coconuts landing on their heads every year, then there is of sharks. Whereas the things that are really statistically likely to kill them, things like heart attacks or heart disease in general, that take more than half a million of Americans every single year, compared to probably a single digit number for shark attacks in any given year. You see that this is actually statistically much more dangerous to you. So next time you see that hamburger, with all this grease and all of that, think about it, that this is actually more likely to kill you than the shark.

What was the point I’m trying to make here? The point I’m trying to make is that a chaos engineering, when you first hear about that, and about introducing failure on purpose, it’s a lot like, end-to-end testing of the unhappy paths, the way that I see it. It’s like the evolution. You do the unit test, you test a small little subset, then you do maybe component testing, some kind of integration testing. Then at some point, you get to an end-to-end testing where you typically test happy path or some popular path and stuff like that. Sometimes chaos engineering is a lot like end-to-end testing of a system when you take it as a whole. During unhappy events, when things break, when machines go down, when the network gets slow, and this kind of thing. So for someone who is hearing, okay, we’re going to add failure to our systems, and we would like to also eventually do it in production, it sounds scary. But that’s like the shark, if you think about it. There are ways to manage the blast radius and there are ways to manage that risk. And the goal of the entire exercise is to decrease the amount of chaos, and decrease the amount of risk that you have in your system, not to increase it. So just a napkin, a few numbers, run through that, and you can typically explain to your engineers that it’s really, they shouldn’t pay too much attention to this scary sounding name, because there’s a lot of return on investment to harvest.

Running in Production [00:18:29]

Henry Suryawirawan: [00:18:29] So if I hear you correctly, this chaos engineering doesn’t necessarily mean that you have to run this experiment in production. You can also run it in maybe a pre-prod or some kind of environment where you can actually simulate and do all these tests. Is that correct?

Mikołaj Pawlikowski: [00:18:43] Oh, no. If you don’t run it in production, you’re excluded from the chaos community.

Henry Suryawirawan: [00:18:47] Okay.

Mikołaj Pawlikowski: [00:18:48] So you know, this is one of the things. Because of all the blog posts, it’s shiny. Doing stuff like this in production is unorthodox, and it makes for a great presentation. But that’s the holy grail, right? You want to be as comfortable with the system that you’ve done this for so long in the other stages that it’s now part of your routine, and you do it in production, and that’s great. And if it can do that, that’s obviously great. Because if you think about that, you’re never fully testing things until they get into production. The data patterns will be different. Usage patterns will be different. So by definition, you can never fully test things on it until the actual proverbial rubber hits the proverbial road. But the common sense, sticking the chaos engineering sticker on your laptop doesn’t necessarily give you a more absolution to stop using common sense. So behind the scenes, the less shiny bit is that you are progressing at the same way that you progress all other software. You write this code. You run it in your test environments. You might increase it. You’re probably going to do the same thing that you do when you release any other software, and limit the percentage of traffic that goes through the systems that have that and then increase it. If anything goes wrong, you roll back.

So the same best principles really apply. Once you get to the holy grail and you run it in production, that’s great. You can write a blog post and go on the conference and talk about it as some of us do. But it’s not just about that. And also it’s probably worth mentioning that there are cases, there are use cases where it’s probably never going to be okay. If your choice is either to potentially introduce this failure, and you’re not confident whether it’s going to work, and someone might die because of that, that’s probably not the right moral choice to do that. Or, if you have very heavy contractual obligations, now from the legal point of view, you might not be able to do that, but that doesn’t mean that you can’t get 80% of the value of harvesting the lower hanging fruit in the process. So, no, that’s a myth.

Other Common Misconceptions [00:20:53]

Henry Suryawirawan: [00:20:54] Thanks for clarifying that. So obviously for us people who are not experienced in it, there are many other more misconceptions probably, the one that excludes you from the chaos engineering group. So maybe you can tell us some of the other common misconceptions that people have.

Mikołaj Pawlikowski: [00:21:08] Sure. So there is an entire suite of this and I typically, on the regular basis, I asked my LinkedIn network to say what’s the biggest blocker. And it still seems to be getting the buy-in. In big part because of this misconception. So the production one is a big one, and that’s typically something that people really think. The other thing is randomness. So a lot of people basically disregard chaos engineering because they feel like it’s a gimmick, a chaos monkey randomly smashing things. While for bigger systems, like Netflix, again, that example that works nicely because doing something as simple as taking down the VMs is already giving you value. It might not work for a smaller system. So explaining what we just talked about, that it’s not just about randomness, it’s about an entire spectrum from random to very deliberate is also important.

Another thing I keep hearing is the, “Oh, we already have chaos enough. We don’t need new heck chaos.” We’ve touched upon this already, too, but I really would like to drive the point home that if you still think that chaos engineering is about adding chaos to your systems, then you still haven’t gotten the memo. We add the failure so that the amount of incertitude, the amount of chaos in your system actually decreases. So the net’s net. We do all of that to decrease the amount of chaos. So obviously under that umbrella, a lot of people say that, because they don’t feel that confident in their systems. All of our systems break, and if you are just hearing about this, and you had a massive outage last week, you might fail. Okay, fine. I need more maturity or something like that. I think that’s a false premise too, because regardless of where you are on your maturity spectrum, you can get some value out of doing that.

Typically this chaos experiments, unless you go really into the deep weeds, they are simple to set up. They don’t take that long to implement. And especially as the tooling around that gets better and better by today, it’s easier and easier to do pretty sophisticated things with little effort. Worst-case scenario, you don’t discover anything and you feel a bit better because you know that this particular scenario is not going to take your system down. And best-case scenario, you can discover something that’s free, like the small little bets that can pay off that are really cheap to do. There’s also the benefit of the kind of mindset that goes with that, that if your engineers on your team think about the fact that this is going to be tested, that this is actually going to be put in practice, and they have to design the systems with this in mind. It just goes more naturally to bake this reliability from day one, rather than doing this as it breaks.

So the mindset is also something that we could probably talk about a little bit more. But the kind of thinking that, “Ooh, we’re not that mature. We still have outages and stuff like that.” It’s really not helping because at any stage in your maturity, even if you’re still getting outages on a regular basis, you can get value out of that. So no, it’s not adding more chaos. It’s definitely not, “Oh, we’re not mature enough.”

I think one more is about the scale. People look at Netflix and they look at Google and they say, “Oh, okay. Yeah, these are massive companies, and they’re bleeding edge, and they have all the scale and that’s great. But it’s also worth noting that chaos engineering doesn’t require you to have all this fancy and massive distributed systems. It can really work. It’s more of a methodology that can really work on any kind of system. Whether you’re working with a single process legacy server, and you would like to make sure that you understand how it breaks, and the kind of things that you expect to happen. Don’t break its retries or whatever recovery logic. You can do that. Nothing is stopping you. So really, don’t be shy, kick the tires, and you’ll see that with a little bit of investment, you can get a lot of value out of that and you don’t need to be like, “Oh no. We’re too small for them.” So I think that’s like my top niche that I just keep hearing over and over. Yeah, it’s probably too late now to change the name. So we’re probably going to be here and then.

4 Steps to Chaos Engineering [00:25:41]

Henry Suryawirawan: [00:25:41] Thanks for clarifying all these misconceptions and myths about chaos engineering. So in the beginning, I think you have mentioned this as well. For people who have heard about this, so they want to give it a try, right? Can you summarize again the four steps that they need in order to start introducing this chaos engineering?

Mikołaj Pawlikowski: [00:25:58] Sure thing. So first is the observability. And I like the word because it’s very technical, but what it means is being able to measure something reliably. What I mean by reliably is that in computers, you know, we’re kind of spoiled because we can measure some things reasonably well. If we’re quantum physicists, it will be a bigger problem. Because the measurement could affect what we try to measure, and then have the uncertainty principle to begin with. But what I’m getting at here is that it’s important to be good at observing this. It can be very simple. This variable can be anything. It could be whether the server is up and down, then that’s something we can observe. Or throughput. Or number of requests per second or anything really.

And then once you have that, I mentioned the steady state, the normal range of that is how you verify what’s going on. So normally, the server is up. Or normally, I get this many requests per second. And then, we hypothesize. We say, okay, so we are disk write heavy. Let’s say that we’re sure that we understand what’s happening if someone steals a little bit of the disks. So we might introduce a hypothesis, if disk becomes 50% slower, I still hit, let’s say, 50% on my steady state in terms of requests per second. And then the last step is to implement that. Take one of the many tools that are available right now to do that. Make sure that it doesn’t affect your observability. Go run it and see what happens. More often than not, you’re going to discover things that you might have not predicted.

Henry Suryawirawan: [00:27:38] Thanks for summarizing that again. So just to recap, first you need to have a good observability. So I think this is a must, right? You cannot introduce something chaotic without actually knowing and observing how the system behaves because that kind of defeats the purpose, I guess. Then the second one is you need to know your steady state. So what exactly your system at the normal state before you introduce this chaos. And the third, you start to hypothesize and making experiments, I guess, like what kind of things that you want to test against the steady state? And then the fourth, the last one will be to implement that. So I hope I summarize it correctly.

Skill Set of a Chaos Engineer [00:28:11]

Henry Suryawirawan: [00:28:11] You mentioned as well, there’s this chaos engineer community and things like that. Is there specific skills actually required from a chaos engineer compared to, I don’t know, like normal engineer or SRE, or maybe you can clarify on that part. Any particular skills for chaos engineer?

Mikołaj Pawlikowski: [00:28:27] Yeah. So this is one of the things that I really like about chaos engineering, is that it really cuts through different stacks and different technologies and different kinds of design and systems. It’s really more of a meta skill, the way that I see it. That also means that it’s not necessarily a job description. I mean, some people have this job description, but it’s more of a mindset slash skill that a lot of different people can have that. So from what I do daily, you know, as a kind of SRE type person, instead, I care a lot about my system not blowing up and my system working well. And as the scale increases and everything, you realize that the failure is when rather than an if, and you start seeing more and more of that as it grows. So the primary driver for me is to have another tool that helps me sleep better at night, and get paged less or called at night. So this is something that works out for the SRE teams but if you’re an application team, you can also leverage the same thing. You design your system, your application, whatever you’re working on in a way that it’s the most reliable possible. So you applied the same skills and there’s nothing stopping you from running your own chaos experiments. The SRE kind of person might be running a platform that runs a lot of client and code that they know little about potentially. On the application side, you run fewer of the applications, but you have much more intimate knowledge about them.

So to give you a more tangible example, let’s say that chances are that you’re using a database. One of the things that you write is you write your test cases, your unit tests, and you verify what happens when the database becomes unavailable. You verified that the right errors returned, for example. So something like that or that there is a retry. Something like that, and that’s great. That’s hopefully what everybody’s doing. But now the question is, do you know what happens when the database is still available, but it becomes a bit slower? Do you know what happens if let’s say all of the connections to database are now being throttled because there’s some busy neighbor? Or just network is overloaded and you get much slower? Do you understand how the compounding effects on all of that work? And do you have things in place to work with that? Or is your application just going to hang in forever, accept enough connections, empty up the pool, and you get stuck forever? So at every level of the stack, whatever you’re working, you can use the same thing. I really try to drive this point home in my book. That’s much more applied and technical than the other books that were available at the time is that wherever you are on the stack, on the platform, whether you’re working with the kernel and you want to verify what happens when syscalls are blocked or are delayed or something like that. Through the platform levels, the networking, maybe Kubernetes level, containers level, all the way up to the browser. Things like. Okay. So JavaScript is everywhere, but do you really test out what happens when the front-end code is having trouble connecting or it’s actually getting slow responses? Do you still display coherent data rather than loading a little bit here, a little bit there, and giving people a false impression. So it’s really applicable to all the levels of the stack. This is something that’s very exciting for me because you get to look at the different things and, you know, just contained to this single box.

Examples of Chaos Engineering Tools [00:32:09]

Henry Suryawirawan: [00:32:09] So when talking about these layers, I’m very interested. Because you can do this at any layers, not just platform levels, which is what we normally heard from Netflix. But you can also run it at the application level, database level, or even network level, or like what you said, doing it on the browser itself. I haven’t heard it to be honest. But maybe can you give us some examples, like the name of the tools and what exactly they are testing for these layers? Maybe some examples would be great for the audience to know.

Mikołaj Pawlikowski: [00:32:37] Sure thing. So, let’s say that we’ve all had this moment when someone gave us some legacy piece of software that has been compiled 10 years ago and it runs. But the documentation is a little bit iffy at best. There is more of a tribal knowledge, and you don’t necessarily want to go and read the Fortran code or whatever C code that it’s implemented in. And so, if you’re working on a startup that was started last year, maybe you’re going to be able to skip that part of your life, but it’s like a rite of passage. So one thing that you could do about it, let’s say that it’s some kind of server, and that’s an example from my book, if you want to actually go out and play with that, there’s a VM that lets you start that and play, is that not a lot of people know that you can use strace not only to trace the system calls, but you can also use it to implement changes. So you can error system calls. Also, if you have a modern enough version of strace, you can actually implement patterns where you can fail, for example, every other request.

So while the things that you could do at this at very very low level is that decipher it’s doing writes to send the request to you. So you can go and you can verify that, for example, if it can’t do it right, if it gets an error, there’s some kind of retry logics. So if you fail a portion of that, you’re going to see when the retry logic kicks in and you get the right response or if it breaks completely. So even if you don’t know much and you don’t even see the source code of this and all you have to go by is the tribal notice that “Oh yeah. It’s supposed to recover. Always recovered.” You can actually go and get some mileage out of the most basic thing that thing does because every program your system is going to have to go and do the writes. So this is something that a lot of people use strace to see the low level what it can do. But not everybody knows that it actually lets you implement chaos experiments like that. Strace is also a great example of the importance of observability. Because the penalty of running a program while it’s being straced is pretty high. So the measurement actually affects the program that you’re running. It slows things down very significantly.

So there are other ways that you can observe these things. There are new technologies that let you do that, like eBPF, the extended Berkeley Packet Filter. That they’re becoming increasingly common and powerful because you can do a lot of similar measurements for the observability point of view without actually affecting this. But in a lot of scenarios, it’s going to be workable because you can measure. If all you’re interested is success rate, it might be okay for you to accept the penalty for doing your test. Something you probably wouldn’t do on a production system, because obviously it will slow things down and affect the users. So this is like the lowest level that I could think of. And that’s why I went all the way there in the book to show you that, worst-case scenario, single process, and you can still apply the same principles. And then, I mentioned Kubernetes. When you go a level up, there’s a lot of stuff that Kubernetes solves for you, but it also introduces its own complexity. And if you don’t have a good understanding, and just expect the Kubernetes to always do the right thing, you might be in for a surprise. If you are working with Kubernetes as a product, if you’re running Kubernetes for someone else, it’s also very important to understand how Kubernetes itself fails and what the fragile points are. And this is something that’s very easy to do with chaos engineering, and every one of those experiments is helping.

Then there are things like you can build chaos engineering into your application directly. Obviously, this means that your code is part of the application, and it’s prone to introducing new bugs. But you can do things like activate the chaos engineering code only behind some kind of flag, and activate that for some percentage of your traffic. Make sure that there’s no run time penalty for running that, and that the code is not actually hit so you can’t break things for the instances where it’s not activated. Although in the browser, like I described, this is something that people react over smile initially to is okay, well, I’m going to chaos test my browser software. But it doesn’t take much to show you that actually if you display some data from the previous request, and some data from the new request, you might actually trick the user to do something silly because they have stale data or inconsistent data. So doing things like this is also important. Yeah. That’s three examples that hopefully give you the entire spectrum of going all the way up from the bottom.

Henry Suryawirawan: [00:37:33] So I think you mentioned a lot of these examples in your book. For those of you who are interested to know more, or even play around with some of these tools, I think Miko explains it clearly in the book. For this layer, for this tool, how do you do that? So make sure to read the book if you are interested further.

Chaos Engineering for People [00:37:48]

Henry Suryawirawan: [00:37:48] So I think in the book, there’s another thing that you mentioned, which I find very interesting, which is what you called chaos engineering, but for people. Why do you write a specific section for that, actually?

Mikołaj Pawlikowski: [00:37:58] It’s a little bit unappreciated, but if you think about your people systems, also known as teams. A lot of the same rules that apply when you’re building a reliable computer system also apply to building reliable human systems. Probably easiest way to illustrate that is that every team is always going to have some bottlenecks, and the bottlenecks might be in a form of throughput bottleneck, or they might be in form of knowledge bottleneck. If you have only one person who’s capable of debugging a particular system, and that person is on holiday, you’re going to have a problem. So a lot of the job of having a performant and good team is to continuously trying to find those bottlenecks and resolve them. And a lot of this stuff happens naturally, and if the people on the team are thinking in this kind of way, kind of chaos engineering way, it’s going to be good for your team long term.

So this is something that the last chapter in the book is actually talking about it in detail, and it’s going to the human aspect of getting the buy-in. So it’s touching upon some of the bottleneck, some of the misconceptions that we just described. That goes into the chaos engineering mindset. Basically, the way that if you think of your system with the expectation of failures happening, rather than a possibility of failures happening, it’s going to help you build more reliable systems from the ground up. It’s kind of like if you have a good CI system that always runs your unit tests, and every time you push a change upstream, you’re going to automatically get the feedback from the unit test. It becomes part of the culture. You get this immediate feedback, quick turnaround for the feedback. It works. It doesn’t work. It detected the problem. It’s similar with the chaos engineering for mindset. That you built this in, into the automatic thinking about things, and it’s not something that you might address later, it’s something that you expect to happen. It goes back again to the back of the envelope napkin. If you think about it, it’s not a question of if, it’s a question of when, and it’s fairly easy to calculate depending on your scale, how often you’re going to see the kind of failures.

And then the final bit of that is that was inspired by Dave Rensin presentation that he gave at one of the conferences when he basically described the teams, and I loved the description, I still remember that, as a set of bio robots executing a distributed algorithm to produce some work. And he actually went much into the details of the games that they came up with to surface and detect the problems. I basically followed to his lead there to include that into the book, and make sure that it’s spread through the community. And some examples of that, I talked about the bottleneck in terms of knowledge. So one of the things that you can do is just tell somebody on a particular day that they’re not allowed to give anybody else any help on this particular subject. That will tell you whether the rest of the team can basically do without them. And if they can’t, you detected a bottleneck. He goes into a lot of more advanced games to all the way up to basically telling people a fake outage is happening, and seeing what happens, and what they do to address that. This obviously is a little bit more tricky because you need to get the buy-in from the higher ups and people might get confused. They might not know what’s actually happening when they try to debug something that’s not really happening. Or you might go all the way in, and actually go and break something on purpose without telling your team to debug your reliability in terms of responding to that. So Dave really went and created some very interesting insight on that in the presentation. And I just couldn’t not include that into the book. Hopefully he still enjoys it there.

Make Chaos Engineering Boring [00:42:08]

Henry Suryawirawan: [00:42:08] Thanks for sharing that. I think it’s a very interesting concept. Teams as a distributed system kind of a mindset. So Miko, I know that you are very active in this community, right? For me, all I know cool stories about chaos engineering is all about Netflix, the chaos monkey, Simian army, and all that. Are there any other cool examples that you have heard people showcase maybe in the conference or things that is publicly available? Are there some cool things like that?

Mikołaj Pawlikowski: [00:42:33] So it’s interesting that you’re asking that because a lot of what I’ve been trying to do over the last few years is actually to make chaos engineering very boring. Let me explain why I think boring is good. If you think about the kind of adoption curve of different new technologies, it always goes to the same bell -shaped curve. Initially, it’s a novelty. You have this very small early innovators population that’s happy to put up with all the shortcomings of that. Then you have like potentially the early majority, then you have the late majority, and the people who drag a little bit. And so for technology or methodology to reach to wide audience, it has to become mainstream enough, basically boring enough that it can be adopted by a lot of people. Because not everybody’s working, is happy to work around the rough edges.

So to take a sample, I rarely bring it up, is the SpaceX rockets. Not that long ago, seeing these rockets go all the way to orbit, and the boosters go all the way to the orbit, and then automatically land looked like something from science fiction. They were the first one to pull it off, and it was amazing. I remember staying up late, cause London time zone, and watching the nine minutes or whatever of the flight and then just landing like something from Sci-Fi. But then over the time, as they got better at it, it stopped being exciting because they stopped blowing up. They just go up, go down, land on the drone ship. Of course I still love it. It’s just becoming so mainstream that I no longer find myself staying up for that. So now maybe the space ship is something that I’m going to want to look at, but if they start landing every test and they start doing it routinely, it becomes boring. And so boring is good.

The same way that a smart phone was something that was outrageous not that long ago, and now everybody has one. You can get it on the cheap, and it’s outrageously good and quick, and you have access to so many different apps. You have a small supercomputer in your pocket all the time, and you don’t even notice that. So I would really like chaos engineering to stop being about exciting stuff that you can do and break things in production and not get fired for that. And instead become this routine thing that we do just because it creates a lot of value. There are benefits to doing that.

Yeah, I’m going to go in the opposite direction of that, and say that it’s probably more about making it boring. And then when you think about it, most of the low hanging fruit, it’s like cybersecurity. In the movies we see the hackers just randomly punch into keyboards, and streams of data going through, and “I’m in.” And probably some nice graphics turned up. But in reality, most of the low hanging fruit is so boring because you need to check all the routine things. I need to stay up to date, and you need to patch with the things that are already known, and you need to make sure that your S3 bucket is not on public setting. So that’s the boring stuff. Doesn’t make it into the movies, but that’s where most of the work is done. So yeah, if you want the exciting stuff, there’s definitely things in the internet. But that’s not really where most of the value is coming from. Most of those are values coming from boring.

Henry Suryawirawan: [00:45:56] Thanks for the valid points. Hopefully, one day we’ll see all these chaos engineers, not unicorns or only the cool companies that are able to do that. But hopefully all the engineering team is able to introduce some kind of experiment, chaotic experiment, in order to test the reliability of their system.

3 Tech Lead Wisdom [00:46:12]

Henry Suryawirawan: [00:46:12] So Miko, thanks for spending your time today. Eventually we come to the end of this conversation. But before I let you go, normally I would ask this one question for all my guests. Which is about three technical leadership wisdom. So can you share maybe some of wisdom that you have, maybe from your career, or maybe from your chaos engineering experiments that you have, so that audience can learn and benefit from you?

Mikołaj Pawlikowski: [00:46:35] Wow. Wisdom. That’s a big word. I’ll give it a try. I think one of the things that I learned, probably gave me a lot of mileage, is that a lot of what we do is about removing the BS from the equation. Engineers, apart from the big egos and everything, are really very finely tuned to detect BS. And so a lot of the tech leadership and leadership in tech in general, in my opinion is about just making sure that there is as little BS as possible. So if someone asks you a question, you have few options, you can say yes, no, if you know for sure the answer. You can say, I don’t know, if you don’t know the answer. Or you can try to BS your way through it, and try to come up with something on the spot. And I got a lot of mileage just by removing that last option. If you just tell people honestly, I know, I don’t know, yes or no, it builds this relationship that’s based on trust when they also feel like they can also say, I don’t know. I’m supposed to be an expert in the field and paid a load of money for that. I’m senior and everything, but there’s stuff that I won’t know, and that’s completely fine. You work with other people who are smarter than you, who have more experience than you, who are by definition are going to be much better at parts of your job. If you are humble enough to just say, “Oh, I don’t know. Well, what do you think?” It gets you a lot of mileage. I think that really helps me stay out of trouble.

And kind of a corollary to that, another piece of a nugget is that regardless of what you think about yourself, there’s always something that you can learn from everybody. And we work a lot in this industry, and working hard on increasing the diversity of thought. Coming from a mindset that, okay, if that person disagrees with me, there is probably something that they think about or they know about that I don’t know about. So, if I just try to convince them to my way of doing things, I’m not going to come out of that conversation wiser. But if I at least by default give it a shot, maybe they’re wrong, maybe there’s something I know that they don’t know. But if I give it a shot, I’m going to get more value of that. And if you keep getting value from every conversation, you’re really going to accumulate that over time. So yeah, I think that basically the kind of being humble and knowing that you can learn all the things from everybody, and not BS-ing is probably what really helped me get where I am right now.

Because you asked for three, I’m going to also try to attempt the third one. I think you probably hear that a lot in your show, but the kind of learning mindsets, lifelong learning, is very important. This is something that some people get built in and they start with that and they’re just super excited about learning. By learning, I mean like a lot of things. There might be a point where you know everything about your niche and there isn’t much more to learn. Instead of stopping learning, you should probably start exploring another niche. Or maybe do something completely out of your comfort zone, and go learn a language, or do some art, or sport and see how that affects your brain. In the recent years I’ve been reading a lot about brain and how it works. I found myself being able to influence a lot of weird things in my life, just by picking up new skills that seem to be completely unrelated. It’s easy these days. You can get an app, you go on Duolingo, and you can pick up a language that would normally cost you a lot of money. It would be impractical. But you can take the few first steps free or very cheaply. Obviously, the pandemic made it a bit more difficult to pick out any new sports up. Anyway, the point is that learning things might have unexpected benefits to you. And this is something that I really recommend doing.

Henry Suryawirawan: [00:50:29] Yeah. I agree with that. Thanks again for reminding this important learning mindset. So Miko, for people who are interested to learn more about you, or maybe your recent book, where can they find you? Online maybe?

Mikołaj Pawlikowski: [00:50:40] Sure. So probably the easiest way to reach out to me is on LinkedIn, if you want to interact. Otherwise, I do have mailing list for the book. If you want to, go to chaosengineering.news, you can sign up and get updates. If you have any particular updates about the book or I mess something up, do reach out. There is a GitHub repo out of the book where you can download the VM and that’s where I can put issues. I’m pretty sure I messed some things up. So looking forward to that. Hopefully, I can crash conferences when they become in person again. Kind of looking forward to them.

Henry Suryawirawan: [00:51:19] So the way I think of it also it’s like a chaos experiment in our life where this pandemic suddenly throws people into different kind of living and mindset and routines. Hopefully, we will end this pandemic soon enough so that we all can live to our previous normal life.

So thanks again, Miko. I hope your chaos engineering mindset and your being the champion of it, have more people being able to implement that in their team, their systems, and also the companies, so that we can make it more boring, so to speak, like what you said. And I wish you good luck for your career as well.

Mikołaj Pawlikowski: [00:51:51] Thank you. It was really fun being on the podcast.

– End –