#59 - DevOps Solutions to Operations Anti-Patterns - Jeffery Smith

11-Oct-2021 52 mins Jeffery Smith

“DevOps is about creating a collaborative environment between the development team and the operations team, and aligning goals and incentives between those two teams. Because so many of the problems that we encounter in life, not just even in technology, are due to misalignment of goals.”

Jeffery Smith is the author of “Operations Anti-Patterns, DevOps Solutions” and the Director of Production Operations at Centro. In this episode, Jeffery described DevOps essentials and emphasized what DevOps is not. He also explained about CAMS, a framework that outlines the core components required for successful DevOps transformation. We then discussed three anti-patterns taken from his book: paternalist syndrome, alert fatigue, and wasting perfectly good incident; and he explained how to recognize those anti-patterns in order to avoid them on our DevOps journey. Finally, Jeffery also talked about postmortem and shared tips on how to cultivate a good postmortem culture.

Listen out for:

Career Journey - [00:04:47]
DevOps - [00:09:13]
CAMS - [00:12:42]
Why DevOps Anti-Patterns - [00:16:48]
Anti-Pattern 1: Paternalist Syndrome - [00:19:55]
Anti-Pattern 2: Alert Fatigue - [00:27:20]
Anti-Pattern 3: Wasting a Perfectly Good Incident - [00:34:33]
Postmortem - [00:39:59]
4 Tech Lead Wisdom - [00:45:57]

_____

Jeffery Smith’s Bio
Jeffery Smith has been in the technology industry for over 15 years, oscillating between management and individual contributor. Jeff currently serves as the Director of Production Operations for Centro, a media services and technology company headquartered in Chicago, Illinois. Before that he served as the Manager of Site Reliability Engineering at Grubhub.

Jeff is passionate about DevOps transformations in organizations large and small, with a particular interest in the psychological aspects of problems in companies. He lives in Chicago with his wife Stephanie and their two kids Ella and Xander.

Follow Jeffery:

Website – https://attainabledevops.com/
Twitter – @DarkAndNerdy
LinkedIn – https://www.linkedin.com/in/jeffery-smith-devops

Mentions & Links:

📚 Operations Anti-Patterns, DevOps Solutions – https://www.manning.com/books/operations-anti-patterns-devops-solutions
- 35% discount code for all products in all formats – podtechleadj21
📚 TCP/IP Illustrated, Volume 1: The Protocols – https://amzn.to/3snW6p9
📚 TCP/IP Illustrated, Volume 2: The Implementation – https://amzn.to/35crxd6
📚 TCP/IP Illustrated: Volume 3: TCP for Transactions, HTTP, NNTP and the Unix Domain Protocols – https://amzn.to/3hrIVNq
📚 Rework – https://amzn.to/36N5v1a
Centro – https://www.centro.net/
Grubhub – https://www.grubhub.com/
Wolters Kluwer – https://www.wolterskluwer.com/
Puppet – https://puppet.com/

Our Sponsors

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

DevOps

The first thing I was going to describe is what it’s not. It is not a role. It is not a job title. DevOps is a style of working. It’s about creating a collaborative environment between the development team and the operations team, and aligning goals and incentives between those two teams.
You would never post a position for an Agile engineer. That sounds insane. You would never do that because Agile is a style of working. Same thing with DevOps.
How do we build cohesive joint incentives for teams to work towards a common goal? And putting that sort of front and center shape a lot of your underlying decisions moving forward. Because when you think about it, so many of the problems that we encounter in life, not just even in technology, are a misalignment of goals.
I think that many system engineers, system administrators, whatever we want to call, I think there are enough people that have been in these roles previously that are doing the same things that we’re doing today. It’s just with a different tool set.
The minute you give someone a title like that [DevOps engineer], it becomes their job. The minute you have a QA team, guess whose job quality is, right? Language is a very finicky thing because it does so much to the way we perceive the world just by giving something a name or title.

CAMS

CAMS is this light framework around DevOps as a concept to think about the core components that you need in order to make this transformation.
- C is for Culture. You create that culture through a small rule chain. Culture is just something that we always have to keep front and center when we’re talking about DevOps.
- The A is for Automation. Automation is really the thing that powers this.
  - One thing is you’re always going to be asked to do more with less. But the other thing that we need to do is the more automation that you have, the more empowerment you can give to other people in the organization. It’s this thing that in the book, I talk about this idea of exporting expertise. So Automation is key.
- The L that you alluded to that some people call it is Lean. I’ll talk about Lean briefly, but I’m not a huge fan of adding the L. Lean is just about operating dynamically and quickly.
- Metrics. DevOps should be rooted in data. As part of that empowerment, people need to know if the thing that they’re doing is actually having an impact. We should be able to objectively point to success or failure, and we do that through metrics.
- Sharing. You know, back to that expertise exporting, how do we share knowledge? How do we share access when appropriate? How do we share responsibility?
When you look at organizations that aren’t doing well in their transformation, you can usually find behaviors or actions that tie to one of these five categories.

Anti-Pattern 1: Paternalist Syndrome

It is when someone in a relationship assumes the role of the parent, even though no true hierarchy exists.
There’s a reputation of operations teams being the team of “No”. I quickly was able to identify that the reason was the fact that we had misaligned incentives.
The paternalist syndrome is this behavior where we assume everyone else is out to destroy the system and we have to protect it. That means nothing can go to production without it coming through us first. That means no one can have access to production, no matter how sane of a request it is.
We’re starting from the position of “No” and defending that instead of starting from the position of “Yes, but”: Yes, but how do we address the audit? Yes, but how do we ensure that access doesn’t spread? And it’s really a mindset.
What the paternalist syndrome will do is, anytime that there’s a problem, the first tool we reach in the toolbox is another gate, another checkpoint. And every time we do that, we’re adding another layer of slowdowns, very little value-add work, and wasted time for ops folks too.
Policies are organizational scar tissue. Every time that there’s a policy that’s enacted, you can typically go back to some inciting incident where that happened, and the policy came out as a result of that. Almost to the point where it becomes lore and legend, where no one was even there for that anymore.
The paternalist syndrome chapter talks about how we can identify some of the common issues that we run into, and sort of the pattern of fear, honestly. Cause that’s what it is, fear around particular situations and how we can go about breaking that down using CAMS. How do we do that through automation so that we can empower someone?
The other thing with the paternalist syndrome is we often insert ourselves as gatekeepers when we’re adding zero value to the process.
The paternalist syndrome is really about changing your mindset. Getting out of the habit of just being “No” and shifting to a “Yes, but”. Yes, but how do we solve for these problems? Because no one wants to feel like they don’t take their jobs seriously. Everyone at work is doing their best with the knowledge that they have to do their job.

Anti-Pattern 2: Alert Fatigue

Alert fatigue is actually a term borrowed from the medical industry. It came about from nurses who would not respond to beeping alarms in hospitals because the alarms always go off. They always go off, so they became desensitized to it to the point where even when there was an emergency, there was no way to elevate that emergency beep.
Alert fatigue in technology (is) to say, what alerts do we have that are just constantly firing, that are drowning out more critical, useful, actionable alerts? I think the keyword there is actionable.
When we design alarms, we design alarms from the perspective of things we think might be bad. The other thing to think about is if you can’t design an alert that leads up engineer to take a next step or action, you need to seriously question the value of that alert.
This idea of alert fatigue is this idea of making sure that these alerts are actionable. And if they’re not actionable, get rid of it. Just get rid of it because it’s not helpful.
I would rather be a few minutes late to an alert, but know it’s real than to be constantly alerted and not being sure if it’s accurate or not, and having to figure that out and decide.
Alert fatigue is really focused on identifying those patterns and trying to make them better, or just simply eliminating them.
Another big thing that the chapter talks about as well is creating metrics that reflect the business impact.
We always default to waking someone up as the default alert. But there can be different types of alert. Have a low priority alert that emails you.
Get rid of alerts that don’t mean anything to you. Tweak your alert notification settings so that you can do emails instead of always paging out. Try to tie your alerts to some sort of business impact so that you know whether you really need to wake someone up, or if the alert that’s firing is something that you actually care about.
It’s okay to actually be alerted late as long as you can guarantee that it’s a real problem and you are supposed to take an action on it. Very seldom, those extra minutes actually mean anything.

Anti-Pattern 3: Wasting a Perfectly Good Incident

It actually stems from a saying from a local politician here in Chicago that always would say never let a good crisis go to waste. It’s this idea that there is so much to learn from an incident.
We have these mental models of our systems, right? These systems are becoming so complicated, so complex, so many different pieces. So everyone in their head has this mental model of how they think the system works. Then there’s the reality of how it actually works. And those two are seldom in full alignment.
It’s usually something off. And the delta, the time that you find out that your model is different than reality, is in an incident. Other than that, you can spend the rest of your life in complete delusion, thinking you understand how the system works. But then, when there’s an incident, suddenly the gap between your understanding and reality is exposed in its raw form.
So often, we sort of just close an incident ticket and move on. But it’s like, hang on, let’s dig into this incident and try to make our mental models better and learn from our mistakes.
It goes beyond just the tech side. It’s also the human side.
Wasting a perfectly good incident is really this idea (that) there’s so much more data and information in a failure that we can bring out if we just really want to put some energy towards it.

Postmortem

The first thing you need to do is do it fast, like immediately after the incident within 24 hours. Why? Because things have a sense of permanence in our minds, but only for a short period of time. Within that 24-hour period, this is the most important thing that has ever happened. But after that period, it becomes just noise in the background, while you’ve got all of these other demands on you. So if you can get people in the room as soon as possible to talk about it, the actual incident is fresher in their mind, so they can recall more of the details. They’re much more energized around it.
Schedule it as soon as possible. Definitely within 24 hours. Once you do a few of these, I guarantee you: Even without any real training in the process, you’re going to discover things. And as you discover them, people will instantly see the value in it.
The hard part is the follow-up action items. Because how do you influence people whose schedule and prioritization you have no control over to make sure that some of these things that were brought up are addressed? Sometimes, it’s worth doing it just to have the knowledge.
But for the things that you need to actually fix, that’s where you really need some leadership buy-in. Putting dollar signs next to the actual potential risk helps sometimes. By design the only people that really care about that (is) management. But yeah, 24 hours. Proving the use out of it and documenting that use, radiating that information as part of the sharing portion. Have it somewhere where people can see and review it and understand it.
Lead by example. Do it yourself. And when you do it, don’t treat it like it’s some new fan-dangled thing. Do it as if it’s the most logical thing that you should be doing.
With any cultural change, you’ve got three categories of people, right? You’ve got supporters, detractors, and fence-sitters. The vast majority of people are fence-sitters. And this is true of any context. Most people aren’t on the extreme ends of either side. Most people are in the middle. But the extreme ends are the loud ones, so those are the ones that we focus on.
It doesn’t take a lot of people to change the culture of an organisation. It takes a few supporters or even a few detractors. It works both ways. Positive influence. It only takes a handful of people to be able to change and have the fence-sitters move over.
Think about that as you’re recruiting people for these different aspects, right? Who are the people that can really be boosters for me? And I don’t have to convert everyone. I’ve got to convert a few really boisterous cheerleaders for this. And if I convert them, people are going to follow. And once people start following, you’ve overwhelmed the detractors and they lose.

4 Tech Lead Wisdom

Never let perfect be the enemy of good or good enough.
- A solution that is 70% effective is better than a perfect solution that hasn’t been implemented. So never let the fact that it’s not perfect stop you. Because guess what? It’s never perfect. It’s only perfect in your mind.
- Keep pushing. Get it done. You’re going to learn a bunch of stuff. It’s not going to be perfect. Something’s going to be screwed up.
There is no point in a project’s journey that you know less about the requirements than in the beginning.
- When you start, that is the most minimal amount of information you’re actually going to have about your requirements. So keep that in mind when you’re designing a solution for something. Because as you move in the project, your requirements are going to become more and more concrete and understanding.
- We have to accept the fact that they did not know when they started. They made the best choice that they can make. So, always keep that in mind.
As an engineer, you have an implicit bias in just about everything you do, and it’s particularly bad in technology.
- “I didn’t write it, therefore it’s crap.”
- A good engineer knows the difference between a preference and a problem. You have to understand what’s a preference and what’s a problem, so that you know where to focus and put your energy.
- No one made a career out of rewriting something that was already working, and just introducing different problems because you never actually fix it, right? Like you might shift it. You might change the set of problems, but there’s always some new problem that you’re going to be dealing with.
When you make it to choose, document your constraints and your context.
- When you make any sort of technical decision, why did you make that decision? What was the reality on the ground?
- There’s always context around every technical decision that gets made. If you can document that, it’ll save you some hassle and future engineers some energy around understanding why particular decisions were made.

Transcript

[00:00:50] Episode Introduction

[00:00:50] Henry Suryawirawan: Hello, everyone. This is Henry Suryawirawan. Welcome to another episode of the Tech Lead Journal podcast. Thank you for tuning in and spending your time with me today listening to this episode. If you haven’t, please follow Tech Lead Journal on your podcast app and social media on LinkedIn, Twitter, and Instagram. Also consider supporting the show by subscribing as a patron at techleadjournal.dev/patron, and support me to continue producing great content every week.

DevOps as a culture and practice is one of the most widely talked about in the current technology landscape. According to the State of DevOps Report, elite and high DevOps performing companies are leading the industries in terms of organizational performance, and continue to outperform the companies that do not practice DevOps optimally. And some of the reasons companies do not practice DevOps optimally are due to the misconception of the whole DevOps concept as a culture, and also some anti-patterns that may get adopted unconsciously.

To shed more lights on DevOps culture and practices, for today’s episode, I’m happy to share my conversation with Jeffrey Smithion Operations at Centro.

In this episode, Jeffrey described DevOps essentials, and importantly, emphasized what DevOps is not. He also broke down and explained about CAMS (C-A-M-S), a framework that outlines the core components required for a successful DevOps transformation. We then discussed three anti-patterns that are taken from his book: the paternalist syndrome, alert fatigue, and wasting a perfectly good incident. Jeffrey explained how to recognize those anti-patterns and how we can avoid them on our DevOps journey. And finally, Jeffrey also talked about post-mortem, and he shared some tips on how we can cultivate a good postmortem culture.

This is such a fun conversation with Jeffrey, and I really really enjoyed it, especially discussing about those three DevOps anti-patterns that I could personally relate from my own experience, and I believe that you would highly enjoy and learn a lot from this episode as well. And if you do consider helping the show by giving it a rating and review on your podcast app or share some comments on the social media channels. Those reviews and comments are one of the best ways to help me get this podcast to reach more listeners. And hopefully, they will also benefit from all the contents in this podcast. So let’s start our episode right after our short sponsor message.

[00:04:00] Introduction

[00:04:00] Henry Suryawirawan: Hey, everyone. Welcome back to another Tech Lead Journal podcast show. Today, I have with me an author named Jeffrey Smith. Jeffrey is an author of the book called “Operations Anti-Patterns, DevOps Solutions”. So I guess it’s like using DevOps solutions to overcome those anti-patterns. So today we will be talking a lot about what are some of the anti-patterns for operations people, or in the DevOps world. We’re going to be talking about some of the things that maybe you should try to avoid in your operations team, and how we can actually do few more strategies that actually can improve how you manage your systems, how you operate your systems. And so that using the DevOps solutions, we can overcome those things. So, thank you so much for spending your time with me today, Jeffrey. Hope to have a good conversation with you today.

[00:04:45] Jeffery Smith: Awesome. Yeah. Thanks for having me. I’m looking forward to it.

[00:04:47] Career Journey

[00:04:47] Henry Suryawirawan: So Jeffrey, in the beginning, maybe if you can introduce yourself, telling us more about your career, any highlights or turning points.

[00:04:54] Jeffery Smith: Sure. Yeah. So, as you mentioned, my name is Jeff Smith. I’m currently the Director of Production Operations in a company called Centro in Chicago, Illinois, here in the United States. We are a Ad Tech platform which didn’t sound like the most exciting thing to me when I originally took the job. Ad tech is like the bane of the internet, but in reality, it is what fuels the internet and keeps so much of it free. I’ll talk a little bit about more about that later, but prior to that, I was with a company called Grubhub and they’re in the food delivery space. My time at Grubhub was really when I got first introduced to DevOps and DevOps concepts. It was the first place I could really sink my teeth into it and experiment with it.

My career really started 20-odd years ago back in upstate New York where I grew up. I had been working doing data entry at a local health insurance provider in the area. I’d always been interested in computers, but I was a terrible high school student. A gentleman who was a manager of operations, who would later become my mentor, walked by and saw me reading Richard Stevens “TCP/IP Illustrated” book. I don’t know if you know that book, but it’s like the networking Bible. What he didn’t know at the time was that book was way over my head. I was reading it and I was getting some of it, but it was pretty dense stuff. He saw it and reached out and was like, “Hey, what are you reading there?” We talked a little bit about it. We would have conversations every now and then. Eventually, an operations position opened up on his team and he said, “Hey, would you like to switch your career, stop doing data entry and do some computer stuff?” So I was like, yeah, sure. So that’s really what kicked off this 20 year romance with tech, I’ll say.

So I was at that company for about 10 years. I worked my way up to operations manager, eventually succeeding him. You know, it was one of those moments where you’re like, “Wow, is every place as messed up as this one? Maybe I need to branch out and see what else is out there. My wife, well, girlfriend at the time now wife, said, “Why don’t we open up a search? Why don’t we look anywhere instead of just looking in our local area?” So that’s what brought us to Chicago. I got a job at Chicago, switch a couple different companies, had some success doing performance tuning for a company called Accenture in one of their like lab environments. But that was right during the start of the financial downturn. I’d been brought on as a contract to hire, and then as soon as I got brought on, the financial crisis hit. So they weren’t really hiring, and they kept extending me, extending me, extending me, and then finally I said, “You know what? I need health insurance, right?” Like I’m walking around here a bucket away from financial ruin. So, I started looking for a new opportunity, and the day I got a new job was the day they told me that they weren’t going to be able to extend my contract anymore. And I was like, “Oh, it’s funny. Cause I was going to have the same meeting with you to tell you I was leaving.” So, everyone was relieved. But you know, worked a couple different jobs before getting to Grubhub.

I think Grubhub was a huge turning point. One, it was the first time I’ve worked for a company that actually made a product that I cared about, and you never really think about how important that is. So often you’re just stuck in this field. When you think about it, technologists are really mercenaries. They don’t really care about the field that they’re in. They’re just like, what language am I coding in, and the field or whatever the company does is secondary to that. So it was the first time I’d worked at a company that actually made a product that I cared about. That really fueled my interest in how things truly work, not just from a computer perspective, but from a business perspective as well. And that gave me all types of insights that I was blind to before. Because, you know, before that it was like, I don’t care about health insurance, right? Like the inner workings of health insurance. I didn’t care about legal tax software, which I was responsible for running in a company called Wolters Kluwer. I just didn’t care about those things. So I didn’t have a vested interest in understanding how they work. But being at Grubhub and being curious about how things work opened me up to this world of possibility when it comes to operations.

So I carried that experience with me into Centro, and even though at first I didn’t care about how Ad Tech worked, I knew that it was important and it could be career changing to really understand the business. I’m glad I did. Because Ad Tech is way more fascinating as much as it is problematic from the various different perspectives, but it is definitely needed in the internet age. Because people don’t want to pay for Facebook. It’s the tradeoff. So how do we do that? And do that in a respectful manner for our customers, clients and users. So that’s a quick recap of 20-plus years of experience, but it’s been a fun ride.

[00:08:55] Henry Suryawirawan: Thanks for sharing your story. It’s very interesting how you started your journey. So through this book, TCP/IP. I have to agree that yeah, that book is also way over my head.

[00:09:05] Jeffery Smith: Yeah. It’s a great book, though. It’s absolutely the Bible. But back then, at that time, I was like, “What?” That was a good prop. It was a really good prop.

[00:09:13] DevOps

[00:09:13] Henry Suryawirawan: So over these 20 years’ journey, I’m sure you have seen a lot of things, right? In the beginning, you also said that the messy things of operations, administrations. And now in this era of DevOps, maybe you can clarify first for the audience and listeners here. What is actually DevOps?

[00:09:28] Jeffery Smith: So the first thing I was going to describe is what it’s not. It is not a role. It is not a job title. That’s a major pet peeve of mine. I think it might even be a little detrimental in my hiring efforts because people are searching for that DevOps role. DevOps is a style of working. It’s about creating a collaborative environment between the development team and the operations team, and aligning goals and incentives between those two teams. You know, we say Dev and Ops and focus on those teams, but it really can be any grouping of teams. It could be dev and finance, ops and finance, ops and security. And we keep coming up with all these acronyms, DevSec, OpsFin, and it’s just like, let’s just call it DevOps. We understand what the idea is. I always say you would never post a position for an Agile engineer. That sounds insane. You would never do that because Agile is a style of working. Same thing with DevOps. How do we build cohesive joint incentives for teams to work towards a common goal? And putting that sort of front and center shape a lot of your underlying decisions moving forward. Because when you think about it, so many of the problems that we encounter in life, not just even in technology, is a misalignment of goals.

[00:10:36] Henry Suryawirawan: You said that it’s not a role, right? But I agree from this part of the world, I keep seeing also DevOps engineer role. What do you think should be the name of the title? Is it like SRE, as some people actually chose to prefer?

[00:10:48] Jeffery Smith: I think that many system engineers, system administrators, whatever we want to call, I think there are enough people that have been in these roles previously that are doing the same things that we’re doing today. It’s just with a different tool set, right? When we made the switch from Unix to Linux, we didn’t really have to come up with a new job description. It was the same deal. It was just, you know, now there’s a new tool. When we make the switch from Python to Go, we don’t come up with a new job title. That’s just an asterisk under the software developer title. I think DevOps is the same. I think we have plenty of titles that could be used. I’m not naive to market forces, though, and I know that if you have a DevOps title, it’s probably an extra $10 or $15,000 a year. So I don’t begrudge anyone using that. In fact, I tell my team that I manage, if you want to call yourself a DevOps engineer on LinkedIn, that’s fine. I’ll back you, but that’s not what we’re called here.

Because the minute you give someone a title like that, it becomes their job. The minute you have a QA team, guess whose job quality is, right? Language is a very finicky thing because it does so much to the way we perceive the world just by giving something a name or title. That’s why I try to avoid it so that everyone knows. It’s like this DevOps things we’re talking about is everybody’s job. And we need to think about that too, from an ops perspective, right? Because we can quickly fall into a trap where we’re like, “Oh we’re building operations stuff, nobody else needs to see this.” Well, that kind of runs counter to this collaborative environment that we’re talking about. So maybe we do need to make sure that anyone and everyone can see that. Maybe we need to make sure we’re keeping the bodies out of the ops code so that we don’t have to lock down the repo. Some configuration file that they’re like, we can’t share that with anyone because there’s too much dirt in there. There’s just way too many dead bodies. We’ve got to keep that under tight locks. By putting that mindset forward, hopefully that helps to prevent and eliminate those sorts of traps.

[00:12:42] CAMS

[00:12:42] Henry Suryawirawan: Thanks for giving that clarification. So in your view, I think you mentioned this in the book as well. There are a few things that are very important when you want to adopt this DevOps culture, right? This new style of working. So it’s an acronym called CAMS, or some people actually call it CALMS. C A M S or C A L M S. So maybe you can give us a light here. What is CAMS?

[00:13:03] Jeffery Smith: So CAMS is really this light framework around DevOps as a concept to think about the core components that you need in order to make this transformation. C is for Culture. The A is for Automation. The L that you alluded to that some people call it is Lean. Metrics and Sharing are the last two. So, culture is like, one, so much of DevOps is cultural, and you need to build a cultural environment in your organization where these sorts of practices and concepts are embraced, where people are free to experiment without fear of retribution if they get something wrong. There are all these small insidious things that happen in our culture that especially us as engineers and technologists might gloss over, not realizing what an outsize impact that they have.

Have you ever worked for a company where the culture is, if you don’t use your budget by the end of the year, you lose it and you won’t get it next year. That’s a cultural thing. It doesn’t have to be that way. There is no golden finance rule. That’s just the way people operate. So what does that do? That creates a culture where everyone is spending aimlessly. They’re spending aimlessly at the end of the year to make sure that they use that budget. And I’m sure that people in finance, they’re like, sweet. That’s exactly what we wanted to happen. But you created that culture through a small rule chain. So culture is just something that we always have to keep front and center when we’re talking about DevOps.

Automation is really the thing that powers this. One of the things is you’re always going to be asked to do more with less. But the other thing that we need to do is the more automation that you have, the more empowerment you can give to other people in the organization. It’s this thing that in the book, I talk about this idea of exporting expertise. So, it is a technical dance to failover a database. I don’t know if you’ve ever had to failover production database. But like everyone’s been in that company where it’s, oh, we got to fail the database over, get Bob. Bob is the only one that can fail the database over. The minute Bob turns that into a script, where someone just has to execute fail_database, Bob has transferred a large chunk of his expertise into this automation script. And now he’s empowered tens of people that can do this thing. So Automation is key.

I’ll talk about Lean briefly, but I’m not huge fan of adding the L. Lean is just about operating dynamically and quickly. I don’t know that it falls in the category of DevOps because typically you’re adopting a model that is similar to the organization. So like, if you’re forced to adopt Lean, but you’re in a waterfall shop, that can be problematic.

Metrics. DevOps should be rooted in data. As part of that empowerment, people need to know if the thing that they’re doing is actually having an impact. And we always talk about it with systems and computers and whatnot, but really extends everywhere, right? As we’re running our ticket queue, is it actually performing the way we want to perform? And what metrics are we judging that success or failure by? We should be able to objectively point to success or failure, and we do that through Metrics.

And then, Sharing. You know, how do we, again, back to that Expertise exporting, how do we share knowledge? How do we share access when appropriate? How do we share responsibility? How do we make sure that we’re all sort of on the hook for the same thing, and we’re all contributing to make that thing better? Another thing that’s sort of front and center. So these things sort of collapsed together to build this framework and how we should think about approaching these DevOps transformations. And when you look at organizations that aren’t doing well in their transformation, you can usually find behaviors or actions that tie to one of these five categories. I keep leaving out lean. I’m sorry for all the lean fans. I’m not hating, but I’m hating a little bit.

[00:16:39] Henry Suryawirawan: So thanks for sharing. I really like the concept exporting the experts. So these things about automation, sharing your knowledge. I think it’s a great thing, especially in this DevOps culture.

[00:16:48] Why DevOps Anti-Patterns

[00:16:48] Henry Suryawirawan: Let’s go into the anti-patterns that you covered a lot in the books. I think there are probably 10 or 12 kinds of anti-patterns. So we can start with the favorite one, probably. But in the beginning, let’s probably dive into why you wrote this book? Why are you covering anti-patterns?

[00:17:05] Jeffery Smith: Sure. So it’s a funny story. The whole journey of the book was actually a long winding road. So, Manning had reached out to me years ago to write a book on Puppet. I was pretty active in the public community at that time, at least from the user’s perspective. You know, I thought about it, and I was like, yeah, sure. But then, before we got things started, my first kid was born. So it wasn’t the ideal time to write a book. So then years later they reached out, and they were like, “Hey, curious if you’d be interested in writing a book on DevOps?” And I was like, “Well, I don’t know what a book on DevOps would be like.” But then, when I thought about it, I was like, you know what, I want a book that is practical advice for people that aren’t in glossy startups that take so much for granted. When you read a lot of these books, either they’re companies that are firmly rooted in technology, and the fact that everyone has sort of bought in is a given. And that’s not the reality that most people face. Most people are in a company that’s got a lot of legacy baggage. They’ve been around for 40-50 years. They’ve got all this inertia around bad practices that we don’t really talk about in the DevOps books. We just simply say, “Stop doing what you’re doing and start doing this.” And it’s like, “Okay, easier said than done.”

The other thing was, a lot of the books seemed like they were written for CTOs, right? Because you would need either a CTO or CTO buy-in to be able to do a lot of the things that they’re talking about. And I was like, well, there’s got to be stuff that an individual contributor or a line manager or something that can do, because I feel like I’ve done it a couple of times, and I’m not a CTO. So that was really what sort of got me interested in writing this book. And it was originally titled “DevOps for The Rest of Us”. But people felt that was an exclusionary us versus them thing, and I hadn’t thought about it from that perspective. And I was like, yeah, kind of hard to talk about a book about, you know, bringing everyone together, and then immediately setting up as an us versus them.

So, the first draft of the book was not an anti-patterns format, actually. This is my first book. I was very academic and formal, and I got a piece of feedback in our review process. So what happens is with Manning, you’ll do a third of the book, and then you’ll release it to a bunch of potential buyers that will review it and give you feedback. So we were doing that with the first third of the book. Someone made a comment, and they said, “I’ve seen Jeff speak, and I was really excited to see this book because he’s such a fun, energetic speaker. And that personality does not translate into this book at all.” And I was like, “Wow. Okay. So it was like, all right, and my editor was like, I don’t know you well, because, yea, I’d never really met my editor. She was like, but it sounds like this book isn’t you right now. So that’s how we reworked it into this anti-patterns thing so I could have a little fun with it. Get more of my own voice into it. That’s how everything sort of took shape. But it was definitely not intended to be an anti-patterns book when it started. That was the result of a pretty cold, but honest piece of feedback that I got.

[00:19:50] Henry Suryawirawan: So I hope when we go through all these anti-patterns, you can also use some of your funs now. So looking forward for that.

[00:19:55] Anti-Pattern 1: Paternalist Syndrome

[00:19:55] Henry Suryawirawan: So let’s start with the first anti-pattern, which is called paternalist syndrome. What is this, actually?

[00:20:01] Jeffery Smith: So, I call it the paternalist syndrome. It is when someone in a relationship assumes the role of the parent, even though no true hierarchy exists. And you know who I’m talking about. Ops people raise your hand. We’re all guilty of this. The BOF, for those that don’t know, go ahead and Google it. But there’s a reputation of operations teams being the team of No. And I understand. I spent my entire career in ops. So I totally have all of that baggage from years of devs just throwing stuff into production. But I quickly was able to identify that the reason, again, was the fact that we had misaligned incentives. So the paternalist syndrome is this behavior that operations people, anyone can do it, I’m really talking to the ops folks right now. This behavior where we assume everyone else is out to destroy the system and we have to protect it. That means nothing can go to production without it coming through us first. That means no one can have access to production, no matter how sane of a request it is. You just can’t do it because, you know, we can list a whole bunch of reasons, and we’ll invoke security. It’s like a red herring as to why you can’t have access. We’ll invoke audit controls as a red herring as to why you can’t have access. And sometimes those are true.

But we’re really starting from the position of “no” and defending that instead of starting from the position of “yes, but”. Yes, but how do we address the audit? Yes, but how do we ensure that access doesn’t spread? And it’s really a mindset. So what the paternalist syndrome will do is anytime that there’s a problem, the first tool we reach in the toolbox is another gate, another checkpoint. Oh, this didn’t go through change control, or this went through change control, but ops wasn’t on the change control process, so now ops has to approve every change. And every time we do that, we’re adding another layer of slowdowns, very little value-add work, and wasted time for ops folks too.

There was a book that I read, “Rework” by Jason Fried from 37 signals. And it’s one of my favorite quotes. I use it all the time. Policies are organizational scar tissue. Every time that there’s a policy that’s enacted, you can typically go back to some inciting incident where that happened, and the policy came out as a result of that. Almost to the point where it becomes lore and legend, where no one was even there for that anymore, right? It’s like, “Oh yeah. There’s some dude on life support in the lunch room. We keep them around just so can explain to us why we don’t do stored procedures anymore.” There’s always some story like that. So the paternalist syndrome chapter talks about how we can identify some of the common issues that we run into, and sort of the pattern of fear, honestly. Cause that’s what it is, fear around particular situations and how we can go about breaking that down using CAMS. How do we do that through automation so that we can empower someone?

A perfect example is, we don’t let developers restart background process services, right? So, why not? They know way better than I do how the system’s going to behave cause they wrote it. So, why wouldn’t they be able to reset it if they’re seeing something weird? Maybe we don’t want to give them SSH access to production. That’s fair. But there’s a lot of different ways that we can restart a service, and then we can expose those ways to the developer and empower them to be able to do it. “Oh. But then we don’t know.” Okay. Well then, your program can send an email, right? That’s not crazy. So if you really want to know, if you’re not approving, and you just want to be notified, have your script send an email saying, “Hey, so-and-so just restarted the service through this access point.” Boom. Done.

And the other thing with the paternalist syndrome is we often insert ourselves as gatekeepers when we’re adding zero value to the process. So I’ll give you an example, a real world example from Centro. When I started at Centro, there were a lot of requests for ad hoc script execution from development. “Hey, I need this Ruby script run.” So we would have to get the script, to copy it out to one of the boxes, SSH into the box, run the script and no commit mode, share that with the developer. He would look at it or she would look at it and say, “Okay, it looks good in commit mode,” we’d run it in commit mode, and then I’d send them the output. What value do we add to this process, other than being a middleman? What value are we adding? And the truth is zero. Because even if we wanted to be part of the approval process, I don’t know anything about the script. I didn’t write any of this code. Okay, you’re updating flights to make sure that campaign IDs match. I don’t even know what that means. You just said a bunch of words that I’m going to take at face value that this is important.

So we said, what if we were to set this up via JIRA? Maybe we could create a change process where someone attaches a script to the JIRA ticket. The JIRA ticket has to be approved by another dev who is way more qualified to approve it than I am. And once it’s approved, we have some automation. Grab the script from the JIRA ticket and execute it on the box and then attach the output to the ticket. Now you don’t even need us. Now there’s not a dev sitting around waiting like, “Oh my goodness, is ops back from lunch yet? I really need to run this script, but no one’s there.” They can just do that on their own. They self police it. They self-govern it.

Now, everything’s not without trade-offs. So now, you have the issue of like, “Oh, we don’t have to fix this problem because anytime it comes up, we just run the script.” And because there’s no friction anymore, a permanent fix is probably not as attractive to them. Whereas before, they had the pain of going through ops to be able to entice them to do that. But still, all in all, it’s a boon. It’s a win. With some standardized scripts, would they even be able to push that functionality down to customer service. So the customer service knows, oh, when you see this problem, you have to submit this JIRA ticket. And then, once it’s approved, you’ll be able to execute this script to clean it up. So the paternalist syndrome is really about changing your mindset. Getting out of the habit of just being “No” and shifting to a “Yes, but”. Yes, but how do we solve for these problems? Because no one wants to feel like they don’t take their jobs seriously. Everyone at work is doing their best with the knowledge that they have to do their job. And this idea that there’s a group that instantly assumes you’re an idiot is not a warm, fuzzy feeling for anyone.

[00:26:02] Henry Suryawirawan: So, yeah. As you mentioned all these stories, like operations people assuming the role of a parent. Like it’s what they always say, so the ops job is to actually make the system safe, make the system secure, stable, whatever that is. While the other parts of the company or the other parts of the team, like dev, is actually out there to introduce changes, instability, like what you mentioned, it’s like they’re there to destroy the company. So I think when you mentioned all these anti-patterns, it resonates a lot with the traditional world.

[00:26:31] Jeffery Smith: Yeah, there was something else that when you said, make the system safe. You know, that’s an interesting way to phrase it because I completely agree. Think about the system that we’re running this on as like a huge toolbox. Basically, what we’re saying is the only tool we have is a bunch of sharp knives. We don’t want to give you access to it because we’re afraid you’re going to cut yourself. And it’s like, well, okay, maybe we could throw some different tools there, right? Maybe we can throw a hammer in. Maybe we can throw a spoon. Something that’s a little safer, but it still allows me to do the job. But yes, if you’re asking me to hammer a nail with a machete, it’s probably unsafe. And that’s essentially what we’re doing. Like, well, because we’ve only got machetes. I can’t give anyone access.

[00:27:10] Henry Suryawirawan: So yeah, for all listeners who listen to this, I hope you notice these patterns or anti-patterns in your team. Make sure that you don’t assume this parental role unnecessarily.

[00:27:20] Anti-Pattern 2: Alert Fatigue

[00:27:20] Henry Suryawirawan: So let’s move on to maybe the next anti-pattern, which is quite common for any administrators, operations people, which is alert fatigue. So many times, there are so many alerts popping up. Probably we don’t do actions on most of them. Can you explain a little bit more about alert fatigue?

[00:27:37] Jeffery Smith: Alert fatigue is actually a term borrowed from the medical industry. It came about from nurses who would not respond to beeping alarms in hospitals because the alarms always go off. They always go off, so they became desensitized to it. To the point where even when there was an emergency, there was no way to elevate that emergency beep beyond this cacophony of sounds that was always going off from these machines. So, we borrow that term, alert fatigue in technology to say, like, what alerts do we have that are just constantly firing, that are drowning out more critical, useful, actionable alerts? I think the keyword there is actionable.

When we design alarms, we design alarms from the perspective of things we think might be bad. We say, oh, we’ll alert on high CPU utilization cause that sounds bad. But is it really bad? We buy these machines to use them. So this idea that we’re worried that a machine is certainly 40 or 50% utilized, that’s not really a bad thing. Let’s say it is 90% utilized. Do we care? What are the other factors? CPU utilization on its own is not a reason for concern. At least not to wake someone up. So I think we’ve all gotten those alerts where it’s like database utilization is high and it’s 3:00 AM. We’re running all of our batch processing. Like, that makes sense to me, but why am I being woken up?

And then the other thing to think about is if you can’t design an alert that leads up engineer to take a next step or action, you need to seriously question the value of that alert. So, when we have an alert that says replication lag time is high. You should be able to say in that alert, this is actual alert that we have, the alert will say, replication slot one has exceeded replication time. Chances are this is related to the database replication service being run by the BI team. You should investigate that database and see if that’s the cause of the replication lag. If it is, restart the Debezium connector in order to catch it up. That’s a very specific set of actions where I don’t have to do a bunch of things. If that’s not the case, if that’s not what’s going on, then it’s like, well, clearly when we wrote this alert, we had a very specific set of scenarios that we were worried about and it’s outside the boundary of that. Maybe I should look deeper into this.

So this idea of alert fatigue is this idea of making sure that these alerts are actionable. And if they’re not actionable, get rid of it. Just get rid of it. Because it’s not helpful. Now, I’m assuming you’ve been on call before, and I’m assuming, not casting any blame on anyone, but I would imagine that you’ve probably received an alert that you said, “Oh, this alert usually clears itself. Let me snooze it for 15 minutes.” Everyone’s done it. Everyone’s done it. So the question is why not just increase the threshold of the alert? And people are like, oh, well then if there’s something wrong, I won’t know for an extra 15 minutes. Well, you don’t know, anyway. Cause your first action is always to snooze it for 15 minutes. You don’t know anyway, because that’s the very first thing you do is like, “Ah, man, that stupid memory thing. We always know that when the code gets into this particular section, it gets memory high and it clears itself.” So it’s like you’re not really doing yourself any service by having it. So push it 15 minutes. When it alerts, you need to react quickly because you’re already behind the eight ball. But guess what? You’re reacting and you know it’s real.

I would rather be a few minutes late to an alert, but know it’s real than to be constantly alerted and not being sure if it’s accurate or not, and having to figure that out and decide. Because I know as human beings, we’re going to err on the side of the pattern and just say, I’m going to snooze this. If you’re at a barbecue, you’re having a friend. You’re out with your friends having a cookout, you’re eating the sun’s out. It’s a beautiful day. Your alarm goes off and it’s like, “Oh, it’s not a memory alert again.” Never ever, ever, ever even be like, “Guys, I got to go. This thing that alerts every 20 days or whatever is alerting again. And I got to look at it.” No. You’re going to snooze it. You’re going to continue chatting with your friends, and then you’re going to crap your pants when you realize it’s a real alert and you’ve got to do something.

So, alert fatigue is really focused on identifying those patterns and trying to make them better, or just simply eliminating them. I think another big thing that the chapter talks about as well is creating metrics that reflect the business impact. Going back to the CPU utilization example. If the database is at 90% CPU utilization, but our transactions per second steady and isn’t climbing, do I care? I don’t care. Work it. Yeah, sure, 90% utilization. Then that’s not to say that you don’t want the metric, right? You just don’t want the alert. Because the metric is good for trending, capacity planning, all of these other great things. I’m just saying I don’t need to know about a capacity planning alert at night, at 3.30 in the morning. That can be an email.

And that’s another thing too, right? We always default to waking someone up as the default alert. But there can be different types of alert. Have a low priority alert that emails you. So you’ll wake up in the morning and then you say, “Oh wow! We were at high CPU utilization last night.” Nothing else was impacted, but that’s a good data point to know, and I’m much more receptive to that data point this morning at 9 o’clock AM while I’ve got my coffee as opposed to 3 in the morning when I don’t know what I’m really looking for or trying to solve. So, get rid of alerts that don’t mean anything to you. Tweak your alert notification settings so that you can do emails instead of always paging out. Try to tie your alerts to some sort of business impact so that you know whether you really need to wake someone up or not, or if the alert that’s firing is something that you actually care about. There it goes again. I like my database is busy, as long as they’re within their operating thresholds.

[00:33:18] Henry Suryawirawan: So you mentioned something that I pick interest, which is it’s okay to actually be alerted late, instead of always have in first minute, right? You get alerts popping up here and there. Because some alerts do actually recover by itself. Because of these anti-patterns for sure, people just put in alerts, but actually they could recover over time in a short period of time. But what you’re saying here is that it’s okay to actually be alerted late as long as you can guarantee that actually it’s a real problem, and you are supposed to take an action on it. So I think it’s a gist that I think for everyone here who has been in operation or still working in operations. You should probably tweak your alerts in order to behave much more properly.

[00:33:55] Jeffery Smith: Yeah, absolutely. Absolutely. Because very seldom those extra minutes actually mean anything, right? And it’s one of those counterfactuals that we talk about in these incident reviews where it’s like, oh man. If we had known about that alert five minutes earlier, we could have prevented the outage. Probably not. You have no idea how quickly five minutes goes by. You’re looking at something. You’re tracking down a red herring. You’re like, oh yeah, it’s probably this thing over here when it’s something completely unrelated. A lot of times, those five minutes aren’t buying you as much as you think they are. Of course, your mileage may vary. Take that with a grain of salt. If you’re in high frequency trading, maybe it’s a different ball game. But for the most of us, for my target audience, it’s like, yeah, you’ll be alright.

[00:34:33] Anti-Pattern 3: Wasting a Perfectly Good Incident

[00:34:33] Henry Suryawirawan: So let’s move on to the next anti-pattern, which is about wasting a perfectly good incident. This is interesting, because a perfectly good incident. Maybe can you explain about this?

[00:34:44] Jeffery Smith: Yeah. So it actually stems from saying from a local politician here in Chicago that always would say never let a good crisis go to waste. It’s this idea that there is so much to learn from an incident. Because when you think about it, so we have these mental models of our systems, right? These systems are becoming so complicated, so complex, so many different pieces. So everyone in their head has this mental model of how they think the system works. Then there’s the reality of how it actually works. And those two are seldom in full alignment. It’s usually something off. And the delta, the time that you find out that your model is different than reality, is in an incident. Other than that, you can spend the rest of your life in complete delusion, thinking you understand how the system works. But then, when there’s an incident, suddenly the gap between your understanding and reality is exposed in its raw form. So often, we sort of just close an incident ticket and move on. But it’s like, hang on, let’s dig into this incident and try to make our mental models better and learn from our mistakes.

It goes beyond just the tech side. It’s also the human side. So simple things that you uncover. Like, okay. All right. Henry, I noticed that this alert fired, you got it and snoozed it, and then it’ll be alerted 15 minutes later, and then you engaged. What happened there? Oh, well, this system alerts all the time, and it typically auto recovers. So when I got the alert, I snoozed it, thinking it was going to recover, but then when it didn’t recover, I realized, okay, something’s really wrong. So as a manager, me personally, right? If I’m not part of the on-call rotation, I may not know that reality, that dynamic exists. So just by simply asking that question in the incident review, reveal something to me like, whoa, okay. We’ve got alerts that are so bad, people ignore them because they’re so common. And I’m sure everyone on your team is going to be backing you up. Yeah. Yeah. I know that alert. I hate that alert. So now it’s like, okay. So clearly, I have poor alerting. That poor alerting is impacting my on-call team. Because every time you’re waking someone up, anytime you paged someone, you’re interrupting their life. They could be at dinner. They could be at a movie. They could be taking care of their sick mother. You have to think when we page out, what is this person doing in their life that I’m interrupting and is this worth it?

As a manager, if I’m not part of the on-call rotation, right? That is information that I can get on the incident review process. Okay, cool. Here’s another example, a real world example where two engineers were talking about the same system using different terminology, and they didn’t realize that they were talking about the same system because one team used this term A and another team uses term B. So my ops guy was thinking that there’s some new system that he doesn’t know anything about that he’s tracking down. And he’s pissed because there’s no monitoring. There’s no metrics around it. But lo-and-behold, oh no, we have a terminology difference. What you’re calling Sidekiq, we’re calling consumer daemon and they’re the actual thing. That’s a huge disconnect. But now that he understands we’re talking about consumer daemon, he instantly has a different view of the entire scenario. Because he’s like, oh, now that I know we’re talking about consumer daemon, I understand that.

These aren’t technical problems. These are human meet space problems that are coming out of the incident review process. But once you start to dig in and peel back, and get into people’s head space, it’s like, Oh, okay. All right, this is making sense. So I noticed Henry, after that was all over, you restarted the service. What made you think to restart the service? What information led you to that? Well, honestly, I was out of ideas and I thought maybe I noticed the memory utilization was high. So what made you look at the memory utilization? Well, I normally don’t look at that, but I happened to be looking at a different screen and saw that it was high. So I just said, well, why not restart the service? Okay. But when you restarted the service, you didn’t realize billing was running, and that interrupted the billing process. So now we’re not getting bills out on time, which was an ancillary effect. Oh, I didn’t realize the service communicated with the billing process. Oh yeah. It’s part of the billing process because they share a queue or something like that. Oh, well, it’s not like I intended to impact billing. I just had no idea. All of these sorts of conversations happen because, again, everyone has a slightly different mental model of the system.

So, missing a perfectly good incident is this idea that you just say, oh yeah. System was low on memory. Henry restarted it. Service recovered. Service didn’t just recover. You impacted the billing team, who now has to rerun the billing job. You were communicating with an engineer and didn’t realize you guys were using the different terminology. Had you known that, you might’ve made different choices if you knew that we were talking about this particular service. We’ve discovered that, oh poor alerting, and because of that, people are hesitant to do things. Either way, they’re waiting until it re-alerts. There’s all of this information that could have been easily dismissed and wrapped up in a memory utilization was high, restarted service, close ticket. So wasting a perfectly good incident is really this idea of like, there’s so much more data and information in a failure that we can bring out if we just really want to put some energy towards it.

[00:39:59] Postmortem

[00:39:59] Henry Suryawirawan: So it actually explains about all these different scenarios based on anecdotes and all that. I realize this is like what some people call it, postmortem activity. But one thing that I do find a challenge sometimes, like for those people who are involved in the crisis, after they solved the problem, of course, they are like, okay, I don’t want to deal with it anymore. That culture of assessing this incident, assessing what can we learn from it? I think it’s not there for most of the people, I would say. So how do you actually inculcate this culture so that it becomes a thing. It becomes a common thing that people actually want to do it because they feel value out of it.

[00:40:33] Jeffery Smith: So, the first thing that you can do as an individual contributor, I’m assuming you have the willpower. So I’m going to say an individual contributor in this scenario. The first thing you need to do is do it fast, like immediately after the incident within 24 hours. Why? Because things have a sense of permanence in our minds, but only for a short period of time. So it’s like, I have something permanent, but only for a short period of time. But that’s how it works. Within that 24-hour period, this is the most important thing that has ever happened. But after that period, it becomes just noise in the background, while you’ve got all of these other demands on you. So if you can get people in the room as soon as possible to talk about it, the actual incident is fresher in their mind, so they can recall more of the details. They’re much more energized around it. You’re really manipulating their psyche, honestly, by doing it early to say like, Hey, let’s do it now while you’re super interested. But car dealer never let you lead the dealership. He doesn’t want you to go home and think about it. He wants you to remember what that SUV felt like the moment you’re in the store so he can tell it to you. It’s a basic key moment emotion. So, the biggest piece of advice I can give is, like I said, schedule it as soon as possible. Definitely within 24 hours.

But once you do a few of these, I guarantee you. Even without any real training in the process, you’re going to discover things. And as you discover them, people will instantly see the value in it. The hard part is the follow-up action items. Because how do you influence people whose schedule and prioritization you have no control over to make sure that some of these things that were brought up are addressed? Sometimes, it’s worth doing it just to have the knowledge. Just to be able to say like, Oh, we know these things now, and even though we’re not going to correct it, the next incident we’re aware of it. But for the things that you need to actually fix, that’s where you really need some leadership buy-in to be able to like, “Hey, these things were things we discovered in the incident post-mortem process. We really are going to need help and commitment from your team to address some of these.” Putting dollar signs next to the actual potential risk helps sometimes. “Oh yeah. We lost $75,000 for 15 minute outage. Sometimes that can be persuasive, not all the time, because again, it’s not coming out of anyone’s paycheck. So, by design the only people that really care about that it’s management. But yeah, 24 hours. Proving the use out of it and documenting that use, radiating that information as part of the sharing portion. Have it somewhere where people can see and review it and understand it.

The other thing that really makes it shine is when someone’s looking at an incident, and they see this beautiful post-mortem incident review, and then they go to another incident and it’s just like restarted service, we’re good. It’s like, “whoa where’s this?” That sort of creates this pattern of like, “Hey Henry, I noticed the last incident you ran. You didn’t do a post-mortem like the other peer members of the team. You should be doing that too.” And it just becomes a cultural thing. It just becomes this thing where, like everyone rallies around it. But it takes time. But, you know, lead by example. Do it yourself. And when you do it, don’t treat it like it’s some new fandangled thing. Do it as if it’s the most logical thing that you should be doing. “Yes, of course we’re going to do a post-mortem. Why wouldn’t we? We just had an incident.” Hide the fact that this is the first post-mortem you’ve ever done before in your life. Sell it like it’s the most natural next step. “Yeah. We’re going to do a post-mortem. We got to understand why it happened? Don’t you think we need to understand what happened?” Who can say no to that? “Yeah, I guess we should probably have a better understanding.” “Yeah. So that’s why we’re going to do the postmortem. Let’s do it now. Come on.”

[00:43:59] Henry Suryawirawan: So as you mentioned in the beginning, right? Probably all this also is part of culture. So you can’t just switch everybody in a second. Like once an incident happened and yeah, everybody will just do postmortem for the next few incidents. So sometimes I think we need the buy-in, and also like I would say, probably policing? Well, you said it’s like leading by example, right? So the people at the top probably also need to spend the time or actually allocate the time to actually do this stuff because it’s important, and it gives high value for the company, not just for that particular team.

[00:44:27] Jeffery Smith: And like with any cultural change, you’ve got three categories of people, right? You’ve got supporters, detractors, and fence-sitters. The majority of people are fence-sitters. The vast majority of people. And this is true of any context. You look at politics. Most people aren’t on the extreme ends of either side. Most people are in the middle. But the extreme ends are the loud ones, so those are the ones that we focus on. It doesn’t take a lot of people to change the culture of an organisation. It takes a few supporters or even a few detractors. It works both ways. So, positive influence. It only takes a handful of people to be able to change and have the fence-sitters move over. Same thing with the negative, though. With the negative behaviors, it only takes a handful of people to be like, “Whoa, this is how we’re going to do things,” and then suddenly your culture is in the tank. So, think about that as you’re recruiting people for these different aspects, right? Who are the people that can really be boosters for me? And I don’t have to convert everyone. I’ve got to convert a few really boisterous cheerleaders for this. And if I convert them, people are going to follow. And once people start following, you’ve overwhelmed the detractors and they lose.

[00:45:32] Henry Suryawirawan: So as I hear your insights about all these anti-patterns, the audience here actually can find more anti-patterns in the book. There are some of the topics that I think like really relevant. But because of the time, I’m sure we can not cover all of them. So for people who are interested, go buy the book or read the book, and you can learn all the fun styles of anti-patterns from Jeffery.

[00:45:52] Jeffery Smith: Yeah, and you need to buy the books. I don’t want to go over all 12.

[00:45:57] 4 Tech Lead Wisdom

[00:45:57] Henry Suryawirawan: Yeah. So Jeffrey, before I let you go, normally I have this one last question that I always ask all the guests, which is called the three technical leadership wisdom. So this is just for people to maybe learn from your journey. So what kind of wisdom that you have in your career that probably you want to share with everyone?

[00:46:12] Jeffery Smith: Okay. The first one is, of course, never let perfect be the enemy of good or good enough. So there are a solution that is 70% effective is better than a perfect solution that hasn’t been implemented. So never let the fact that it’s not perfect stop you. Because guess what? It’s never perfect. It’s only perfect in your mind. Keep pushing. Get it done. You’re going to learn a bunch of stuff. It’s not going to be perfect. Something’s going to be screwed up.

Second piece of advice is, along the same themes of this whole perfect is the enemy of good discussion is that there is no point in a project’s journey that you know less about the requirements than in the beginning. When you start, that is the most minimal amount of information you’re actually going to have about your requirements. So keep that in mind when you’re designing a solution for something. Because as you move in the project, your requirements are going to become more and more concrete and understanding. Product people and project people are going to come to you with all this list of requirements, and it’s going to look like they did a bunch of due diligence. But, know that they only know so much at this point, and cut them some slack as a result of that. So we have a problem now at work where ops gets involved too late in the life cycle of a project. But a lot of it is because when they start, they don’t know if they’re going to need ops support or not. Because it’s like, “Oh, if it’s a feature in the monolith, we don’t need ops for anything. We can develop that on our own. They’ve given us all the automation tools we need. We can, you know, do everything we need to do without them.” But then, if halfway through the project, they pivot and they say, “Oh, this needs to be a separate microservice.” Well, suddenly that’s a whole new ball game for ops. But we have to accept the fact that they did not know that when they started. They made the best choice that they can make. So, always keep that in mind.

I guess the third thing is, as an engineer, you have an implicit bias in just about everything you do, and it’s particularly bad in technology. There is a strong sense of this, “I didn’t write it, therefore it’s crap.” A good engineer knows the difference between a preference and a problem. It’s a pattern that I see all the time. An engineer comes in, they get hired and ready to get started. They look at some code or something and they’re like, “Oh, this is all wrong. Can’t do anything.” It’s funny cause this thing is making this like, 140 million a year. So, tell me what is so broken about it? Is it perfect? No, of course it’s not perfect. There’s a bunch of problems with it. See my previous two ideas and suggestions. You have to understand what’s a preference and what’s a problem, so that you know where to focus and put your energy. Because no one made a career out of rewriting something that was already working, and just introducing different problems. Because you never actually fix it, right? Like you might shift it. You might change the set of problems, but there’s always some new problem that you’re going to be dealing with. So accept that and know that everything you do is going to be future use, thing that they hate. You’re going to make a choice and someone’s going to come in five years later, like “Well, this was dumb. Why didn’t you write it in Go? Why didn’t you write it insert new language that’s hip and trendy?”

And I’m going to add a fourth one too. When you make it to choice, document your constraints and your context. When you make any sort of technical decision, why did you make that decision? What was the reality on the ground? I remember a story where a guy was telling me about all of this handcrafted code that was built at his company. And he’s like, “I don’t understand. Why didn’t they just use Kubernetes to orchestrate all of this? And I was like, “When did they write this stuff?” He’s like, “Oh, that was probably like 8-10 years ago.” That might be why they didn’t use Kubernetes, right? It’s like, “Oh yeah, I hadn’t thought about that.” Yeah. So you’ve got 10 years of energy built into this thing. Just switching to Kubernetes like that isn’t an easy thing. So there’s always context around every technical decision that gets made. If you can document that, it’ll save you some hassle and future engineers some energy around understanding why particular decisions were made.

[00:50:11] Henry Suryawirawan: Thanks for sharing these wisdoms. So I’m laughing as you shared all this, because I can see these patterns over and over again in any places that I went into. So this is common, right? So, thanks Jeffery for your time. So for people who want to learn more about you or connect with you or find the books, where can they find it?

[00:50:30] Jeffery Smith: Sure. Yeah. So, I have a website that I don’t really update or maintain. But if you feel narcissistic, you can check that out at attainabledevops.com. Most likely the best place to find me is on Twitter, where I’m @DarkAndNerdy. You can find the book “Operations anti-patterns, DevOps Solutions” at Manning.com. If you want to order it direct from them, both physical and ebook copies, but it’s also available on the Amazon bookstore and in Audible. I don’t read it though, unfortunately. Everyone’s like, oh, I bought this thing, and you were going to read it. And I was like, ah, I didn’t even know they were turning it into an audio book. I got email the same time you guys did. So I’m going to make a lobby for the second edition that I’ll read it though. So we’ll see how that goes.

[00:51:07] Henry Suryawirawan: Yeah. I mean like when you read it, probably you can impose this fun style. So for people who are listening, so they can also be entertained at the same time, which I think I’m doing it as you speak just now. So thanks again, Jeffery, for your time. I really learned a lot from this conversation and I wish you good luck for the things that you do.

[00:51:24] Jeffery Smith: All right. Thanks for having me. I had a really good time.

– End –