#68 - 2021 Accelerate State of DevOps Report - Nathen Harvey
“Many organizations think in order to be safe, they have to be slow. But the data shows us that the best performers are getting both. And in fact, as speed increases, so too does stability.”
Nathen Harvey is the co-author of 2021 Accelerate State of DevOps Report and a Developer Advocate at Google. In this episode, we discussed in-depth the latest release of the State of DevOps Report. Nathen started by describing what the report is all about, how it got started, and explained the five key metrics suggested by the report to measure the software delivery and operational performance. Nathen then explained how the report categorizes different performers based on their performance against the key metrics and how the elite performers outperform the others in terms of speed, stability, and reliability. Next, we dived into several new key findings that came out of the 2021 report that relate to documentation, secure software supply chain, and burnout. Towards the end, Nathen gave great tips on how we can use the findings from the reports to get started and improve our software delivery and operational performance, that ultimately will improve our organizational performance.
Listen out for:
- Career Journey - [00:05:28]
- State of DevOps Report - [00:09:32]
- The Five Key Metrics - [00:13:55]
- Speed, Safety, and Reliability - [00:19:58]
- Performers Categories - [00:23:26]
- 2021 New Key Findings - [00:28:01]
- New Finding: Documentation - [00:30:44]
- New Finding: Secure Software Supply Chain - [00:34:58]
- New Finding: Burnout - [00:37:22]
- How to Start Improving - [00:39:36]
- 3 Tech Lead Wisdom - [00:43:55]
_____
Nathen Harvey’s Bio
Nathen Harvey, Developer Relations Engineer at Google, has built a career on helping teams realize their potential while aligning technology to business outcomes. Nathen has had the privilege of working with some of the best teams and open source communities, helping them apply the principles and practices of DevOps and SRE. He is part of the Google Cloud DORA research team and a co-author of the 2021 Accelerate State of DevOps Report. Nathen was an editor for 97 Things Every Cloud Engineer Should Know, published by O’Reilly in 2020.
Follow Nathen:
- Twitter – @nathenharvey
- LinkedIn – https://linkedin.com/in/nathen
- Github – https://github.com/nathenharvey
Mentions & Links:
- 2021 Accelerate State of DevOps Report 2021 – https://cloud.google.com/blog/products/devops-sre/announcing-dora-2021-accelerate-state-of-devops-report
- DORA DevOps Quick Check – https://www.devops-research.com/quickcheck.html
- 📚 Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations – https://amzn.to/3spqpvC
- 📚 97 Things Every Cloud Engineers Should Know: Collective Wisdom from the Experts – https://amzn.to/3C4Nnvm
- GitHub Developer Productivity Report – https://github.blog/2021-03-10-measuring-enterprise-developer-productivity/
- Dr. Nicole Forsgren – https://nicolefv.com/
- Jez Humble – https://www.linkedin.com/in/jez-humble/
- Gene Kim – https://www.linkedin.com/in/realgenekim/
- Chef – https://www.chef.io/
- Puppet – https://puppet.com/
- National Institute of Standards and Technology – https://www.nist.gov/
- Secure software supply chain – https://github.blog/2020-09-02-secure-your-software-supply-chain-and-protect-against-supply-chain-threats-github-blog/
- Blameless postmortems – https://sre.google/sre-book/postmortem-culture/
Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.
Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.
State of DevOps Report
-
[DORA Research Program] is the longest and largest running research program of its kind of. DORA itself brings an academic rigor to the investigation of this question of how do teams improve their software delivery and operations performance.
-
I think we have a lot to learn from one another, and there are always questions that we have. Yes, there were reports everywhere. I think the important thing to remember is how you, as a team, utilize the findings of a report to help your team improve. That’s the biggest question always.
The Five Key Metrics
-
The five key metrics, they started off as four key metrics. At the beginning of the research program, the question was how do we improve and measure software delivery performance? What they came up with was a very practical and pragmatic way to measure software delivery performance. And that’s where we got the four keys.
-
How do you measure software delivery? Within DORA, what we measure are two metrics for stability and two metrics for throughput.
-
Throughput, the velocity.
-
How fast are you delivering software? For that, we look at deployment frequency. How frequently are you pushing changes out to your customers or the users of your application?
-
And then importantly, we look at lead time for changes. We want to decrease the amount of time it takes for those changes to go from the developer into production.
-
-
And then we have our stability metrics. Moving fast is great, but if you’re moving fast and always breaking everything, that’s not so good.
-
The first is your change fail rate. So when you make a push into production, how frequently are you shouting out an expletive? When you have that, that’s a change failure. We want to decrease those as much as possible.
-
Our final stability metric has to do with this idea, this truth, that it doesn’t matter what we do to reduce risk, risk is always present. Failures are inevitable. Something is going to break. So we have to look beyond the tools. We have to look at our team, and say, how quickly does it take us or how quickly can we restore service to our customers when there is something that impacts them? And so this is our time to restore service.
-
-
-
Many organizations think in order to be safe, I have to be slow. But the data shows us that the best performers are getting both. And in fact, as speed increases, so too does stability.
-
We introduced this idea, not just of software delivery performance, but software delivery and operations performance. Operations performance because when you’ve deployed something, that’s the beginning of it providing value to you as an organization and to your customers.
-
In the past, in previous years, we asked questions about your operational performance in terms of availability. But this year, we’ve refined our approach and gone deeper with the research to move from availability to reliability. Availability is really a component of reliability.
-
What we’re really trying to answer the question we’re getting at, can you keep your promises to your customers? Can they rely on the application or the service that your team is building and shipping? Reliability practices push organizational performance. They drive that organizational performance.
Speed, Safety, and Reliability
-
Interestingly, what we found is that all of the performers, whether you’re a low performer or the best of the best, the elite performers, when it comes to software delivery, each one of those performance categories could improve their overall organizational performance with the adoption of reliability practices.
-
In a world of Site Reliability Engineering or the practice of SRE, we often talk about an error budget. And we use an error budget to say, if we’re not meeting our reliability goals, then we should focus our effort on the things that will make the system more reliable. And the flip side of that, if we are meeting or even exceeding our reliability goals, then we should focus on things that make us faster.
-
We have to make sure that as a team responsible for an application or service, we have joint accountability for all of these metrics. Part of the reason in the past, we used to think that slow meant safe.
-
We’ve spent a lot of times telling our developers, go faster, go faster. And then turning to our operators and say, keep it stable. And I’ve been that operator. I know the best way to keep the system stable is to not change it. But that’s not good for our customers.
-
The biggest lesson or one of the biggest lessons that we’ve taken from another industry is from lean manufacturing. And lean manufacturing says, do things in small batches. So when we apply that to the delivery and operations of software, we know that when we have a small change, it’s always heading to production. That’s a small change. It’s easier to reason about. When it lands in production, it’s less likely to have a huge negative impact. We also bring in practices like progressive deploys.
-
By making those smaller changes, you’re less likely to have a big impact. You’re more easy to reason about it and move that change through the system.
Performers Categories
-
Today and since 2018, we’ve had low, medium, high, and elite performers–four different clusters of performance.
-
On that deployment frequency, elite performers are able to deploy on demand. That may mean even multiple deploys per day. But a low performer in 2021 will do fewer deployments than once per six months. So it might take seven, eight months, maybe yearly, for them to do a single code deployment.
-
Time to restore service for an elite performer is less than an hour. For a low performer, it may take more than six months to fully restore service when there is an outage or something that interrupts there.
-
What defined elite performers three years ago, maybe, it only defines high performers today. Because as an industry, we are always getting better. We’ve also seen that since 2018, not only are [elite performers] getting better, but the number of people or number of teams that are able to achieve that is growing.
2021 New Key Findings
-
The real question is, how do I get better? That’s really where I find the research to be very powerful.
-
We dig into different capabilities that teams should invest in. Now, those capabilities might be technical capabilities, but they’re also process capabilities, measurement capabilities, and cultural capabilities. We really want to take a full systems view. Not just the systems like the computers, but the systems that include you and me, the people in the system.
-
SRE and DevOps, we wanted to investigate what’s the difference between them? Are they working together? And in fact, we found that they’re complementary.
-
We find that cloud is one of those capabilities that drives performance. But it’s not just paying a cloud bill. You have to do cloud the right way.
-
The other things that we investigated, security. No surprise here. Security is on everyone’s mind right now.
-
And then another one that was really interesting this year that’s brand new this year, in fact, is we looked at documentation. We feel like docs could be better or that these docs are bad, but does documentation actually drive performance? And so we’re able to draw a predictive correlation or a predictive connection between good documentation. It enables the right technical practices to help you hit those goals.
-
We investigated culture. Specifically, we looked this year at how teams are doing with burnout. What sort of things can we put in place to help mitigate burnout from a cultural perspective?
New Finding: Documentation
-
There are lots and lots of different types of documentation. We kind of investigated what are the practices, in particular, around documentation that matter. What we found was a predictive connection between things like having clear ownership and clear guidelines for when documentation should be updated, making sure that it is part of the developers’ daily work.
-
We talk about this concept of shift left. We want to do the same thing with documentation. And in fact, we want our developers as they’re making changes, are the docs being updated? This is a question that we should be asking as part of our pipeline.
-
Another thing that’s really important is we have to incentivize it properly. If I’m going to an annual performance review, and my manager discounts any work I’ve done around documentation, that’s not good. Why would I be motivated to keep the docs up to date?
-
We also looked at things like, when there is an outage or an incident, do you trust the documentation? Can you use that as a tool to help you recover and restore service to your users?
-
We didn’t really dig in so specifically into which types of documentation, but more on the availability of that documentation, and the process around keeping it up to date.
-
While we didn’t really investigate the various types of documentation, and try to draw any predictive connections there, I think it is important that we, as practitioners, think about the scope of documentation. And it really comes down to this question of how do we enable the best technical practices that drive those outcomes that we talked about? These four key metrics.
New Finding: Secure Software Supply Chain
-
What we found was that the teams that are embracing, and actually putting into practice the security practices that we investigate, were 1.6 times more likely to exceed their organizational goals.
-
We also found that elite performers who met or exceeded their reliability targets are twice as likely to have security integrated into the development process.
-
We often think of security as the thing to slow us down, and the security as the department of no. That’s not what security should be. It’s not what security has to be. And in fact, when we embrace better security practices, when we really shift left on security, we see that (security) could help us speed everything up.
-
We really investigated specific practices like testing for security. Can your developers run a test that would validate that the code that they’ve written follows the right security policies? Are you working together with the security professionals within your organization? I really look at this as my role as a security engineer or a security professional has changed from I’m responsible for everything to I’m responsible for enabling the developers and enabling the operators to build and run more secure systems.
New Finding: Burnout
-
What we focus on is this other question. So first, did we see a downturn or a drop in software delivery performance? Overall, we did not see (that). In fact, we see teams continuing to improve.
-
GitHub actually released a report in 2020 that shows an increase in developer activity. As measured by things like pushes, pull requests, review pull requests, all of these. So even GitHub is seeing these numbers improving.
-
We also looked at this question of burnout. How are you feeling? And really, for burnout, one of the things that we talk about is what is your ability to disconnect from work. Because if you can’t disconnect from work, work is occupying your mental space all the time.
-
What we found was that teams with a generative team culture, or a team culture that’s really focused on the mission. Those teams that also were composed of people who felt included, and like they belonged on part of their team. Those teams were half as likely to experience burnout, even during the pandemic.
How to Start Improving
-
Unfortunately, the answer to where they should improve first is it depends. But the DORA framework can help you with this. Because what we actually do is investigate these various capabilities and help draw those predictive connections.
-
We know that if we want to improve our organizational performance, software delivery and operations performance drives organizational performance.
-
The research program itself looks into about 30 different capabilities like this. Whether it’s trunk based development, or a generative culture, or documentation, lots and lots of different capabilities.
-
The quick check tells you where do you sit, but it will also make a recommendation for you. And then there’s even further follow-up questions into each one of those capabilities to help you prioritize.
-
On the site, you have access to all of the State of DevOps Reports. And just because the report says 2017 on it, does not mean that report is outdated. Some of the capabilities investigated there are really important.
-
At the end of the day, what the DORA research program is trying to instill in all of us as an industry is the practice and the mindset of continuous improvement. We find what’s holding us back. We make an improvement. We reassess. So get into this loop in this practice of identifying, where are we, what’s holding us back, let’s make a change to improve.
3 Tech Lead Wisdom
-
Your team knows what needs to be done to improve.
-
Listen to them. Ask them what do we need to do to improve. Give them audacious goals and let them get to work.
-
Your job as a leader is to create that time and space and give them the resources that they need to improve.
-
-
For the practitioners, the individual contributors, you know what needs to be done.
- Speak up! Share this with your leadership. Drive that message forward. Consult with your team and learn from each other.
-
Learn from each other. To me, that’s probably the most important.
-
Each and every one of us comes to work with a completely different set of lived experience, a completely different set of knowledge. It’s really important that we all share and learn from one another.
-
Throw out the org chart. Throw out the structure. Let’s have those conversations. Make sure that everyone feels included, and like they belong as part of this team. Because in truth, they do. And we’re better when we share our insights and our ideas with one another.
-
[00:01:02] Episode Introduction
[00:01:02] Henry Suryawirawan: Hello, hello to all of you my friends. I hope that you’re feeling great and healthy today. Welcome to another new episode of the Tech Lead Journal podcast. Thank you for tuning in and listening to this episode. If you’re new to Tech Lead Journal, I invite you to subscribe and follow the show on your podcast app or our growing social media communities on LinkedIn, Twitter, and Instagram. And if you have been regularly listening and enjoying this podcast and love the type of content that I’m producing, will you support the show by subscribing as a patron at techleadjournal.dev/patron, and support my journey to produce great episode every week.
State of DevOps report. Many of you may have heard about this report, which has been running for over the past seven years. It is the largest and longest running research that aims to provide data driven industry insights that examine the capabilities and practices that drive software delivery, as well as operational and organizational performance.
My guest for today’s episode is Nathen Harvey. Nathen is a Developer Advocate at Google, a part of the Google Cloud DORA research team, and the co-author of the 2021 Accelerate State of DevOps Report. He was also the editor for the “97 Things Every Cloud Engineer Should Know” published by O’Reilly in 2020.
In this episode, we discussed in-depth the latest release of the State of DevOps Report. Nathen started by describing what the report is all about, historically how it got started, and what problems it was trying to solve. And then he explained the famous five key metrics suggested by the report to measure the software delivery and operational performance. Nathen then explained how the report categorizes different performers based on their performance against the key metrics and how the elite performers outperform the others in terms of speed, stability, and reliability. Next, we dived into several new key findings that came out of the 2021 report that relate to the importance of good documentation, secure software supply chain, and positive team culture that mitigates burnout during challenging circumstances, like the pandemic. Towards the end, Nathen gave great tips on how we can use the findings from the reports, not just the latest report, on how we can all get started and improve our software delivery and operational performance, that ultimately will also improve our organizational performance as a whole.
I really enjoyed my conversation with Nathen, learning about the Accelerate State of DevOps Report, the five key metrics, and the new key findings that we can find in the latest 2021 report. And if you also enjoy this episode, I encourage you to share it to someone you know who would also benefit from it. You can also leave a rating and review on your podcast app or share some comments on the social media on what you enjoy from this episode. I always love reading your reviews and comments, and they are one of the best ways to help me spread this podcast to more people. And it is my hope that they can also benefit from this podcast. Let’s get our episode started right after our sponsor message.
[00:04:45] Introduction
[00:04:45] Henry Suryawirawan: Hello, everyone. Welcome back to another episode of the Tech Lead Journal podcast. Today I have with me a Developer Relations at Google. His name is Nathen Harvey. He’s actually one of the coauthor of the latest State of DevOps Report 2021. So if you are not aware about State of DevOps Report, today we’ll be covering a lot about it. Talk about what are the latest findings as well. Nathen is also someone who is experienced in DevOps, SRE. So he’ll be talking about probably from his experience as well. And he’s also an editor for the “97 Things Every Cloud Engineers Should Know”, which is published in 2020. So welcome to the show, Nathen. Looking forward to have this discussion with you about State of DevOps Report.
[00:05:25] Nathen Harvey: Oh, thank you so much, Henry. And I’m really happy to be here.
[00:05:28] Career Journey
[00:05:28] Henry Suryawirawan: So Nathen, in the beginning, maybe if you can introduce yourself. Sharing more about your highlights for turning points in your career.
[00:05:35] Nathen Harvey: Yeah, for sure. Thanks. As Henry mentioned, my name is Nathen Harvey. I’m a Developer Relations engineer here with Google Cloud. Some highlights of my career. Well, it’s been a long career for sure. I would say a couple of quick highlights. I really look at my career as a career where I’ve tried to help teams improve the work that they’re doing. Sometimes that means helping teams improve the work that we are doing. For example, I’ve worked with many different web applications. Well, actually one of the first applications I was responsible for, Henry, we called it the customer and partner extranet. Do you know what an extranet is? It’s basically a website, but you have to log in to get to it. In the late nineties, early 2000, it was mind breaking that we would have these things on the web for customer. So I’ve been a lead engineer on a project like that. I’ve helped an organization, a team that I was on, really adopt automation practices, moving things from data centers into the cloud.
One turning point that I had was I’m working at this huge software company. At the time, I was responsible for managing our Salesforce automation tool, which happened to be a product called Siebel back in the day, and it ran in our data centers. And I recognize the problem. We needed more capacity for our servers. But they run in our data center. So I went to my boss, and I said, “Boss, I need more capacity”. And he said, “Yeah, sure. Fill out this form. And then I’ll get it to accounting and we’ll get it all approved”. So I filled out the forms and got it all approved. Henry, it was only 18 months later that I was able to log onto that server. It was very, very fast. 18 months is all it took. Obviously, that’s unacceptable in today’s technology landscape. The very next job I went to though, was a startup, and we were doing everything on the cloud. So instead of waiting 18 months, I could put my credit card into a web form, and in five minutes have access to a full server, a server that I would never be able to touch, but I had full control over that server. That really changed everything about how I approach tech, how I think about tech. It was pretty incredible.
And then, at that same organization, I was responsible for operations. Like I said, it was a startup. It was very small. I was responsible for everything. I had a good mentor who was helping me. One day, that mentor came to me and he said, “Nathen you know, there’re these tools out there that can automate a lot of the work that you’re doing. Maybe you should look into them. They have names like Puppet and Chef”. And I looked at him and I said, “You know what, Tom? I am too busy to automate”. Too busy to automate. What a great phrase, and that phrase has really stuck with him. I’m too busy to improve? Obviously, if you automate, you’ll have more time and more capacity. So I took those words to heart and eventually, I learned both Puppet and Chef, and ended up using Chef quite a lot in another organization. And then eventually, I fell in love with the Chef community. I have to tell you, it was open source, and it was a community of practitioners that were really automating infrastructure. I fell so much in love with that community. I said, I have to help that community. And so I joined Chef, the company, and I was part of leading and fostering that community for about six years. After my time at Chef, I joined Google Cloud to continue my journey through DevOps, into SRE, and really helping teams find ways that they can perform their best.
[00:08:57] Henry Suryawirawan: Thanks so much for sharing your story. I could still remember as well, traditionally last time when we used to ask for servers and all that, even hardening process itself can take months. So I think I really relate to your story. And thanks for sharing the story where you said that you are too busy to automate. I think we have been there. Especially all these SRE or DevOps people. Sometimes we are just too busy in our daily work, daily toils, and also forget how to actually improve from there.
[00:09:25] Nathen Harvey: It’s so key. It’s so key even to make a small investment in improving. Just a little bit better every day.
[00:09:32] State of DevOps Report
[00:09:32] Henry Suryawirawan: Speaking about improvement and DevOps related automation. This year, I saw that Google Cloud and DORA just released a State of DevOps Report 2021. So can you tell us more what is actually State of DevOps Report?
[00:09:44] Nathen Harvey: Yes, absolutely. So the State of DevOps Report is essentially what it says on the title. It’s a state of DevOps report about this. But let’s get into a little bit deeper background there. So there’s a research program, the DORA research program. It is the longest and largest running research program of its kind of. DORA itself brings an academic rigor to the investigation of this question of how do teams improve their software delivery and operations performance. And so each year DORA has published a state of DevOps report that summarizes the findings from that year’s research from that year’s investigation. The investigations, the research always go across every industry vertical. So we talk to organizations in finance, in government, in technology, in retail, across the board, really seeking to understand what helps teams improve.
[00:10:40] Henry Suryawirawan: So if I’m not mistaken, this has been running for the last seven, eight years. Is that correct? Yeah.
[00:10:46] Nathen Harvey: Yes, that’s right. It’s been seven years now that this has been running. Yes.
[00:10:50] Henry Suryawirawan: So if you can probably explain how did they start it? What kind of problem that they are trying to solve back then? And why coming up with this report itself can help?
[00:10:59] Nathen Harvey: Yeah. Interestingly, the report itself and the research actually was started by Puppet Labs or Puppet. And what they saw almost a decade or over a decade ago now, they saw this burgeoning community that identified as the DevOps community. Puppet had this question, what is the DevOps community? What are they doing? How can we bring the lessons that community has learned to a wider audience? So they started the State of DevOps Report and an annual survey to go with that. Now as that program evolved, Dr. Nicole Forsgren got involved in that to, again, bring a little bit more academic rigor to the research and to the investigations.
So those reports continued, and then eventually, Dr. Nicole Forsgren, together with Jez Humble and Gene Kim, formed an organization called DORA. DORA stands for DevOps Research and Assessment. What that organization did was continue the research program in conjunction with Puppet and on their own. But they also built a tool, an assessment tool to go in and help organizations measure how are they doing against their software delivery and operations performance. And importantly, how to identify where they should focus their improvement work. DORA eventually in 2018 was acquired by Google Cloud, and we’ve continued to expand on that research, and continue to work together to bring the findings of that research and practical implementations to our customers.
[00:12:28] Henry Suryawirawan: If I can also relate with my personal experience, looking at this report, right? I think since the State of DevOps Reports, suddenly we come to see a lot of state of other things reports in the market. Including actually, lately that we have seen Puppet also releasing another State of DevOps Report. Why are there so many reports? Maybe can you clarify a little bit here?
[00:12:46] Nathen Harvey: Yeah. So why are there so many reports? Well, I think we have a lot to learn from one another, and there are always questions that we have. So I think this is really important. I think when DORA that organization was formed, and started publishing their own reports, Puppet continued with their own research. I was just looking at the 2021 Puppet State of DevOps Report, and I found fascinating insights in there. The methodology is similar across the two, although not identical. And certainly, the things that are investigated in each year’s report are completely different to one another. So I think it’s really interesting. But I also think, you know, we see a state of the secure supply chain or the software supply chain. We see it at a state of that report, the state of DevSecOps. Yes, there were reports everywhere. I think the important thing to remember is how do you, as a team, utilize the findings of a report to help your team improve. That’s the biggest question always.
[00:13:46] Henry Suryawirawan: Thanks for putting that as a note, because we all see this report. We all see the findings, but at the end, I don’t know whether we improve or not. So that’s the only measurable things that we need to do.
[00:13:55] The Five Key Metrics
[00:13:55] Henry Suryawirawan: Speaking about the findings, we know that State of DevOps Report has these popular findings, which were used to be called four golden metrics or four key metrics. But now, there are five key metrics. Can you probably explain to us what are these five key metrics?
[00:14:11] Nathen Harvey: Yes, absolutely. So the five key metrics, and you’re right, they started off as four key metrics. Really, at the beginning of the research program, the question was how do we improve and measure software delivery performance? Frankly, when it comes to the word DevOps, Henry, you have a different definition than I have, than anyone of our listeners has. This is one of the challenges of DevOps. As a community, DevOps has always shunned a common definition. So as a researcher, how do you measure a thing, and research a thing that’s not well-defined? Well, this was one of the problems that the DORA research team set out to solve for.
So what they came up with was a very practical and pragmatic way to measure software delivery performance. And that’s where we got the four keys. So software delivery, how do you measure software delivery? Well, within DORA, what we measure are two metrics for stability and two metrics for throughput. I’ll start with the throughput, the velocity. How fast are you delivering software? For that, we look at deployment frequency. So how frequently are you pushing changes out to your customers or the users of your application? And then importantly, we look at lead time for changes. So as a developer, I’m making a change to the code base. I commit that change to the version control system. You’re using a version control system, right, Henry? I hope all of you are. If you’re not using a version control system, stop listening to this podcast. Go read up on Git, or any other version control system, and then come back and listen to the rest of this podcast. Okay. Sorry. I had to get on there for a second. So deployment frequency, how frequently are you pushing out changes? Next is lead time for changes. As a developer, I make a change, commit it to the version control system. The timer starts. That timer stops when that change lands in production. And so we want to increase the frequency with which we’re pushing out changes, and decrease the amount of time it takes for those changes to go from the developer into production. Those are our velocity metrics.
And then we have our stability metrics. Moving fast is great, but if you’re moving fast and always breaking everything, that’s not so good. So stability, we look at two things there. The first is your change fail rate. So when you make a push into production, how frequently are you shouting out an expletive? “Oh my gosh. Argh, this is broken. We have to roll it back immediately”. So when you have that, that’s a change failure. Someone had to get involved in that deployment process. We want to decrease those as much as possible.
Our final stability metric has to do with this idea, this truth, that it doesn’t matter what we do to reduce risk, risk is always present. Failures are inevitable. Something is going to break. So we have to look beyond the tools. We have to look at our team, and say, how quickly does it take us or how quickly can we restore service to our customers when there is something that impacts them? And so this is our time to restore service. So those are our four key metrics: deployment frequency, lead time for changes, change failure rate, and your time to restore service. Now what we found as we looked at those four metrics, and what we found consistently over throughout the course of the research study is that they aren’t trade-offs. Many organizations think in order to be safe, I have to be slow. But the data shows us that the best performers are getting both. And in fact, as speed increases, so too does stability. So they have some really interesting findings there.
Okay. But you asked about the five metrics and I said there was a fifth one that we’ve added this year. That fifth one that we’ve added, we’ve really refined it this year, I would say. For the past couple of years, since I want to say 2018, although I might have to check my notes on that. 2018, 2019, whatever it takes. We introduced this idea, not just of software delivery performance, but software delivery and operations performance. Now, operations performance because when you’ve deployed something, that’s the beginning of it providing value to you as an organization and to your customers. So we really need to answer this question of what’s our operational performance looks like?
In the past, in previous years, we asked questions about your operational performance in terms of availability. But this year, we’ve refined our approach and gone deeper with the research to move from availability to reliability. Now, a subtle change in language, but as we know, language is important. But that subtle change is really more to help everyone consider not just is the application available, but availability is really a component of reliability. In both cases, though, what we’re really trying to answer the question we’re getting at, can you keep your promises to your customers? Can they rely on the application or the service that your team is building and shipping? So what we found this year, as we dug really deep into reliability, is that reliability practices push organizational performance. They drive that organizational performance. So, as an organization, if you want to get better, investing in reliability can really help.
[00:19:24] Henry Suryawirawan: I really love the way you explain all these key metrics with a little bit of humor here and there as well. So this is my first time hearing such explanation of the golden key metrics. So just to recap, there’s the velocity or throughput metrics, which are the deployment frequency and also lead time for changes. There are the stability part, which is the change failure rate and time to restore the service. And the fifth one, which is related to operational side, which is now called reliability aspect. So it used to be availability, but now we changed the term to reliability.
[00:19:58] Speed, Safety, and Reliability
[00:19:58] Henry Suryawirawan: I think you mentioned as well in your explanation that it used to be traditionally well-known, right? So if you want to be safe, risk averse, you have to go slow. So that’s why there are maybe quarter release, monthly release, or even yearly release because the software just couldn’t get deployed. And now, it seems that most of the organizations understand that in order to be more safe, you have to go faster, and that’s why you have this velocity and throughput measurements to actually indicate whether the organization changes rapidly or not. And now you have the reliability aspect as well. What does it tell? The balance trade-off between risk? And what’s the operational side actually tell us?
[00:20:35] Nathen Harvey: Yeah. Interestingly, what we found is that all of the performers across whether you’re a low performer or the best of the best, the elite performers, when it comes to software delivery, each one of those performance categories could improve their overall organizational performance with the adoption of reliability practices. So we find that these are very complimentary practices to one another. And in fact, what we’ve talked about or what we investigated a lot of within reliability, a classic example here. Are you using the data from your reliability measures to help you prioritize work? In a world of Site Reliability Engineering, or the practice of SRE, we often talk about an error budget. And we use an error budget to say, if we’re not meeting our reliability goals, then we should focus our effort on the things that will make the system more reliable. And the flip side of that, if we are meeting or even exceeding our reliability goals, then we should focus on things that make us faster.
The other thing, I think that’s really interesting about all of these metrics together. There’s kind of two things I want to point out. One, we have to make sure that as a team responsible for an application or service, we have joint accountability to all of these metrics. Part of the reason in the past, we used to think that slow meant safe. It also comes down to who’s incentivized for which? We’ve spent a lot of times telling our developers, go faster, go faster. And then turning to our operators and say, keep it stable. And I’ve been that operator. I know the best way to keep the system stable is to not change it. But that’s no good for our customers, right? So we have to take these five metrics. Our four software delivery metrics and our operational metric, and really have joint ownership of them across all of our teams.
The other thing that I think is really interesting is just how do they reinforce one another. In DevOps technology practices, we learn from other disciplines as well. And I think the biggest lesson or one of the biggest lessons that we’ve taken from another industry is that from lean manufacturing. And lean manufacturing says, do things in small batches. So when we apply that to the delivery and operations of software, we know that when we have a small change, it’s always heading to production. That’s a small change. It’s easier to reason about. When it lands in production, it’s less likely to have a huge negative impact. We also bring in practices like progressive deploys, where you might deploy that change to 5% of your users and see, how did that go? Before you roll it out globally to all of your users. By making those smaller changes, you’re less likely to have a big impact. You’re more easy to reason about it and move that change through the system.
[00:23:26] Performers Categories
[00:23:26] Henry Suryawirawan: Thanks for those two summaries. Because that actually explains the gist of the DevOps itself. So when we traditionally used to have fights between developers and operators. Now with the DevOps concept and culture, so basically those two teams need to have a joint accountability, like what you said, and also incentivized to actually work together, make the customers happy at the end of the day. Speaking about how the State of DevOps Report classifies what they call performers, right? So there are a few categories. There are the elites and the low up to low performers. So can you explain a little bit, what are these categories? And what do you see the significant difference between elite and maybe low performers?
[00:24:04] Nathen Harvey: Yes, for sure. So again, this kind of gets back to the sort of academic rigor of the research itself. So what we’ve done each year, as part of the research program, is ask respondents about how are they doing against those four key metrics or those five key metrics now, of course. And what we do is we take their answers and we throw them down onto a plot, essentially. So we lay them all out on a grid, and we see these clusters emerged. These clusters show us that there are three statistically significant, or four statistically significant groups of performers.
I said three first, because actually in the beginning, part of the research program itself, there were three different distinct groups. But in 2018, a fourth group started to emerge. A subset of those top performers. What we’ve seen, or what we decided to do in 2018 was identifying, call those out as elite performers. So today and since 2018, we’ve had low, medium, high, and elite performers, four different clusters of performance. So what we do is we look at how are they doing across each of those metrics? So, at a very high level, I’ll tell you a little bit about the elite performers and the low performers. On that deployment frequency, elite performers are able to deploy on demand. That may mean even multiple deploys per day. But a low performer in 2021, will do fewer deployments than once per six months. So it might take seven, eight months, maybe yearly, for them to do a single code deployment. The flip side of that, of course, is our time to restore service or change fail rate. On the time to restore service, I’ll give you the deltas there between the two. Time to restore service for an elite performer less than an hour. For a low performer, it may take more than six months to fully restore service when there is an outage or something that interrupts there.
Now, the other thing about those four categories is because we’ve had them for many years, we can see changes in the shape of those. So what defined elite performers three years ago, maybe it only defines high performers today. Because as an industry, we are always getting better. We’ve also seen that in 2018, the elite performers were about 7% of the sample. In 2021, they’re 26% of the sample. So that elite performers, not only are they getting better, but the number of people or number of teams that are able to achieve that is growing.
[00:26:36] Henry Suryawirawan: Wow. That’s a very stark difference between the elite and low performers. So if many of you are interested to assess yourself or your team. Where you are at? Go check the State of DevOps Report. I think each of the four metrics, you will have some statistics, what will be each of the performer kind of like performing at that particular metrics. So, for example, just now Nathen mentioned about deployment frequency, where elite performance can do on demand, multiple deploys per day. So the low performers probably somewhere around six months per deployment frequency, and that’s really bad. If you can classify yourself, of course, measure where you are at now, and you can look forward how to improve it. That’s kind of like the roadmap out of these findings.
[00:27:15] Nathen Harvey: Absolutely. You can absolutely check the State of DevOps Report, but we also have a tool, Henry, that’s available to anyone. If you go to cloud.google.com/DevOps, you’ll see there’s a button there that allows you to take a quick check. So this quick check is the DORA DevOps quick check. And we will ask you questions about your team’s performance today, and help you assess where are you? You can then compare yourself to other elite performers or high performers. You can compare yourself to others within your industry and across all of the survey respondents. So definitely check out the quick check.
[00:27:52] Henry Suryawirawan: Thanks for the plug there. I’ll make sure to put it in the show notes. I wasn’t aware of that tool, but yeah, definitely, it would be great if there’s an automated tool that can help you to assess.
[00:28:00] Nathen Harvey: Yeah.
[00:28:01] 2021 New Key Findings
[00:28:01] Henry Suryawirawan: So speaking about this year’s report, what are the new key findings? I saw a couple of new key findings. Maybe you can share what are the most important, significant findings this year.
[00:28:11] Nathen Harvey: Yeah, absolutely. Because when we look at these clusters of performance, that’s really good to know and help understand where you are. But the real question is how do I get better? That’s really where I find the research to be very powerful. Because we dig into different capabilities that teams should invest in. Now, those capabilities might be technical capabilities, but they’re also process capabilities, measurement capabilities, and cultural capabilities. We really want to take a full systems view. Not just the systems like the computers, but the systems that include you and me, the people in the system, Henry. We’re there. So, at a very high level, there are a number of different things that we investigated this year. We touched briefly on SRE. So SRE and DevOps, we wanted to investigate what’s the difference between them? Are they working together? And in fact, we found that they’re complementary. We also continued our research into cloud. Not surprisingly, we find that cloud is one of those capabilities that drives performance. But it’s not just paying a cloud bill. You have to do cloud the right way. And in fact, we lean on other research from the National Institute of Standards and Technology, which helps define the key characteristics of cloud computing.
The other things that we investigated, security. No surprise here. Security is on everyone’s mind right now. Especially, if you look back over the last 12 months, 24 months, the number of attacks and vulnerabilities that are out there, we’re all thinking about security a lot. And so we dug into that. And then another one that was really interesting this year that’s brand new this year, in fact, is we looked at documentation. Everyone knows documentation. We need to do better with documentation. Well, we wanted to make sure that was actually true. Yeah. We feel like docs could be better or that these docs are bad, but does documentation actually drive performance? And so we’re able to draw a predictive correlation or a predictive connection between good documentation. It enables the right technical practices to help you hit those goals. And then, of course, I can’t talk to anyone about DevOps without talking about culture. So yes, of course, we investigated culture. Specifically, we looked this year at how teams are doing with burnout. What sort of things can we put in place to help mitigate burnout from a cultural perspective? So those are some of the high level quick findings there, and I’m happy to dig into any of them with you, Henry.
[00:30:44] New Finding: Documentation
[00:30:44] Henry Suryawirawan: Definitely, one of the most surprising thing is the documentation, right? It was never before in the State of DevOps Report, but this year there was this new thing. So why do you think documentation can actually be a predictive measures of how performant an organization could be? Is it because people just need to find a summary of the things within the company? Or is it about playbook and run book? Or are there other types of documentations?
[00:31:10] Nathen Harvey: Yeah, this is a really good question, and of course, there are lots and lots of different types of documentation. We kind of investigated what are the practices, in particular, around documentation that matter? And what we found was predictive connection between things like having clear ownership and clear guidelines for when documentation should be updated, making sure that it is part of the developers’ daily work. If we go back to security, we talk about this concept of shift left. Shift left on security makes the developers or helps enable the developers to bring better security practices. We want to do the same thing with documentation. And in fact, we want our developers as they’re making changes, are the docs being updated? This is a question that we should be asking as part of our pipeline.
But another thing that’s really important, Henry, is we have to incentivize it properly, right? If I’m going to an annual performance review, and my manager discounts any work I’ve done around documentation, that’s not good. Why would I be motivated to keep the docs up to date? So it goes all the way up to like incentive structures, and then down to daily practices of the team that’s responsible for building and operating that software. We also looked at things like, when there is an outage or an incident, do you trust the documentation? Can you use that as a tool to help you recover and restore service to your users?
[00:32:39] Henry Suryawirawan: So, yeah. Speaking about incentivizing, right, I think at least from my experience, I can see that documentation sometimes are neglected. So people like to do action oriented stuff, where documentation is used to be like more academic, slow and there’s nothing really out of that. But in this pandemic era where everyone probably is remote, I think documentation is key. Because you want to be able to find things properly without having to interact all the time. Meetings these days are a lot, and you’ll be getting burnout if you always meet people. And also, yes, like what you said, during outage, incidents happening, do you actually trust your documentation? Where someone else who probably not familiar with the incident may be able to perform the same kind of restorative actions without actually needing the same person doing all over again. So I think that’s really key for documentation. How about software related things? Is there anything about architecture diagram, or software related documentations?
[00:33:33] Nathen Harvey: Yeah, we didn’t really dig in so specifically into which types of documentation, but more on the availability of that documentation, and the process around keeping it up to date. But as we talked to customers, as I work with people and teams and my own experience, and I’m sure you’ve experienced this as well, Henry, there are lots of different types of documentation. Like you mentioned, architecture diagrams. Are they up to date? Are the architecture diagrams, maybe, they’re automatically generated from the systems themselves? Maybe I don’t have to look to a document. There’s documentation in the Wiki. Is that kept up to date? There’s even documentation in the commit messages in your version control system. So while we didn’t really investigate the various types of documentation, and try to draw any predictive connections there, I think it is important that we, as practitioners, think about the scope of documentation. And it really comes down to this question, right, of how do we enable the best technical practices that drive those outcomes that we talked about? These four key metrics. And you can start to see how documentation really can play a critical role in that.
[00:34:41] Henry Suryawirawan: So, yeah, I mean like, those are the examples that you mentioned, architecture diagram that can be auto-generated. I heard some people doing architecture decision registry or decision records that you also commit along with your source code. You can also put some documentation in the commit message itself for people to actually see the history code changes.
[00:34:58] New Finding: Secure Software Supply Chain
[00:34:58] Henry Suryawirawan: So moving on from documentation, there’s another finding that I find also interesting, which is about secure software supply chain. How do you make sure security is embedded inside your delivery pipeline? So can you talk more about that? How does it predict elite performers?
[00:35:12] Nathen Harvey: Yeah. So what we found was that the teams that are embracing, and actually putting into practice the security practices that we investigate, were 1.6 times more likely to exceed their organizational goals. We also found that elite performers who met or exceeded their reliability targets. I’m an elite performer, software delivery are meeting or exceeding my reliability targets, I’m twice as likely to have security integrated into my development process. That’s huge. I think we often think of security as the thing to slow us down, and the security as the department of no. That’s not what security should be. It’s not what security has to be. And in fact, that when we embrace better security practices, when we really shift left on security, we see that could help us speed everything up.
So we really investigated specific practices like testing for security. Can your developers run a test that would validate that the code that they’ve written follows the right security policies? Are you working together with the security professionals within your organization? I really look at this as my role as a security engineer or a security professional has changed from I’m responsible for everything, to I’m responsible for enabling the developers and enabling the operators to build and run more secure systems. Maybe as a security professional, I’m also responsible for providing pre-approved code, whether that’s libraries or open source packages that we’ve certified. You can use these without having to go through another review process. These are agreed-upon standards that we can use within the organization. All of that really helps drive the security practices and then that organizational performance.
[00:37:06] Henry Suryawirawan: Yeah. I can now see the relation with the key metrics. So reliability is one aspect where security actually is helping a lot. Where if you are more reliable, basically you are also more secure in terms of maybe your software, your deployment, and maybe your architecture, in a sense that people won’t be able to attack it so much.
[00:37:22] New Finding: Burnout
[00:37:22] Henry Suryawirawan: Speaking about culture, the most important thing in DevOps. So this year there’s also things about burnout and maybe due to pandemic as well, we can see the relation to that. So what are the key highlights of this cultural aspect this year?
[00:37:34] Nathen Harvey: Yeah, for sure. The first thing that I want to just be very specific on. The DORA research program is about software delivery and operations. We are all globally working under a pandemic. We leave the research to how has the pandemic affected software delivery teams. We leave that to the pandemic researchers. But what we focus on is this other question. So first, did we see a downturn or a drop in software delivery performance? Overall, we did not see. In fact, we see teams continuing to improve. Actually, we could look to our peers, the peers that are also studying developers. So GitHub actually released a report in 2020 that shows an increase in developer activity. As measured by things like pushes, pull requests, review pull requests, all of these. So even GitHub is seeing these numbers improving.
We also looked at this question of burnout. How are you feeling? And really, for burnout, one of the things that we talk about is what is your ability to disconnect from work? Because if you can’t disconnect from work, like work is occupying your mental space all the time. What we found was that teams with a generative team culture, or a team culture that’s really focused on the mission. Those teams that also were composed of people who felt included, and like they belonged on part of their team. Those teams were half as likely to experience burnout, even during the pandemic. So overall, we saw burnout go up, but we saw that those teams with generative culture, where the individuals on the team felt included and like they belonged as part of the team, that cut burnout in half.
[00:39:14] Henry Suryawirawan: So I can also relate to it. One of the episodes that I have in the past, where the person also covers about the research that was done by Google. So things like generative culture you mentioned, right? Things like being included, have a sense of mission, psychological safety and things like that. So I think if people want to improve their cultural aspect as well, I’ll make sure to put that in the show notes as well, so that you can also refer how to improve it.
[00:39:36] How to Start Improving
[00:39:36] Henry Suryawirawan: We’ve been talking a lot about the findings, the key metrics, I know that some people may not be there in terms of performance and all that. So what will be some of your key advice for them to get started or where should they improve first?
[00:39:50] Nathen Harvey: Yeah, this is a really good question. Unfortunately, Henry, the answer is where should they improve first? It depends. But the DORA framework can help you with this. Because what we actually do is investigate these various capabilities, and help draw those predictive connections. So I can say, for example, that we know that if we want to improve our organizational performance, software delivery and operations performance drives organizational performance. Well, how do we get better at software delivery and operations? One way is by practicing Continuous Delivery. Okay, but how do we get better at Continuous Delivery? Well, there are a bunch of technical practices that drive Continuous Delivery. Things like trunk-based development, or Continuous Integration.
The research program itself looks into about 30 different capabilities like this. Whether it’s trunk based development, or a generative culture, or documentation, lots and lots of different capabilities. So I’m going to take you back to that quick check. Not only will the quick check tell you where do you sit, but it will also make a recommendation for you. If you’re a medium performer, here are the three capabilities that are likely to have a big impact on your team. And then there’s even further follow-up questions into each one of those capabilities to help you prioritize.
But at the end of the day, the best thing to do is to sit back and look at the body, all of the capabilities, and have an honest assessment of your team. How are we doing with each one of these capabilities? What you’re looking for is the one that’s holding you back. That constraint. What’s constraining you? Focus your improvement there. So that might mean you need to focus your improvement by improving your testing. It might mean you need to lean into something like blameless postmortems, and what can you learn from incidents? One of these things maybe is the thing that’s holding you back. So make an investment in improving in that area. And so, together with the report and the quick check, you can get really good insights into what the various capabilities are, and then decide which is right for your team?
[00:42:00] Henry Suryawirawan: Definitely. It depends on the context of each team. So I really agree with that. Speaking about the quick check, again, make sure to try it out, because there will be recommendations as well. The State of DevOps Report itself covers a lot of these technical aspects. And if I may, I add one more thing, which is the “Accelerate” book, written by the original authors of the DORA reports, Nicole Forsgren. So in the “Accelerate” book, it also covers all the fundamental, basic important technical practices, including also the cultural aspect which I think is a good resource for people to get started.
[00:42:32] Nathen Harvey: Absolutely agree that “Accelerate” book is incredible. You should absolutely read that. The other thing, I think, while we’re on the topic, on the site, you have access to all of the State of DevOps Reports. And just because the report says 2017 on it, does not mean that report is outdated. Some of the capabilities investigated there are really important. And I think at the end of the day, really what the DORA research program is trying to instill in all of us as an industry, is the practice and the mindset of continuous improvement. Let’s design an experiment. We believe that by getting better at trunk based development, we will improve our software delivery and operations performance, and drive our organizational performance. Let’s test that theory and improve our trunk based development and then reassess. So let’s measure again. We find what’s holding us back. We make an improvement. We reassess. So get into this loop in this practice of identifying, where are we, what’s holding us back, let’s make a change to improve.
[00:43:37] Henry Suryawirawan: I love that the way you mentioned about continuous improvement aspect. It comes back to the lean and agile principles and mindset where you need to also do experiments, test it out, and improve based on that findings. So I think this feedback loop is really critical in DevOps, lean and agile mindset.
[00:43:55] 3 Tech Lead Wisdom
[00:43:55] Henry Suryawirawan: So Nathen, it’s been a really pleasant conversation. I learned a lot about the State of DevOps Report, especially for this year. So unfortunately, we reached the end of our conversation. But before I let you go, I have this one normal question that I ask all my guests, which is to share your three technical leadership wisdom. So think of it like an advice for the listeners here, from your experience, what are the things that they should be thinking about?
[00:44:17] Nathen Harvey: Yes. Alright. You phrased it also as technical leadership. So I’m going to start with some really strong advice for leaders. And that advice is this. Your team knows what needs to be done to improve. Listen to them. Ask them, what do we need to do to improve? Give them audacious goals and let them get to work. So your job as a leader is to create that time and space, and give them the resources that they need to improve.
Now, on the flip side of that, I’ll give some advice for the practitioners, the individual contributors. You know what needs to be done. Speak up! Share this with your leadership. So drive that message forward. Consult with your team, and learn from each other. And that learn from each other, to me, that’s the third, and probably the most important. Each and every one of us comes to work with a completely different set of lived experience, a completely different set of knowledge. It’s really important that we all share and learn from one another. Throw out the org chart. Throw out the structure. Let’s have those conversations. Make sure that everyone feels included, and like they belong as part of this team. Because in truth, they do. And we’re better because of all of us, and when we share our insights and our ideas with one another. So maybe, those are my three.
[00:45:40] Henry Suryawirawan: Yeah. I must say it’s a beautiful way of sharing about this wisdom. It’s all interrelated. So the first is like your team probably knows what to improve. So do ask the team. And for the team members itself, speak up actually what to improve. Just speak up to your leaders. And at the end, we all learn from each other. So really beautiful way of explaining this wisdom. So Nathen, for people who want to connect more with you, talking more about DevOps report or anything related to DevOps or DORA, or maybe Google Cloud, where they can find you online?
[00:46:08] Nathen Harvey: Yes. The best place to find me is Twitter. And there I’m @NathenHarvey. I will just warn you that my father misspelled my name or chosen non-traditional spelling. So it’s N-A-T-H-E-N-H-A-R-V-E-Y. That’s my Twitter handle. And then you can find me on LinkedIn as well. So search for my name, Nathen Harvey, with an E, and you’ll find me. So those are two good ways, and Henry, I’m sure you can put some links in the show notes there on how folks can reach me as well.
[00:46:38] Henry Suryawirawan: Definitely. I’ll put that in the show notes. So Nathen, thanks so much again for sharing your knowledge and expertise here. Thank you so much.
[00:46:44] Nathen Harvey: Henry, it’s my pleasure. Thank you so much for having me and thank you for your dedication to Tech Lead Journal. I think bringing these types of insights and conversations to the broader community, you’re doing a real service to everyone. Thank you so much.
[00:46:57] Henry Suryawirawan: It’s been my pleasure.
– End –