#117 - How to Establish SRE Foundations From Scratch - Vladyslav Ukis

23-Jan-2023 54 mins Vladyslav Ukis

“The strength of SRE is in the alignment of operational concerns between the product management, product development, and product operations.”

Dr. Vladyslav Ukis is the Head of R&D at Siemens Healthineers and author of “Establishing SRE Foundations”. In this episode, Dr. Vlad shared insights on how to establish SRE foundations from scratch based on his firsthand experience at Siemens Healthineers and the concepts described in his book. We started by discussing the basic SRE concept and how it differs from other related concepts, such as ITIL, COBIT, and DevOps. Dr. Vlad then explained in-depth how SRE implementation can help to create an alignment between the product management, product development, and product operations teams. He also shared the importance of having internal SRE coaches to facilitate this transformation and when an organisation can start realizing the benefits of implementing SRE. In the latter half, Dr. Vlad walked us through how we can begin our SRE journey, make further progress in the journey, and measure the success of our SRE implementation. Also, do not miss his sharing on how SRE implementation can help to improve reliability in a stringent industry, such as healthcare.

Listen out for:

Career Journey - [00:06:04]
Getting to Know SRE Concept - [00:08:24]
SRE vs Other Frameworks - [00:12:20]
SRE Definition - [00:16:48]
Ops-Development-Product Alignment - [00:19:26]
SRE Coach - [00:26:36]
Realizing SRE Benefits - [00:28:52]
How to Begin SRE Journey - [00:31:37]
SRE Journey Progression - [00:36:15]
Healthcare Reliability - [00:41:48]
Measuring SRE Implementation Success - [00:46:25]
3 Tech Lead Wisdom - [00:48:44]

_____

Vladyslav Ukis’s Bio
Dr. Vladyslav Ukis graduated in Computer Science from the University of Erlangen-Nuremberg, Germany, and later from the University of Manchester, UK. He joined Siemens Healthineers after each graduation and has been working on Software Architecture, Enterprise Architecture, Innovation Management, Private and Public Cloud Computing, Team Management, Engineering Management, Portfolio Management, Partner Management, and Digital Transformation at large. He currently works as the Head of R&D for the Siemens Healthineers teamplay digital health platform, and has shared his DevOps knowledge in his book “Establishing SRE Foundations” published in 2022.

Follow Dr Vlad:

LinkedIn – linkedin.com/in/dr-vladyslav-ukis-5172ba32

Mentions & Links:

📚 Establishing SRE Foundations – https://www.amazon.com/Establishing-Foundations-Step-Step-Organizations/dp/0137424604
✍🏼 Why I wrote “Establishing SRE Foundations”? – https://www.linkedin.com/pulse/why-i-wrote-establishing-sre-foundations-dr-vladyslav-ukis/
Siemens Healthineers – https://www.siemens-healthineers.com/
teamplay digital health platform – https://www.siemens-healthineers.com/en-id/digital-health-solutions/teamplay-digital-health-platform
Google’s SRE books – https://sre.google/books/
Ben Treynor Sloss – https://www.linkedin.com/in/benjamin-treynor-sloss-207120

Our Sponsor - Skills Matter

Today’s episode is proudly sponsored by Skills Matter, the global community and events platform for software professionals.
Skills Matter is an easier way for technologists to grow their careers by connecting you and your peers with the best-in-class tech industry experts and communities. You get on-demand access to their latest content, thought leadership insights as well as the exciting schedule of tech events running across all time zones.
Head on over to skillsmatter.com to become part of the tech community that matters most to you - it’s free to join and easy to keep up with the latest tech trends.

Our Sponsor - Tech Lead Journal Shop

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Getting to Know SRE Concept

The buildup of the teamplay digital health platform was new for the company. We didn’t have cloud operations before. And also, we didn’t have the skills in order to operate such a platform. That was a total paradigm shift for us when we understood that actually in order to make this work, we need the development organization to do operations.
We started promoting the idea that actually the development teams, they need to be also looking into the operational aspects and not just developing features. And we kind of have tried doing this for several years, but we saw that this is a difficult idea to grow in a development organization that has never done operations before. Because in the minds of the developers who have never touched operations, their job is to develop software. In the minds of product managers, then their job is to tell the developers what to develop. Basically, everybody’s thinking that there is this operations department that will do operations, and the operations department cannot do this because they don’t have the context, and they don’t have the internal knowledge necessary to operate services.
The trouble was that those Google SRE books, while great, I saw they were describing the high end of what you do when you really operate services at Google scale. Then I also read a couple of other books, but all they were kind of describing was already at a level that was far away from where we used to be.

SRE vs Other Frameworks

I needed to zoom out, and see what else is out there in order to introduce SRE well in the context of the existing methodologies. ITIL implementation: This is basically a framework for how you set up the operations, the IT function of an organization. And then, there are also a couple of other frameworks, like COBIT and others. What they do well is to prescribe how an IT function of a big organization can be established and run.
Then, there are other related things. DevOps is a very hyped concept in the industry. This is kind of a philosophy. How you run the product delivery, especially with the short recycles and feedback along the way from different environments and from different people, including the customers and so on. But then, generally, DevOps stays at the philosophical level. It’s a philosophy of fast software delivery and fast feedback loops. But especially in the area of operations, it doesn’t prescribe you enough to get started. You can think of DevOps as an overarching philosophy of running product delivery, and then you can think of other frameworks like ITIL and COBIT as frameworks for organizing the IT function of the enterprise.
SRE is an opinionated implementation of the ops part of DevOps. Based on the name, DevOps tries to bring together development and operations. Although I think it’s broader, but you can kind of take that assumption and say, DevOps, based on the name, tries to unite development and operations so that they work well together.
Development and operations, they should be working together. But how? Should they do pair programming? Should they do pair operations? Or what should they do? How should they work together?
SRE as a concrete implementation of the DevOps philosophy is this opinionated framework to implement the ops part of DevOps. It’s got concrete prescriptions of what should be put in place and also roughly by whom, in order to bring development and operations together. This is also coming from the origins of SRE, which is trying, as software engineers, to come up with a framework for operating services. So, this is then a sort of software engineer led approach to operations.
SRE can live very nicely in parallel with ITIL and COBIT. Because with ITIL and COBIT, you just set up the IT function of the enterprise, and then with DevOps you set the overarching philosophy for product delivery, and then SRE is a piece in order to enable the operations part of your product delivery organization well.

SRE Definition

SRE is like a way of operating software if you give it to software engineers.
Interestingly, this kind of thinking when you task software engineers with doing something can be also applied to other areas. So, for example, being specifically in healthcare, we’ve got a lot of regulatory burden.
Traditionally, these regulations are not fulfilled from the software engineer’s point of view. So they are basically fulfilled from the regulatory person’s point of view, and therefore, there are lots of documents that you have to write in order to demonstrate your compliance with the regulations. Those documents have to be handled in a certain way in order for you to be able to provide evidence during audits that you actually comply with the process and so on. And actually, if you task a software engineer to design the regulatory function, it would become something completely different.
I think you can apply that kind of thinking, just task software engineer with designing X and that leads to a totally different implementation than if it comes from a place where it’s not inspired by software engineering.
And we also started actually changing our regulatory process by adding lots of automation and so on. That definitely makes the whole process much leaner, makes the whole process much easier to apply and then leads to a great acceleration of product delivery because if you are less constrained by the regulatory burden, then you can move much faster.
Overall, definitely, I would still agree with the original definition by Google. Quotes by Ben Treynor Sloss that this is what happens when you task the software engineer to design the operations function. So definitely would be happy to apply that kind of thinking also in other areas.

Ops-Development-Product Alignment

The strength of SRE is exactly in the ability to bring about that alignment in operational concerns between the product management, product development, and product operations. Whereas other methodologies and frameworks that we talked about before have their strength in other areas. Like ITIL and COBIT, their strength is designing the IT function of the enterprise. DevOps, the strength is in the philosophy of product delivery. Whereas SRE, I think the strength of the methodology is in the alignment with operational concerns in the entire organization.
It prescribes that you need to have service level objectives for your services, which are then based on service level indicators. Then, once you’ve got service level indicators and service level objectives, then automatically you’ve got so-called error budgets, which are given budgets for errors. And then if you exceed your given budget for errors, then you can do data-driven decision making about whether we invest now into reliability or whether we continue investing in features. So you’ve got these kinds of primitives that you can work with and that you can use to guide your thinking.
How does SRE then bring about the alignment of different parties on operational concerns? First of all, under the SRE framework, the developers need to go on call for their services to the extent agreed in the organization. So that means that the developers need to experience firsthand what it’s like to run their services in production.
The product management typically is not involved in product operations. But under the SRE framework, they need to be involved in product operations in order to actually make those decisions when we invest in reliability versus when we invest in product features. But these decisions are done based on sound data from production. So that’s based on the error budget consumption.
If there is no error budget left anymore, what do we do? The product management needs to be involved in the definition of the so-called error budget policies that state service by service and team by team: what do we do if there is no error budget left anymore? Do we then invest in reliability? If we do invest in reliability, then how do we do this? Do we take just one engineer and put on the reliability, or do we re-prioritize the backlog and so on? As you can see, the product management becomes totally part of this entire conversation, how to guide investment in reliability.
The product operations, they usually are not involved in enabling others to do operations, but under the SRE framework, that’s exactly what they need to do. So basically they don’t necessarily need to run services in production, but they need to provide SRE infrastructure to enable developers to go on call and be efficient.
Basically, a traditional organization puts their responsibilities on their head, so to speak. Because the developers, they never went on call, but they need to go on call now. The operations people, they never enabled others to do operations, but they need to enable developers now to do operations. The product management, they were never involved in operations, but now they’re part of the conversation. How do we guide the investment into reliability?
It gives each party a certain set of tasks and aligns them using those primitives, like SLOs, error budgets, error budget policies, and so on, so that the whole organization then is aligned on those operational concerns, ideally, before hitting production.
What we did at Siemens Healthineers was that, first of all, we placed SRE in the portfolio of things that we do on the platform. We reached an agreement that this is an important topic. It’s important for us to improve the operations of the platform, and therefore we’ll try SRE, because everything that we’ve tried before wasn’t as successful.
With that, then we were able to allocate resources across the board. We were able to start the development of the SRE infrastructure in the operations teams. We were able to start talking to the development teams and literally go kind of team by team, and introduce the concept there and take each team on its individual journey towards operations improvements using SRE.
That kind of team-based coaching was also key to establishing SRE in the organization because every team is different. Also, every team is using the SRE infrastructure slightly differently.

SRE Coach

When you embark on the SRE transformation and want to establish SRE in an organization, then you need somebody who would be driving the transformation. And this is the role of SRE coach. That needs to be somebody who’s really driven to improve operations and who can understand the organization well enough in order to know what you need to do to establish it.
That’ll be different organization by organization and, especially also, culture by culture. Therefore, I think it needs to be somebody internal. I wouldn’t think that somebody external would be able to do this, because you need to know too much of the inner workings of an organization in order to make it happen. There might be just not enough trust between the teams and the external coach in order to make that change. And another thing, typically something like this is a multi-year journey, and usually somebody external is not there to spend years with the organization.
I think this is definitely a key role, and must be there for quite a long time until SRE becomes the usual ways of working for any team. And also to the point where if a new team has started, then they kind of don’t think about how to do operations, but they just use the existing SRE infrastructure and approach everything from the SRE point of view from the beginning when they set up new services.

Realizing SRE Benefits

If you are coming from a traditional development organization that has never done operations before, then there will be basic things that you can improve immediately. For example, you need a uniform logging across all services.
The next question is when do you really start seeing improvements at the management level? Typically, in order for this perception to be developed, you need to see some external change. And usually, the external change is measured based on the customer complaints. So the amount of external complaints gets reduced.
If the development teams start taking production seriously, if they start really kind of looking into how the services are running, even if it’s not kind of the best SRE way, then that usually already brings a huge effect in terms of reliability.

How to Begin SRE Journey

One important thing is that it’s important for the initiative to be taken seriously. That requires SRE to be put onto the list of organizational priorities that the organization wants to explore. Because this is such a profound change. I don’t think you can really bring this about if the organization is not serious enough about the initiative. So I think, the first thing, put this on the list of initiatives you are undertaking.
The second thing, find a possibility in the operations team to implement a little bit of SRE infrastructure and really the very minimum. So I would say just start with one SLI, say, availability, because that’s kind of the most common one and the most fundamental one. Then pick, say, one development team, that is responsible for some services that would be open to exploring a new way to do operations.
From then onwards, operate SRE. Circulate frequently between what the development team actually needs and what the SRE infrastructure needs to provide. Basically, bring together that one team and the operations team implementing the infrastructure, and bring it to a point where that one team would say, “It’s better now and we want from this SRE infrastructure an additional 10 features.” Now we know we are onto something useful and let’s go find the second team.
Then the second team can already make use of this SRE infrastructure that was built for the first team. They will, of course, have some different context, and this is where you can then decide which additional features to implement into the SRE infrastructure.
From then onwards, it’s about going team by team, and talking to them, discussing with them whether SRE would be something for them. But usually, the more teams you onboard, the easier it gets, because the SRE infrastructure is already more mature every time you take on a new team, because you’ve got more features there, and you know that they’re useful. Because also there is a social proof that this new thing, SRE, helps with the services operations.
This is roughly the journey that you would need to take, and this is also where the SRE coaches are essential, especially at the beginning.

SRE Journey Progression

Step one is to involve the product owner into the SRE coaching sessions from the beginning. The product owner is totally a full member of these discussions.
Number two, once the team is mature enough to come up with the error budget policy.
- This is something that’s really helpful, because it forces the team to think exactly about this question: What will we do when there is no error budget left for this service or for that service? It does this ahead of time before you’ve exhausted your error budget, and that provokes good discussions.
- You need to put it on paper. Our error budget policy is if this happens, that’s what we do. If that happens, that’s what we do. So that’s already definitely the next level of maturity.
- I think it’s important not to start with the whole error budget policy before the team is actually ready for this, because that’s a pretty advanced concept. That’s also how I put it in the book. So I’m talking about the basic SRE Foundations and advanced SRE Foundations. And the basic SRE foundations are SLIs, SLOs, and error budgets. The advanced SRE foundations are the error budget policies and error budget based decision making.
- In order for you to talk about the error budget policy, you need already everybody’s understanding about the SLIs, the SLOs, and the error budget. Then the SLOs need to make sense. You need to already have several iterations on the whole thing.
Once the error budget policy is in place, then the real test is, will that error budget policy be executed? Once it really reaches the point where we don’t have that error budget anymore for the service. Does the team really follow what they have done?
And then the step after is the error budget based decision making.
In the book, I put all those concepts in the pyramid.
The cool thing about that kind of high-level dashboard (of SLOs and error budgets) is that especially the product managers, they are enabled to work with this, because then you can say over time from experience, we actually thought that we would have this certain service level, but we are not fulfilling this and therefore we need to prioritize some significant work in order to build reliability into these particular services. The cool thing is that you make those decisions really based on data and not just on the opinion of the architect, on the opinion of the development and so on, which depending on the team culture, may weigh a lot or not a lot to the product manager.

Measuring SRE Implementation Success

Especially at the beginning, it’s a good idea to just see how many services the teams are onboarding onto the infrastructure. This is not outcome-based, of course, because you still don’t know whether it’s improved reliability. But at least these output based indicators at the beginning, they’re definitely good because, if teams are not willing to put their services onto the SRE infrastructure, then there’s no use in the infrastructure.
Then the next thing is, do the teams define the SLOs? Is the number of SLOs actually growing? Then the next thing, now that the SLOs are there, are they being fulfilled? If there are SLOs but the majority of them are broken, then that means the teams just forget about them. What is the percentage of SLOs that are being fulfilled and so on?
Then the next step is coming back to those customer escalations that we were talking about earlier. It can easily happen that all the SLOs are perfectly green, so that’s all cool, but the customers keep calling.
The true measure of implementing SRE correctly, your customers should be happy, and your error budget is a representation of that. In your book, you also mentioned a couple of other things, like, for example, perception from the partners that you work with, not just the users.

3 Tech Lead Wisdom

Act in context. I think that’s important because there are very many technologies, methodologies and so on, but it’s the context that actually decides whether this particular thing that you have heard about, have read about, have thought about would actually fit.
Practice servant leadership. That means that on the one hand, trying to show the way to the organization, to the teams, but on the other hand, also being very well aware of their context, and therefore, kind of inviting them to go on a certain journey instead of coercing them to adopt a particular thing. On the one hand, lead, but on the other end, do this in a servant way.
Look for ways to motivate people.
- All these things that we have talked about, they are not easy. Acknowledge that this is not easy. Empathize with the people and, specifically, as a third point, look for ways to motivate people. I think this is really important to make some change stick.
- Especially on the SRE transformation journey, there are so many ways where you can point out very small wins. I think pointing out those small wins that now as a team, we’ve done this, that really motivates the people and taking this seriously from the SRE coach point of view and really seeking things that people can be thanked for, thereby driving the motivation.

Transcript

[00:01:17] Episode Introduction

Henry Suryawirawan: Hello again, my friends and my listeners. Welcome to the Tech Lead Journal podcast, the show where you can learn about technical leadership and excellence from my conversations with great thought leaders in the tech industry. If this is your first time listening to Tech Lead Journal, please subscribe and follow the show on your podcast app and on LinkedIn, Twitter, and Instagram. And to support my journey creating this podcast, subscribe as a patron at techleadjournal.dev/patron.

My guest for today’s episode is Dr. Vladyslav Ukis. Dr. Vlad is the Head of R&D at Siemens Healthineers and the author of “Establishing SRE Foundations”. In this episode, Dr. Vlad shared insights on how to establish SRE foundations from scratch based on his firsthand experience at Siemens Healthineers and the concepts described in his book.

We started by discussing the basic SRE concept and how it differs from the other related concepts, such as ITIL, COBIT, and DevOps. Dr. Vlad then explained in depth how SRE implementation can help to create an alignment between the product management, product development, and product operations teams. He also shared the importance of having internal SRE coaches to facilitate this transformation, and when an organization can start to realize the benefits of implementing SRE. In the latter half of our conversation, Dr. Vlad walked us through how we can begin our SRE journey, make further progress in the journey, and measure the success of our SRE implementation. Also do not miss his sharing on how SRE implementation can help to improve reliability in a stringent industry, such as healthcare.

I really enjoyed my conversation with Dr. Vlad. Though there are quite a number of resources explaining about SRE concept, not many provides practical step-by-step guide on how to implement it, especially for the traditional organizations that need to evolve from their legacy operations and practices. I especially admire Dr. Vlad for writing an SRE book, despite not having the experience working at Google, the place where the SRE concept and practices were created. Dr. Vlad successfully embarked on the journey to implement SRE in his organization and shared his firsthand experience in his book and this episode. So if you like this episode, make sure to also check out his book as well.

And as always, if you find this episode useful, please help share it with your friends and colleagues, so they can also benefit from listening to this episode. I always appreciate your support in sharing and spreading this podcast and the knowledge to more people. And don’t forget to give this podcast a rating, if you’re listening on Apple Podcasts and Spotify.

Let’s continue to the conversation with Dr. Vlad, after hearing some words from our sponsors.

[00:05:02] Introduction

Henry Suryawirawan: Hello, everyone. Welcome back to another new episode of the Tech Lead Journal podcast. Today, I’m very excited to meet someone who wrote a book about SRE, which is titled “Establishing SRE Foundations”, but he is actually not someone coming from Google. As probably most of you would’ve known, SRE is a concept that is introduced and widely popularized by Google. But today, we’ll meet someone who actually didn’t come from Google, but wrote a good book about SRE. His name is Dr. Vladyslav Ukis. He’s currently the Head of R&D at Siemens Healthineers. So I’m looking forward for this conversation, Dr. Vlad, and hopefully, we have a great conversation about SRE in general.

Vladyslav Ukis: Yeah, definitely. Henry, thank you very much for having me on the show. I’m really happy that it worked out that we can talk about SRE on this really wonderful podcast, which I wasn’t aware of before. And now that I’ve seen all the great speakers from previous episodes, I’m really honored to be able to talk to you and be on the show.

Henry Suryawirawan: Thank you for your kind words.

[00:06:04] Career Journey

Henry Suryawirawan: So Dr. Vlad, I always love to ask my guests to actually share a little bit more about yourself. Maybe sharing more about your career highlights or any turning points that you think are interesting for us to learn from.

Vladyslav Ukis: Yeah. So I spent my career so far in the healthcare domain with the Siemens Healthineers. Siemens Healthineers is a Germany-based company that produces lots of medical imaging devices. So things like, for instance, computer tomographers or magnet resonance devices or ultrasound devices and so on. There is lots of software that’s associated with those devices. So, there is a lot of on-premise software, first of all, that’s driving the devices themselves so that you can do scans on those machines. Then there are lots of post-processing on-premise software, which is software that is used by the physicians inside the hospital departments, but not necessarily directly at the scanners. Then there is also cloud software. That software usually manages the fleets of those scanners and helps the personnel in the hospitals to run the machines more efficiently in the entirety.

So, I worked in the on-premise software at Siemens Healthineers, and then at some point, I moved into the cloud software. That was also the very beginning of the cloud journey for Siemens Healthineers. Basically, over the years, we built the first cloud-based platform for medical applications for the entire organization. And now, every business unit that wants to build a cloud application, they build it on top of the teamplay digital health platform, which we built. And that was also the product where we started exploring how to actually operate such a platform at scale. And this is where SRE came in, and this is where the whole story started that led to the book.

Henry Suryawirawan: Thanks for sharing your journey. So one thing that I pick up, you have been in this healthcare industry for a long time, and actually you implement SRE concept in it. I think most of us are educated about implementing SRE for cloud-based systems. So hopefully, today, we’ll be discussing a little bit more maybe from your journey, implementing it in a healthcare industry, which is probably quite stringent, and it’s kind of like mission critical, right? So you can’t have it to be a failure.

[00:08:24] Getting to Know SRE Concept

Henry Suryawirawan: First of all, how did you come about with this SRE concept? Because you didn’t come from Google and, obviously, you come from a healthcare industry, which is traditional. Maybe tell us more a little bit on this journey.

Vladyslav Ukis: Yeah. Yeah. So, as I mentioned before, the buildup of the teamplay digital health platform was new for the company. We didn’t have cloud operations before. And also, we didn’t have the skills in order to be able to operate such a platform, because so far, the majority of the software produced by the company was on-premise, and as a development team, you don’t operate on-premise software. That is done by professional services, which is then another organization. That was a total paradigm shift for us when we understood that actually in order to make this work, we need the development organization to do operations.

So then, the next step on that journey was to see, okay, how could we do this? We have tried to operate the platform, just using means that came to our minds. So we started setting up alerts. We started promoting the idea that actually the development teams, they need to be also looking into the operational aspects and not just developing features. And we kind of have tried doing this for several years, but we saw that this is a difficult idea to grow in a development organization that has never done operations before. Because in the minds of the developers who have never touched operations, their job is to develop software. In the minds of product managers, then their job is to tell the developers what to develop. Basically, everybody’s thinking that there is this operations department that will do operations, and the operations department cannot do this because they don’t have the context, and they don’t have the internal knowledge necessary to operate services.

At some point, I kind of understood that we need to do something different in order to really grasp the operational aspect on the platform. So I went to the QCon conference in London. There, there was a whole track on SRE. So I stayed nearly the entire conference on that track, because I knew that was the pain point that the organization had at that time. Then I learned a lot about SRE. But, of course, that was just conference knowledge. After the conference, I kind of understood that this is something real, that there are lots of companies in the software industry at large going that way and trying out SRE and seem to have some success, however you define it, with it. So, basically, I thought, okay, that might be worth a try for us as well.

So, then I came back and sort of started learning more and more about this. Learned the Google SRE books, of course. But the trouble was that those Google SRE books, while great, I kind of saw that they were describing the high end of what you do when you really operate services at Google scale. Then I also read a couple of other books, but all they were kind of describing was already at a level that was far away from where we used to be. So then I understood, okay, fine. This is great, but we need to start somewhere and really start small and really start seeing whether SRE would be applicable in an environment like us, which is obviously a different culture than Google. A different context than Google, and, basically, different everything compared to Google. So then we kind of started adopting SRE slowly but surely, making small steps, implementing a little bit of infrastructure and that then grew to the entire organization, then operating the entire platform using SRE. And this is how the journey was shaped.

Henry Suryawirawan: Really exciting, right? So you learned about it in a conference. You went through like one whole day, probably. But you learned it from the conference and you started implementing it and you become like an author of it. So I think that really speaks like a tremendous journey.

[00:12:20] SRE vs Other Frameworks

Henry Suryawirawan: First of all, I think many organizations would have been in your position as well, right? From a traditional company, maybe from healthcare, maybe from some companies which are not born on the internet era. In the first few chapters of the book, you actually try to compare SRE with the traditional way of doing IT operations, which are like ITIL, COBIT, and also DevOps, in general, as a culture. Yeah. So maybe if you can give some context and summary. What is the difference of SRE with all these other frameworks?

Vladyslav Ukis: Yeah. So, when I was writing the introduction to the book, I understood that there is not just SRE, which kind of was my focus before because I was focused for many years on just doing SRE. But then when writing a book, of course, I needed to zoom out, and see what else is out there in order to introduce SRE well in the context of the existing methodologies. So then I saw that, especially prevalent today, is the ITIL implementation. This is basically a framework for how you set up the operations, the IT function of an organization. And then, there are also a couple of other frameworks like this, like COBIT and others. What they do well is to basically prescribe how an IT function of a big organization can be established and can be run.

But then, there are other kind of related things. Like, for instance, DevOps, right? So, DevOps is a very, very hyped concept in the industry. This is kind of a philosophy. How you run the product delivery, especially with the short recycles and feedback along the way from different environments and from different people, including the customer and so on. But then, generally, DevOps stays at the philosophical level. So basically it’s a philosophy of fast software delivery and fast feedback loops, especially in the area of operations, but also in other areas. It doesn’t prescribe you enough to get started, so to speak. So basically you can think of DevOps as an overarching philosophy of running product delivery, and then you can think of other frameworks like ITIL and COBIT as frameworks for organizing the IT function of the enterprise.

So now, where does the SRE come in that context? So actually SRE is an opinionated implementation of the ops part of DevOps. So, you can also take a very narrow kind of view on DevOps and say that just based on the name, DevOps tries to bring together development and operations. Although I think it’s broader, but you can kind of take that assumption and say, DevOps, based on the name, tries to unite development and operations so that they work well together. But then, again, this is kind of staying at the philosophical level. Development and operations, they should be working together. But how? Should they do pair programming? Should they do pair operations? Or what should they do? How should they work together?

So, actually, SRE as a concrete implementation of the DevOps philosophy is this opinionated framework to implement the ops part of DevOps, so to speak. It’s got concrete prescriptions of what should be put in place and also roughly by whom, in order to bring development and operations together. This is also coming from the origins of SRE, which is trying, as software engineers, to come up with a framework for operating services. So, this is then sort of software engineer led approach to operations. This is what SRE is. And then if you take those things like ITIL and COBIT for organizing the IT function of the enterprise, then DevOps as the overarching philosophy of product delivery, and then SRE being a concrete implementation of the DevOps philosophy, then you can say, okay, so actually SRE can live very nicely in parallel with ITIL and COBIT. Because with ITIL and COBIT, you just set up the IT function of the enterprise, and then with DevOps you set the overarching philosophy for product delivery, and then SRE is a piece in order to enable the operations part of your product delivery organization well.

Henry Suryawirawan: Wow. Thanks for this explanation. So we can see clearly where it fits into this overall, along with the other IT and operations frameworks. So thanks for sharing that.

[00:16:48] SRE Definition

Henry Suryawirawan: You mentioned a phrase which is very popular to describe SRE. SRE is like a way of operating software if you give it to software engineers, right? I think it’s a quote from Ben Treynor Sloss. I’m also interested to listen from you. What is your definition of SRE coming outside of Google? How do you describe SRE? Is there anything that you probably have different in terms of perspectives or opinions?

Vladyslav Ukis: I think you can definitely see that it’s coming from the software engineering world. I would agree that this is what happens when you task software engineers with doing operations. And, actually, very interestingly, this kind of thinking when you task software engineers with doing something can be also applied to other areas. So, for example, being specifically in healthcare, we’ve got a lot of regulatory burden.

So, we’ve got a lot of regulatory compliance, regulations that we have to fulfill. Traditionally, these regulations are not fulfilled from the software engineer’s point of view. So they are basically fulfilled from the regulatory person’s point of view, and therefore, there are lots of documents that you have to write in order to demonstrate your compliance with the regulations. Those documents have to be handled in a certain way in order for you to be able to provide evidence during audits that you actually comply with the process and so on. And actually, if you task a software engineer to design the regulatory function, it would become something completely different.

So, actually I think you can apply that kind of thinking, just task software engineer with designing X and that leads to a totally different implementation than if it comes from a place where it’s not inspired by software engineering. So I think there’s generally great potential to apply that kind of thinking also in other areas. And we also started actually changing our regulatory process with adding lots of automation and so on. That definitely makes the whole process much leaner, makes the whole process much easier to apply and then leads to, actually, a great acceleration of product delivery because if you are less constrained by the regulatory burden, then you can move much faster.

So, overall, definitely, I would still agree with the original definition by Google. Quotes by Ben Treynor Sloss that this is what happens when you task the software engineer to design the operations function. Yeah, definitely would be happy to apply that kind of thinking also in other areas.

Henry Suryawirawan: Thanks for the interesting perspective. So if you want to improve something, you can also apply software engineering to X problem, right? So maybe one day we’ll see regulatory compliance engineer, when you apply software engineering to some area.

[00:19:26] Ops-Development-Product Alignment

Henry Suryawirawan: You mentioned about last time in IT, the ops team will be its own silo, development team its own silo, and product manager its own silo. So in the book, you actually explain how SRE actually is able to unify these three different teams. So tell us more how SRE actually helps bring alignment into these three different areas?

Vladyslav Ukis: So I think the strength of SRE is exactly in the ability to bring about that alignment on operational concerns between the product management, product development, and product operations. Whereas other methodologies and frameworks that we talked about before have got their strength in other areas. Like ITIL and COBIT, their strength is designing the IT function of the enterprise. DevOps, the strength is in the philosophy of product delivery. Whereas SRE, really, I think the strength of the methodology is in the alignment on operational concerns (with) the entire organization.

The cool thing about SRE that also benefits us a lot in the teamplay digital health platform is that it prescribes that you need to have service level objectives for your services, which are then based on service level indicators. Then, once you’ve got service level indicators and service level objectives, then automatically you’ve got so-called error budgets, which are given budgets for errors. And then if you exceed your given budget for errors, then you can do data-driven decision making about whether we invest now into reliability or whether we continue investing in features. So, basically, you’ve got these kinds of primitives that you can work with and that you can use to guide your thinking.

So now, how does SRE then bring about the alignment of different parties on operational concerns? First of all, under the SRE framework, the developers, they need to go on call for their services to the extent agreed in the organization. So that means that the developers need to experience firsthand what it’s like to run their services in production. And that can be just a little bit if the agreement to operate the services is such that the product development is just a little bit involved in operations, and it may be that the development teams are fully on call for their services. So you can arrange sort of between 10 and 100 percent (of operations) can be done by the developers themselves. The product management typically is not involved in product operations. But under the SRE framework, they need to be involved in product operations in order to actually make those decisions when we invest in reliability versus when we invest in product features. But these decisions are then done based on sound data from production. So that’s based on the error budget consumption.

So, if there is no error budget left anymore, what do we do? The product management needs to be involved in the definition of the so-called error budget policies that state service by service and team by team: what do we do if there is no error budget left anymore? Do we then invest in reliability? If we do invest in reliability, then how do we do this? Do we take just one engineer and put on the reliability, or do we re-prioritize the backlog and so on? As you can see, the product management becomes totally part of this entire conversation. How to guide investment on reliability. The product operations, they usually are not involved in enabling others to do operations, but under the SRE framework, that’s exactly what they need to do. So basically they don’t necessarily need to run services in production, but they need to provide SRE infrastructure to enable developers to go on call and be efficient.

Basically, a traditional organization puts their responsibilities on their head, so to speak. Because the developers, they never went on call, but they need to go on call now. The operations people, they never enabled others to do operations, but they need to enable developers now to do operations. The product management, they were never involved in operations, but now they’re part of the conversation. How do we guide the investment into reliability? As you can see, it gives each party a certain set of tasks and aligns them using those primitives, like SLOs, error budgets, error budget policies, and so on, so that the whole organization then is aligned on those operational concerns, ideally, before hitting production.

Henry Suryawirawan: So I think you refer to this as a collective ownership in your book. So from three different perspectives, from development, from operations, and also product management. And I like when you said that now the three different groups actually align on what kind of reliability that they would want to offer for their services. And if there’s something wrong about reliability, for example, your error budget is exceeded, then you make a certain decisions. Should you invest in reliability more or should you continue investing on features, knowing that your error budget is exceeding?

So I think this misalignment is pretty common in many organizations, right? So maybe from your journey, tell us how did you solve this? Because when you introduce SRE, it’s a totally radical concept, I assume, for traditional companies. And some people would even think, “Oh, this works in Google, but not us”. So how did you overcome that misalignment and make sure three different groups actually can align on this same implementation?

Vladyslav Ukis: Yeah. Specifically, what we did at Siemens Healthineers was that, first of all, we placed SRE into the portfolio of things that we do in the platform. We reached an agreement that this is an important topic. It’s important for us to improve the operations of the platform, and therefore we’ll try SRE, because everything that we’ve tried before wasn’t as successful. So, we’ll give it a try. We put it into the portfolio of things that we’ll work on. With that, then we were able to allocate resources across the board. We were able to start the development of the SRE infrastructure in the operations teams. We were able to start talking to the development teams and literally go kind of team by team, and introduce the concept there and take each team on its individual journey towards operations improvements using SRE.

That kind of team-based coaching was also key to establishing SRE in the organization because every team is different. Also, every team is using the SRE infrastructure slightly differently. So, basically, taking the people who are implementing this SRE infrastructure and going kind of team by team into the product teams, including the product owner, was key to actually being able to establish SRE broadly in the organization.

Henry Suryawirawan: It’s very interesting because yeah, like you said, every team is different. Some may be more advanced in terms of understanding. So for those people who haven’t really understood about the SRE concept, I do have an episode in episode 96 which covers the basic ideas, principles, and concepts in SRE. So today we will not be covering that because you can refer to that episode. Do check it out if you want to learn about what we are talking about: SLO, error budget, SLI, right?

[00:26:36] SRE Coach

Henry Suryawirawan: So, Dr. Vlad, when you actually tried to implement this, I guess some people may be puzzled about SRE. “Huh?” So it’s like, how do you actually educate people? You mentioned about the role of SRE coach. Tell us more about this role. The significance of it. Like how do you actually recruit them? Is this coming from external or is it internal? So maybe tell us more about this important role of the SRE coach.

Vladyslav Ukis: Yeah. So, when you embark on the SRE transformation and want to establish SRE in an organization, then you need somebody who would be driving the transformation. And this is the role of SRE coach. That needs to be somebody who’s really driven to improve operations and who can understand the organization well enough in order to know what you need to do to establish it. That’ll be different organization by organization and, especially, that’ll be different, also, culture by culture. Therefore, I think it needs to be somebody internal. I wouldn’t think that somebody external would be able to do this, because you need to know too much of the inner workings of an organization in order to make it happen. Also, I think there might be just not enough trust between the teams and the external coach in order to make that change. And also another thing, typically something like this is a multi-year journey, and usually somebody external is not there to spend years with the organization. So, I think it needs to be somebody internal, somebody driven to improve operations and somebody convinced that SRE might be a good approach there.

That person needs to be a mediator between the operations team implementing the SRE infrastructure, development teams adopting SRE at different rates, and the management team endorsing the transformation and also providing a little bit of budget for this. So, I think this is definitely a key role, and must be there for quite a long time until SRE becomes the usual ways of working of any team. And also to the point where if a new team has started, then they kind of don’t think about how to do operations, but they just use the existing SRE infrastructure and approach everything from the SRE point of view from the beginning when they set up new services.

[00:28:52] Realizing SRE Benefits

Henry Suryawirawan: And you mentioned this could be a multi-year journey, right? So maybe from your experience, how many years did it take you to actually really start seeing the effect and benefits across maybe one or two teams, if not the whole company?

Vladyslav Ukis: So, I think you can see the benefits pretty quickly. Because if you are coming from a traditional development organization that has never done operations before, then there will be basic things that you can improve immediately. Just to talk about the basics, for example, you need a uniform logging across all services that you’ve got. Typically, in a traditional organization, you will not have uniform logging everywhere. So, there will be always, you know, one team does it in one way, another team does it in some other way, and there is this new service where the logging has not yet been enabled and so on. So, basically you kind of immediately can elevate the maturity a little bit.

Then, the next question is, okay, when do you really start seeing improvements at the management level? Where the management team members will say, “Yeah, so actually, before we had this and before the teams were doing operations, the services weren’t as reliable as now”. Something like this will take longer, of course, but not necessarily that long, I would say. Because, typically, in order for this perception to be developed, you need to see some external change. And usually, the external change is measured based on the customer complaints, right? So, usually what the management team notices is that, okay, somebody’s complaining because something is not working. So the amount of external complaints gets reduced. This is when the people will start noticing, ah, actually, yeah, this already kind of bears some fruit .

Although from the SRE coach’s point of view, they, of course, will know. Okay. We are just actually at the beginning of the journey. But already, if the development teams start taking production seriously, if they start really kind of looking into how the services are running, even if it’s not kind of the best SRE way, then that usually already brings a huge effect in terms of reliability.

Henry Suryawirawan: Yeah, I can totally relate with what you said. So previously, if you haven’t done any kind of SRE, for example, there are typically many kinds of implementations, especially across teams in the same company, right? So, like logging is one thing, observability, whatever that is, you will have so many tools, and they’re not uniform as well. It’s pretty difficult to trace from one service to the other. And I think with SRE concept, you start maybe standardizing all this, having conventions. And I think, like what you said, at the end of the day, it impacts the customers, right? Users’ happiness. So from the management point of view, I think once you start seeing a reduction in customer complaints or customer related issues, maybe that is a sign when you actually implemented SRE successfully.

[00:31:37] How to Begin SRE Journey

Henry Suryawirawan: So you have explained the SRE concept to everyone. Everyone seems to understand about the concept. So how do we start by laying the foundations? If I understand SRE correctly, it always revolves around service level objectives. When we discuss about collective ownership, I think from the product development, product operations, and product management, we also agree on the same service level objectives, right? So we have the same definition. We have the same understanding about a consequence of not meeting that objective and things like that. So tell us more, what should be the step-by-step foundational steps for people who are trying to implement SRE in their company or team?

Vladyslav Ukis: I think a couple things need to be there. One important thing is that I think it’s important for the initiative to be taken seriously. That requires SRE to be put onto the list of organizational priorities that the organization wants to explore. So, every organization has got some list of big topics that they’re working on. 10 topics roughly that we are working on right now, and that needs to be on the list there, because this is such a profound change. The development teams suddenly start going on call for their services, and the operations team start suddenly to implement some frameworks in order to enable the development teams to do operations. So basically, the development teams start doing some operations work, and the operations team starts doing some development work. And it’s a total kind of transformation for both parties. Then the product management suddenly is in the middle of the discussions about reliability, which is totally new to them. So I don’t think you can really bring this about if the organization is not serious enough about the initiative. So, I think, the first thing, put this on the list of initiatives you are undertaking.

Then the second thing, find a possibility in the operations team to implement a little bit of SRE infrastructure and really the very, very minimum. So I would say just start with one SLI, say, availability, because that’s kind of the most common one and the most fundamental one. And then just implement some infrastructure that can take some logs and calculate the availability of the service based on those logs, and then can alert on, say, certain thresholds. That definitely is already enough. Then pick, say, one development team, that is responsible for some services that would be kind of open to exploring a new way to do operations.

From then onwards, kind of operate SRE. Circulate very frequently between what the development team actually needs and what the SRE infrastructure needs to provide. Basically, bring together that one team and the operations team implementing the infrastructure, and bring it to a point where that one team would say, “Yeah, actually make sense. It’s an improvement compared to before. It’s better now and we want from this SRE infrastructure additional 10 features”. I think that’s a good starting point. So that’s solid already. You’ve got an owner of the SRE infrastructure. You’ve got infrastructure used by a real team, and they want more. This is actually a good point to say, okay, now we know we are onto something useful and let’s go find the second team.

Then the second team can already make use of this SRE infrastructure that was built for the first team. They will, of course, have some different context, and this is where you can then decide which additional features to implement into the SRE infrastructure. From then onwards, it’s about going team by team, and talking to them, discussing with them whether SRE would be something for them. But usually, the more teams you onboard, the easier it gets, because the SRE infrastructure is already more mature every time you take on a new team, because you’ve got more features there, and you know that they’re useful. Because also there is a kind of social proof that this new thing SRE kind of helps with the services operations. I would say this is kind of roughly the journey that you would need to take, and this is also where the SRE coaches are essential, especially at the beginning.

Henry Suryawirawan: So I think that totally makes sense, right? You just do it gradually, building the infrastructure as well. Don’t forget about that. Because you can’t just tell the team, hey, you go implement without an infrastructure or platform provided by the SRE best practices. Because otherwise, people will implement it differently. I totally can relate as well, because I did try similarly, leaving the teams to implement it, but I guess people’s understanding varies. You would start seeing some teams which implement it properly. Some teams which probably just follow the request or the order, so to speak. Some actually just finish and then forget about that totally.

[00:36:15] SRE Journey Progression

Henry Suryawirawan: Which brings us to the key concept of SRE. So when your SLO is actually being breached, what do you do about it? Because I think that is maybe the tipping point, how the culture changed, right? Because if you have an error budget, it got breached, but nobody cares about it or does any action. So maybe from your experience, how do you handle this such that it becomes a tipping point where people understand, okay, this we have to take seriously.

Vladyslav Ukis: Yeah. So several steps there. Step one is to involve the product owner into the SRE coaching sessions from the beginning. The product owner is totally a full member of these discussions. Number two, once the team is mature enough to come up with the error budget policy, then this is something that’s really helpful, because it forces the team to think exactly about this question. What will we do when there is no error budget left for this service or for that service? It does this ahead of time before you’ve exhausted your error budget, and that provokes really good discussions. That also touches upon other things that the team is doing. For example, the team might be, say, planning to do some workshop in order to plan the next increment. What happens if your error budget gets exhausted exactly on that day? Do you then disband the workshop or what do we do with our planning? So I think that promotes really good discussion. You need to basically put it on paper. Our error budget policy is if this happens, that’s what we do. If that happens, that’s what we do. So that’s already definitely the next level of maturity.

But I think it’s important not to start with the whole error budget policy before the team is actually ready for this, because that’s a pretty advanced concept. That’s also how I put it in the book. So I’m talking about the basic SRE Foundations and advanced SRE Foundations. And the basic SRE foundations are SLIs, SLOs, and error budgets. The advanced SRE foundations are the error budget policies and error budget based decision making. So here we are already in the advanced space, and you need to be aware of this. It doesn’t make sense for the SRE coaches to force the discussion about the error budget policy because either people will just follow the discussion and forget about it the next day or they will just not be able to talk about this because they’re just too far away from this. In order for you to talk about the error budget policy, you need already everybody’s understanding about the SLIs, everybody’s understanding about the SLOs, everybody’s understanding about the error budget. Then the SLOs need to make sense. You need to already have several iterations on the whole thing. And then you need to reach a point where at some point, okay, so now we’ve exhausted the error budget, but we haven’t talked about yet, what we’ll do now? And this is where you bring that discussion. Not before.

Once the error budget policy is in place, that’s then good enough. But then the real test is, okay, so will that error budget policy be executed? Once it really reaches the point where we don’t have that error budget anymore for the service, and that’s important. Does the team really follow what they have done? That’s the next step. And then the step after is the error budget based decision making. I think that’s a really cool step. That’s at the top of the SRE concept pyramid. In the book, I put all those concepts in the pyramid, and that’s kind of at the tip of the pyramid at the top.

Basically, what you do, you start tracking your error budgets consumption or your SLO adherence over time. You divide the time into error budget periods. So, it could be one error budget period is just one month. And then you plot all your services and their SLO adherence by error budget period. So, you can have a graph there. You know, you’ve got all your services on the Y axis, and on the X axis you’ve got, say, half a year’s timeline. So, it would be six error budget periods, and then you see where you broke your SLOs and where you didn’t break them and so on. If you break the SLO for the error budget period, it’s actually useful to see whether you broke it by just a couple of percentages or whether you really kind of broke it significantly.

The same also applies if you are fulfilling your SLO, then do you fulfill it with a lot of error budget left at the end of the error budget period, or whether you’ve nearly exhausted your error budget, and therefore you kind of nearly broke the SLO. Those details are useful as well. But the cool thing about that kind of high-level dashboard is that especially the product managers, they are enabled to work with this, because then you can say over time from experience, we actually thought that we would have this certain service level, but we are not fulfilling this and therefore we need to prioritize some significant work in order to build reliability into these particular services. The cool thing is that you make those decisions really based on data and not just on the opinion of the architect, on the opinion of the developers, and so on, which depending on the team culture, may weigh a lot or not a lot to the product manager.

Henry Suryawirawan: So I can totally understand once you have this cool dashboard where you can see each different services, how they’re performing, their SLO, the error budget, whether they’re breached or healthy, I aspire to go that way one day. Like you said, it’s a data-driven approach of knowing where to improve and when to improve it across different teams. And don’t forget the SLO here is a representation of users’ happiness. It’s not like an internal metrics that you just want to capture, right? But it’s actually a representation of users’ happiness. So when you see it’s being breached, basically the users should be unhappy as well.

[00:41:48] Healthcare Reliability

Henry Suryawirawan: Which is coming to my next question. Because in the healthcare industry, I always assume that healthcare industry must be very reliable. All these devices that you mentioned, they should function perfectly. But we know in SRE that a hundred percent reliability is totally impossible. Maybe can you tell us more about from your experience implementing it in healthcare? Is healthcare always like 99.99% reliable? Or is there any kind of explanation that you can give here?

Vladyslav Ukis: There are different devices and associated software systems in healthcare. There are hardware scanners. There, of course, your availability and reliability requirements are the highest. There are very strict medical device regulations that oblige you to perform certain tests and so on in order to reach that high level of reliability.

Then there are so-called post processing workstations. These are still on-premise software systems that, of course, need to also work reliably because the work in the hospital is very fast-paced and if a system doesn’t work, then that can block the entire workflow. But at least it doesn’t harm human life anymore if it doesn’t work.

So then there is cloud software, which are services that enable you to run your fleets of scanners more efficiently. If that doesn’t work, then at least you don’t block immediately the work in the hospital department, but you would block, for instance, the operations person, the COO of a hospital who is now trying to optimize the assignment of the patients to the scanners in order to be more efficient.

So there are different kinds of levels of reliability that you need to have there, and all that is governed by the medical device regulations. They’re also a bit different depending on the criticality of the system to the human lives.

Henry Suryawirawan: I’m interested, very much interested in the hardware scanners when you mentioned it has to be the most reliable. What happens when the error budget is about to be breached? Or maybe there’s an alert coming? Are there people scrambling? Because it could affect people’s life, right? So tell us when this situation happens, what actually really happens?

Vladyslav Ukis: Well, first of all, we only applied so far SRE to software, but I think it’s a good idea to also think about how that could apply to hardware. The scanners themselves, they’ve got lots of checks and balances in order not to harm the patient. So, for example, you can think of a computed tomographer. CT scanner. That CT scanner injects some acceptable radiation, dose of radiation, into the patient in order to acquire images from the patient. Of course, if the scanner would notice that now the patient would be harmed because there is, say, too much radiation about to be exposed, then the machine will stop the scan. So immediately, the staff will be notified. The staff will then remove the patient from the room where the scanner is installed and so on. So there are definitely lots of precautionary measures there.

Then additionally, what the cloud software can do where we apply the SRE is that we’ve got, for example, dose management software. Dose management is a cloud-based application. It can also monitor, although not kind of in real time, what’s going on with the scans being done right now, but it can monitor things like, for example, imagine you are a patient that requires several scans. That means that, actually, your accumulated radiation dose over time can actually be exceeding a certain threshold that is also governed by the government regulations. So actually you need to keep track of the radiation, not just during a particular scan, but also overall over a period of time. If you’ve got some serious disease, then you might be prescribed to do certain scans within, say, a year or so. So what that software can do, it can actually keep track of your accumulated dose and then also issue warnings to the hospital staff that actually that patient already got so much and therefore be careful and so on.

As you can see, this is a very interesting industry where, at different levels, you require reliability, and that can be on the one hand really kind of life-critical right now to, okay, so it’s not life critical right now, but you still need to have some monitoring of the patient in order not to expose the patients to some harmful treatment.

Henry Suryawirawan: Thanks for sharing all this. The reason I ask is because some people might have perception, okay, this may only work for certain cases, maybe internet based software, right? But actually you can apply it almost in anything, especially in technology.

[00:46:25] Measuring SRE Implementation Success

Henry Suryawirawan: So let’s go back to the discussion about implementing SRE in your organization. So let’s say you have done it or the teams have already implemented. Maybe from your view, how do you actually measure that this SRE transformation is successful? Are there any indicators that you probably check periodically to say that, okay, this SRE implementation is actually progressing really well? Maybe some tips here. I know that you have some tips in the books as well. Maybe you can give some summary what kind of measurement that you do in order to find out that your transformation is successful.

Vladyslav Ukis: Yeah, definitely. So, especially at the beginning, it’s a good idea to just see how many services the teams are onboarding onto the infrastructure. This is not outcome-based, of course, because you still don’t know whether it’s improved reliability. But at least these output based indicators at the beginning, they’re definitely good because, if teams are not willing to put their services onto the SRE infrastructure, then there’s no use in the infrastructure. That’s one thing.

Once you see, okay, so the number of services is growing on the infrastructure, so that’s good. Means teams are finding the infrastructure useful. Then the next thing is, okay, so do the teams define the SLOs? Is the number of SLOs actually growing? Then the next thing, okay, now that the SLOs are there, but are they being fulfilled? If there are SLOs but the majority of them are broken, then that means the teams just forget about them. They basically don’t make sense. Then you see, okay, so what is the percentage of SLOs that are being fulfilled?

Then the next step is coming back to those customer escalations that we were talking about earlier. It can easily happen that all the SLOs are perfectly green, so that’s all cool, but the customers keep calling. These are the measures that you can take in order to monitor the process of the SRE introduction.

Henry Suryawirawan: I like the way you describe it. Your error budget can be all green, but the customers keep complaining. So I think that is also like a big sign that actually you might have missed the most important SLOs of all, right? Because the true measure of implementing SRE correctly, your customers should be happy, and your error budget is a representation of that. In your book, you also mentioned a couple of other things, like, for example, perception from the partners that you work with, not just the users. I think that also matters, because sometimes if you are like a SaaS provider, right? So some partners connecting to you also matter whether they complain or not. So, thanks for all this measurement.

[00:48:44] 3 Tech Lead Wisdom

Henry Suryawirawan: Dr. Vlad, it’s been a great conversation so far, but unfortunately, due to time, we have to wrap up. But before I let you go, I have one last question that I always love to ask my guests, which is to share your version of three technical leadership wisdom. Is there something that you want to share with us here, Dr. Vlad?

Vladyslav Ukis: Let me think of something kind of in the context of the SRE transformation, but still kind of applicable to general technical work.

One thing that’s important is to act in context. I think that’s important because there are very many technologies, methodologies and so on, but it’s the context that actually decides whether this particular thing that you, you know, have heard about, have read about, have thought about would actually fit. So, I think this kind of acting in context is an important, I’d say even, principal to follow.

Another important aspect, I would say, is to practice servant leadership. That means that on the one hand, trying to show the way to the organization, to the teams, but on the other hand, also being very well aware of their context, and therefore, kind of inviting them to go on a certain journey instead of coercing them to adopt a particular thing. On the one hand, lead, but on the other end, do this in a servant way. I think that’s important.

The third thing would be, I’d say, all these things that we have talked about, they’re kind of not easy. For the developers, it’s not easy to just jump on call, right? It’s a totally different game if you’ve never done this, or for the operations people to start suddenly act as a real software developer providing frameworks to the others to use and so on. It’s a totally different world. And for the product managers to suddenly be involved in operations, it’s also, you know, out of their world so far. So I think kind of acknowledging that this is not easy, empathizing with the people and, specifically, as a third point, kind of looking for ways to motivate people. I think this is really important to make some change stick. Especially on the SRE transformation journey, there are so many ways where you can point out very small wins. I think, really, pointing out those small wins that now as a team, we’ve done this, that really motivates the people and taking this serious from the SRE coach point of view and really seeking things that people can be thanked for, thereby driving the motivation, I think is really, really important.

Henry Suryawirawan: Wow. I love the last one. So first, you have to acknowledge people’s situation. Empathize with them and find motivations and celebrate the small wins. I think that’s always important for any kind of leader.

Dr. Vlad, if people want to learn from your journey, so maybe they want to establish the same SRE transformation in their organization, and they want to learn from you. Is there a place where they can find you online, connect with you, or ask questions?

Vladyslav Ukis: Yeah, definitely. I can be easily found on LinkedIn. Now on LinkedIn, there are several conversation threads about the book. I think that would be also the best way to just join those threads or start new ones where we can discuss those things.

And also, thank you very much for the conversation, for the questions. I think that brought out really good aspects and was really fun to talk about. And while we were doing this, I was thinking that there is probably another myriad of things that we could talk about, because that topic is so vast.

Henry Suryawirawan: Yeah, when I saw your book, it’s very thick and lots of good information there. I haven’t finished reading it, definitely. But people, do check it out. So I think it’s really like transformation based kind of learnings, right? It provides you like the rationale of how you implement SRE, the step-by-step journey, including to the end, measuring the transformation, making sure that you are successful in your journey.

So thanks Dr. Vlad for this conversation. I really love it. So good luck with your future improvement for your SRE transformations in your company.

Vladyslav Ukis: Thank you very much. Thanks a lot for having me. And thanks to everyone who took the time to listen to this episode. Thank you very much and let’s improve operations together.

– End –