#96 - Practical Guide to Implementing SRE and SLOs - Alex Hidalgo
“Reliability is the most important thing. Your users define your reliability, so make sure you’re measuring the right thing. And 100% is out of the question, so pick the right target.”
Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and author of “Implementing Service Level Objectives”. In this episode, we discussed the practical guide on how to implement SRE and SLOs. Alex started by explaining the basic concept of service reliability and service truths. He then explained the concept of reliability stack, that includes the famous SRE concepts: SLI, SLO, and error budgets. Alex then shared his insights on how we can define a service reliability target, why a higher reliability target is expensive, and the risk of a service of being too reliable. Towards the end, Alex shared his tips on how we can build an SRE culture and how we can use the error budget as a communication tool within the organization.
Listen out for:
- Career Journey - [00:07:19]
- Understanding SRE & SLO - [00:14:17]
- Service & Reliability - [00:17:30]
- Service Truths - [00:21:06]
- Reliability Stack - [00:23:45]
- Defining Reliability Target - [00:27:11]
- Higher Reliability is Expensive - [00:29:27]
- SLI - [00:34:26]
- Measuring Correctness - [00:37:30]
- Critical User Journey - [00:41:49]
- Setting SLOs - [00:45:55]
- Being Too Reliable - [00:47:18]
- Communicating with Error Budget - [00:51:02]
- Building SRE Culture - [00:54:13]
- 3 Tech Lead Wisdom - [00:57:57]
_____
Alex Hidalgo’s Bio
Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and author of “Implementing Service Level Objectives”. During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex’s previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Follow Alex:
- Twitter – @ahidalgosre
- Nobl9 – https://www.nobl9.com/
- Website – https://www.alex-hidalgo.com/
Mentions & Links:
- 📚 Implementing Service Level Objectives – https://amzn.to/3OWJluA
- 📚 Site Reliability Engineering – https://amzn.to/3ao01w8
- 📚 The Site Reliability Workbook – https://amzn.to/3ysaNcC
- Admeld – https://www.admeld.com/
- Squarespace – https://www.squarespace.com/
- Kubernetes – https://kubernetes.io/
- Docker – https://www.docker.com/
- Apache Kafka – https://kafka.apache.org/
- Logstash – https://www.elastic.co/logstash/
- Elastic Search – https://www.elastic.co/
- Kibana – https://www.elastic.co/kibana/
- Chubby – https://people.cs.rutgers.edu/~pxk/417/notes/chubby.html
The conference takes place online, and we have the 10% discount code for you: AWSM_TLJ.
Skills Matter is an easier way for technologists to grow their careers by connecting you and your peers with the best-in-class tech industry experts and communities. You get on-demand access to their latest content, thought leadership insights as well as the exciting schedule of tech events running across all time zones.
Head on over to skillsmatter.com to become part of the tech community that matters most to you - it’s free to join and easy to keep up with the latest tech trends.
Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.
Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.
Understanding SRE & SLO
-
Some of the biggest problems with both understanding what SRE and SLOs are is that no one’s really ever fully defined it.
-
There is, of course, the first Google SRE book. But you know what? It’s like 30 something chapters and not a single team at Google actually does all those things, anyway. They’re best practices. They’re aspirational.
-
The original definition that Ben Treynor Sloss, the inventor of Site Reliability Engineering, he said SRE is what happens when you ask software engineers to solve operational problems. That’s great, but that’s also very vague.
-
There isn’t one definition and I would actually say it doesn’t really matter as long as our end goals are kind of the same. And I see the same with SLOs.
-
In the first and second Google SRE books, they don’t even define what an SLI is in the same way, a Service Level Indicator. A very important part of how to do SLOs. The two different books don’t even agree. I think part of the confusion is just that there often aren’t single resources to point people at. We don’t have degrees for this.
-
When it comes to more philosophical things like Site Reliability Engineering, like Service Level Objectives, I think one of the reasons people sometimes struggle is because it can be more difficult to start from scratch. Because there aren’t strictly defined resources that people can just learn.
Service & Reliability
-
A computer server is not very much different (than a human server). It is a thing that takes your request and responds to it correctly. It’s very similar in concept. Computer service is something that listens to a request from something and responds to it appropriately. That is the best way to think about it. Instead of trying to define exactly what it means at a technological level. Cause I don’t think that’s important.
-
People often conflate it with availability, but they’re very different things. Because a service can be available and not be reliable.
-
Reliability is an old term. Reliability engineering even goes back to the 1940s. It’s not a concept unique to SRE or Google or tech or computers.
-
What reliability means is, is this system performing how it was defined to perform? For computer services, that basically means is it doing what it needs to be doing? If we are cool with the concept of a service being a thing that does something for someone else, then reliability is, is that thing doing that thing it’s supposed to be doing?
-
The reason I always try to steer people away from just thinking about it as being like availability is because you can be very available and still be doing a bad job.
-
A service is anything that’s doing something for something else. And reliability means, are you doing that well enough?
Service Truths
-
I personally believe that there are three things that are true about any service:
-
One is that reliability is its most important feature. If your service is not being reliable, it’s not doing much. So you need always to be thinking about reliability first.
-
The second truth is that you don’t get to decide what your reliability is. I don’t care what your measurements say. I don’t care what your logs say, what your metrics say. I don’t care if you have a million healthy pods, all reporting up. If your customers, if your users, anyone that depends on you thinks you’re being unreliable, you are. If you’re not meeting the needs of your users, you’re not being reliable. So you need to take their perspective into account.
-
The third service truth is that nothing is ever a hundred percent. So don’t aim for it. This is just a truth of the world. Outside of pure mathematical constructs, nothing’s ever a hundred percent. Things fail. Failures occur. It turns out that people are actually fine with failure, as long as failure doesn’t happen too often. The third truth is just don’t aim for a hundred percent because that’s a fool’s errand. So instead, pick a more reasonable target.
-
-
Understand what reliable means. Understand that someone else determines what reliable means. Understand that reliability is the most important part of your service. And then make sure you’re only trying to be reliable enough. Something that works for both you and the people that depend on you.
Reliability Stack
-
You have three primary components of what I call the reliability stack.
-
SLIs or Service Level Indicators. Service Level Indicators are measurements. They are bits of telemetry about your system that tell you, is it doing what it’s supposed to be doing? And, again, this should be from your user’s perspective, as close as you can get, at least.
-
SLOs or Service Level Objectives. Service Level Objectives are targets for how often you want your SLI to be true.
-
If your SLI is able to tell you, “Yes, we’re currently meeting the expectation of our users”, or it’s able to tell you, “Oh, we’re currently not meeting the expectations of our users”, you’re now able to inform a ratio. Good events over total events equals a ratio, equals some kind of percentage. Your SLO is a target for what you want that percentage to be.
-
You can pick something more reasonable. So we just talked about a hundred percent is impossible. No one ever hits that, anyway. Pick a target that’s realistic, that accommodates failure, that makes sure you’re not spending too much money and trying to aim for something you can’t.
-
-
At the top of the reliability stack, you have error budgets. Error budgets are just a way of thinking about how your SLO has performed over a period of time.
-
An error budget often takes into account a time window that’s often fairly large, generally anywhere from a week to 28 days to 30 days, sometimes even a whole quarter. The idea is your error budget is that other side of the percentage.
-
Your error budget measures are we failing more often or the right amount? And so, your error budget lets you think about things over periods of time. Over the last 30 days, what percentage of the time have we failed? And is that helping us meet our SLO targets? Are we exceeding it?
-
A budget is something you can spend. And an error budget is exactly that. What amount of that have you spent over time? And that then helps you make better decisions about where you need to focus your attention.
-
-
Defining Reliability Target
-
It depends on your service. The important thing is to be meaningful and to be thoughtful. Think about what is my service? Who are my users?
-
I say user a lot, but I don’t just mean customers. I mean anyone that relies on your service. It might be another team down the hallway. It might be internal people in your company. It might be another service. Your service might be six layers removed from an actual human, but that other service is still a user of your service.
-
There isn’t a single answer. Be thoughtful about it. Take your time. Think, what does my service do? What is it supposed to do? What do the users expect? Can I go talk to those users? If I can, let’s go talk to them. Let’s ask them directly.
-
Is there a product management team? Do they have a user journey document? Maybe we should go look up what the defined user journeys are. Why is the software written in the first place?
-
I don’t have a special formula people can use to instantly understand what it is that they need to be thinking about when they pick how reliable they need to be. But what I can say is in every single case, if you’re careful and you’re thoughtful, that’ll lead you to the best answers.
Higher Reliability is Expensive
-
If you want to ensure that you are being more reliable, you need to ensure you’re having fewer failures.
-
Individual components fail more often. So if you add more and more, you’re just going to have more and more failures. You just end up having to build more and more complex systems to ensure that you are, in fact, being more and more reliable over time.
-
This is sometimes totally reasonable. For some systems and some services, you do need to hit very high targets. This means you need to be distributed. You need to ensure you’re built for high availability and quick fail overs and redundancy.
-
But once you introduce so many things, you’re also spending a lot more money. Because no matter if this is running on your own hardware or your own data centers or everything’s in the cloud, you’re now running more things. So it costs more money. Both just in a per month billing situation, as well as the fact that now you need more engineers to take care of it. If you have a hundred components instead of one, you need more engineers to take care of those hundred components. So now you have to hire a bunch of engineers.
-
If you’re trying to hit a really high reliability target, what a lot of people don’t always do the math about is what that really means in terms of like time.
-
If you want four nines of reliability, that gives you only seconds per month to respond. You can barely have any downtime at all, ever. So now, what do the on-call rotations look like?
-
The closer you try to get a hundred percent, the more and more that grows. It’s exponential. If we can all agree that just logically you cannot ever hit a hundred percent, anyway, it turns into a limit. The limit approaches infinity forever. You’ll never actually hit a hundred percent anyway and trying to get there will just cost you more and more money in terms of how many services you need, how much you’re paying for your cloud providers.
-
Often, leadership decides we’re going to hit this target. They pick a bunch of nines all in a row. Cause they think that’s reasonable or because they know that some of Google services hit that. But you know what? You’re probably not Google. Google has all those things. Google does have teams all over the world. Google does have data centers all over the world, and you probably don’t.
-
Just be reasonable, be thoughtful. Think about the targets you’re picking and make sure that they make sense for you.
SLI
-
I have mixed feelings about the golden signals. I kind of wish we hadn’t in the first SRE book written about them the way we did. Because I feel that too many people think that by having labeled them the golden signals, that they’re the only things that matter. And I see a lot of people both start and stop there.
-
There’s nothing wrong with measuring availability. There’s nothing wrong with measuring latency. There’s nothing wrong with measuring throughput. That’s not my issue. My issue is that people often start and stop there.
-
They are, in fact, good starting points. Absolutely they are. So if you’re asking me, are they good starting points? Yes. If that’s where you need to start, start there. Almost everyone can measure those things.
-
But those things rarely tell the whole story. They rarely tell you what your users are actually experiencing. As we were talking earlier, reliability is what your users need from you.
-
I think people can graduate. They can upgrade to a more user journey based focus. Is this doing what users need it to do? Which includes many levels beyond just what the golden signals tell you.
-
I think they were accidentally taken too seriously or as too much of the end goal, the way they were written about. But I do think they are good starting points. Because almost everyone has that data already. If you don’t, it can be generally speaking, pretty easy to instrument. It can be much more difficult to measure. Did we send the data that the client asked for? That can be a lot more work. It takes a lot of time to get there.
-
A really good SLI is ever-changing. It’s ever adopting. Cause the world’s going to change. Your user expectations are going to change. What your service is defined as even doing is going to change over time.
-
Make sure that you’re looking at your SLIs. You’re looking at those measurements and you’re saying, is what we’re measuring, is it still telling us enough? Is it still telling us what our users are experiencing? Is it still looking at things from their perspective?
-
So start with errors, start with latency, start with availability. But move from there. Think about what else your users actually need.
Measuring Correctness
-
It depends, again, on your service. A favorite of mine is using synthetic checks.
-
You don’t always need as much resolution as you think you do, because even if we only had a few of these per minute, as opposed to hundreds of thousands per second, while we were covering so many components of the system, it was just as good a data.
-
But you can also set SLOs off of actual 100% user data. Now, this requires advanced tracing solutions. This requires a lot of work.
-
Your SLI could be set off of actual human traffic. Your actual user requests could in fact be used for that as well. Or anything in between.
-
If there’s anything I want to drive home, it’s be thoughtful. Be meaningful. Do what works for you. Don’t overspend resources trying to do the real user journey tracing. That’s not reasonable for everyone. Not everyone has the resources with the knowledge or the time to do that. So maybe set up a synthetic instead.
-
Or you know what? Let’s take another step back. Maybe for you right now, just latency and error rate, just the golden signals.
Critical User Journey
-
A critical user journey is generally something that’s defined by the product aspect of your organization, telling you what needs to happen with your product for you to be a successful company. Generally, in this case, meaning making money.
-
It is the expected way that a feature will work. And a feature very rarely connects directly to a single service, in terms of a microservice, in terms of a single team owning it.
-
User journeys generally span multiple different components of your system. That generally means that there are many different teams, not just one, responsible for all the components that user journey travels over.
-
They’re a good analog for what a good SLI is. A good service level indicator is a user journey. A user journey is a KPI or a key performance indicator. But they’re all similar in the sense that they’re all measurements likely having to do with your customers or your users, and none of them are ever going to be a hundred percent, so you can set targets on all of them.
-
An SLI is a user journey, is a KPI. They’re all kind of the same thing. Just different business units have slightly different ways of talking about them.
-
Individual teams need SLIs set on their own services. They need SLOs set on their own services. They need to understand how their services are operating for the things that depend on them. But then, perhaps the director of your organization needs to be setting SLIs, needs to own SLIs and SLOs for the kind of user journey stories that go across many different services. Perhaps, the VP of Engineering needs to be owning the concept of, like, can you check out? So, the checkout microservice team has their own SLO that measures how often their microservice is not throwing an error. The commerce department, the director of that department, owns an SLO that says, does the checkout workflow work? The VP of Engineering owns an SLO that says something like, are users able to use our website and send us money?
-
These things build on top of each other. I think every step of that way needs its own SLO and you can use those individual SLOs to inform an overarching SLO. The SLI for an SLO could be the SLO status of a different SLO.
-
There’s no super hard and fast rules here. It’s just, are you measuring things? Are you taking your users into account? And are you trying to make sure you’re not trying to be a hundred percent?
Setting SLOs
-
It’s solely dependent on your service. Make the right decisions. Make sure it is enough that you are covering all important aspects of your service. So don’t choose too few. Cause if it’s too few, you can’t understand what’s going on.
-
And also, make sure it’s not too many. Because if you have too many, running into multiple comparison problem where you have too much data and you don’t know what to look at, and you don’t know what’s telling you what, and you no longer have a good idea of what these signals are anymore.
-
The Google SRE book said five or six per service. That seems reasonable to me. But I also wouldn’t, again, like so many things, I don’t like hard and fast rules. I like approaches. I like philosophies. So, don’t feel bad if you don’t have five. Don’t feel bad if you have many more than five. Just make sure it’s the right amount for you.
Being Too Reliable
-
The main problem you run into by being too reliable is that people will end up expecting you to continue to run that way. So, if your users were previously okay with you being only reliable 99% of the time, but then you proceed to spend like a year being 99.9% reliable. Their expectations may have now changed. And now, they’re going to hold you to that 99.9%, even though they were totally happy with only 99% before. So you paint yourself into a corner if you are too reliable too often, cause user expectations will change.
-
What you want to do is you want to make sure that you aren’t being so reliable. You may accidentally do this. You may just accidentally have a few months where everything was just like super lucky. Or maybe your whole team went on vacation. You didn’t touch anything, so nothing broke for a while. But then you run into the problem of people are now going to expect this moving forward.
-
Once you paint yourself into that corner, now you might be stuck with being held to a level of reliability that’s otherwise too expensive and too difficult.
Communicating with Error Budget
-
What error budgets let you do is they let you tell others here’s how reliable we have been. So your SLO target tells others here’s how reliable we want to be, and your error budget lets you tell other people here’s how reliable we’ve actually been.
-
The reason that’s such a good communication tool is because it helps other people figure out what their own SLO should be, and that also helps other people just basically understand what all this actually boils down to.
-
It gives you a more human friendly way of communicating this kind of historical reports. What did Q1 look like? What did 2021 look like? It’s just an eventual output from an SLO based approach that helps you then go have conversations with people. Whether it’s via the time-based error budget definitions.
-
Or just by being able to say like, look, we exceeded our error budget every quarter last year. We believe we’re aiming for the right target, but we can’t. So we need more resources or we need more head count or it could be the opposite.
-
It could be like, hey, we’ve been exceeding our budget a ton. Maybe we move some of the staff over to this other project that’s having some problems. Or maybe we should be moving quicker. Maybe we should be shipping features more often. Let’s spend that budget. Let’s ship more features. Let’s experiment. Let’s try things. Let’s do chaos engineering.
-
There’s so many different things that error budgets let you do, but they almost all involve communicating to other people.
Building SRE Culture
-
The best advice I can give, or at least my favorite advice, because it’s not always possible for everyone, if possible, just get started. Just do it. Just pick a service and pick an SLO and measure it.
-
Maybe it’s the totally wrong target and maybe your SLI is a bad one. That’s fine. Pick a new one. They’re not agreements like they’re objectives. Just get started with it and start gathering the data and see what you can start doing with the data. And maybe first it’s just you, and then maybe it’s your team, or maybe you can get your whole team on board right away.
-
I really think SLOs are often from the bottom up thing. I think they’re very often organically grown. I think that people can be sold on them philosophically, but they don’t understand why they need to spend time implementing them until someone’s kind of given them the hard data.
-
If it doesn’t give you what you want, maybe pick different targets, maybe pick different measurements, or maybe you’re not quite ready for them. That’s also totally possible.
-
Every industry out there already, in some way, understands failure happens. It’s not just computer services. Embracing failure is always a good thing by ensuring you’re not failing too often. This is always going to lead, eventually, to happier engineers and happier business and happier users and happier customers.
-
It’s all about embracing those service truths that we talked about at first, right? Reliability is the most important thing. Your users define your reliability, not you, so make sure you’re measuring the right thing. And a hundred percent is out of the question, so pick the right target. You can embrace those truths without real time monitoring and advanced statistics and all the stuff that comes along with it. Just get started. Even if it’s in a spreadsheet. Even if it’s only just once a month.
3 Tech Lead Wisdom
-
Be kind.
-
It goes a very long way. You’re dealing with other humans, no matter what you think your job is. No matter what your computer service is, it exists at some level for other humans, whether they are your customers or your end users or your coworkers or whatever.
-
Be kind, just be nice to each other. Don’t be pompous. Try to always remember that every decision you make impacts other people. Initiate those decisions with kindness.
-
-
Be thoughtful.
-
My most meaningful mantra at this stage in my career is “think things over”. Sometimes you need to react. Sometimes you’re in an emergency. Sometimes you’re dealing with an incident or immense business pressures, or your company is under immense financial strain.
-
But always, you always have time to be thoughtful. You always have time to take at least a few seconds and be like, okay, is this the right thing I’m about to say? Is this the right thing I’m about to do? Is this the correct action I’m about to take?
-
-
Adopt blamelessness.
-
Make sure that your organization is building an appropriate culture, where we understand that humans don’t make mistakes on purpose.
-
Every single one of us makes mistakes, every single day. But generally speaking, unless you’re a bad actor, unless you’re literally trying to bring down the company from the inside. Unless that’s the case, people aren’t doing it on purpose, and always remember that.
-
Adopt blamelessness. Combine that with thoughtfulness. Combine that with the kindness and be better to each other.
-
[00:02:00] Episode Introduction
Henry Suryawirawan: Hello to all of you, my friends and my listeners. Welcome to the Tech Lead Journal podcast, the show where you can learn about technical leadership and excellence from my conversations with great thought leaders out there. And today is the episode number 96. Thank you for tuning in and listening to this episode.
If this is your first time listening to Tech Lead Journal, make sure to subscribe and follow the show on your podcast app and social media on LinkedIn, Twitter, and Instagram. And for those of you who enjoy this podcast, and wanting to contribute to the creation of the future episodes, support me by subscribing as a patron at techleadjournal.dev/patron.
Implementing SRE concepts and best practices can be daunting! Although Google released a few SRE books, including the famous “Site Reliability Engineering” book, many of us still have some gaps in terms of really understanding the essence of the concepts and practices, such as the Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. And on top of that, how can we start building a good SRE culture and avoiding some common pitfalls, especially when communicating the benefits of these set of practices, for example, to the business or stakeholders. Also, do tools matter in implementing SRE? How reliable should our service be? And how should we measure it? These are some of the common questions that I know people usually ask when introduced with SRE concept. And if you do have the same questions and thoughts in your mind, then today’s episode is definitely for you.
My guest for today’s episode is Alex Hidalgo. Alex is the Principal Reliability Advocate at Nobl9 and the author of “Implementing Service Level Objectives” book. Alex previously worked at Google as a Site Reliability Engineer and also Customer Reliability Engineer, and also contributed multiple chapters to “The Site Reliability Workbook”. In this episode, we discussed the practical guide on how to implement SRE practices and Service Level Objectives or SLOs. Alex started by explaining the basic concept of service reliability and the 3 service truths. He then explained the concept of reliability stack, that includes the famous SRE concepts: SLIs, SLOs, and error budget. Alex then shared his insights on how we can define a service reliability target, why a higher reliability target is expensive, and the risk of a service of being too reliable. Towards the end, Alex shared his tips on how we can start building SRE culture and how we can use the error budget as a communication tool within the organization.
I very much enjoyed my conversation with Alex. And even though I have been learning the SRE concepts for quite some time now, there are still a number of insights that I learned from Alex in this episode. And if you also find this episode useful, please share it with your friends and colleagues who can also benefit from listening to this episode. Leave a rating and review on your podcast app and share your comments or feedback about this episode on social media. It is my ultimate mission to make this podcast available to more people. And I need your help to support me towards fulfilling my mission. Before we continue to the conversation, let’s hear some words from our sponsor.
[00:06:37] Introduction
Henry Suryawirawan: Hey, everybody. Welcome back to Tech Lead Journal podcast. Today, I have a guest with me named Alex Hidalgo. He’s the Principal Reliability Advocate at Nobl9, and he’s the author of a book titled “Implementing Service Level Objectives”, and contributed to another book, which is part of the SRE book from Google, which is titled “Site Reliability Workbook”. So as you can tell, today we are going to talk a lot about SRE and SLOs, and things like that. So if you are wondering about these practices, SLO, how to set it right? And things like that. We’re going to cover it today. So Alex, really thank you so much for your time today. Looking forward for this conversation.
Alex Hidalgo: Thanks so much for having me.
[00:07:19] Career Journey
Henry Suryawirawan: So Alex, I’d like to actually start with my guests to tell his or her story. Any career turning points, any highlights. So maybe if you can share yours?
Alex Hidalgo: Sure. My route to where I am today has been an interesting one. I grew up as a computer nerd. My dad helped teach me how to program when I was eight or nine years old. I was on the internet before the web existed. Back when it was all still texts. So, through most of my education, I always assumed I wanted to work with computers. So, I didn’t go to college after high school. I went, and I got a job in the tech industry. But that first job was doing network security work for the governments of the US, and I hated it. I was pretty good at it. I actually got a promotion within a year, but it made me miserable. And so, I thought I didn’t want to work with computers. I thought computers were just a hobby for me. So I quit my job and realized it was still time for me to go back to school. So I started college a few years after most people. I ended up studying philosophy and history. And I took a bunch of creative writing courses. All the while, computers were still a hobby, but I kind of decided computers weren’t for me as a career.
Then I decided to move to New York at the height of the 2008 recession. So right when the global economy, and especially here in the US, things were just absolutely tanking. Suddenly, I’m in one of the most expensive cities in the world, and I can’t find a job. My money’s running out, whatever I had of savings at the time. Because in my twenties, I worked in the service industry. I was a server. I was a cook. I worked in a warehouse. My money was starting to burn out. I kind of realized I can probably still do this computer thing, even if it’s just for now, even if it’s just to make some money and get back on my feet.
The very first tech job I applied for, which was an IT desktop support kind of position. I was hired right away. I actually had so much fun. It was at a small company, only about 10 people. We were the IT department for companies that didn’t have their own. So every day I’d travel all over New York city, every borough, all over the place, just helping people with whatever they needed help with. And I really loved it. I loved that human part. That interacting with other people and helping them learn how to use their computers. Fast forward a few years, I’m working for this company called Admeld now, and then Admeld is purchased by Google. And suddenly, I’m at Google as a Site Reliability Engineer. I didn’t even know what that meant at first. But as I learned more and more about it, it spoke to me so much. It was really what I had always meant to do. It was every bit of my personality, like the things I truly love. It was all kind of reflected in it. The human aspects, thinking about users, putting humans first, blameless culture for incident response and incident retrospectives. Just everything about it I absolutely loved. So I really fell into the role really well.
Fast forward a few more years. My last role at Google before I left was on the Customer Reliability Engineering team or CRE team. The CRE team was a group of fairly experienced SRE. We were tasked with teaching Google’s largest cloud customers, how to SRE? How can people make their services more reliable? We realized that the biggest thing that we needed, because the customers we were engaging in were not other tech companies, they were not subdivisions of Google, right? They were retailers, and they were industry manufacturing companies, and they were all over the place. So we realized we needed a common vernacular. We needed a shared language. And what we decided was SLOs, Service Level Objectives, would be that shared language. So, I spent a good year traveling all over and teaching all sorts of people what SLOs are and why they’re so great and how to use them. Cause that was really the building blocks that the CRE team wanted in order for us to engage with people in the best possible way. I had some fun doing that, but it was my time to move on from Google. I’d been there for a long time. Everyone needs to change at some point.
So I moved over to Squarespace. When I started at Squarespace, I was asked, “Hey, we want to do SLOs. You know how to do SLOs. Can you help us? And I was like, “Sure, no problem.” I set this up with my manager. I was going to spend like 60% of my time on teamwork and about 40% of my time teaching the entire organization, including my own team, how to do SLOs. I didn’t realize how much work that really was when you’re getting started from scratch. When you are starting from just the absolute bottom where people barely even understand what all the different terms mean. It’s a lot of work. You need to build the right tooling. Chances are your monitoring systems don’t even do SLO math in the first place. You need to do education and workshops. You need to build document repositories. It was a lot of work.
We didn’t define our first SLO at Squarespace for about six months after I started. After about a year and a half of that, I was running a workshop every Friday afternoon for four hours from noon until 4:00 PM, and we had a break. But still, it was a four-hour workshop, every single Friday, and people really liked them. They were popular. But I was just tired of saying the same thing over and over again. So at some point, I was complaining to a friend of mine there, like a good coworker, and I was just like, “I wish there was a book about this. An entire book, not just a chapter in the SRE workbook, but a whole book. Because that way, I could point people at it. Instead of just doing these workshops over and over again.” My friend said, “Well, you should write it.” And I said, “No, an expert should write it.” And he said, “You are the expert.” And I cursed, like I straight up just like said curse words. Cause I knew he was right. I just didn’t realize it until he had said it. I suddenly knew I was writing a book, and I’d always heard how difficult that is. A few months later, I was working on a book for O’Reilly, “Implementing Service Level Objectives”. That’s directly led me to where I am today.
I’m now at Nobl9, which is a startup based entirely around how to do Service Level Objectives. How to measure them? We do all the tooling for you, etc, etc. So, I came to fall in love with SLOs at Google because I realized they made my life better and they made my users’ lives better. They make humans happier, and that’s always been the most important thing to me. I’ve been very lucky to have been able to focus primarily on SLOs for last six, almost seven years of my career now. It’s a great place to be.
Henry Suryawirawan: Thank you so much for sharing your story. It is a very beautiful story. A lot of ups and downs, including how you got to find your passion. And hopefully, today you are not tired to speak about SLO one more time. So at least today we’re going to cover some of the basics. Thank you so much for sharing this story.
[00:14:17] Understanding SRE & SLO
Henry Suryawirawan: So Alex, you wrote this book “Implementing SLO”. There are a number of books, not many, but a number of books about SRE, SLO and all that. But I still find that people find it difficult to grasp the concept. What do you think are some of the challenges that people face in understanding these practices?
Alex Hidalgo: So I think some of the biggest problems with both understanding what SRE is, Site Reliability Engineering, as well as SLOs, what Service Level Objectives are, is that no one’s really ever fully defined it. There is, of course, the first Google SRE book. But you know what? It’s like 30 something chapters and not a single team at Google actually does all those things, anyway. They’re best practices. They’re aspirational. And, sure, those are the original definition that Ben Treynor Sloss, basically the inventor of Site Reliability Engineering. He said SRE is what happens when you ask software engineers to solve operational problems. That’s great, but that’s also very vague. And then, what is the difference between SRE and DevOps? Is a DevOps engineer even a thing? Or is DevOps just an approach? And then marketing teams got a hold of it, right? Suddenly the SRE book was selling really well. So now suddenly, oh, we’re going to have SRE companies and we’re going to have tooling that is SRE tooling. There isn’t one definition and I would actually say it doesn’t really matter as long as our end goals are kind of the same. And I see the same with SLOs. I have my own definitions. I do think there are true kind of definition that everyone can agree upon.
But, you know, in the first and second Google SRE books, they don’t even define what an SLI is in the same way, a Service Level Indicator, which we can talk more about, of course. A very important part of how to do SLOs. The two different books don’t even agree. So I think part of the confusion is just that there often aren’t single resources to point people at. We don’t have degrees for this. You’re purely a software developer. You can get a degree in computer science. Certain algorithms have certain names, have very strict definitions. If you’re writing code in a certain language, you have to adhere to the syntax of that language. For certain languages, very strictly. We can very clearly say, this is a Java program versus this is a Python program. But when it comes to more philosophical things like Site Reliability Engineering, like Service Level Objectives, I think one of the reasons people sometimes struggle is because it can be more difficult to start from scratch. Like my story about Squarespace, where I realized, even though I knew how to do these things, I had this 1,200 person organization that I had to teach from the ground up. Because there aren’t these strictly defined resources that people can just learn.
Henry Suryawirawan: I think what you said speaks truth to my experience as well. It’s really hard to read those SRE books, by the way. It’s very dense, sometimes could be dry, sometimes could be Google related. And there are not many available experts out there that can be said like certified SRE, to actually tell you, this is how the better practice should be. And many tools, vendors just came out and probably, like what you said, somehow polluted the term and maybe define their own definition and things like that. So I think that’s one of the challenge.
[00:17:30] Service & Reliability
Henry Suryawirawan: Today, let’s try to also discuss about these concepts from the basics. Since you have a lot of experience, you write this book “Implementing SLOs”. But first of all, what is the definition of service and reliability? Because some people actually use these terms interchangeably, so many different variations. Maybe let’s start from there.
Alex Hidalgo: Yeah, absolutely. Service is actually not difficult to define because you probably already know. It’s what the word service just means. One of my favorite examples is that I’ve drawn so much of my experience and so much of why I love SLOs is because I used to be a server. Like server in a restaurant. I provided a service for people, which was to take their orders and bring them food. A computer server is not very much different. It is a thing that takes your request and responds to it correctly. It’s very similar in concept. That’s what a computer service is. Computer service is something that listens to a request from something and responds to it appropriately.
That, I think, is the best way to think about it. Instead of trying to define exactly what it means at a technological level. Cause I don’t think that’s important. To some teams, it is important to think of their service as a pod running in Kubernetes or a series of pods or a Docker container somewhere or a binary running on a virtual machine or a piece of hardware even, or networking gear. Those provide services as well and those don’t fit any of those previous definitions. For some people, the service they care about is the retail website. You go to a place to buy a pair of socks. Some people at that company just care about that entire website, even though it might be composed of 80 individual tiny microservices, can really be defined as anything that does something for someone else. I don’t think we have to say, okay, that’s a service because it is a set of pods deployed via a single deployment on Kubernetes. Whatever. This isn’t a service cause it’s a user journey. No, no, no, no. A service, you can think of it holistically. You can think of it philosophically. It’s something that does something first.
And that leads us right into what I think the correct definition of reliability is. Because people often conflate it with availability, but they’re very different things. Because a service can be available and not be reliable. Reliability is an old term. Reliability engineering even goes back to the 1940s. It’s not a concept unique to SRE or Google or tech or computers. What reliability really means is, is this system performing how it was defined to perform? For computer services, that basically means is it doing what it needs to be doing? If we are cool with the concept of a service being a thing that does something for someone else, then reliability is, is that thing doing that thing it’s supposed to be doing?
The reason I always try to steer people away from just thinking about it as being like availability is because you can be very available and still be doing a bad job. That example of the retail website, where you just need to buy a pair of socks. Well, maybe you can log into the website and it’s very quick and you can search and you get 10,000 results for socks and they have every color and every size and everything you ever wanted. But then when you go to checkout, you can’t. That’s not being reliable, even if that service is being available to you at the time. So a service is anything that’s doing something for something else. And reliability means, are you doing that well enough?
[00:21:06] Service Truths
Henry Suryawirawan: Thanks for explaining. This is very, very simple. I think everyone here could understand that. Really, as part of the foundation, right? When you understand what is service, what is reliability? The next few things will become easier. But before we go into more SRE, reliability and all that, I saw one section in your book where you mentioned about service truths. I think this is an interesting concept for me. Would you be able to share probably what do you mean by this service truths? And what are those?
Alex Hidalgo: Sure. So, I personally believe that there are three things that are true about any service. One is that reliability is its most important feature. If your service is not being reliable, it’s not doing much. As we just said, if we’re defining reliability as are you doing what you’re supposed to be doing, well, then it kind of follows. If you’re not being reliable, you’re not doing what you’re supposed to be doing. So you need always to be thinking about reliability first.
The second truth is that you don’t get to decide what your reliability is. I don’t care what your measurements say. I don’t care what your logs say, what your metrics say. I don’t care if you have a million healthy pods, all reporting up. If your customers, if your users, anyone that depends on you thinks you’re being unreliable, you are. If you’re not meeting the needs of your users, you’re not being reliable. So you need to take their perspective into account.
And then the third service truth is that nothing is ever a hundred percent. So don’t aim for it. This is just a truth of the world. Outside of pure mathematical constructs, nothing’s ever a hundred percent. Things fail. Failures occur. It turns out that people are actually fine with failure, as long as failure doesn’t happen too often. The third truth is just don’t aim for a hundred percent because that’s a fool’s errand. So instead, pick a more reasonable target. Understand what reliable means. Understand that someone else determines what reliable means. Understand that reliability is the most important part of your service. And then make sure you’re only trying to be reliable enough. Like an achievable amount. Something that works for both you and the people that depend on you.
Henry Suryawirawan: As you can tell, these are the fundamental understanding. So let me try to reiterate. The first is that reliability is the most important attributes or characteristics of your system. No matter how your system has so many features or functional requirements, but if it doesn’t perform reliably, that’s probably not a good service. And the other one is you don’t define your reliability, but users do it for you on behalf. So when they use your system, they will tell you if your system is reliable or not. And the last one, nothing is a hundred percent. Don’t ever try to achieve that. Because people are okay with failures as long as it’s not failing frequently.
[00:23:45] Reliability Stack
Henry Suryawirawan: Let’s go back to the more deep dive definition about SLI, SLOs and all that. So you have this concept of reliability stack. Maybe let’s start from there. What are in the stack? How do you define them? And how they work with each other?
Alex Hidalgo: Sure. So you have three primary components of what I call the reliability stack. This is what people really often refer to when they’re saying SLOs. When they often use the term SLOs, they really mean a few different things. First is SLIs or Service Level Indicators. Service Level Indicators are measurements. They are bits of telemetry about your system that tell you, is it doing what it’s supposed to be doing? And, again, this should be from your user’s perspective, as close as you can get, at least.
Then next, you have SLOs or Service Level Objectives, and Service Level Objectives are just targets for how often you want your SLI to be true. So if your SLI is able to tell you, yes, we’re currently meeting the expectation of our users, or it’s able to tell you, oh, we’re currently not meeting the expectations of our users. You’re now able to inform a ratio. Good events over total events equals a ratio, equals some kind of percentage. Your SLO is just a target for what you want that percentage to be. So you can pick something more reasonable. So we just talked about a hundred percent is impossible. No one ever hits that, anyway. So you can say something like. We’ll go back to the sock buying scenario. We may as well use it all night. Why not? If you can’t check out when you try to buy socks, that’s not good. But it’s okay if that only happens one in a hundred times. Because you try to check out, it doesn’t work. What are you going to do? Probably just going to click again, and as long as it works the next time, fine. So maybe you only have to make sure that sock checkout works 99% of the time. It lets you pick a target that’s realistic, that accommodates failure, that makes sure you’re not spending too much money and trying to aim for something you can’t.
And then finally, at the top of the reliability stack, you have error budgets. Error budgets are just a way of thinking about how your SLO has performed over a period of time. An error budget often takes into account a time window that’s often fairly large, generally anywhere from a week to 28 days to 30 days, sometimes even a whole quarter. The idea is your error budget is that other side of the percentage. So if you say sock checkout should work 99% of the time, what you’re also saying is 1% of the time, sock checkout is allowed to fail. Your error budget is that 1%. Your error budget measures are we failing more often or the right amount? And so, your error budget lets you think about things over periods of time. Over the last 30 days, what percentage of the time have we failed? And is that helping us meet our SLO targets? Are we exceeding it?
Your error budget, it’s worded that way because a budget is something you can spend. And an error budget is exactly that. We’re allowing ourselves this 1% or whatever you’ve defined for your own service, of course. But in our continuing example, you have this 1% of checkouts are not allowed to work. What amount of that have you spent over time? And that then helps you make better decisions about where you need to focus your attention.
Henry Suryawirawan: Thanks for mentioning about this budget term. Because it is that consciously defined. It’s not something that someone just picked, but it’s actually something that can be spent. Error is expected sometimes, but it also can be spent.
[00:27:11] Defining Reliability Target
Henry Suryawirawan: So you mentioned about defining reliability. Many people got stuck even here, right? Like how reliable your service should be? You said it’s not a hundred percent. Some business people, it should be a hundred percent. How can we actually define our reliability, our service? How should we go about it? What’s the approach?
Alex Hidalgo: Unfortunately, this is one of those times where I invoke my very senior engineer who’s been doing this too long card and I say, it depends. It’s unique. It depends on your service. The important thing is to be meaningful and to be thoughtful. The important thing is think about what is my service? Who are my users? Real quick tangent, I say user a lot, but I don’t just mean customers. I mean anyone that relies on your service. It might be another team down the hallway. It might be internal people at your company. It might be another service. Your service might be six layers removed from an actual human, but that other service is still a user of your service. So that’s why I say user a lot, cause it can be any of those things.
There isn’t a single answer. Be thoughtful about it. Take your time. Think, what does my service do? What is it supposed to do? What do the users expect? Can I go talk to those users? If I can, let’s go talk to them. Let’s ask them directly. Is there a product management team? Do they have a user journey document? Maybe we should go look up what the defined user journeys are. Why is the software written in the first place? Why was it spun up at this company? What does it do? It’s not the most satisfying answer, because I don’t have like an answer. I don’t have a special formula people can use to instantly understand what it is that they need to be thinking about when they pick how reliable they need to be. But what I can say is in every single case, if you’re careful and you’re thoughtful, that’ll lead you to the best answers.
Henry Suryawirawan: One thing that I really love about the SRE concepts and SLOs and all that, they actually always put users in the first place. It’s not some random technical decisions. Okay. This is how the reliability should be, and it’s actually always comes from the user’s perspective. Like you mentioned in the beginning, hundred percent is not possible sometimes, right? Even things that relies on internet. By default, they are not reliable because your packets could be lost or you need to retry and all that.
[00:29:27] Higher Reliability is Expensive
Henry Suryawirawan: Reliability, another thing very important to understand, maybe if you can help to explain, is that as you go higher, if you aspire to go higher, it becomes more difficult, complex, and also expensive. Tell us more why this is the case?
Alex Hidalgo: Sure. I mean, if you want to ensure that you are being more reliable, you need to ensure you’re having fewer failures. Individual components fail more often. So if you add more and more, you’re just going to have more and more failures, right? Like you have something that fails 99% of the time, and you have just that one thing. But if you need to be redundant, if you want to make sure that if this thing fails, Thing A, then you also now need to have Thing B in place. Cause now Thing B needs to take over in case Thing A fails. That’s all good and well, but Thing B might also only be 99% reliable. So now you need something to determine whether or not Thing A or Thing B is currently being reliable or not and where to send that traffic. So now you have Thing C. We can go on and on, but you know, you just end up having to build more and more complex systems to ensure that you are, in fact, being more and more reliable over time.
This is sometimes totally reasonable. For some systems and some services, you do need to hit very high targets. This means you need to be distributed. You need to ensure you’re built for high availability and quick fail overs and redundancy. There’s all sorts of engineering for reliability concepts that we could spend hours and hours talking about. But once you introduce so many things, you’re also spending a lot more money. Because no matter if this is running on your own hardware or your own data centers or everything’s in the cloud, you’re now running more things. So it costs more money. Both just in a per month billing situation, as well as the fact that now you need more engineers to take care of it. If you have a hundred components instead of one, you need more engineers to take care of those hundred components. So now you have to hire a whole bunch of engineers.
And now if you’re trying to hit a really high reliability target, well, what a lot of people don’t always do the math about is what that really means in terms of like time? If you want four nines of reliability, that gives you only seconds per month to respond. You can barely have any downtime at all, ever. So now, what do the on-call rotations look like? So you now have like a hundred different components of your service, because you have to have incredible redundancy and all sorts of extra availability stuff, and now you have a whole bunch of engineering teams to take care of all those components. Now they have to be on call. But they also have to be on call and respond immediately. That means no longer can those teams just be single-homed. You need a follow the sun rotation. So you and I are exactly 12 hours apart. So we can have a team in Singapore and we can have a team in New York. Great. But now you need offices in Singapore and New York. Now you don’t just have X number of teams. You have two X number of teams. Because every team now needs teams in Singapore and teams in New York to ensure that someone’s only being on call during the day. Because no one can always wake up at 3:00 AM and respond in time to try and defend a target as stringent as 99.99%.
This just goes on and on and on. The closer you try to get a hundred percent, the more and more that grows. It’s exponential. If we can all agree that just logically you cannot ever hit a hundred percent, anyway, it turns into a limit. The limit approaches infinity forever. You’ll never actually hit a hundred percent anyway and trying to get there will just cost you more and more money in terms of how many services you need, how much you’re paying for your cloud providers. God, maybe you can’t even be on one cloud. Maybe you got to go multi-cloud because what if AWS goes down? Well, we better be running in GCP as well. Do you know how difficult it is to write stuff that lives on top of both of those things? And ensures that it can route to GCP or AWS at the right moments in time. How to even detect whether or not your AWS or GCP instances are running correct at that point in time? It gets so complex to try to hit these high targets that you’re just going to kill yourself trying to spend too much money. Turns out, you don’t have to hit those targets anyway.
Henry Suryawirawan: Yeah. I was about to say, in the end, after all of these complexities and effort, your users actually don’t need that kind of reliability. So I think the best way is, again, to check with your users what is their expectations? And yes, sure. Some systems will require this kind of high availability. But yeah, hopefully people who listen to Alex’s explanation, you could actually understand. You need to really define your reliability. It’s not just some nines that you see. Okay. Four nines. So let’s just put it that way.
Alex Hidalgo: Yeah. So often, leadership decides we’re going to hit this target. They pick a whole bunch of nines all in a row. Cause they think that’s reasonable or because they know that some of Google services hit that. But you know what? You’re probably not Google. Google has all those things. Google does have teams all over the world. Google does have data centers all over the world, and you probably don’t. As you said, just be reasonable, be thoughtful. Think about the targets you’re picking and make sure that they make sense for you.
[00:34:26] SLI
Henry Suryawirawan: So let’s start with the fundamentals of reliability stack, which is SLI. So you mentioned it’s like a metrics that defines how your users’ experience in terms of reliability. Probably, if we can relate to the SRE best practice, commonly it advocates four different golden signals for monitoring, probably define the metrics of your system. Is this the right place to start? How do you define your SLI?
Alex Hidalgo: I have mixed feelings about the golden signals. I’m on record. This is not going to blow anyone’s mind who’s known me for a while. I kind of wish we hadn’t in the first SRE book written about them the way we did. Because I feel that too many people think that by having labeled them the golden signals, that they’re the only things that matter. And I see a lot of people both start and stop there. And there’s nothing wrong with measuring availability. There’s nothing wrong with measuring latency. There’s nothing wrong with measuring throughput. That’s not my issue. My issue is that people often start and stop there. They are, in fact, good starting points. Absolutely they are. So if you’re asking me, are they good starting points? Yes. If that’s where you need to start, start there. Almost everyone can measure those things. It’s a great place to start.
But those things rarely tell the whole story. They rarely tell you what your users are actually experiencing. As we were talking earlier, reliability is what your users need from you. And we can just tell a quick little story, right? So, is your service available? Yes. It’s responding to requests. Is its latency low? Sure. It’s responding to requests very quickly. Is it experiencing much errors? No. It’s available, and it’s responding to things very quickly, and every response is HTTP 200. Everything’s great. But if you’re sending them the wrong data, if the data you’re sending them is not what they’re asking for, you’re not being reliable at all. And that’s not covered under the golden signals at all. So I think people can graduate. They can upgrade to a more user journey based focus, which is, again, is this doing what users need it to do? Which includes many levels beyond just what the golden signals tell you.
So, yeah, I have some mixed feelings. I think they were accidentally taken too seriously or as too much of the end goal. The way they were written about. But I do think they are good starting points. Because almost everyone has that data already. If you don’t, it can be generally speaking, pretty easy to instrument. It can be much more difficult to measure. Did we send the data that the client asked for? That can be a lot more work. It takes a lot of time to get there. But yeah, like that’s my kind of feelings there. A really good SLI is ever-changing. It’s ever adopting. Cause the world’s going to change. Your user expectations are going to change. What your service is defined as even doing is going to change over time. So make sure that you’re looking at your SLIs. You’re looking at those measurements and you’re saying, is what we’re measuring, is it still telling us enough? Is it still telling us what our users are experiencing? Is it still looking at things from their perspective? So start with errors, start with latency, start with availability. But move from there. Think about what else your users actually need.
[00:37:30] Measuring Correctness
Henry Suryawirawan: You mentioned about checking like the correctness of data, right? So this is sometimes where I am confused as well, personally. If you are always checking whether you send the correct data, then it gets like infinite loop. How do you check, in the first place, the data is correct? What about the testing phase of your product? How do you actually cover this correctness thing? Because do we really need to measure the correctness of all requests? Or is it only partially? Or maybe only some service types that need to have this correctness attribute measured? So maybe tell us a little bit more about that.
Alex Hidalgo: I think it’s all the things that you just mentioned. It depends, again, on your service. A favorite of mine is using synthetic checks. A good example is I was once responsible for a very large scale log system. Hundreds of thousands of incoming logs every minute, or maybe even for second. Very, very high volume. We wanted to make sure that we were doing things correctly. What we realize is that if we could craft a special log with a special tag on it, and we insert it, and then we waited, found out how long it took for it to go through the entire pipeline, it get indexed. And for us to be able to retrieve it on the other side and verify that signature we put in was there. That gave us availability, latency, ensuring there weren’t errors and data correctness, and even data freshness.
But the thing is, it told a pretty complete story. This is what the users of this log service actually needed to happen. They needed to insert logs and then they needed those logs to be indexed. And then they need to be able to retrieve those logs. So, we just wrote a job that would just continually do this. And luckily, this log system work decently, so this only had to happen a few times per minute. Even under a few times per minute, that gave us more than enough data to set an SLO on top of. So we’re no longer dealing with the hundred thousand requests per second that the service actually dealt with. It was only a few a minute now. But we were reasonably sure, at least close enough to being sure, that if any step along the way broke, for most people we would find out because our special crafted log, which was specially crafted, but otherwise went through the same workflow, the same pipelines as everything else. That gave us a pretty good signal. And that one measurement told us four or five different things because it covered five different services, the log producing services, the Kafka system that sat in between, the Logstash service that pulled off of the Kafka topics, the Elastic Search cluster, that Logstash inserted the logs into, which were then indexed by Elastic Search. And then the check talked to Kibana that sat in front of Elastic Search to actually retrieve the log. So all those services and the availability of all of them and the latency of all of them and the error rates of all of them were all being covered by this one synthetic that just ran over and over and over again.
So that’s one thing. You don’t always need as much resolution as you think you do, because even if we only had a few of these per minute, as opposed to hundreds of thousands per second, while we were covering so many components of the system, it was just as good a data. But you can also set SLOs off of actual 100% user data. Now, this requires advanced tracing solutions. This requires a lot of work. We do not have time, really, to get into all the technology involved for you to be able to trace from your human end user clients or browser across the entire internet, through your cloud provider and CDN and load balancers into your app, all the way back to the databases and whatever resources they needed to talk to. An older backup, like render time at the client or whatever. But the fact is I’ve seen that and it is absolutely possible, and that is a lot of work. But you could, in fact, also your SLI could be set off of actual human traffic. Your actual user requests could in fact be used for that as well. Or anything in between.
If there’s anything I want to drive home, it’s be thoughtful. Be meaningful. Do what works for you. Don’t overspend resources trying to do the real user journey tracing. That’s not reasonable for everyone. Not everyone has the resources with the knowledge or the time to do that. So maybe set up a synthetic instead. Or you know what? Let’s take another step back. Maybe for you right now, just latency and error rate, just the golden signals. Maybe that’s just fine.
[00:41:49] Critical User Journey
Henry Suryawirawan: Thank you for clarifying that. Again, it’s very insightful for me. Thanks for explaining that. So you mentioned a couple of times about user journey. Maybe some people also confused about this term. So what do you mean by user journey? Is it like entire experience? I want to buy socks. Or is it like when you check out? Or is it when you load a page? There are so many different ways to define this. Maybe if you can help also to define what is a user journey or some people call it critical user journey?
Alex Hidalgo: Sure. So, a critical user journey is generally something that’s defined by the product aspect of your organization, telling you what needs to happen with your product for you to be a successful company. Generally in this case, meaning making money. That’s the most general product management, product owner definition of what a user journey is, right? It is the expected way that a feature will work. And a feature very rarely connects directly to a single service, in terms of a microservice, in terms of a single team owning it.
So user journeys generally span multiple different components of your system. That generally means that there are many different teams, not just one, responsible for all the components that user journey travels over. It’s kind of a high level product manager explanation of user journey, but I think they’re just a good analog for what a good SLI is. A good service level indicator basically is a user journey. A user journey basically is a KPI or a key performance indicator, which is even one more step up, right? This is now the business side. What does the business say? Your business operational teams, not your computer operational teams. What do they say we need to be measuring? What do they say is important for our revenue, for our bottom line? What does the Chief Revenue Officer or the CFO, what do they care about?
But they’re all similar in the sense that they’re all measurements likely having to do with your customers or your users, and none of them are ever going to be a hundred percent, so you can set targets on all of them. I kind of like to say that an SLI is a user journey, is a KPI. They’re all kind of the same thing. Just different business units have slightly different ways of talking about them.
Henry Suryawirawan: I think the most important thing also, again, it’s coming from the product, right? It’s not randomly from some engineers. Okay. This is our critical user journey. You mentioned that a user journey could typically involve a number of services. It could be components, database, load balancers, and all that. It could also be multiple microservices that span across multiple teams. One part of a confusion here. How do you define the SLIs? Is it per team? Is it per whole user journey itself? How do you advise people to think about this?
Alex Hidalgo: I think could be both. I think individual teams need SLIs set on their own services. They need SLOs set on their own services. They need to understand how their services are operating for the things that depend on them. Again, even if those things are only other services as well. But then, perhaps the director of your organization needs to be setting SLIs. It’s not to say the director has to be necessarily implementing them, but needs to own SLIs and SLOs for the kind of user journey stories that go across many different services. And perhaps, the VP of Engineering needs to be owning the concept of, like, can you check out? So, the checkout microservice team has their own SLO that measures how often their microservice is not throwing an error. The commerce department, the director of that department, owns an SLO that says, does the checkout workflow work? The VP of Engineering owns an SLO that says something like, are users able to use our website and send us money? Something along those lines. But, you know, I think these things build on top of each other. I think every step of that way needs its own SLO and you can use those individual SLOs to inform an overarching SLO. The SLI for an SLO could be the SLO status of a different SLO. You know, like I said, there’s no super hard and fast rules here. It’s just, are you measuring things? Are you taking your users into account? And are you trying to make sure you’re not trying to be a hundred percent?
[00:45:55] Setting SLOs
Henry Suryawirawan: You mentioned about different departments owning SLOs, right? Well, let’s move to SLO part. So you have defined your SLIs, you have maybe latency, error rate, availability, and all that, and then you set your SLOs. First of all, how many SLOs should a service have? Is it in hundreds? How many is good enough? So I think there are some art here. I think it’s mentioned in the SRE book as well, in your book as well. What’s the art? How to define the number of SLOs for your service?
Alex Hidalgo: Can be so boring for your listeners. Cause it’s just it depends, again, right? It really is, though. It’s solely dependent on your service. Make the right decisions. Make sure it is enough that you are covering all important aspects of your service. So don’t choose too few. Cause if it’s too few, you can’t understand what’s going on. And also, make sure it’s not too many. Because if you have too many, running into multiple comparison problem where you have too much data and you don’t know what to look at, and you don’t know what’s telling you what, and you no longer have a good idea of what these signals are anymore. I think the Google SRE book said five or six per service. That seems reasonable to me. It really does. Yeah, sure. Why not? But I also wouldn’t, again, like so many things, I don’t like hard and fast rules. I like approaches. I like philosophies. So, don’t feel bad if you don’t have five. Don’t feel bad if you have many more than five. Just make sure it’s the right amount for you.
[00:47:18] Being Too Reliable
Henry Suryawirawan: There’s also another thing that is covered in the book. So when you set SLO target, make sure you’re not setting it too much beyond what users expect, because it could cause a little bit of complications. Maybe you can explain. First of all, how do we know that we are exceeding users’ expectations? Maybe we ask the users like, while you mentioned. And secondly, then what should we do? Do we lower the definition, the target itself? Maybe if you can explain a little bit, the problem of being too reliable here.
Alex Hidalgo: Yeah. So the main problem you run into by being too reliable is that people will end up expecting you to continue to run that way. So, if your users were previously okay with you being only reliable 99% of the time, but then you proceed to spend like a year being 99.9% reliable. Their expectations may have now changed. And now, they’re going to hold you to that 99.9%, even though they were totally happy with only 99% before. It’s a lot of nines, but I think everyone, hopefully, was able to follow what I was saying there. So you paint yourself into a corner if you are too reliable too often, cause user expectations will change.
What you want to do is you want to make sure that you aren’t being so reliable. You may accidentally do this. You may just accidentally have a few months where everything was just like super lucky. Or maybe your whole team went on vacation. You didn’t touch anything, so nothing broke for a while. We can come up with a bunch of different examples. But, you know, then you run into the problem of people are now going to expect this moving forward. My favorite example of this is the Chubby team at Google. This was written about in the first Google SRE book. You can read the whole story there. The quick versions, Chubby is a global lock service. So it holds tiny bits of data that are useful for various highly distributed services to be able to read and understand at certain points in their operation. Chubby just generally ran pretty well. When I tell this story and I say that, my old chubby SRE friends often give me a dirty look cause, apparently, their life is not always that easy. But from a user perspective, global Chubby, which was the global version of this, the globally available version of this, ran very well. I believe they had four nines. So a 99.99% target.
Every quarter, generally speaking, Chubby still have that because it ran pretty well. So what they would do at the end of every quarter is they would just shut Chubby off. They would just burn whatever budget they had remaining. This would be communicated to teams. Emails would be sent out. This Thursday afternoon at 3:00 PM, we’re going to be shutting Chubby off for exactly 2 minutes and 17 seconds, cause that’s how much error budget we have left. Even though people were told about this, other services would always crash. Because someone would have a dependency on Chubby that they didn’t know about. Chubby, being good citizen, being ran by an excellent SRE team, would say, we’re going to make sure you find out because we’re only promising you four nines. If you’re expecting anything more than four nines, you’re going to have trouble. So we’re going to ensure you find out if you’re going to have trouble, because we’re giving you exactly four nines per quarter.
That’s kind of a humorous story and it’s kind of out there and there’s not a lot of teams that will ever get to the point. There’s not a lot of organizations that’ll get to that point. But it’s a really good example of ensuring that your users aren’t getting too used to you being too reliable. Cause once you paint yourself into that corner, now you might be stuck with being held to a level of reliability that’s otherwise too expensive and too difficult.
Henry Suryawirawan: Another probably assumption that people think of when the service being too reliable, it’s like Google search. When our internet is down, the first thing that they will test is Google down as well. So I think when Google is deemed as reliable that much. So I think people expect that Google is like a benchmark for internet, even the reliability of internet.
[00:51:02] Communicating With Error Budget
Henry Suryawirawan: So you mentioned about error budget. This Chubby story is really interesting. How they use error budget to actually make sure that the service doesn’t perform too reliable. In the error budget sections of the book, one thing I find really interesting, you mentioned that error budget is actually not a technical term, right? It’s a communications framework, and this is probably between engineers and business owner, maybe other departments. So tell us more about this communication aspect of error budget.
Alex Hidalgo: Yeah. So really what error budgets let you do is they let you tell others here’s how reliable we have been. So your SLO target tells others here’s how reliable we want to be, and your error budget lets you tell other people here’s how reliable we’ve actually been. And the reason that’s such a good communication tool is because it helps other people figure out what their own SLO should be, and that also helps other people just basically understand what all this actually boils down to. Because we’ve said the number nine, and between us, like probably a hundred times already. But what does 99.9% even mean? Well, when you translate it to meaning 40 minutes per month, that’s something that humans understand. So when you’re able to go to someone and say, all right, we were unreliable for 17 minutes last quarter. But that means we still had a budget left of 24 minutes. Although unreliability periods are not always all in a row. They’re not always about downtime. They’re not always about outages.
But the point is, it gives you a more human friendly way of communicating this kind of historical reports. What did Q1 look like? What did 2021 look like? From our user perspective, from this hypothetical all seeing user who never stop watching us? What did it look like in time, for example? It’s just an eventual output from an SLO based approach that helps you then go have conversations with people. Whether it’s via the time-based error budget definitions, which again, make it easier for some human heads to wrap themselves around it. Or just by being able to say like, look, we exceeded our error budget every quarter last year. We believe we’re aiming for the right target, but we can’t. So we need more resources or we need more head count or it could be the opposite. It could be like, hey, we’ve been exceeding our budget a ton. Maybe we move some of the staff over to this other project that’s having some problems. Or maybe we should be moving quicker. Maybe we should be shipping features more often. Because you know what? We’re really awesome. We’re almost a hundred percent all the time. Let’s spend that budget. Let’s ship more features. Let’s experiment. Let’s try things. Let’s do chaos engineering. There’s so many different things that error budgets let you do, but they almost all involve communicating to other people saying, hey, this data told us this thing over time. What can we do with that data?
Henry Suryawirawan: I really like all this concept. They kind of like built on top of each other, like you mentioned, reliability stack. So the error budget, specifically, if you use it right, and if the people aligned in the organization that error budget is an important concept for them, you can use this for communicating priorities as well. Like you mentioned, should we ship more features? Should we hire more people? Should we even fix the reliability? Or should we do experiments and even like Chubby, right? Should we just spend it because we are too good? I think that’s really interesting.
[00:54:13] Building SRE Culture Henry Suryawirawan: So I think many people are interested in this SRE, SLOs, SLIs and all that. But implementing it, like you mentioned in your Squarespace experience, is hard. Maybe tell us a little bit more tips. How should people start building this SRE culture? Or even defining how to get the buy-in from the people within the company?
Alex Hidalgo: The best advice I can give, or at least my favorite advice, because it’s not always possible for everyone. I want to be very upfront that I understand that not everyone is in a situation where they can do what I’m about to say. But if possible, just get started, just do it, just pick a service and pick an SLO and measure it. And maybe it’s the totally wrong target. And maybe your SLI is a bad one. That’s fine. Pick a new one. They’re not agreements like they’re objectives. Just get started with it and start gathering the data and see what you can start doing with the data. And maybe first it’s just you, and then maybe it’s your team, or maybe you can get your whole team on board right away. And then you can start showing the teams that you work closely with, “Hey, look at this cool data we’re getting. Look at the great decision making we’ve been able to make and how we’ve been able to more effectively plan our sprint, because of our error budget data.” And then, these other teams like, “Oh, that seems kind of cool. Maybe we should try that.”
I really think SLOs are often from the bottom up thing. Really, honestly, I think they’re very often organically grown. I think that people can be sold on them philosophically, but they don’t understand why they need to spend time implementing them until someone’s kind of given them the hard data. So if possible, just start, just go, just start doing these things, see what it gives you. If it doesn’t give you what you want, maybe pick different targets, maybe pick different measurements, or maybe you’re not quite ready for them. That’s also totally possible.
But every industry out there already, in some way, understands failure happens. It’s not just computer services. Embracing failure is always a good thing by ensuring you’re not failing too often. This is always going to lead, eventually, to happier engineers and happier business and happier users and happier customers, therefore. So that’s my favorite advice. I don’t know if it’s the best advice, because like I said, I know some people are not in a situation where they can just go do it. I’m sympathetic to that. But that’s generally what I tell people. If possible, just give it a shot and see what happens.
Henry Suryawirawan: Yeah. Sometimes people get stuck into the tooling, you know, can we get the data? And all that. But I think we can always start simple, right?
Alex Hidalgo: Yeah. I’ve seen plenty of teams get started with SLOs by manually calculating them at the end of each quarter. Like no joke. They didn’t have real time alerting on their SLOs, or realtime error budget status. They didn’t even have error budget. They got started by, at the end of every quarter, when they were getting ready to plan what should our priorities for next quarter be? They would go run some queries from their monitoring system. Do math against it. Put it into a spreadsheet and then calculate. Okay, last quarter we were X percent reliable. What does that mean for our next quarter?
And that’s a totally reasonable way to get started. It’s all about just, again, embracing those service truths that we talked about at first, right? Reliability is the most important thing. Your users define your reliability, not you, so make sure you’re measuring the right thing. And a hundred percent is out of the question, so pick the right target. You can embrace those truths without real time monitoring and advanced statistics and all the stuff that comes along with it. Just get started. Even if it’s in a spreadsheet. Even if it’s only just once a month.
Henry Suryawirawan: I like that you mentioned this, even though you don’t have the tooling, just start. Because some people think, after reading the book again, it’s philosophical, right? We are not Google. We don’t have all the tools, so we are stuck. So I think just the key message here, just start. And I think these three service truths, like we discussed in the beginning, are really important. Once you get it right, you will find ways to actually measure your users’ happiness.
[00:57:57] 3 Tech Lead Wisdom
Henry Suryawirawan: So Alex, thank you so much for spending your time. It’s like the crash course of SRE and SLO, definitely. But unfortunately, we need to wrap up pretty soon. But before I let you go, I normally ask one last question for all my guests, which is to share your three technical leadership wisdom. So this is maybe some kind of advice for you to give us. Also, maybe based on your career journey, experience, or maybe hard lessons sometimes.
Alex Hidalgo: Sure. If there’s three things I have to share. Be kind. It goes a very long way. You’re dealing with other humans, no matter what you think your job is. No matter what your computer service is, it exists at some level for other humans, whether they are your customers or your end users or your coworkers or whatever. But be kind, just be nice to each other. Don’t be pompous. Try to always remember that every decision you make impacts other people. Initiate those decisions with kindness.
Be thoughtful. I’ve said that a lot tonight. But it really is, I think, maybe my most meaningful mantra at this stage in my career is think things over. Sometimes you need to react. Sometimes you’re in an emergency. Sometimes you’re dealing with an incident or immense business pressures, or your company is under immense financial strain. I’ve been all those places. But always, you always have time to be thoughtful. You always have time to take at least a few seconds and be like, okay, is this the right thing I’m about to say? Is this the right thing I’m about to do? Is this the correct action I’m about to take?
And then finally, adopt blamelessness. Make sure that your organization is building an appropriate culture, where we understand that humans don’t make mistakes on purpose. Okay. It sounds like humans always make mistakes. Of course, we do. Of course, every single one of us does, every single day. But generally speaking, unless you’re a bad actor, unless you’re literally trying to bring down the company from the inside. Unless that’s the case, people aren’t doing it on purpose, and always remember that. So adopt blamelessness. Combine that with thoughtfulness. Combine that with the kindness and be better to each other.
Henry Suryawirawan: I love your wisdom because it all touches the human aspect. Nothing here about technology or SLO and SRE. So thank you so much for this beautiful message.
So Alex, for people who want to follow you, or maybe look for your product, Nobl9, where they can find you online?
Alex Hidalgo: So you can find me on Twitter, primarily, ahidalgosre. That’s A H I D A L G O S R E on Twitter. Also my website, Alex-Hidalgo.com. And definitely, go check out Nobl9. I believe anyone can get started with SLOs. We exist to help you do that. We exist to help you measure SLOs and calculate your error budgets the best possible way, no matter where your data lives. So come check us out at Nobl9.com. That’s N O B L 9 .com.
Henry Suryawirawan: Thank you so much. Again, I really enjoy this conversation. So it was a pleasure. Thanks Alex.
Alex Hidalgo: Thanks Henry. I had a blast. Thanks so much for having me.
– End –