#88 - Observability Engineering - Liz Fong-Jones

16-May-2022 47 mins Liz Fong-Jones

“Observability is a technique for ensuring that you can understand novel problems in your system. Can you understand what’s happening in your system and why, without having to push a new code by slicing and dicing existing telemetry signals that are coming out of your system?”

Liz Fong-Jones is the co-author of the “Observability Engineering” book and a Principal Developer Advocate for SRE and Observability at Honeycomb. In this episode, Liz shared in-depth about observability and why it is becoming an important practice in the industry nowadays. Liz started by explaining the fundamentals of observability and how it differs from traditional monitoring. She explained some important concepts, such as the core analysis loop, cardinality and dimensionality, and doing debugging from a first principle. Later, Liz shared the current state of observability and how we can improve our observability by doing observability driven development and improving our practices based on the proposed observability maturity model found in the book.

Listen out for:

Career Journey - [00:05:44]
Observability - [00:06:30]
Pillars of Observability - [00:09:57]
Monitoring and SLO - [00:12:28]
Core Analysis Loop - [00:15:06]
Cardinality and Dimensionality - [00:18:41]
Debugging from First Principle - [00:21:20]
Current State of Observability - [00:26:49]
Implementing Observability - [00:30:20]
Observability Driven Development - [00:36:53]
Having Developers On-Call - [00:39:06]
Observability Maturity Model - [00:41:59]
3 Tech Lead Wisdom - [00:44:10]

_____

Liz Fong-Jones’s Bio
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 17+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights. She lives in Vancouver, BC with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Follow Liz:

Twitter – @lizthegrey
LinkedIn – https://www.linkedin.com/in/efong/
Website – https://www.lizthegrey.com/
GitHub – https://github.com/lizthegrey

Mentions & Links:

📚 Observability Engineering – https://info.honeycomb.io/observability-engineering-oreilly-book-2022
Honeycomb – https://www.honeycomb.io/
Charity Majors – https://www.honeycomb.io/teammember/charity-majors/
OpenTelemetry – https://opentelemetry.io/
OpenCensus – https://opencensus.io/
OpenTracing – https://opentracing.io/
Jaeger – https://www.jaegertracing.io/
LightStep – https://lightstep.com/

Our Sponsor - Skills Matter

Today’s episode is proudly sponsored by Skills Matter, the global community and events platform for software professionals.
Skills Matter is an easier way for technologists to grow their careers by connecting you and your peers with the best-in-class tech industry experts and communities. You get on-demand access to their latest content, thought leadership insights as well as the exciting schedule of tech events running across all time zones.
Head on over to skillsmatter.com to become part of the tech community that matters most to you - it’s free to join and easy to keep up with the latest tech trends.

Our Sponsor - Tech Lead Journal Shop

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Observability

The goal of a Site Reliability Engineer is to try to make systems easier to operate for the people that are working on them, whether it’s SREs themselves or on-call or whether it’s the software engineering team that is building and operating and running the system.
One of those key considerations in terms of ensuring the system is highly available is thinking about what are your Service Level Objectives? What are your reliability goals for the system? And then, on the flip side, how do we think about trying to achieve those goals? What happens if there is an incident? Does it take five minutes to fix or does it take three hours to fix? That’s where observability comes in. Because observability is a technique for ensuring that you can understand novel problems in your system.
When we think about defining observability, it’s basically this question: can you understand what’s happening in your system and why, without having to push new code, and should do so very quickly by slicing and dicing existing data that you already have in terms of telemetry signals that are coming out of your system.
The key way that we think about observability and kind of that first step of instrumentation is, as you’re developing your application, you should add instrumentation to help you understand what’s happening inside of your code, so that you don’t wind up be caught on the back foot later. That doesn’t mean you have to predict every single failure that is going to happen in advance. You kind of have to lead yourself to the breadcrumbs that you’re going to need to debug in the future.
This starts with, in my view at least, kind of having some form of tracing to know when did each request start and stop in each service. Where did that request come from? Which other service called you in turn? And maybe some other properties like what user is it?
The more you enrich in your trace with information that may or may not seem relevant at the time, but if you can freely add that detail of what feature flags are on at the time? Which user ID is it? Which browser are they using? Which language are they using? Then it means down the road, you don’t have to have pre-aggregated by any of those dimensions. You can kind of see which, if any, of those dimensions are relevant.

Pillars of Observability

Having high-quality signals does matter. You’re not going to get great observability unless the data is there. However, you can’t just throw a bunch of data into a data lake and say that you’re done.
That’s kind of why I pushed back against the three pillars narrative. Because one or two really high-quality signals is enough. You don’t necessarily need kind of voluminous debug logs if you have tracing that is representative of what you otherwise put through on for logging.
Metrics can be useful in high-volume situations. But the problem with metrics is that they’re pre-aggregated. You can’t really slice the data what’s been pre-aggregated at the source, right? So I think that each signal has some benefits and drawbacks that you need to evaluate and you don’t need to collect all three of them. In fact, there are new and emerging signals like continuous profiling.
There are many different telemetry types that we can think about utilizing as we kind of compile this capability that’s really not a technical capability, but instead a socio-technical capability. What can your engineers accomplish with the system, not what are you measuring on the system?
The goal of observability is actually to provide a level of introspection or details that helps you understand the internal state of the systems.
I think that when you’re measuring external state, you are measuring things like, what’s the CPU utilization, what’s the memory utilization? Those are the things that are potentially useful, but they don’t give you an idea of what was the application doing at the time that this particular request was executed.
That kind of detail into what was the code actually doing, and not what are the side effects of the code, I think that’s kind of what differentiates observing it from the inside out rather than from the outside in.

Monitoring and SLO

To me, observability is a superset of the capabilities that monitoring would provide. So to me, they’re not at all synonymous. Because with monitoring, you’re just trying to figure out when something has broken, but won’t necessarily answer why. I used my monitors to tell you something has gone wrong, and I used my logs to figure out what happens. In a complex distributed system, logging is not going to save you anymore. It takes forever to search through. It doesn’t give you causality of what caused what.
When we think about the relationship between monitoring and observability, I think this is why I mentioned Service Level Objectives earlier. To me, Service Level Objectives are kind of monitoring 2.0. They’re the answer to the “tell me when something has gone wrong”.
Whereas observability kind of maps more closely, to me, to kind of replacing logging than it does to replacing the metrics we used for monitoring in the past. When you have your Service Level Objective, you define what an acceptable level of success is.
Like peanut butter and jelly, they go better together, and then together, they kind of provide a superset of the kind of monitoring and logging that people used to do in the past.

Core Analysis Loop

Traditional monitoring is more reactive, while observability is more investigative. Observability really helps you form and test hypotheses really quickly. Because when I am trying to debug an incident, if my error budget has gone off, the number one thing they start asking is “What’s special about the requests that are failing?”
Instead of looking at this wall of graphs to figure out which lines wiggle at the same time, what observability gives me is this power to have causality. The power to see on requests that failed to correlate which properties were most associated with.
In the case of Honeycomb, this is a feature called BubbleUp where you can draw a box around the anomaly, and it’ll tell you what’s special about the anomaly, where you’re telling us what the anomaly is. We’re not trying to guess based off of two Sigma or whatever, really. We’re just telling you, “Here’s the difference between your control and experiment.” The cool thing is from there, you can basically go in and refine.
And then ultimately at the end of the day, you can visualize this in either kind of aggregated metrics, like what’s the distribution is, or you can go in and look at the raw data and see it kind of in tabular log like format of what are each of the fields and the relevant queries that matched, or you can go in and look at it as a trace waterfall. It starts with forming your hypothesis, narrowing down the data, and then looking at the data to confirm your hypothesis.

Cardinality and Dimensionality

When you’re instrumenting your code, you should feel free to add as many attributes. Add as many key-value pairs as you like to explain what’s going on in your code and kind of leave those breadcrumbs. It’s almost like adding tests or adding comments. It’s just a sensible way of leaving your future self this annotation that explains what is this code doing.
The reason why people have hesitated to do that in the past, at least with kind of traditional monitoring systems, is that monitoring systems bill you by the number of distinct key-value pairs that occur on each metric that you were recording. So that’s caused people to either emit these key-value pairs to logs where they get lost or to not record them at all.
When we think about high dimensionality, what we’re encouraging you to do is to sprinkle these annotations throughout your code and to kind of encode as many keys as you like.
Dimensionality is about the number of distinct keys, and cardinality is about the number of distinct values per key. Ideal system that supports observability should allow these basically unlimited cardinality, and need very, very high amount of dimensionality.

Debugging from First Principle

When we have a really well-functioning core analysis loop, where you’re able to rapidly form hypotheses and test them, you no longer need to make this leaps of intuition to the one magically right hypothesis. And instead, you can test a lot of different hypotheses and kind of narrow down your search space, and eliminate red herrings and dead ends. Whereas in the previous ways of debugging, it used to be that you would kind of have this one person who knew the system really well on their head, who could immediately jump to knowing what the right answer was.
That’s why I think first principles debugging is much better because it enables anyone on your team who has learned how to in first principles debugging, to step into any unfamiliar situation and to figure it out in a predictable amount of time.
My goal is to make it so that the worst-case scenario of someone who’s inexperienced but knows first principles debugging, the issue will take 10 minutes or 30 minutes at most to debug rather than three hours, five hours or 24 hours or worse. What happens if that person who’s been at the company for 15 years leaves the company or retires?
We talk a lot at Honeycomb about this idea of bringing everyone on your team to the level of the best debugger. So you don’t need these magical flashes of intuition. Instead, that everyone has that base capability.
In the field of engineering as a whole, when we talk about going back to first principles, what we mean is throw out everything that you know. Let’s start from understanding the system from the laws of physics, or the laws of mathematics, instead of trying to read the manual, to see what this machine says it does. So you can kind of trace everything back to understand, what does this lever do? Not what does the manual say this lever does? Or what does this person remember the lever does?
The definition, to me, of first principles debugging is to throw out the book and use the laws of the system, laws of reality to understand it.
When this comes to computer systems, I start by looking at the flow of execution of one request. To try to understand what does, for instance, a normal request look like, and what does an abnormal request look like. Where do they differ? What services are they passing through? Where are they spending their time?
Finding example traces that exemplify both the slow path and the fast path. And then just working them up on my spin, comparing them to try to understand where is one fast, where is one slow. And then that will bring to mind hypotheses and ideas about why might these two things be different. And then I can start testing them.
It’s not knowing magically which graph or which dashboard to look at. Instead, it’s spending a little bit of time puzzling out what’s the difference between these two things based off of what I can observe about them from the outside.

Current State of Observability

The reason why we need this today is because we are building cloud native microservices, where you can no longer use a traditional logging system that collects all that from one host where your entire query runs, or where you can do APM on the entire host to understand what are the slow paths on just that one host. That doesn’t work anymore if you have microservices, a request that needs to flow through services maintained by more than one team. At that point, you kind of now have these squishy boundaries where no one person holds that information about what’s going on in their head.
The motivation is that the complexity of systems has meant that we can no longer debug systems based off of known unknowns. Metrics that we thought to pre-aggregate in advance or kind of these magical flashes of insight by the expert. In that, there is no longer an expert on every aspect of your system. There might be an expert on like one or two aspects of your system, but not how all the pieces fit together. That’s why observability is crucial and needed.
The problem is that observability has three pillars. The people who are already selling you solutions to one or two, or maybe even all three of these so-called pillars, are trying to persuade you that what they do satisfies this requirement of being able to understand your complex systems. “You can have observability if you just buy our logging and tracing and metrics solutions.” Everyone in the industry has started calling what they do observability, even if it isn’t.
What’s worse is they’re making you pay observability at three different times. Like they’re selling you three different SKUs. There should be one set of data, so there’s not any mismatch, so you can jump fluidly between them, and so it doesn’t cost you an arm and a leg.

Implementing Observability

I think the reason is that it works until it doesn’t. System complexity is a very sneaky thing. You think you can understand everything in your head and kind of you’re that magical person that’s able to debug everything within five minutes, until you got the thing that stumps you, that takes three hours, or until you go on vacation and someone is calling you because they can’t figure out the system that you’ve built.
I think that’s the reason why people may not necessarily invest in observability when they should be, is because the cost of inadequate observability sneaks up on you. It’s a form of technical debt to have a lack of observability.
The other, unfortunately, sinister thing is like it feels really good to be the person who magically debugs the problem. It feels like job security. It feels like accomplishing something.
I think there are two pieces to it. I think there’s kind of the foundational understanding of what observability is and isn’t.
I would start by reading [Observability Engineering], or at least reading the first couple of chapters, just to make sure that you have that language to explain to your team why you want them to change their practices.
You can add all the tools in the world, [but] adding a new tool doesn’t necessarily solve things unless people understand the motivation and understand how they are going to use it.
Once you have people bought in, the next step is to add the OpenTelemetry SDK to your application. OpenTelemetry is a vendor neutral open source SDK that allows you to generate this telemetry data, whether the metrics formatter, log formatter, or trace formatter. The idea is that it is a common language for being able to produce this data, to transmit it around, and to propagate that state and context between different microservices. So when you add OpenTelemetry to your application, you’re not locking yourself into any pick or solution, you’re making an investment in your future.
You still might need to add some tests, or you might need to add some manual instrumentation, but OpenTelemetry, in general, does a good job of handling kind of that automatic generation of trace span for your application based off of the frameworks that you’re using. So you can have kind of this rich data about what requests are flowing in and out with the framework in your language that you’re using.
Then you’ll have to pick a place to send that data. And there are a wide number of options. You can certainly use open source solutions like Jaeger. The trouble with those open source solutions is that they are great for visualizing individual traces, but that it’s not necessarily going to provide a comprehensive replacement for your existing monitoring workflows. Because people often need to understand what’s the aggregate view of my system as a whole is, and then zoom into that trace.
At that point, you might want to look at vendor solutions. Basically, any backend that supports OpenTelemetry is going to be a place that you can send that data to. And the bonus of OpenTelemetry is it’s vendor neutral, and it supports teeing the data. You can actually send to more than one provider at the same time and see which one you like.
I think that you have to think about the cost effectiveness. How much is it going to cost you and what are the benefits that you get? So you may not necessarily want to go with the lowest cost vendor because the lowest cost vendor might provide you with a very primitive ability to analyze the bug. A lot of the return of investment from observability comes from saving your engineer’s time and improving your customer outcomes and decrease in customer churn.
It’s important to focus on what are your evaluation criteria. And to specify upfront, we want to be able to understand the issues within half an hour. Or we want to be able to measure our Service Level Objectives in the first place, even understand where we’re going wrong. And I think that to go along with that, how intuitive is it? Is everyone on my team adopting it?

Observability Driven Development

The auto-instrumentation can only get you so far. But auto-instrumentation is never going to capture things from post bodies because that’s how you get password leaks, right? So you have to pick and choose which attributes you want to send along. And the answer is anything nonsensitive, basically. You would laugh if someone said I can automatically create your test for you or I can automatically create code for you. It’s like, “No, you can’t.”
I think it’s super important to understand what is this code going to look like in production. So the best way is to add that telemetry, to add those key-value pairs into your code as you’re developing it. That way you can test it inside your test suites. That way, inside of your dev box, you can emit that telemetry data to a tracing pool and to see what those trace spans look like. And hey, you might discover why your tests are failing. Not by kind of running a debugger, but instead by looking at this telemetry data.
The earlier you adopt observability into your developer life cycle, the more your developers will use it at all stages of development, not just at production, and the sooner you’ll catch bugs. It’s really synergistic with test driven development, but you’re also exercising your observability code during your tests.

Having Developers On-Call

I think that it is important for developers to have some stake in the production running of their application. However, that does not necessarily have to take the form of on-call for every engineer. But I think at the team level, it is an important thing for teams to build and run their own software.
I think that if a lot of developers understood the pain that ops teams go through, they might have a little bit more empathy. But at the same time, you shouldn’t throw someone who aren’t prepared directly into that. That handoff process can be really helpful for encouraging ownership. But ownership and responsibility need to be accompanied by giving people the tools that they need to be able to be successful.
You wouldn’t assign a brand new developer to design the architecture of your future system. So why are you asking people who’ve never operated a system before to go swim in the deep end? I think you could have to offer this graceful path, and I think that observability really is this methodology for kind of bridging the ops and dev roles together.
The reason why people did monitoring before was that monitoring is outside-in observation or outside-in measurement. Because often, operators didn’t really necessarily have the ability to make code change into the system. So they were limited only to what can my APM agent measure or what can my monitoring tool measure.
When you have the shared responsibility of developing the telemetry and observing and looking at this telemetry, that suddenly bridges those two worlds together because you’re having the instrumentation being added in this virtuous cycle, along with looking at the results of the instrumentation.
It is a key responsibility of SREs to make the system debuggable, to make the system have SLOs, and to monitor those SLOs.

Observability Maturity Model

There are these kinds of key five areas in that observability maturity model of where observability isn’t even helping you.
We talked about observability driven development and kind of getting the code right the first time and exercising that execution path early. That’s kind of one area of maturity.
The other area of maturity that we actually didn’t touch on yet is Continuous Delivery. For me, observability has a “you must be this holy right”, that is you can ship code every couple of weeks, at least. Because if not, it doesn’t matter that you decrease your time to resolve issues from three hours to five minutes if you can only ship code every three months. Or if you can only ship new instrumentation every three months. But no, it doesn’t make sense. Focus on making your delivery cycle faster.
If you are struggling with your code and being able to ship it, start by observing your delivery pipeline, not your code. So that’s kind of another key maturity area is, do we understand why our build suddenly starts taking longer than, in our case, 15 minutes? We try to keep our builds under 10 minutes so we can deploy production once an hour or faster. So kind of keeping that software delivery cycle snappy, or getting it to being snappy, is another key area.
And then third, we talked a lot about the break fix and production resilience workflow. I think also it’s really important to think about the user analytics. To what extent are people making use of your product? If no one is using it, doesn’t really matter that you ship that feature.
And then finally, I think to tie it all together, technical debt and managing that technical debt is really important. And observability can serve it. “Hey, you have this single point of failure.” Or “Hey, you have this circular dependency.”

Tech Lead Wisdom

It is much, much more important to make the social piece work than the technical piece. It’s really important to get people talking to each other. Software delivery is much more about people communicating and being on the same page and having a shared working model than it is about what tools you use.
So don’t think first about introducing new tools. Think first about aligning people on the outcome and making sure everyone is agreed on the outcome. And then you can worry about what tools you’re going to use to implement it.

Transcript

[00:01:27] Episode Introduction

Henry Suryawirawan: Hello to all of you, my friends and listeners. Welcome to the Tech Lead Journal podcast, the show where you can learn about technical leadership and excellence from my conversation with great thought leaders. And very happy today to present the 88th episode of the podcast. Thank you for tuning in listening to this episode. If this is your first time listening to Tech Lead Journal, subscribe and follow the show on your podcast app and social media on LinkedIn, Twitter and Instagram. If you are a regular listener and enjoy listening to the episodes, support me by subscribing as a patron at techleadjournal.dev/patron.

My guest for today’s episode is Liz Fong-Jones. Liz is the Principal Developer Advocate for SRE and Observability at Honeycomb, and recently she just published her book titled “Observability Engineering”, which she co-authored with her colleagues, Charity Majors and George Miranda. In this episode, Liz shared in-depth about the concept of observability and why it is becoming an important practice in the industry nowadays. She started by explaining the fundamentals of observability and how it differs from traditional monitoring, and how observability can help us to run a more reliable and stable production systems, including its relation with the DevOps and SRE practices. She explained some interesting concepts, such as the core analysis loop, cardinality and dimensionality, and doing debugging from a first principle. In the later part of the conversation, Liz shared her view of the current state of observability, including the proliferation of vendor and open source tools, and how we, engineers, can improve our system’s observability by doing observability driven development, and improving our practices based on the proposed observability maturity model found in the book.

I really enjoyed my conversation with Liz, diving deep into observability and understanding the different nuances of observability concepts. If you are interested in this topic, I would also highly recommend reading further the “Observability Engineering” book, in which Honeycomb has kindly provided the eBook for free on their website. Find out the link provided in the show notes to get your free copy.

And if you also enjoy this episode and find it useful, share it with your friends and colleagues who may also benefit from listening to this episode. Leave a rating and review on your podcast app, and share your comments or feedback about this episode on social media. It is my ultimate mission to make this podcast available to more people. And I need your help to support me towards fulfilling my mission. Before we continue to the conversation, let’s hear some words from our sponsor.

[00:05:04] Introduction

Henry Suryawirawan: Hello, everyone. Welcome back to another new episode of the Tech Lead Journal podcast. Today I’m very excited to finally meet someone who I have been following for quite a while. Liz Fong-Jones is here with us. So today we’ll be talking a lot about SRE and observability, which is the topics and the trends that are up and coming also these days. Liz actually is the Principal Developer Advocate at Honeycomb. She has been working in this SRE and observability for maybe 16 years by her bio. Really, it’s a pleasure to meet you today, and I’m looking forward to learning everything about SRE and observability today.

Liz Fong-Jones: I’m not sure we will get to everything, but we can certainly do a good chunk of it. Hi, thanks for having me on the show.

[00:05:44] Career Journey

Henry Suryawirawan: So Liz, for people who may not know you yet. Maybe if you can help to introduce yourself, maybe telling us more about your highlights or turning points in your career.

Liz Fong-Jones: Yeah. So, I started working as a systems engineer in 2004. I’ve been doing all kinds of work on reliability and making systems better and easier to operate. That includes spending a number of years working at game studio, a number of years working at Google–that’s where I spend most of my career as a Site Reliability Engineer here at Google. And then my career took a turn towards thinking about how do I teach not just my team to work on better systems, but how do I teach all teams across the software industry to do better? So that’s when I become a Developer Advocate, and I switched over to working at Honeycomb.

[00:06:30] Observability

Henry Suryawirawan: So, yeah. I’ve been following you with all your SRE contents. I started probably with the SRE contents, mostly. And then it took a turn since you joined Honeycomb. And now we’re talking a lot more about observability. And you have an upcoming book, which is titled “Observability Engineering”. So this is something that you wrote with Charity Majors and one of your colleagues in Honeycomb.

Liz Fong-Jones: Yes. George Miranda.

Henry Suryawirawan: Yes, George Miranda. Observability is definitely one of the hottest topics these days, especially in the technology world. Maybe if you can start by helping us to understand what is actually observability? Because this term is commonly used these days.

Liz Fong-Jones: Let’s contextualize in terms of both SRE and observability. Not all of your listeners will necessarily have heard of SRE either. So, when we start with thinking about what SREs do, the goal of a Site Reliability Engineer is to try to make systems easier to operate for the people that are working on them, whether it’s SREs themselves or on-call or whether it’s the software engineering team that is kind of building and operating and running the system.

One of those key considerations in terms of ensuring the system is highly available is thinking about what are your Service Level Objectives? What are your reliability goals for the system? And then, on the flip side, how do we think about trying to achieve those goals? What happens if there is an incident? Does it take five minutes to fix or does it take three hours to fix? To me, that’s where observability comes in. Because observability is a technique for ensuring that you can understand novel problems in your system. That’s why it was a natural transition for me to go from working on SRE to working on observability. So when we think about defining observability, it’s basically this question: can you understand what’s happening in your system and why, without having to push new code, and should do so very quickly by slicing and dicing existing data that you already have in terms of telemetry signals that are coming out of your system.

Henry Suryawirawan: So you mentioned a couple of key points when you described observability here. The first is without pushing new code. If I imagine last time, I used to be a software developer, whenever I wanted to troubleshoot or debug something, sometimes I introduced new code.

Liz Fong-Jones: Yes. That print and debug statement. All of us are guilty of it.

Henry Suryawirawan: So tell us more about this without pushing new code. How can we actually do this?

Liz Fong-Jones: Yeah. So the key way that we think about observability and kind of that first step of instrumentation is, as you’re developing your application, you should add instrumentation to help you understand what’s happening inside of your code, so that you don’t wind up be caught on the back foot later. That doesn’t mean you have to predict every single failure that is going to happen in advance. You kind of have to lead yourself to the breadcrumbs that you’re going to need to debug in the future.

So this starts with, in my view at least, kind of having some form of tracing to know when did each request start and stop in each service. Where did that request come from? Which other service called you in turn? And maybe some other properties like what user is it? The more you enrich in your trace with information that may or may not seem relevant at the time, but if you can freely add that detail of what feature flags are on at the time? Which user ID is it? Which browser are they using? Which language are they using? Then it means down the road, you don’t have to have pre-aggregated by any of those dimensions. You can kind of see which, if any, of those dimensions are relevant.

[00:09:57] Pillars of Observability

Henry Suryawirawan: So instrumentation, tracing, all these are being mentioned these days, and people also normally categorize it with these three pillars of observability called logs, metrics, tracing. So tell us more, are these three the most important things that we need to implement for observability?

Liz Fong-Jones: I think that having high-quality signals does matter. You’re not going to get great observability unless the data is there. However, you can’t just throw a bunch of data into a data lake and say that you’re done. So I think that’s kind of why I pushed back against the three pillars narrative. Because one or two really high-quality signals is enough. You don’t necessarily need kind of voluminous debug logs if you have tracing that is representative of what you otherwise put through on for logging. Metrics can be useful in high-volume situations. But the problem with metrics is that they’re pre-aggregated. You can’t really slice the data what’s been pre-aggregated at the source, right? So I think that each signal has some benefits and drawbacks that you need to evaluate and you don’t need to collect all three of them.

In fact, there are new and emerging signals like continuous profiling. So there are not really even three pillars or lenses or whatever you want to call them. There are many different telemetry types that we can think about utilizing as we kind of compile this capability that’s really not a technical capability, but instead a socio-technical capability. What can your engineers accomplish with the system? Not what are you measuring on the system. But can you actually analyze it? Can you actually get that result of, I can understand to figure out what’s happening.

Henry Suryawirawan: So one thing that is also interesting for me, when I read and study about observability, right? It mentions that the goal of observability is actually to provide a level of introspection or details that helps you understand the internal state of the systems. So the keyword here is internal state. Tell us more about this. How do you differ this with something that monitors external state?

Liz Fong-Jones: Yeah. So I think that when you’re measuring external state, you are measuring things like, what’s the CPU utilization? What’s the memory utilization? Those are the things that are potentially useful, but they don’t give you an idea of what was the application doing at the time that this particular request was executed. So I think that kind of detail into what was the code actually doing, and not what are the side effects of the code, I think that’s kind of what differentiates observing it from the inside out rather than from the outside in.

[00:12:28] Monitoring and SLO

Henry Suryawirawan: The other thing that people in the industry are used to, traditionally, we call it monitoring. So when we talk about system administrators, last time we used to talk about monitoring. And in the past, I don’t know how many years, recently is we start to shift from monitoring to observability. So is there any significant difference between monitoring, observability, or is just another term being coined to brand the same thing with the new term?

Liz Fong-Jones: To me, observability is a superset of the capabilities that monitoring would provide. So to me, they’re not at all synonymous. Because with monitoring, you’re just trying to figure out when something has broken, but won’t necessarily answer why. Which is why people had considered for a long time, you know, I used my monitors to tell you something has gone wrong, and I used my logs to figure out what happens. Guess what? In a complex distributed system, logging is not going to save you anymore. It takes forever to search through. It doesn’t give you causality of what caused what.

So when we think about the relationship of monitoring and observability, I think this is why I mentioned Service Level Objectives earlier. To me, Service Level Objectives are kind of monitoring 2.0. They’re the answer to the “tell me when something has gone wrong” that monitoring claimed before. Whereas observability kind of maps more closely, to me, to kind of replacing logging than it does to replacing the metrics we used for monitoring in the past. So what do I mean by this? Well, when you have your Service Level Objective, you define what an acceptable level of success is.

For instance, at Honeycomb, we aim for ingest pipeline to be 99.99% reliable. That means no more than one in 10,000 packets of the coming telemetry are dropped. But if we start seeing an elevated error rate. If we have, say, we’re allowed to drop 1 million telemetry packets per month, and we start dropping a hundred thousand per hour, we’re going to exhaust that error budget for the month in 10 hours. To us, that’s an emergency. We’ll go and fix it right away. But if we have like a thousand bad requests that are just being dropped per hour, that’s going to take a thousand hours. That’s not an emergency. So it kind of differentiates between expected levels of errors and unexpected levels of errors, and is not susceptible to the same problem that people have had with monitoring in the past of, “Oh my God, it’s 2:00 AM. One out of one request failed. That’s a hundred percent error rate. Wake up everyone.” No one likes to be paged at 2:00AM for something that flapped and goes right away. So like peanut butter and jelly, they go better together, and then together, they kind of provide a superset of the kind of monitoring and logging that people used to do in the past.

[00:15:06] Core Analysis Loop

Henry Suryawirawan: So, one thing when I read your book as well, in preparation of this conversation is that you mentioned traditional monitoring is more reactive. You mentioned about getting alerted. You only see side effects of certain things happening. While observability is more investigative. So tell us more about this investigative manner. So when an incident happens, what actually does observability help you?

Liz Fong-Jones: Yeah. So observability really helps you form and test hypotheses really quickly. Because when I am trying to debug an incident, if my error budget has gone off, the number one thing they start asking is what’s special about the requests that are failing? Instead of looking at this wall of graphs to figure out which lines wiggle at the same time, what observability gives me is this power to have causality. The power to see on requests that failed to correlate which properties were most associated with maybe what, maybe it’s one customer, maybe it’s one billing, right? Or maybe it’s a combination of those two things.

And the really neat thing is, in the case of Honeycomb at least, the way we’ve influenced this is a feature called BubbleUp where you can draw a box around the anomaly, and it’ll tell you what’s special about the anomaly, where you’re telling us what the anomaly is. We’re not trying to guess based off of two Sigma or whatever, really. We’re just telling you, “Here’s the difference between your control and experiment.” The cool thing is from there, you can basically go in and refine, right? You can say filter only to this population of events, and then repeat the process. Or to group by this field or to group by a combination of fields and to kind of confirm that hypothesis. So instead of theorying out like, “Oh my God, like, how do I set up this query? How do I write it in this obtuse query language? Oh God, I hope I got it right. Because it’s going to take two minutes to run over all these logs.” Instead, you get that feedback instantly.

So going back to our SLOs, 99.5% of Honeycomb queries complete within 10 seconds. There is not a cause to mess things up. You can feel free to just try and experiment, to analyze and slice your data in particular way, and you’ll get your result within 10 seconds. If it doesn’t work out, try a different query. If it doesn’t work out, you now have the ammunition you need to run your next query.

And then ultimately at the end of the day, you can visualize this in either kind of aggregated metrics, like what’s the distribution is, or you can go in and look at the raw data and see it kind of in tabular log like format of what are each of the fields and the relevant queries that matched, or you can go in and look at it as a trace waterfall. I think that kind of allows you to slice and visualize the data any way that you need in order to understand why is this slow, where’s this failure coming from. So it starts with forming your hypothesis, narrowing down the data, and then looking at the data to confirm your hypothesis.

Henry Suryawirawan: Yeah. So if I understand correctly, the one that you described just now is actually the, what you called, core analysis loop. So when an incident happens, this is what the sequence of events that is going to happen. So you look at a bunch of data that you load initially, then you figure out, okay, this looks interesting. Maybe you use BubbleUp in Honeycomb, right? Where you maybe narrow down the search results. And then from there, you correlate and then you start again and then you iterate until you actually find the root cause. Do I understand correctly?

Liz Fong-Jones: Yes. Maybe not necessarily the root cause, but like a proximate set of triggers that may have tipped this system over the edge. There is usually no one root cause in an advanced system. The other thing that I’m going to point out is that the thinking about what a core analysis loop is. That’s work long before I came to Honeycomb. So when I was working at Google, I kind of formulate this hypothesis about the core analysis loop and make it faster. And then I realized that Honeycomb was going to be the kind of manifestation of how do we bring this technology, not just to Google employees, but to the rest of the world.

[00:18:41] Cardinality and Dimensionality

Henry Suryawirawan: Nice, nice. So, one thing that you mentioned when we do this core analysis loop is actually slicing and dicing data, right? So I think one of the, maybe, commonly mentioned about the requirements of observability is the high cardinality and dimensionality. Without this, it’s probably difficult for you to slice and dice data because that’s just not something interesting that you can narrow down. So tell us more, what do you mean by cardinality and also dimensionality? And why it has to be high enough?

Liz Fong-Jones: Yeah. So earlier, I referred to this idea when you’re instrumenting your code, that you should feel free to add as many attributes. Add as many key-value pairs as you like to really explain, as you go along, what’s going on in your code and kind of leave those breadcrumbs. It’s almost like adding tests or adding comments, right? Like it’s just a sensible way of leaving your future self this annotation that explains what is this code doing? What is this code thinking? The reason why people have hesitated to do that in the past, at least with kind of traditional monitoring systems, is that monitoring systems bill you by the number of distinct key-value pairs that occur on each metric that you were recording. So that’s caused people to either emit these key-value pairs to logs where they get lost or to not record them at all.

So when we think about high dimensionality, what we’re encouraging you to do is to sprinkle these annotations throughout your code and to kind of encode as many keys as you like. But there’s a problem. The problem is that even if you reduce the number of keys in a metrics-based system, it turns out that if you have a lot of distinct values like user ID, right? Let’s suppose you have millions of users. It turns out that your metric system has to create a time series for each distinct user and track it forever. Even if that user has only appeared once. So there is this amortization that a metric system expects that your season value will get reused often. And therefore, there is a high upfront cost to having a new key and value. Therefore, metric systems penalize you for having a high cardinality dimension. So to sum this up, dimensionality is about the number of distinct keys, and cardinality is about the number of distinct values per key. Ideal system that supports observability should allow these basically unlimited cardinality, and need very, very high amount of dimensionality.

Henry Suryawirawan: So for people who are trying to understand about this concept, cardinality, dimensionality, let me just repeat again what Liz just said. So cardinality refers to the number of unique values that you’re storing in your metric system, while dimensionality refers to the number of unique keys that you’re sending to your metric systems.

[00:21:20] Debugging From First Principle

Henry Suryawirawan: Another thing that I learned about observability that you mentioned in the book. I mentioned in the beginning that I used to go to the server, make code changes to do debugging. But in your book, you actually mentioned, if you use observability, you’re actually doing a debugging from first principle. Tell us more about this concept. Because I’m still trying to understand this part.

Liz Fong-Jones: Yes. So we talked earlier about the core analysis loop, which kind of sets us well up to think about this. When we have a really well-functioning core analysis loop, where you’re able to rapidly form hypotheses and test them, you no longer need to make this kind of leaps of intuition to the one magically right hypothesis. And instead, you can test a lot of different hypotheses and kind of narrow down your search space, and eliminate red herrings and dead ends. Whereas in the previous ways of debugging, it used to be that you would kind of have this one person who knew the system really well on their head, who could immediately jump to knowing what the right answer was. “Oh, I saw this two months ago. It’s this, right?”

That’s why I think first principles debugging is much better because it enables anyone on your team who has learned how to in first principles debugging, to step into any unfamiliar situation and to figure it out in a predictable amount of time. Sure. Your expert, who has seen the system and who has been at the company for 15 years, they might be able to solve the issue in two minutes. But my goal is to make it so that the worst-case scenario of someone who’s inexperienced but knows first principles debugging, the issue will take 10 minutes or 30 minutes at most to debug rather than three hours, five hours or 24 hours or worse. What happens if that person who’s been at the company for 15 years leaves the company or retires? That happens these days. So, we talk a lot at Honeycomb about this idea of bringing everyone on your team to the level of the best debugger. So you don’t need these magical flashes of intuition. Instead, that everyone has kind of that base capability.

This is really personal to me because I started at Google in 2008 and I left Google in 2018. 11 years. I had been on 12 different teams at Google in those 11 years. I was changing teams on average slightly faster than once per year. I’ve stayed two years on one team, but I’d also stay with six months in another team. I never got that amount of time on any one team to really be at the deep expert on a system. But I was still a really great and valued team member because I had gotten really good at first principles debugging, at understanding what software is available in the Google tool stack to understand any unfamiliar kind of service. Because Google had this kind of standardization. Every service at Google uses the same tracing, the same metric system, the same logging system. And therefore, it was possible for me to walk into a completely unfamiliar situation and figure it out in half an hour or two. That was kind of surprising to people. Because if you spent like two or three years on a system, but not 10 years, it might take you several hours to pick out what’s going on, cause you don’t quite have that expert level knowledge of the system, but also you haven’t grown that muscle yet. Develop that muscle of how do I walk into any unfamiliar situation? Whereas for me, I’m always somewhere between day one on that new team and day 180 on that team. I would have to figure things out on the fly for myself. That was my survival skill.

Henry Suryawirawan: So when you mentioned first principle, for those who are not familiar with this term, what do you mean by first principle? That’s the first thing. And tell us more about what are the skills or maybe techniques that you categorize as first principle debugging techniques?

Liz Fong-Jones: Yeah. So in the field of engineering as a whole, when we talk about going back to first principles, what we mean is throw out everything that you know. Let’s start from understanding the system from the laws of physics, or the laws of mathematics, instead of trying to read the manual, to see what this machine says it does. The machine is ultimately made up of a bunch of levers and wires and so forth. So you can kind of trace everything back to understand, okay, what does this lever do? Not what does the manual say this lever does? Or what does this person remember that the lever does? You go and look at the machinery color connected to the back of this lever to understand what it does. I think that the definition, to me, of first principles debugging is to throw out the book and use the laws of the system, laws of reality to understand it.

When this comes to computer systems, what this fundamentally meant to me, initially in my time at Google, and then my time today at Honeycomb is like, I start by looking at the flow of execution of one request. To try to understand what does, for instance, a normal request look like, and what does an abnormal request look like. Where do they differ? What services are they passing through? Where are they spending their time? So this is kind of where I find this idea of examples are very useful, of finding example traces that exemplify both the slow path and the fast path. And then just working them up on my spin, comparing them to try to understand where is one fast, where is one slow. And then that will bring to mind hypotheses and ideas about why might these two things be different. And then I can start testing them. So that’s what I mean when I talk about first principles debugging. It’s not knowing magically which graph or which dashboard to look at. Instead, it’s spending a little bit of time puzzling out what’s the difference between these two things based off of what I can observe about them from the outside.

[00:26:49] Current State of Observability

Henry Suryawirawan: Thanks for that explanation. So we’ve been talking a lot about the techniques, the internals, what observability is. Tell us more, why is observability now become very hot? Is it because it’s just a trend? So many tools are available. Or is it actually solving a real problem?

Liz Fong-Jones: I think the answer, unfortunately, is yes to both, and it is a non-overlapping set. So in terms of the definition of observability I gave earlier of kind of understanding complex systems from first principles and of being able to debug unknown problems. The reason why we need this today is because we are building cloud native microservices, where you can no longer use a traditional logging system that collects all that from one host where your entire query runs, or where you can do APM on the entire host to understand what are the slow paths on just that one host. That doesn’t work anymore if you have microservices, a request that needs to flow through services maintained by more than one team. At that point, you kind of now have these squishy boundaries where no one person holds that information about what’s going on in their head.

So I think the motivation is that the complexity of systems has meant that we can no longer debug systems based off of known unknowns. Metrics that we thought to pre-aggregate in advance or kind of these magical flashes of insight by the expert. In that, there is no longer an expert on every aspect of your system. There might be an expert on like one or two aspects of your system, but not how all the pieces fit together. So that’s the motivation. That’s why observability is crucial and needed.

The problem is that observability has, you mentioned, three pillars earlier. The people who are already selling you solutions to one or two, or maybe even all three of these so-called pillars, are trying to persuade you that what they do satisfies this requirement of being able to understand your complex systems. So they say, three pillars of observability. You can have observability if you just buy our logging and tracing and metrics solutions. I think that’s why you’ve seen so much marketing buzz and noise about it, is that everyone in the industry has started calling what they do observability, even if it isn’t. What’s hilarious is like you see all these companies, you know, like data observability, that are like, source code observability. And I’m like, does this word mean anything anymore?

So I think, basically, going back in history, Honeycomb first started using the word observability in 2017 and it actually predated us by a little bit. I think the Twitter observability team predates us by like one or two years in terms of using the system control word observability to spread this idea of understanding systems. And then you’ll see this explosion of people like starting to call it observability from 2019 onwards, really. That there’s kind of this plethora of companies that say, “Oh, we do observability tool.” And it’s like, “Really? Okay, well, let’s see. Can you actually understand your systems altogether, or is this just a rebranding of your existing monitoring and logging solutions?”

Henry Suryawirawan: Yeah, sometimes this is also my confusion. So when you see some brands or some products call themselves observability, most of them are like maybe white labeling, so to speak, right? Where you can actually see logs, metrics, and traces. Sometimes they are not integrated, in fact, right? So you will see three different features and three different things, and you will correlate using a human intuition.

Liz Fong-Jones: Yeah. And what’s worse is they’re making you pay observability at three different times, right? Like they’re selling you three different SKUs. There should be one set of data, so there’s not any mismatch, so you can jump fluidly between them, and so it doesn’t cost you an arm and a leg.

[00:30:20] Implementing Observability

Henry Suryawirawan: Yeah. The cost part, I agree, because it can cost you a lot when you send a lot of data. I don’t know in your personal experience when working with Honeycomb or when you’re meeting developers out there. In my experience, even though we have these microservices cloud native, there are still people who actually, believe it or not, do not implement any trace, do not implement any metrics, or has very little logs implemented in their system. Tell us more, why is this still happening? Even though we see buzz trends about observability and all that?

Liz Fong-Jones: I think the reason is that it works until it doesn’t. System complexity is a very sneaky thing. You think you can understand everything in your head and kind of you’re that magical person that’s able to debug everything within five minutes, until you got the thing that stumps you, that takes three hours, or until you go on vacation and someone is calling you because they can’t figure out the system that you’ve built. So I think that’s the reason why people may not necessarily invest in observability when they should be, is because the cost of inadequate observability sneaks up on you. It’s a form of technical debt to have a lack of observability. We know how good the industry is about paying their technical debt. So I think that’s what it is.

I think the other, unfortunately, sinister thing is like it feels really good to be the person who magically debugs the problem. It feels like job security. It feels like accomplishing something. If you’re the most senior engineer on your team and you know everything about the system and you’re the one making decisions about the system, you may not necessarily care quite as much as the engineer who is new in your team and is struggling to understand how to solve it together.

Henry Suryawirawan: Now, if people have listened to this episode, and they want to start implementing observability, how they can do it? Is it like go and buy or subscribe to a solution that are available out there? Install open source? Maybe tell us more how you should start?

Liz Fong-Jones: I think there are two pieces to it. I think there’s kind of the foundational understanding of what observability is and isn’t. And you mentioned our book, “Observability Engineering.” We can drop a link in the show notes to a way to download a free copy of “Observability Engineering”. So I would start by reading that book, or at least reading the first couple of chapters, just to make sure that you have that language to explain to your team why you want them to change their practices.

You know, you can add all the tools in the world, right? This is what I discovered in Google. We’ve had really good tracing at Google since 2008. No one used it. No one used it because it was hard to use and no one could see why they needed to use it. There was no easy linkage to people’s existing debugging workloads. So adding a new tool doesn’t necessarily solve things unless people understand the motivation and understand how they are going to use it.

Once you have people bought in, the next step is to add the OpenTelemetry SDK to your application. So, what is OpenTelemetry? OpenTelemetry is a vendor neutral open source SDK that allows you to generate this telemetry data, whether the metrics formatter, log formatter, or trace formatter. The idea is that it is a common language for being able to produce this data, to transmit it around, and to propagate that state and context between different microservices. So when you add OpenTelemetry to your application, you’re not locking yourself into any pick or solution, you’re making an investment in your future, in the same way that you’re adding a new test framework to your application is an investment in your future.

You still might need to add some tests, or you might need to add some manual instrumentation, but OpenTelemetry, in general, does a good job of handling kind of that automatic generation of trace span for your application based off of the frameworks that you’re using. So you can have kind of this rich data about what requests are flowing in and out with the framework in your language that you’re using. Then you’ll have to pick a place to send that data. And there are a wide number of options. You can certainly use open source solutions like Jaeger. The trouble with those open source solutions is that they are great for visualizing individual traces, but that it’s not necessarily going to provide a comprehensive replacement for your existing monitoring workflows. Because people often need to understand what’s the aggregate view of my system as a whole, and then zoom into that trace and Jaeger can fulfill this image of the trace development.

So at that point, you might want to look at vendor solutions and, you know, certainly Honeycomb is out there. I also think very highly of our competitor, LightStep. But basically, any backend that supports OpenTelemetry is going to be a place that you can send that data to. And the bonus of OpenTelemetry is it’s vendor neutral, and it supports teeing the data. You can actually send to more than one provider at the same time and see which one you like, which I think is really great. I think competition is better for the market. It’s great for everyone, and it really incentivizes us as vendors to do the right thing by it. Therefore, I think that’s why I recommend OpenTelemetry. Because it kind of handles a lot of what would otherwise be rote work of creating trace spans, propagating it and all that state around. And it doesn’t lock you in. It gives you that freedom of choice.

Henry Suryawirawan: So thanks for the tips of opting for OpenTelemetry. OpenTelemetry itself is like a new standard. Previously it’s called OpenTracing.

Liz Fong-Jones: The merger of OpenCensus and OpenTracing. If you have two standards, why not merge them into one? We actually have managed to deprecate both OpenCensus and OpenTracing. We’re not doing that XKCD comic thing of now there are 23 standards.

Henry Suryawirawan: Earlier, right, if you choose for vendors solutions, normally, they will charge you by number of data being sent to their systems. So tell us more, a bit of practical choices here. How do we assess solutions, first of all, if let’s say, we want to go with the vendor SaaS-based products?

Liz Fong-Jones: Yeah. I think that you have to think about the cost effectiveness. How much is it going to cost you and what are the benefits that you get? So you may not necessarily want to go with the lowest cost vendor because the lowest cost vendor might provide you with a very primitive ability to analyze the bug. I think a lot of the return of investment from observability comes from saving your engineer’s time and improving your customer outcomes and decrease in customer churn. That is far more important than the cost of the base solution. For instance, Jaeger is free. But how much time is it going to save you in debugging, really?

So I think there’s kind of this continuum and it’s important to focus on what are your evaluation criteria. And to specify upfront, we want to be able to understand the issues within half an hour. Or we want to be able to measure our Service Level Objectives in the first place, even understand where we’re going wrong. So I think that’s kind of one dimension. And I think that to go along with that, how intuitive is it? Is everyone on my team adopting it?

[00:36:53] Observability Driven Development

Henry Suryawirawan: So after we choose all these OpenTelemetry, we know which vendor solution. I think at the end of the day, the developers themselves need to instrument the code.

Liz Fong-Jones: That’s correct. The auto-instrumentation can only get you so far. But auto-instrumentation is never going to capture things from post bodies because that’s how you get password leaks, right? So you have to pick and choose which attributes you want to send along. And the answer is anything nonsensitive, basically. But the telemetry framework is not going to be able to do that automatically.

Henry Suryawirawan: Yeah. So auto-instrumentation can do that auto magically for up till a certain level. But I think at the end of the day, developers need to have a conscious decision to instrument their code.

Liz Fong-Jones: Exactly, right? But you would laugh if someone said I can automatically create your test for you or I can automatically create code for you. It’s like, “No, you can’t.”

Henry Suryawirawan: Yeah. Which brings to the technique that you mentioned in the book. We have heard about this term a lot of times, driven development, right? So you’ve coined this observability driven development and shifting left for observability. So tell us more about this technique, this concept. Why is it important?

Liz Fong-Jones: Yeah. I think it’s super important to understand what is this code going to look like in production. So the best way is to add that telemetry, to add those key-value pairs into your code as you’re developing it. That way you can test it inside your test suites. That way, inside of your dev box, you can emit that telemetry data to a tracing pool and to see what those trace span look like. And hey, you might discover why your tests are failing. Not by kind of running a debugger, but instead by looking at this telemetry data. So the earlier you adopt observability into your developer life cycle, the more your developers will use it at all stages of development, not just at production, and the sooner you’ll catch bugs. So that’s the argument is that it’s really synergistic with test driven development, but you’re also exercising your observability code during your tests.

Henry Suryawirawan: Yeah. So if I can imagine I’m being a developer again. So as and when I try to develop a feature, I have to think about what kind of things that I want to be able to observe, or maybe look at in my observability tools. So that when an issue happens, I actually have those data. And then, yeah, I implement or instrument it in my code. I think that’s really key.

[00:39:06] Having Developers On-Call

Henry Suryawirawan: Another thing that you mentioned, if developers want to have more conscious decision of adding instrumentation, it’s putting them on call and support their own system. So that’s why this is actually very important.

Liz Fong-Jones: Yeah. So I think that it is important for developers to have some stake in the production running of their application. However, that does not necessarily have to take the form of on-call for every engineer. But I think at the team level, it is important thing for teams to build and run their own software. So in practice, a lot of teams struggle with that because you just say, you know, “Here’s the pager. Congratulations.” As opposed to, “We’re going to support you with high-quality observability with SLOs and tell you this methodology as to how you run a service. Now that we’ve trained you, here’s the pager.”

I think that if a lot of developers understood the pain that ops teams go through with the amount of malices that we formed from years of being subjected to 2:00 AM pages, and kind of almost the masochism that requires, they might have a little bit more empathy. But at the same time, you shouldn’t throw someone who aren’t prepared directly into that. We should clean it up a little bit before you say, “Hi, here’s the pager, by the way.” So I think kind of that handoff process can be really helpful for encouraging ownership. But ownership and responsibility need to be accompanied by giving people the tools that they need to be able to be successful. You wouldn’t assign a brand new developer to design the architecture of your future system. So why are you asking people who’ve never operated system before to go swim in the deep end? I think you could have to offer this graceful path, and I think that observability really is this methodology for kind of bridging the ops and dev roles together.

The reason why people did monitoring before was that monitoring is outside-in observation or outside-in measurement. Because often, operators didn’t really necessarily have the ability to make code change into the system. So they were limited only to what can my APM agent measure? Or what can my monitoring tool measure? So I think that when you have the shared responsibility of developing the telemetry and observing and looking at this telemetry, that suddenly bridges those two worlds together because you’re having the instrumentation being added in this virtuous cycle, along with looking at the results of the instrumentation.

Henry Suryawirawan: So you mentioned all these build and run it, it comes back to normally the DevOps culture and SRE culture, right? So is this observability also supports the implementation of that particular DevOps and SRE culture?

Liz Fong-Jones: Yes, I think so. When you originally saw some of my videos on SRE and DevOps, I highlighted that it is a key responsibility of SREs to make the system debuggable, to make the system have SLOs, and to monitor those SLOs. At Google, we certainly didn’t use the word observability until 2019. But what we were doing was probably closer to observability than to monitoring because we are doing this practice of the core analysis loop and being able to understand the systems.

[00:41:59] Observability Maturity Model

Henry Suryawirawan: So for people who have been implementing this, how do they know they have done it right? Where they are lacking? I see in your book, you mentioned this term called observability maturity model. Is there a way to gauge where people are at this point in time? And what else can they aspire to achieve?

Liz Fong-Jones: Yeah. I think that there are these kinds of key five areas in that observability maturity model of where observability isn’t even helping you. We talked about observability driven development and kind of getting the code right the first time and exercising that execution path early. That’s kind of one area of maturity.

The other area of maturity that we actually didn’t touch on yet is Continuous Delivery. For me, observability has a “you must be this holy right”, that is you can ship code every couple of weeks, at least. Because if not, it doesn’t matter that you decrease your time to resolve issues from three hours to five minutes if you can only ship code every three months. Or if you can only ship new instrumentation every three months. But no, it doesn’t make sense. Focus on making your delivery cycle faster.

So kind of, if you are struggling with your code and being able to ship it, start by observing your delivery pipeline, not your code. So that’s kind of another key maturity area is, do we understand why our build suddenly starts taking longer than, in our case, 15 minutes? 15 minutes is the hard cutoff or, “oh my God, this is too slow.” So we try to keep our builds under 10 minutes so we can deploy production once an hour or faster. So kind of keeping that software delivery cycle snappy, or getting it to being snappy, is kind of another key area.

And then third, we talked a lot about the break fix and production resilience workflow. I think also it’s really important to think about the user analytics. To what extent are people making use of your product? If no one is using it, doesn’t really matter that you ship that feature.

And then finally, I think to tie it all together, technical debt and managing that technical debt is really important. And observability can serve it. “Hey, you have this single point of failure.” Or “Hey, you have this circular dependency.” So I think that those are some of the key responsibilities that you have when you’re kind of thinking about how to add observability to your software delivery practices.

Henry Suryawirawan: Thanks for explaining about this maturity model. I think you explain it much deeper in the book as well for people who want to understand where are these areas that you need to invest time in. So make sure to check the book. We’ll put it in the show links later.

[00:44:10] Tech Lead Wisdom

Henry Suryawirawan: So Liz, I find myself having a crash course about this observability. So thank you so much for explaining all these concepts, all these implementations. Really thank you for that. But unfortunately, due to time, we need to wrap up soon. But before I let you go, I normally ask one last question for every guest that I have in the show, which is to share three technical leadership wisdom, for people to learn from your experience or from your journey. So maybe you can share with us, what are your wisdom?

Liz Fong-Jones: Sure. I might keep it brief and just leave it to one piece of wisdom, which is that it is much, much more important to make the social piece work than the technical piece. It’s really important to get people talking to each other. Software delivery is much more about people communicating and being on the same page and kind of having a shared working model than it is about what tools you use. So don’t think first about introducing new tools. Think first about aligning people on the outcome and making sure everyone is agreed on the outcome. And then you can worry about what tools you’re going to use to implement it.

Henry Suryawirawan: Wow. That’s really wonderful. So socio first, rather than the technology aspect, and try to align people. Because yeah, I can see people implementing different technologies. Even when you talk about microservice, right? It’s like everyone just wants to have their own services, manage it themselves, and they don’t really care whether it integrates well and serves the customers well. So thank you so much, Liz, for your time. For people who want to follow you or learn more about this observability or Honeycomb, where they can reach you online?

Liz Fong-Jones: Yeah, I am Liz the Grey, pretty much everywhere. So that’s on Twitter. That’s on GitHub. That’s how to look me up. I will also drop those links into the show notes.

Henry Suryawirawan: Why is it the grey, if I may ask?

Liz Fong-Jones: Teenage fan-girling about Gandalf from Lord of the Rings.

Henry Suryawirawan: All right. So thank you, Liz. I hope you have a great day today.

Liz Fong-Jones: Thank you. Cheers.

– End –