#36 - Building High-Performing Teams with Observability and CI/CD - Charity Majors

26-Apr-2021 1 hour 1 min Charity Majors

included in Culture & Practices Leadership Software Craftsmanship Team Collaboration Personal Growth DEI DevOps Continuous Delivery Observability

Buy me a coffee

“A high-performing team is one that gets to spend almost all of their time solving interesting problems that move the business forward. Not doing a lot of toil. Not working on things they have to do in order to get to the things they want to do."

Charity Majors is the co-founder and CTO of Honeycomb, the observability tools for engineering teams to debug production systems faster and smarter. In this episode, we discussed in-depth about building high-performing teams by having observability and CI/CD as the critical pillars to support it. We opened up our discussion discussing what observability is and how Honeycomb helps to provide observability for distributed systems compared to the other monitoring tools available. Charity then shared her strong views on how to build high-performing teams by focusing on Continuous Delivery, the sociotechnical aspects of the team, and the 5th key metric as her addition to the widely known DORA metrics. Towards the end, we discussed the engineer/manager pendulum, how one should be conscious about it, and that we should not treat going into management as a promotion or a one-way street.

Listen out for:

Career Journey - [00:06:04]
Observability and Monitoring - [00:10:52]
Observability Mindset - [00:13:08]
Implementing Observability - [00:15:09]
Observability Pillars - [00:18:35]
Honeycomb Overview - [00:20:06]
Honeycomb Cool Use Cases - [00:27:02]
Writing Custom Database - [00:28:40]
Building High-Performing Team - [00:31:20]
15-Min Continuous Delivery - [00:34:45]
Testing in Production - [00:36:05]
Shipping is Company’s Heartbeat - [00:38:47]
Sociotechnical Aspect of Teams - [00:41:01]
Good Traits to Look for in Engineers - [00:43:17]
The 5th Key Metric - [00:45:51]
Engineer/Manager Pendulum - [00:48:45]
Effective Manager - [00:54:21]
Concerns on Manager/IC Pendulum - [00:55:19]
3 Tech Lead Wisdom - [00:56:49]

_____

Charity Majors’s Bio
Charity Majors is the co-founder and CTO of Honeycomb, provider of tools for engineering teams to debug production systems faster and smarter. Previously, Charity ran infrastructure at Parse and was an engineering manager at Facebook, where she ran next-generation distributed systems at scale. Charity is the co-author of Database Reliability Engineering (O’Reilly), and is devoted to a world where every engineer is on call and nobody thinks on call sucks.

Follow Charity:

Twitter – https://twitter.com/mipsytipsy
LinkedIn – https://www.linkedin.com/in/charity-majors/
Website – https://charity.wtf/
Honeycomb – https://www.honeycomb.io/
O11ycast – https://www.heavybit.com/library/podcasts/o11ycast/

Mentions & Links:

Charity’s blog posts and presentation
- “The Engineer/Manager Pendulum“ blog post – https://charity.wtf/2017/05/11/the-engineer-manager-pendulum/
- “Engineering Management: The Pendulum or The Ladder“ blog post – https://charity.wtf/2019/01/04/engineering-management-the-pendulum-or-the-ladder/
- “It is time to fulfill the promise of CI/CD“ slide deck – https://speakerdeck.com/charity/cd
Honeycomb blog posts
- Observability 101: Terminology and Concepts – https://www.honeycomb.io/blog/observability-101-terminology-and-concepts/
- So You Want To Build An Observability Tool… – https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/
- Intro to o11ycast: A Human Perspective on the Role of Observability – https://www.honeycomb.io/blog/intro-to-o11ycast/
- The Future of Software is a Sociotechnical Problem – https://www.honeycomb.io/blog/the-future-of-software-is-a-sociotechnical-problem/
Observability — A 3-Year Retrospective – https://thenewstack.io/observability-a-3-year-retrospective/
What the H*ck is Observability? – https://www.learnwithjason.dev/what-the-h-ck-is-observability
DevOps – https://en.wikipedia.org/wiki/DevOps
Parse – https://en.wikipedia.org/wiki/Parse_(platform)
Scuba – https://research.fb.com/wp-content/uploads/2016/11/scuba-diving-into-data-at-facebook.pdf
Facebook – https://about.fb.com/
Splunk – https://www.splunk.com/
New Relic – https://newrelic.com/
Datadog – https://www.datadoghq.com/
MongoDB – https://www.mongodb.com/
MySQL – https://www.mysql.com/
CircleCI – https://circleci.com/
Travis CI – https://travis-ci.org/
Jenkins – https://www.jenkins.io/
Golang – https://golang.org/
Ruby – https://www.ruby-lang.org/en/

Our Sponsors

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Career Journey

Every day, it’d be like a new app would hit the iTunes Top 10. Poof! Parse would go down. I tried every monitoring tool, every logging tool, every APM tool. I tried everything and nothing helped because they were all dealing with metrics and pre-aggregates.
Pre-aggregates are great when you know what questions you’re going to need to ask, and you capture that data in that way. But when you have to ask different questions every day, and you have no idea what questions you’re going to need to ask, all the existing tools just fell apart.
We’re trying to build a way to understand your system, just by asking questions from the outside, without having to ship any custom code to understand the problem, which implies that you could have known what it was going to be in advance. So it’s all about the unknown unknowns, and it’s all about being able to infinitely slice and dice at the telemetry that’s coming out.
If you say you want to be able to ask any question of your systems, some things proceed from that. Some technical prerequisites proceed from that. You have to be able to handle high cardinality and high dimensionality, and you have to gather the telemetry in such a way that you have one arbitrarily-wide structured event, per request per service.

Observability and Monitoring

Monitoring is about taking it from the perspective of the service, and saying, “Is this service objectively healthy?” Is the service healthy? Do I need to plan for more capacity? What’s my consumption percentage? It’s usually built on a structure of metrics.
The specific technical term for metric is that it’s a number that you can append tags to. So it’s really cheap. You can store a lot of metrics very efficiently. You can age them out as they get older, you can drop some detail, and you can persist them for a very long time.
The source of truth for observability is not metrics because when you capture that one metric, that one number, you’ve discarded all of the contexts, except for a few tags which you bring back.
With observability, you’ve captured it in this very wide event, which captures all that context, and it says all of these things are linked together by this one request from user’s perspective. So you can then ask any combination of questions from then on.
Monitoring is about from the perspective of the service. Observability is about from the perspective of the user. What is the user’s experience like?
The user for monitoring and metrics is typically ops. The user for observability is typically people who are writing code. People who are shipping code every day, who want to understand the magical conjunction between your code, your infrastructure at this point in time, and that user’s experience.

Observability Mindset

This became necessary when the monolith began to decompose. Because, if all else failed with the monolith, you could always attach a debugger and step through the code. Well, now you’ve blown up the monolith and God knows how many services, and every time it hops the network, you’ve lost all the contexts. You can no longer trace that code anymore.
You can think of this as like packing along all of that context that it needs every time it hops the wire.

Implementing Observability

Absolutely anyone in the chain can take responsibility for this. I think that most commonly we have ops teams who could get it set up on behalf of their developers and start to teach them. Increasingly, their job is not to make the code run. It’s to help developers understand.
First wave of DevOps was like “Ops people, you must learn to write code.” It was like, “Okay. Okay. We get the message.” And now it’s like, “Okay, software engineers, it’s your turn. It’s your turn to learn to write operable services. It’s your turn to learn to support those services, and to build systems that are not too expensive to maintain.”
Sometimes, it’s the developers who take it upon themselves to bring observability in. Often, it’s the ops teams who bring it in on behalf of their developers. Often, it’s execs who have been paying attention, they’ve got their ear to the ground. They know that systems are becoming impossible to run and scale and maintain without the sort of tooling.
Forever the way that we’ve been doing this has been by guessing. When we see something wrong, we guess based on past outages and battle scars. And then we go and look for evidence that we’re right. That works really well when you have a lot of people who’ve been there for a very long time, who have a lot of contexts in their heads, and the systems aren’t changing that much.
I think it doesn’t matter who takes it upon themselves to do this as long as someone does. Because it is very important. Because people burn out teams. This is an ethical issue, I believe. This is a humane issue.
We’ve all grown up working in the tech industry, accepting a certain amount of self abuse, just like masochism, and ops is just infamous for this. But it doesn’t scale very well. And our system increasingly don’t reward that kind of magical thinking.
Whenever you have a platform, you’re inviting your users chaos to live at your house, and you’re responsible for it.
The separation between business and technical metrics is pretty arbitrary and detrimental, in fact. I think that you want everyone to be reading off the same answer sheet. You want everyone in the company to have the same view of reality, and to the extent that you can all be looking at the same numbers.
I believe that tools can create silos. Because the borders of your tool is the borders of your knowledge.

Observability Pillars

Metrics, logs, and traces are just data formats. And you can derive all of them from the arbitrarily wide structured data blob. You can derive your metrics from it. You can derive your log lines from it. You can derive your spans from it. You can’t go in the other direction.
We need to stop thinking about this in terms of the archaic tooling that we’ve inherited, and start thinking about it in terms of the use we get out of the system.
So many legacy tools are just like, you definitely need metrics, logs, and traces. Why? Because they happen to have metrics products, logging products, and tracing products that they want to sell you. And if they’re selling you those three products, and you’re paying to save their data in three different places, three different ways, and they can’t talk to each other.

Honeycomb Overview

I never want to take the human out of the driver’s seat. But I do want to look for ways to make machines do what machines are good at. Machines are good at crunching lots of numbers. To let humans do what humans are good at, which is attaching meaning to things.
If you see a spike, you don’t actually know if that was good or bad or something you expected to see, something you wanted to see. You don’t know that until a human has attached a meaning to that. And then the computer can crunch a bunch of numbers and back up the meaning that you assigned to it.
I don’t want any developer to give up that centrality of being in the driver’s seat. But I do want to look for ways to take out the repetitive, the loop.
One other thing I will point out is that something philosophically at Honeycomb that we really believe in is the power of the team. Because whenever we’re working on these complex distributed systems, I’m an expert in my corner of it. But I don’t know your part nearly as well. Yet while I’m debugging, I have to take the whole system into account.
You should be able to draw on the collective wisdom of your team as experts in the system. I trust that so much more than I would ever trust any AI.
Observability is fundamentally about the first person perspective.
It’s better to have second best data than no data at all.

Honeycomb Cool Use Cases

Because we’re engineers, we love to figure things out. We love spotting inefficiencies. And some things seem really hard and tough, but then you just see them in the right tools.

Writing Custom Database

It comes down to the fact that the trade-offs that we wanted to make are not trade-offs that anyone else would make. For our purposes, fast and mostly right is better than accurate. We’re fine with just throwing away some very small tail of data in order to get sub-second query latencies. This is really important because when you’re debugging, you want to be in a state of flow. So we needed to be really fast. We were okay with discarding some data.
We also knew that we didn’t want there to be any schemas. We wanted people to be able to throw in more dimensions at any point in time.
No schema changes, no indexes. Because indexes again, they violate the theory of observability, which is that you shouldn’t have to predict what’s going to be fast? What you’re going to need to ask? They should all be equally fast to query it on.
So that meant that we needed to build a columnar store. There were not a lot of open source, freely available columnar stores. We also knew that we needed the data to be very distributed. We didn’t want to spend all of our time managing the database.

Building High-Performing Team

A high-performing team is one that gets to spend almost all of their time solving interesting problems that move the business forward. Not doing a lot of toil. Not like working on things that they have to do in order to get to the things that they want to do.
In order to get there, I think that focusing on the DORA metrics is a really good place to start. How often do you deploy? How long does it take? How often does it fail? How long does it take to restore? Time to recovery. And a lot of improving those metrics involves leveling up in observability. Seeing what you’re doing, like putting on your glasses before you go and you drive down the freeway. And a lot of teams, they’re really flying blind.
It’s important to note that high-performing teams, they need to feel emotional safety. They need to support each other. They need to be kind.
The high-performing teams, you don’t get them by just hiring the best people or hiring the best engineers. You don’t. In fact, some of the people I’ve known who are the best individual engineers are pretty toxic to teams. Overall, you as a leader, it’s your responsibility not to cuddle the best engineers. It’s your responsibility to build teams that have a very steady, strong level of combined output. Because that’s what wins the race. It’s not individuals. It’s setting your teams up for success.
The place to start for almost everybody I’ve ever talked to is you need to shrink the amount of time between when someone has written the code, and when the code is live. You should get the interval down to 15 minutes or less. And if you get that, you get to spend your time on a lot of problems that are much, much higher up the value chain. But if you get that wrong, you’re just going to waste all your time tending to the pathologies. And it’s really inefficient, ineffective.

15-Min Continuous Delivery

It is important because you don’t want people to have enough time in there to be able to shift context. That moment when you’ve written the code, you want this to be a small diff.
How quickly can you deploy one line of code? It should be your goal here, right? You want to be short enough that you hook into people’s physiological systems. Their muscle memory. You want all of your developers to go and look at their code after they’ve shipped it.
And if it’s an hour or more, that I guarantee you’re going to start batching multiple people’s changes up together, when it’s really important that you be shipping one person’s change set that at a time. Cause that creates ownership. If it’s your change going up, you’re going to go look at it.
You need to construct that feedback loop and keep it crisp.

Testing in Production

We need tools for production that are like scalpels. That are very sensitive, very precise. Developer cycles are the scarcest resource in our universe. They’re not infinite. If we’re wasting them all on pre-production stuff, then we’re left with nothing for production.
I just think that the importance needs to be inverted there. We need to be thinking first about production. Do we have the tools that we need to ship safely and swiftly and understand what’s going on? And then, give 80% to production and 20% to pre-production. Instead of 80% to pre-production and 20% to production.
The point is not that everyone should be testing in production. The point is that everyone is testing in production. Whether you admit it or not, and you’re better off, if you just admit it, and try to do it well.
Everybody has to test in production. You do. I do. It’s not embarrassing. It’s a fact of life. Now, how can we do it better?
You should invest that energy into observability. And if you want to get really precise in release, you should be using feature flags so that you can flip that on enough so that you can decouple your releases from your deploys. You should invest in observability so that you can see, and canaries. If you’re really scared about a change, ship it to 1% of the traffic, or ship it to 1% of the users, or ship it to one node.
Build the tools that will let you have this very high level of precision around who you’re shipping it to, and then the ability to compare the traffic stream so that you can see what the difference in error rates is. Or what the difference in latency is.

Shipping is Company’s Heartbeat

Shipping is your company’s heartbeat.
It can be done. This can absolutely be done. It’s not hard. It’s just engineering work. You just need to commit to doing ongoing work, set yourself a limit, alert when it goes off. Take it seriously, and invest in it over time, and you can hold yourself to a high standard.
Shipping, for any tech company, giving value to our users is why we exist. Shipping new changes, shipping new code for users should be absolutely as common, as ordinary, as uneventful, and as regular. It should be a nothing.
It is because those changes are small and uneventful that it’s not a big deal. I feel we need to invert the way that we think about this. This is a social change. This is a mindset change. But it’s not a new thing.
The thing is this is so good for engineers. Engineers who have worked with true CI/CD are unwilling to ever go back. Cause it is like a different profession. It is so much more interesting and exciting and up to the moment. You see your real impact on users' lives many times a day, and that’s fundamentally motivating and inspiring.

Socio-Technical Aspect of Teams

You brought a set of replacements to keep everything running, and then something breaks. How long do you think it would take them to figure out and fix even a very small problem? It’s a mind experiment to show you just how much of the system lives in our heads.
People are very important part of the technical system, and you can’t just separate them. The part of the system that lives in your head is a part of the system.
What’s really important to us is their communication skill. They need to be able to write code, obviously. But the real interview is not us looking at their code, it’s us talking to them about their code. We’re just having conversation about, well, talk to me about your choices here. Why did you choose this? What else might you have done? Or like what trade-offs did you have in mind when you were doing this? Because there are so many people who are capable of writing a great code, but can’t communicate about it.
We believe that high-performing teams come from people who are ready, willing, and able to talk about what they do, and why they’ve done what they’ve done, without getting their ego too wrapped up with it. It’s just being able to have a conversation about it.
We were never looking for the best programmers. We were looking for the best communicators.

Good Traits to Look for in Engineers

We look for humility. We look for people who can be gently challenged and don’t fall apart. We look for people who are constantly, I guess growth mindset is a term of art for it. But we want people to be able to come curious to a problem. Not to come feeling they’re going to be terrified if it turns out that they don’t know what they’re doing. Because it turns out we don’t know what we’re doing either.
We have values that we hire adults. We don’t want to be your family or your whole life. But we also do hire people who take a lot of pride in their craft. We don’t want to hire people who just go to work, who just do it to pay the bills either. Because, we take a lot of pleasure in what we do.
Their values are everything is an experiment, and fast and close to right is better than done and perfect.
Feedback as a gift is another one. Because I think that so many pathologies in the modern workplace can be traced back to the fact that people are scared to give each other feedback. And so it builds up, and it becomes a big thing.
We spend more time at work with each other than we do at home with our partners and our families. I think we shouldn’t be embarrassed about talking about our feelings and being emotional human beings at work. Because if you’re not, you’re just faking it, and I don’t believe that lets you perform at a high level over the long term.

The 5th Key Metric

Every team should be tracking how often someone gets paged out of hours. I strongly believe that software engineers need to be on call for their own code, but this comes hand-in-hand with commitment from management to make sure that it does not suck.
The reason to put software engineers on call is because this is how we make it better. This is how we close that feedback loop. How we make it short. How we get things fixed quickly so they don’t get to infest and become big problems. We fixed them when they’re small.
It’s about pulling your own weight. It’s about being willing to support your own code. It’s about management making sure you have the time to fix the things that need to be fixed.
You can’t just put people on call and then tell them to build features all the time. If you’re putting them on call, you have to give them the time to fix the problems so they’re not getting woken up all the time. And I feel like having that metric and holding managers accountable for that metric. Never promote a manager whose paging alerts are going off at all hours. That is just a sign of terrible management. Their team is going to burn out.

Engineer/Manager Pendulum

People are agonizing over whether to become managers because they see it as a one-way trip. They’re not sure if they can get hired as an engineer afterwards. They’re not sure if they should. They’re often told that they shouldn’t.
The best technical leaders that I’ve ever worked with have been people who have spent time with managers. Because you learn things about how to help a team, overcome obstacles, and work as a unit that you can’t learn any other way. You learn things about how the business works that you can’t learn any other way. But also has been hands on within the past five years. Because when you take a step away from the day-to-day of building and shipping code, you’re on a path that diverges, and there comes a point where you are not an effective leader of people who are writing and shipping code.
If you were to stay and a line manager who manages engineers, who manages the people who are writing shipping code, you’re going to need to make your way back to the well from time to time. You can’t go more than five years without building and shipping code regularly yourself, and maintain that level of attention and empathy, and immersion in the problem set. Your engineers won’t respect you as much. You won’t be able to mentor and develop engineers as well. You won’t have opinions that you can trust about the technology.
You have to train yourself not to just speak up all the time as a manager. Once you step out of that position and you’re the manager, it’s your job to guide people up to reclaim the oxygen that you once breathe. It’s your job to grow people into that space. And this can require giving up a lot of ego on our parts. But it’s necessary. It’s the best way to grow people up is to have a manager who actually understands what they’re going through and can help guide and nudge and suggest opportunities for them.
We need to stop talking about management as a promotion. It’s not a promotion. It’s just a change of career. It’s a lateral step. What that means is that it’s not a demotion if a manager wants to go back to engineering.
You want your managers to want to be managing people. You want the people who were better as engineers to try it for a while, maybe, and then be able to switch back to engineering without any loss of status, without any loss of ego. So it’s really important that management is not a promotion, just a change of career.
Anyone who’s trying management out, you should ask for a two-year commitment. Otherwise, it’s too rough on the teams if you’re like shuffling manager around constantly. You literally have to become a different person to be an effective manager. You have to retrain your own instincts, your own reactions of the way that you think you’re pacing. You’re pacing on it on a daily, weekly basis is so much different as a manager. You have to gain the instinct and intuition, know when you’re doing a good job, and when you’re doing a bad job. That takes at least two years to build.
If pretty much anyone who is interested in trying management got a shake at it. And then, if we asked ourselves honestly, “Is this good for me? Do I like it or not?” And there should be a path back and forth that is well-trodden. I feel like so much of the pathologies that I see on teams come from the relationship between management and engineers.
I think in the way that the interval between when you write the code and when you ship the code is the fundamental building block of a healthy software digestive cycle. I think that pathway between senior engineer and manager being an equal one and a well-trodden one, is the fundamental building block of a good social side of the socio-technical organization.

Effective Manager

A lot of it comes down to self-awareness. Go to therapy.
Learn stuff about power dynamics, especially if you’re a white or Asian in Tech. There’s a lot of stuff that you need to learn to be sensitive to that you’ve perhaps never been sensitive to before. You owe it to your reports to learn about those things.
Learn how to listen. All the advice that I would give for new managers is really about self-awareness and humility and communication skills.

Concerns on Manager/IC Pendulum

Choose your companies carefully, especially as you get to be more senior in your career. You should be interviewing them just as much as they’re interviewing you. Ask them about their philosophy of management.
You want to find a place that has a culture that matches your own aspirations.
It is honestly much easier to go back from being a manager to being an engineer, because this is a superpower. Everyone wants engineers. There is so much more demand for engineers than there is availability.

3 Tech Lead Wisdom

Your career is an enormous asset. It is the most valuable thing you possess. It is a multi-million dollar appreciating asset, and you should treat it that way.
- Think about it for the long term. Sometimes this means taking a job with a little bit less pay because you think it makes the story of your career look much better.
- Be in a position where you are not spending all of your money every month. Don’t live paycheck to paycheck because then you’re trapped. You can’t take any jobs that pay any less than that, and sometimes you’re going to want to. Look at your career as a long-term asset and think about the story that you wanted to tell.
Try to be an ally to women gender, other genders and races.
The advice that I always give to young women who asked me for advice is like when you’re a junior engineer, just toughen up. Try not to take it too personally. Put your head down, learn as much as you can, just survive. Become a senior engineer, and then get sensitive, and then learn to be sensitive on behalf of other people. Use your power for good. Anyone who’s a senior engineer in Tech has an absolute responsibility to learn to be sensitive, and to use our power for good on behalf of those who can’t.

Transcript

Episode Introduction [00:01:06]

Henry Suryawirawan: [00:01:06] Hello, everyone. I’m back here again with another new episode of the Tech Lead Journal podcast. Thank you so much for tuning in and spending your time with me today listening to this episode. If you’re new to the podcast, know that Tech Lead Journal is available for you to subscribe on major podcast apps, such as Spotify, Apple Podcasts, Google Podcasts, YouTube, and many others. Also check out and follow Tech Lead Journal social media channels on LinkedIn, Twitter, and Instagram. Every day, I post quotes and words of wisdom from the recent podcast episode, and I share them on those channels to give us some inspiration and motivation for us to get better each day.

And if you’re thinking about making some contribution to the show and support the creation of this podcast, please consider joining as a patron by visiting techleadjournal.dev/patron. I highly appreciate any kind of support and your contribution would help me towards sustainably producing this show every week.

For today’s episode, I am extremely excited to share my conversation with Charity Majors. Charity is the co-founder and CTO of Honeycomb, the observability tools for engineering teams to debug production systems faster and smarter. Charity is a frequent speaker and is someone that I admire a lot, not just for her thought leadership on DevOps, observability, distributed systems, etc, but also for her strong opinions on multiple other topics, such as management, engineering, startup, culture, diversity, and many more, which you can also find and follow on her Twitter and personal blog.

In this episode, Charity and I discussed in-depth about building high-performing teams by having observability and CI/CD as the critical pillars to support it. We opened up our discussion discussing about what observability is, how it differs from monitoring, and why the traditional monitoring tools and pre-aggregates are not sufficient to help us understand our systems better and to uncover the unknown-unknowns. Charity shared how Honeycomb provides unique capabilities that can help provide observability for distributed systems and how it works, including an interesting use case that can provide observability for CI/CD pipelines.

Then we moved on to discuss about building high-performing engineering teams. Charity shared her strong views on how to build such teams by focusing on Continuous Delivery, caring a lot about the socio-technical aspects of a team, and the fifth key metric as her addition to the widely known DORA metrics. Towards the end, we discussed the engineer/manager pendulum, based on her popular blog post sometime back, and Charity shared what she means by the pendulum, and how one should be conscious about it. She gave her words of advice and explained that going into management is not a one-way street, and that we should also not treat it as a promotion.

This is such an amazing episode that you don’t want to miss, and I’m certain that you will also enjoy this episode. If you like it, consider helping the show by leaving it a rating, review, or comment on your podcast app or social media channels. Those reviews and comments are one of the best ways to help me get this podcast to reach more listeners, and hopefully they can also benefit from the contents in this podcast. Let’s get this episode started right after our short sponsor message.

Introduction [00:05:12]

Henry Suryawirawan: [00:05:12] Hey, everyone. Welcome to another new episode of the Tech Lead Journal. Today, I’m very excited to have someone that I really admire from the DevOps technology side. Her name is Charity Majors. I’m sure many of you would have recognized her name. Today, we’ll be talking a lot about observability, monitoring, CI/CD, and a few other topics that she’s famous for in terms of blog posts. Charity is a CTO and co-founder of Honeycomb, developer tools that help to improve the observability of your application. She also writes a great blog post at charity.wtf, and co-host a popular podcast called O11ycast, which I think is one of the best observability and monitoring podcasts available out there. So Charity, thanks for your time today. Hope to have a good conversation with you.

Charity Majors: [00:06:01] Yeah. Thanks so much for having me. I’m really excited.

Career Journey [00:06:04]

Henry Suryawirawan: [00:06:04] So Charity, maybe in the first place, let us hear more from you about your career journey, about your highlights and turning points.

Charity Majors: [00:06:11] Yeah. Well, I’m kind of an accidental technologist. I went to college for classical piano performance when I was 15, and I never had a computer when I was growing up. I never even had email until I went to college. But I didn’t want to be poor. I didn’t want to be broke, and I found myself hanging out in the computer lab a lot. I never actually had a degree in computer science. I had majored in classical studies, Latin and Greek, and creative writing and philosophy. But I found that I was spending more and more time working in the lab than I was doing any of my schoolwork. I dropped out a couple times, came to San Francisco. I’ve been working at startups ever since. I have made a niche for myself being an early, or I like to be the first infrastructure hire that joins a team of software engineers. When they think that they’ve got something that might be real, and I like to help them help it grow up. And then I typically start hiring a team, and then I manage the team. Then there comes a point where I’m like, well, I’ve done everything I could do here. And I find another startup, and I do it again.

So that’s what I was doing when I joined Parse, which is a mobile backend for mobile developers. And that was about six years ago. I was there before beta, then got acquired by Facebook. I stayed at Facebook for 2.5 years. Give you a little bit of context. Parse was, we were doing a lot of like microservices before there were microservices. We were doing this really hard cotenancy problems. We had over a million apps running on Parse with one database and one pool of app workers, and we were going down constantly. It was really professionally embarrassing to me. Cause every day, it’d be like a new app would hit the iTunes Top 10. Poof! Parse would go down. I tried every monitoring tool, every logging tool, every APM tool. I tried everything and nothing helped because they were all dealing with metrics and pre-aggregates. Pre-aggregates are great when you know what questions you’re going to need to ask, and you capture that data in that way. But when you have to ask different questions every day, and you have no idea what questions you’re going to need to ask, like all the existing tools just fell apart. So there was one tool at Facebook called Scuba that we started shipping some data sets into. And the time it took us to figure out which app was the root cause of us going down or whatever, just dropped like a rock. It could be hours or days. It was like open-ended like, “Who knew?”, to like predictably, it was seconds, not even minutes. But every time we would just like, start slicing and dicing, we could find the answer. This made a huge impression on me, and when I was leaving Facebook, I was like, “Oh my God, I can’t go back to being without this tooling.” I was supposed to go and be an engineering manager at Slack, Stripe, or something. And I was just like, this needs to exist because everyone is starting to have this problem. Everyone is starting to have this problem of high cardinality and high dimensionality, and there is no tool out there that exists.

So we started building the tool that would become Honeycomb. We had to start by writing our own database effectively, because there was nothing out there that could do this. And that was difficult, but it wasn’t as difficult as trying to figure out the language. Cause we knew we weren’t building a monitoring tool. We knew that it was something very fundamentally different. That it was, you know, for helping you develop your application. And it was about six months into having started Honeycomb when I happened to Google the term “observability”, which no one was using at the time. When I read the definition, which comes from like mechanical engineering and control theory, and it has to do with how well can you understand any state that your system can get into just by looking at it from the outside. And I went, “Oh my God!” Like light bulbs appeared in my head. This is what we’re trying to do. This is what we’re trying to build. We’re trying to build a way to understand your system, just by asking questions from the outside, without having to ship any custom code to understand the problem. Which implies that you could have known what it was going to be in advance. So it’s all about the unknown unknowns, and it’s all about being able to infinitely slice and dice at the telemetry that’s coming out.

We started talking about it a lot. And for the first two years or so, it was like just nothing. Everybody was like, we’ve been told this is impossible. We believe that you’re lying. Blah, blah, blah. And about three years into doing Honeycomb, one day it was just like everything changed. I woke up the next morning and the entire world was like, “But of course we need observability. We have always wanted observability. And by the way, we do observability too.” And I was just like, “Oh my God! No.” So there’s been a little bit of a bandwagon effect. Now the frustrating thing is that everybody’s like, “Yeah, we do observability too.” It’s like, well, no, they’re supposed to be very specific technical definition. Because if you say you want to be able to ask any question of your systems, some things proceed from that. Some technical prerequisites proceed from that. You have to be able to handle high cardinality and high dimensionality, and you have to gather the telemetry in such a way that you have one arbitrarily-wide structured event, per request per service. I’ve written all these blog posts and stuff. The people who follow it closely understand, but like the rest of the market just like, “Oh, observability. It’s a new synonym for telemetry. Got it.” And I’m just like, “Okay, well this is better. This is progress, you know.”

Henry Suryawirawan: [00:10:47] It’s common in technology industry. Any new jargon, any new buzzword, everyone just use it.

Observability and Monitoring [00:10:52]

Henry Suryawirawan: [00:10:52] So if you can help us, like maybe share from your point of view, what is observability? And how does it differ from monitoring, which what we are used to?

Charity Majors: [00:11:01] Yeah. I think it’s really important that people like understand the way it’s different because monitoring is a very mature. It’s an important discipline of its own, and there’s a lot of best practices and stuff that go along with it. I think that the metrics tooling is great. It fulfills a separate need in the ecosystem from observability. Monitoring is really about taking it from the perspective of the service, and saying, “Is this service objectively speaking healthy?” Is the service healthy? Do I need to plan for more capacity? What’s my consumption percentage? It’s usually built on a structure of metrics. There are the generic term for metric, which is telemetry. But the specific technical term for metric is just, it’s a number that you can append tags to. So like it’s really cheap. You can store a lot of metrics very efficiently. You can age them out as they get older, you can drop some detail, and you can persist them for a very long time. The source of truth for observability is not metrics because when you capture that one metric, that one number, you’ve discarded all of the contexts, except for a few tags which you bring back.

So for example, so you have a spike of errors in your graph. “Ah, what’s wrong?” Well, with observability, you might slice and dice to figure out, “Oh, these errors are for all the requests that are coming from this version of iOS, from this endpoint, from this language pack, from this version of the firmware, from this build ID.” Just like the long string of things, like this is what’s different about them. Well, you can’t do that with metrics unless you happen to in advance capture a metric for this version of iOS, this build ID. And that build ID is always incrementing and always changing, so you have to capture those specific numbers in advance with metrics. Then with observability, you’ve captured it in this very wide event, which captures all that context, and it says all of these things are linked together by this one request from user’s perspective. So you can then ask any combination of questions from then on. So monitoring is about from the perspective of the service. Observability is about from the perspective of the user. What is the user’s experience like? Which I think is super important. So, the user for monitoring and metrics is typically ops. The user for observability is typically people who are writing code. People who are shipping code every day, who want to understand the magical conjunction between your code, your infrastructure this point in time, and that user’s experience.

Observability Mindset [00:13:08]

Henry Suryawirawan: [00:13:09] So I think to me when I hear about it, it seems like it requires a lot of mindset change. First of all, like capturing all these high dimensional, high cardinality metrics, and also like implementing that. So what do you think someone, if they want to explore this observability further, what should they do to change their mindset?

Charity Majors: [00:13:27] Well, I think that a shortcut is often just seeing your own system, your own data in an observability tool. Which is why Honeycomb has a really generous free-tier, because we want people to be able to sign up, play around with it. Even if they don’t become our users, I still want them to like understand the mind shift and experience the difference. And so I think that’s the shortest, quickest route to experiencing it, to understanding it. There’s a lot of things that will just click. If you’re used to trying to understand your own system, and you see it in a different tool, you already know the fields, the landscape, and so it’s a pretty quick and easy way. Another way of thinking about this is that this became necessary when the monolith began to decompose. Because, if all else failed with the monolith, you could always attach a debugger and step through the code. Well, now you’ve blown up the monolith and God knows how many services, and every time it hops the network, oops!, you’ve lost all the contexts. You can no longer trace that code anymore.

Which is why, when you’re gathering the telemetry for observability, the way we do it in Honeycomb is, we initialize a Honeycomb event as soon as a request enters the service. And then, we pre-populate it with everything we know about the environment, the language, the internals, go routines, anything about the context that was passed in, parameters. And then while it’s executing in that service, you as a developer can go, “Oh, this might be useful. Stuff it in.” We also wrap a lot of common functions database calls. So anytime you do a database call, we wrap it, and we record the time it took, and the raw query, and the normalized query, and results. And every time we do a HTTP request, we wrap it, and we give you the duration, and the results, and all that stuff. And then when the request is ready to exit or error from that service, it gets packaged up into that one arbitrarily wide structured data blob, and shipped off to Honeycomb. So you can think of this as like packaging, packing along all of that context that it needs every time it hops the wire. Well now, you’ve got that context, because we’re passing it along with request.

Implementing Observability [00:15:09]

Henry Suryawirawan: [00:15:10] That’s pretty cool. So in the first place, who’s responsible for implementing this? Because in the beginning you said it should be developers. First of all, they will be the main users of this. But as we all know, observability, monitoring, these days are highly associated with ops, or the DevOps side. So who should implement this? And how should someone actually think about, okay, what kind of questions that I would ask about my system and where does this observability might be useful for my system down the road?

Charity Majors: [00:15:37] Absolutely anyone in the chain can take responsibility for this. I think that most commonly we have ops teams who could get it set up on behalf of their developers and start to teach them. Which is, I think, a really beautiful thing. I think that ops teams are not becoming any less necessary. And increasingly, their job is not to make the code run. It’s to help developers understand. This is like the second wave of DevOps, right? First wave of DevOps was like “Ops people, you must learn to write code.” It was like, “Okay. Okay. We get the message.” And now it’s like, “Okay, software engineers, it’s your turn. It’s your turn to learn to write operable services. It’s your turn to learn to support those services, and to build systems that are not too expensive to maintain. But we’re here for you. We will help you.” So I think that sometimes it’s the developers who take it upon themselves to bring observability in. Often it’s the ops teams who bring it in on behalf of their developers. Often it’s execs who are like, they’ve been paying attention, they’ve got their ear to the ground. They know that systems are becoming impossible to run and scale and maintain without the sort of tooling.

Because forever the way that we’ve been doing this has been by guessing. When we see something wrong, we guess based on past outages and battle scars. We’re like, “Ah, I think it’s this, right?” And then we go and look for evidence that we’re right. That works really well when you have a lot of people who’ve been there for a very long time, who have a lot of contexts in their heads, and the systems aren’t changing that much. But when the system start breaking in a novel way, every single day, it doesn’t do you much good. So, I think it doesn’t matter who takes it upon themselves to do this as long as someone does. Because it is very important. Because people burn out teams. This is an ethical issue, I believe. This is a humane issue. We’ve all grown up working in the tech industry, accepting a certain amount of self abuse, just like masochism, and ops is just infamous for this. We’ll throw ourselves on the grenade, drag ourselves into the meetings. Just like in the morning, like I’ve been up all night. It’s like a badge of honor for us, which it was fun. I enjoy being the wizard who points at a graph and goes, “Ck”. That’s super fun. But it doesn’t scale very well. And our system increasingly don’t reward that kind of magical thinking.

Henry Suryawirawan: [00:17:43] Yeah. Especially these days when people are using lots and lots more microservices, distributed systems, even platforms like Cloud, where you have so many services.

Charity Majors: [00:17:50] Platforms are the worst. Because whenever you have a platform, you’re inviting your users chaos to live at your house, and you’re responsible for it.

Henry Suryawirawan: [00:17:59] So can this observability concept be used for monitoring business metrics as well?

Charity Majors: [00:18:04] I think that the separation between business and technical metrics is pretty arbitrary and detrimental, in fact. I think that you want everyone to be reading off the same answer sheet. You want everyone in the company to have the same view of reality, and to the extent that you can all be looking at the same numbers. Well, that’s just a lot of arguments you don’t have to have. I believe that tools can create silos. Because the borders of your tool is the borders of your knowledge. I’ve been in so many meetings where you spend more time arguing about what was true because of what your tool said versus the actual problem that you’re trying to solve together.

Observability Pillars [00:18:35]

Henry Suryawirawan: [00:18:36] So maybe if you can mention few attributes. I know you have metrics. You have maybe logging, and you maybe have traces, like spans, and things like that. What are the things that are crucial in any distributed systems these days?

Charity Majors: [00:18:49] I mean, metrics, logs, and traces are just data formats. And you can derive all of them from the arbitrarily wide structured data blob that I was just describing. You can derive your metrics from it. You can derive your log lines from it. You can derive your spans from it. You can’t go in the other direction. You can’t go from a metric to a wide structure blob. You can’t go from a log line to a wide structure blob. So I really think that we need to stop thinking about this in terms of the archaic tooling that we’ve inherited, and start thinking about it in terms of the use we get out of the system. And I feel like the move to these arbitrarily wide structured data blobs is inevitable, but it also shouldn’t be a big deal for people, right? If you squint, they’re just logs, they’re just structured. A trace is just a wide log line that has a couple of dedicated span ID fields and trace ID fields. So I think it’s kind of a red herring. It pisses me off that so many legacy tools are just like, you definitely metrics logs and traces. Why? Because they happen to have metrics products, logging products, and tracing products that they want to sell you. And if they’re selling you those three products, and you’re paying to save their data in three different places, three different ways, and they can’t talk to each other. So I see this as a really cynical ploy on behalf of the entrenched legacy, the Splunks and the New Relics and the Datadogs of the world. They count up all the products they want to sell you, and they’re like, “This is how many you need.” And it’s like, “No, you don’t.” What you need is the functionality.

Honeycomb Overview [00:20:06]

Henry Suryawirawan: [00:20:07] Maybe you can share a little bit more about Honeycomb for people who haven’t used it before. I myself haven’t used it personally, but I heard a lot of rave reviews about it. So maybe you can share a little bit how Honeycomb can help? And what are some of those tooling, like what you said, they sell you as a suite of products, could not achieve compared to Honeycomb?

Charity Majors: [00:20:25] Well, so, you know there’s a narcissism of small differences. Of course, we’re all trying to solve the same problems here. So like for example, we usually start, all of us universally, we usually start our debugging experience by looking at a metrics graph. We see a spike in errors. We’re like, “Ah, there’s a problem.” Then, because you can’t slice and dice metrics arbitrarily, you typically jump over to your logging tool, and you start just trying to visually correlate. Well, that spike of errors looks like this over here. It’s not really science, but you know, you try and find the same thing around the same time. Then maybe you want to trace it or something. So you pick out a trace ID, you pick out a request ID, and then you jump over to your tracing tool and you paste it and you hope that one got captured. And so you’re just visually jumping around and you, the human, are in the middle of the glue. Well, you do exactly the same thing in Honeycomb. Only you don’t jump around. You just, you look at the graph, you see there’s a spike or something. Well, the way we do it at Honeycomb is you just draw a little box around this is what I’m interested in. And then we can pre-compute for you for all the dimensions inside the box you said you were interested in and outside the baseline, and then we diff them, and then we sort them. So all the ones that are different rise up to the top, and then you can click on one and just say, “Show me this trace. Show me example trace for this.” So you’re doing exactly the same thing, but you know that you’re dealing with the same data because you’re not jumping around. The cool thing is that because traces and events were just like two sides of the same coin, well, it turns out that tracing is just visualizing by time. So you’re dealing with exactly the same events, exactly the same data, stored only one time, but you’re able to turn it around, inspect it in all of the different ways just to try and find the anomalies.

I think that the BubbleUp trick that we’ve done is like it’s a game changer. Because what it did was it replaced the manual process of what we used to do at Parse with Scuba. What we used to do is go “Okay. There’s a spike. Well, let’s see, is it for all endpoints?” “No, it’s just for the endpoints that are like endpoints.” “Well, is it for all the right endpoints?” “No, it’s just the ones that are writing to, say, MongoDB.” “Well, is it to all the MongoDB endpoints?” “Well, no. It’s only erroring to these three primaries.” “Well, what do they have in common? Or is it to all of them?” “Well, no, it’s just to this application ID.” You’re following a trail of breadcrumbs, which is, let me point out, vastly preferable to guessing, and looking for evidence that you’re right. Right here, you’re actually following the trail of breadcrumbs to the answer every time. But what BubbleUp did is it just replaced that whole iterative process because we pre-compute them, and we show you them all at once. You’re just like draw a box, “bloop”. There’s the answer, which is pretty cool.

And this is where we hear a lot these days about AIOps, and like machine learning and all this ********. I have lots of strong feelings on this topic. But I feel like I never want to take the human out of the driver’s seat. But I do want to look for ways to make machines do what machines are good at. Machines are good at crunching lots of numbers. To let humans do what humans are good at, which is attaching meaning to things. If you see a spike, you don’t actually know if that was good or bad or something you expected to see, something you wanted to see. You don’t know that until a human has attached a meaning to that. And then the computer can crunch a bunch of numbers and back up the meaning that you assigned to it. So I don’t want any developer to give up that centrality of being in the driver’s seat. But I do want to look for ways to take out the repetitive, the loop. It’s just like, “Is it this? Is it that?” We can collapse a lot of that work so that you just have to point and click and understand.

Henry Suryawirawan: [00:23:34] The way you describe it, actually sounds to me as if like I’m Googling something, right? No problem. I start from google.com, search one thing, and then it brings me to more and more links until I get maybe a solution.

Charity Majors: [00:23:47] Yeah. It’s about starting up high and methodically working your way to the answer. It’s a process that we’re all familiar with, and our tools just haven’t supported it. We’ve just had to make these wildly so fancy.

One other thing I will point out is that something philosophically at Honeycomb that we really believe in is the power of the team. Because whenever we’re working on these complex distributed systems, I’m an expert in my corner of it. The service that I’m writing right now, like the thing that I’m building, I know everything about it. I know the variable names. I know where the gremlins live. But I don’t know your part nearly as well. Yet while I’m debugging, I have to take the whole system into account.

So we talk a lot about how can we bring everyone up to the level of the best debugger in every single corner of the system? Well, we can do that by piggybacking on the way that I interact with my system as an expert, the way you interact with your system as an expert. So if I get paged about something, maybe it looks like a MySQL problem, and I don’t know anything about MySQL. But I know that the experts on the team are like Ben and Emily. And I feel like we had an outage like this, like last Thanksgiving. So I just want to go look at like, were any of them on call? Like what did they do? What was meaningful enough to them that they would run it a bunch of times, right? What did they tag and put into a post-mortem? What did they add a comment to? Or name and like put in there a personal arsenal of useful links. You should wear grooves in the system as you use it, and you should be able to draw on the collective wisdom of your team as experts in the system. I trust that so much more than I would ever trust any AI.

Henry Suryawirawan: [00:25:09] So, if I understand correctly, the way it works is you instrument your code, your services to give you this blob of data. Is there any other mechanism that you use? And how about the platforms that you mentioned? So how do you instrument the platforms, like the databases, the cloud platform that you use?

Charity Majors: [00:25:25] Well, observability is fundamentally about the first person perspective. It’s about writing a library for your code. It being the first person actor, who’s this is what I am doing. This is what I am seeing. But data is data, and it’s better to have second best data than no data at all. And when you’ve got, say, a database where you’re not instrumenting it, and it’s not your code, you do want some ways to get telemetry out of it. The easy answer is we’ll accept any structured data you can throw at us. It’s most useful to you as a debugger if you’ve gathered into one arbitrarily wide structured data blob per request per service. But you know, if you have an ELK stack, you can just put a little stanza in your Logstash thing that will just duplicate the output streams, and send us everything you were sending to ELK. That can be a really nice little bridge for people as they’re getting started. Because they already know this data. They know how to navigate it. You don’t have to do anything to change any code that way.

For things like MySQL, I’m a databases person, right? So, this is an old school ****, but what we do is we get three sources of data for MySQL or MongoDB or whatever. We get the slow query log, cause there’s some information that is only sent to the slow query log. We also stream over the wire. We do a tcpdump. We stream over the wire. And we reconstruct all the transactions, and then we ship each transaction in to Honeycomb as an event. Cause there’s some information you can only get that way. And thirdly, we also run a cron job that connects every couple seconds to the command line interface, and just dumps out all the internal stats, and we ship those in as events too. Those three sources of data, that’s basically the sum total of everything you can get out of these databases. And it’s frustrating because they clearly weren’t designed in a way to just like output all of the information that we would like them to have in one place. But you can get a pretty good approximation with those three sources.

Honeycomb Cool Use Cases [00:27:02]

Henry Suryawirawan: [00:27:03] So I’m sure you have plenty of these, right? But any kind of cool use cases that maybe your customers have shared?

Charity Majors: [00:27:09] The coolest thing that was a big surprise to me was when one of our customers, a year or two ago, made a library called build events for build pipelines. And what they did was basically they made it so that you can instrument your CI/CD pipeline as a trace. Just like your trace. So they like overloaded that a little bit. If you put the build events library in your Circle or Travis or Jenkins or whatever, then it will wrap. So it shows each test when it fires off, and when it finishes, you can see how well the parallelization is doing? Or if it’s all sequential? Or where the time is going? And it’s so amazing because bringing down the amount of time it takes to build and ship your code is so central to, I think, every team really should be paying attention to this. But there hasn’t been a good way to visualize it. And I thought that was just ******* brilliant.

It’s such a good use of visualization by time, and that had never occurred to me. So now we try and get people to do it all the time. Because it’s such a great gateway drug to observability. Because we’re engineers, we love to figure things out. We love spotting inefficiencies. And some things seem really hard and tough, but then you just see them in the right tools, it’s like, “Oh duh, I can fix this.” And I think almost everyone has this experience with the build events. They’re like, “Oh my God. That’s where all of my time is going.” And they go and fix it. And it’s just, you get such a dope, you get such a high off of it. It’s just, it’s so fun.

Henry Suryawirawan: [00:28:25] I’ve never imagined it that way. But yeah, now that you’ve mentioned it, I think it’s pretty cool if we can have that data. Especially if you can correlate with a real production system, like what happened during build time? What happened during development? Maybe some test cases behave differently during that time. So that’s pretty cool.

Writing Custom Database [00:28:40]

Henry Suryawirawan: [00:28:40] So you mentioned in the beginning as well that you have to write your own database. Maybe can share a little bit, because since you are a database person, why the current database tooling is not good enough for you?

Charity Majors: [00:28:51] Really, it comes down to the fact that the trade-offs that we wanted to make are not trade-offs that anyone else would make. For our purposes, fast and mostly right is better than accurate. We’re fine with just throwing away some very small tail of data in order to get like sub-second query latencies. This is the other thing about Honeycomb. It’s ******* fast. The 95th percentile for queries is like under a second. And this is really important because when you’re debugging, you want to be in a state of flow. You want to be asking this and this and this and this. Imagine if Google took five minutes to return a search? You’d be planning them in advance. You’d be like, “Oh God, okay, I’m going to run this query, and then I’m going to go to the bathroom.” That’s not good. So we needed to be really fast. We were okay with discarding some data. We also knew that it was very core that we didn’t want there to be any schemas. We wanted people to be able to throw in more dimensions at any point in time. A maturely instrumented service with Honeycomb will have 300, 400, 500 dimensions. So the events are that wide, and we don’t want the friction of, okay, now I’m going to add another one. We want people to just be able to throw it in, forget about it, stop sending it, forget about it. So no schema changes, no indexes. Because indexes again, they violate the theory of observability, which is that you shouldn’t have to predict what’s going to be fast? What you’re going to need to ask? They should all be equally fast to query it on.

So that meant that we needed to build a columnar store. There are not a lot of open source, freely available columnar stores. We also knew that we needed the data to be very distributed. We didn’t want to spend all of our time managing the database or, like, adding databases for each user. The way it’s stored is actually it’s cool. So when data comes in through the API, the events get dropped into Kafka, into a topic just for you, your account. And then the topic is consumed by N numbers of retrievers. All of our sources are named after dogs cause we’re a dog company. So the retrievers, just a pair of them for each, for it will consume off the queue. And then a thing that we did just in the last year or two is we actually made our database go serverless. So like at first, we were storing them all on SSDs in Amazon. And then we were like, wouldn’t it be cool if it was just an S3? We thought that the performance hit would be huge, but it wasn’t. So now, like the queries actually, they run as Lambda jobs, and the data gets aged out to S3 after a few hours on disk. And eventually, I think we’d like to make it so that you can choose to age it out to your S3 buckets. So that they’re really just hosted on, you know, you can put it forever for free. We have 60 days of storage for free, which I think is the best in the industry. And we’re able to do that because we were able to just outsource to S3 where it’s really cheap.

Henry Suryawirawan: [00:31:16] Wow. Sounds really cool. Thanks for sharing that, of course.

Building High-Performing Team [00:31:20]

Henry Suryawirawan: [00:31:20] So Charity, you mentioned also you’re passionate about building high-performing team. Lately I’ve seen one of your talk about CI/CD, and also the blog posts. So can you share us a little bit more about what do you think about high-performing team? What is actually high-performing team versus the low-performing one?

Charity Majors: [00:31:36] A high-performing team is one that gets to spend almost all of their time solving interesting problems that move the business forward. Not doing a lot of toil. Not like working on things that they have to do in order to get to the things that they want to do. Not slogging through fixing the build again. Not slogging through outages. Not waiting for test to pass. Not waiting for deploys to finish, fixing something and then realizing it was a different problem. Just like all of the stuff that makes this job frustrating sometimes. I think that we tend to take for granted that it just, it has to be that way. I’ve now worked on a few high-performing teams where it just, it doesn’t. If you get your **** in order, you can have a job where you spend most of your time solving new and interesting problems. And that is just, it’s a joy, right? It brings a joy to your work that is just it’s indescribable.

And in order to get there, I think that focusing on the DORA metrics is a really good place to start. How often do you deploy? How long does it take? How often does it fail? How long does it take to restore? Time to recovery. And a lot of improving those metrics involves leveling up in observability. Seeing what you’re doing, like putting on your glasses before you go and you drive down the freeway. And a lot of teams, they’re really flying blind. That means that a lot of their work is wasted. I also feel like it’s important to note that high-performing teams, they need to feel emotional safety. They need to support each other. They need to be kind. And the high-performing teams, you don’t get them by just hiring the best people or hiring the best engineers. You don’t. In fact, some of the people I’ve known who are the best individual engineers are pretty toxic to teams. Overall, you as a leader, it’s your responsibility not to cuddle the best engineers. It’s your responsibility to build teams that have a very steady, strong level of combined output. Because that’s what wins the race. It’s not individuals. It’s setting your teams up for success.

I just published a slide deck with a bunch of steps the other day. I do think that the number one technical point that people should focus on, which I know that this is something that nobody would disagree with, right? We should have high-performing teams spend all their time on moving the business forward. No ******* going to disagree with that sentence. The hard part is how. How? And the place to start for almost everybody I’ve ever talked to is you need to shrink the amount of time between when someone has written the code, and when the code is live. You should get the interval down to 15 minutes or less. And if you get that, you get to spend your time on a lot of problems that are much, much higher up the value chain. But if you get that wrong, you’re just going to waste all your time tending to the pathologies. And it’s really inefficient, ineffective. In my experience, if the number of people it takes to write and build the system is, say, x number with a 15 minutes or less interval. Well, if that interval stretches to hours, you need twice as many people. If that interval stretches to days, you need twice as many people again. And if it’s weeks, you need twice as many again. Because the cost of context switching, and waiting on each other, and the diff get bigger, and the reviews take longer. And now you need a dedicated SRE team to like manage all of the deploys and the failures, and the changes get batched up. It’s just a death spiral. So that’s the place that I think everyone needs to start is getting that fully automated in 15 minutes or less, and then everything is going to get so much easier for you.

Henry Suryawirawan: [00:34:39] Or in short, it’s a Continuous Delivery, right? So that’s what the team should aspire for.

Charity Majors: [00:34:44] Continuous Delivery for real.

15-Min Continuous Delivery [00:34:45]

Henry Suryawirawan: [00:34:45] Yep. So 15 minutes to me, that sounds really a short amount of time, and there are so many things that people want to do in the pipeline. Like for example, automated tests, performance tests, security tests, whatever tests. So is that really important to have 15 minutes as a max limit?

Charity Majors: [00:35:01] Yes. I feel like I’m being generous with that. It is important because you don’t want people to have enough time in there to be able to shift context. That moment when you’ve written the code, you want this to be a small diff, right? You want this to be like a few lines. How quickly can you deploy one line of code? It should be your goal here, right? You want to be short enough that you hook into people’s physiological systems. Their muscle memory. You want all of your developers to go and look at their code after they’ve shipped it. So it needs to be short enough that it hooks into your physical like expectation, just like you don’t feel done until you’ve got in and you’ve looked at it. And so, if that’s an hour or more, I’m sorry, that’s too much time. Someone’s gone and switch over to something else, and that means that they flushed all of this context out of their head, and they’ve gotten distracted, and they’re not going to do it regularly. And if it’s an hour or more, that I guarantee you’re going to start batching multiple people’s changes up together, when it’s really important that you be shipping one person’s change set that at a time. Cause that creates ownership. If it’s your change going up, you’re going to go look at it. If you know it’s your change and one to 20 of your coworkers going out at some point in the next day, you’re not going to go look at it. You need to construct that feedback loop and keep it crisp.

Testing in Production [00:36:05]

Henry Suryawirawan: [00:36:06] And I also notice that you are one of the strongest proponent for testing in production, right? Is that also one of the reason why? Because you want to achieve this 15 minutes, right? You can’t do a lot of testing.

Charity Majors: [00:36:16] It all goes together. It all goes together. I think there’s been a major sea change over the past five years, in terms of startups that have started, the way people are talking about this stuff. It used to be that we put all this focus on pre-production stuff. All of our energy went to like, every developer get to their own staging environment, and “Oh, there’s pre-production stuff and everything.” But we’ve hit diminishing returns a long time ago there. And I feel like over the last five years, we’ve started to focus correctly much more on, “Okay, but now what about production?” We need tools for production that are like scalpels. That are very sensitive, very precise. Developer cycles are the scarcest resource in our universe. They’re not infinite. If we’re wasting them all on pre-production stuff, then we’re left with nothing for production.

I just think that the importance needs to be inverted there. We need to be thinking first about production. Do we have the tools that we need to ship safely and swiftly and understand what’s going on? And then, give 80% to production and 20% to pre-production. Instead of 80% to pre-production and 20% to production. But the point is not that everyone should be testing in production. The point is that everyone is testing in production. Whether you admit it or not, and you’re better off, if you just admit it, and try to do it well. Don’t deny, deny, deny, and then sneakily like SSH in to see what’s going on. No. Everybody has to test in production. You do. I do. It’s not embarrassing. It’s a fact of life. Now, how can we do it better?

Henry Suryawirawan: [00:37:36] Yeah. I mean, I must admit, I also test in productions most of the times. Because you can’t really predict what actually happened.

Charity Majors: [00:37:42] Yeah. You have no way. The gold standard is to capture and replay like 24 hours' worth of traffic. And I will say that the closer you get to laying bits on disk, the more paranoid you should be. I have written this suite of software three times for three different databases that capture 24 hours of traffic and replays it again and again under different conditions and stuff. You should know when to do that major database upgrade. You should absolutely ******* do that. But every day when you’re shipping a few lines at a time? No, of course not. But you should invest that energy into observability. And if you want to get really precise in release, you should be using feature flags so that you can flip that on enough so that you can decouple your releases from your deploys. You should invest in observability so that you can see, and canaries. So if you’re really scared about a change, ship it to 1% of the traffic, or ship it to 1% of the users, or ship it to one node. Build the tools that will let you have this very high level of precision around who you’re shipping it to, and then the ability to compare the traffic stream so that you can see what the difference in error rates is. Or what the difference in latency is. Or anything like that. Build those tools to address your paranoia. Don’t say that you’re not testing in production.

Shipping is Company’s Heartbeat [00:38:47]

Henry Suryawirawan: [00:38:48] So I’m interested to dig deeper about this 15 minutes. So what actually has to be done in this 15 minutes? Apart from building software and also packaging it, and deploying it.

Charity Majors: [00:38:57] Run tests. Absolutely run tests. I know that some languages like with Golang, it hardly takes any time at all. We can run a **** load of tests. But this is achievable even for things like Ruby, which is notoriously huge to package, just slow to run all this stuff. Our friends at Intercom who are the ones who came up with a slogan that shipping is your company’s heartbeat, which I really value and respect, they run a Ruby monolith, and their entire build and test pipeline runs and deploys in under 10 minutes. It can be done. This can absolutely be done. It’s not hard. It’s just engineering work. You just need to commit to doing ongoing work, setting yourself a limit, set yourself a limit, alert when it goes off. Take it seriously, and invest in it over time, and you can hold yourself to a high standard.

Henry Suryawirawan: [00:39:39] So I’m also interested with this phrase, “Shipping is the heartbeat of your company.” Maybe you can tell us a little bit more about that.

Charity Majors: [00:39:45] Again, this came from Intercom, and I adopted it because I love it so much. Shipping, for any tech company, giving value to our users is why we exist. Shipping new changes, shipping new code for users should be absolutely as common, as ordinary, as uneventful, and as regular. It should be a nothing. It should be nothing burger. Ask me how many times a day we shipped? I don’t ******* know. We ship all the time, like a couple times an hour. I don’t know. Cause we’re making changes constantly. And it is because those changes are small and uneventful that it’s not a big deal. I feel like we need to invert the way that we think about this. This is a social change. This is a mindset change. But it’s not a new thing.

The principles of CI/CD have been around for what, 15 years? Accelerate was written with all of that data years ago, like we’ve known this for a very long time. We just need to put it into action. And the thing is this is so good for engineers. Engineers who have worked with true CI/CD are unwilling to ever go back. Cause it is like a different profession. It is so much more interesting and exciting and up to the moment. You see your real impact on users' lives like many times a day, and that’s fundamentally motivating and inspiring. It’s real hard to ever be willing to go back to a situation where releases are so far away from your lived experience, or someone else’s problem. You don’t want that.

Sociotechnical Aspect of Teams [00:41:01]

Henry Suryawirawan: [00:41:02] And I think this is related to the sociotechnical team aspect that you mentioned, right? Like in the first place you mentioned, you don’t need all the high-performer people in your team, you just need a team that works well together. Maybe you can also share a little bit, what do you mean by sociotechnical aspect of a team?

Charity Majors: [00:41:18] Yeah. So for example, think of it this way. Imagine the New York Times is a technical and social organization. Now, imagine that all of the people in the tech side got flown to The Bahamas for a month. And they brought in a new set of people who are, everybody’s experienced, talented, good engineers, etc. You brought a set of replacements to keep everything running, and then something breaks. How long do you think it would take them to figure out and fix even a very small problem? First of all, how did they get into the systems? It’s a mind experiment to show you just how much of the system lives in our heads. These people are not fungible, right? Engineers are not fungible. People are very important part of the technical system, and you can’t just separate them. I think it’s really important, especially as we were entering this age where Machine Learning and AI this, AI that. We should remind ourselves that no, people are really ******* important here. The part of the system that lives in your head is a part of the system.

At Honeycomb, we interviewed lots of engineers, and what’s really important to us is their communication skill. Now, they need to be able to write code, obviously. But the real interview is not us looking at their code, it’s us talking to them about their code. We’re just having conversation about, well, talk to me about your choices here. Why did you choose this? What else might you have done? Or like what trade-offs did you have in mind when you were doing this? Because there are so many people who are capable of writing a great code, but can’t communicate about it, and those aren’t the people that we want for our team. Cause we believe that high-performing teams come from people who are ready, willing, and able to talk about what they do, and why they’ve done what they’ve done, without getting their ego too wrapped up with it. It’s just being like able to have a conversation about it. Christina and I could have just hired all of our ex-Google, ex-Facebook friends. We knew that we didn’t want to create a cultural bubble, so we made sure to go outside of our networks. But we were never looking for the best programmers. We were looking for the best communicators. The level at which this team performs, it just, it blows me away every single day. They’re order of magnitude better than the elite category and the DORA metrics. And it’s not because they’re the flashiest coders. It’s because they are dogged about improving their craft, and it’s really inspiring.

Good Traits to Look for in Engineers [00:43:17]

Henry Suryawirawan: [00:43:17] So what else do you see from a candidate if you interview them apart from communication skills and some basic programming skills, of course they need to have? What other things that you look for in an engineer?

Charity Majors: [00:43:27] We look for humility. We look for people who can be gently challenged and don’t fall apart. We look for people who are constantly, I guess growth mindset is a term of art for it, which feels a little corporate-y to me. But we want people to be able to come curious to a problem. Not to come feeling they’re going to be terrified if it turns out that they don’t know what they’re doing. Because it turns out we don’t know what we’re doing either. All day long, none of us know what we’re doing. Our company values are, we have values that are like, we hire adults. We don’t want to be your family or your whole life. But we also do hire people who take a lot of pride in their craft. We don’t want to hire people who just go to work, who just do it to pay the bills either. Because, we take a lot of pleasure in what we do. Nobody honestly has more than four hours a day in them of hard focused engineering labor. And it’s so much more important to me that people know how to manage their time to get themselves those four hours than for them to sit at their desk 24 hours a day. We hire adults. Their values are everything is an experiment, and fast and close to right is better than done and perfect. Which traces all the way back to manifesting in our storage engine, right? Like that’s our philosophy for storing data. It turns out it’s our philosophy for people as well.

Feedback as a gift is another one. Because I think that so many pathologies in the modern workplace can be traced back to the fact that people are scared to give each other feedback. And so it builds up, and it becomes a big thing. And if we can just like deescalate that, and make it not a big thing to be constantly giving and receiving feedback, and to be able to give feedback, knowing that we might be wrong about it. Just like, “Hey, here’s something that I’ve noticed that I thought might help you.” Cause we were all on the same team, right? Like we should be able to help each other improve, which is so easy to say and so hard to do. Because we all have our own personal damage. We all have our own traumas and stuff around this stuff. So we’re all constantly having to learn and unlearn bad behavioral patterns. But there was a couple of people who said to me after they started working at Honeycomb, they’re like, “Wow you guys talk about your feelings an awful lot.” And I’m just like embarrassed. It’s okay. Yes. Female founded company. Yes. I guess we do talk about our feelings a lot. But you know what? It’s better than the alternative.

Henry Suryawirawan: [00:45:28] Thanks for sharing all that. I think it’s a really good lessons learned as well for other teams who would like to build high-performing team.

Charity Majors: [00:45:35] You know, we spend more time at work with each other than we do at home with our partners and our families. I think that we shouldn’t be embarrassed about talking about our feelings and being emotional human beings at work. Because if you’re not, well, you’re just faking it, and I don’t believe that lets you perform at a high level over the long term.

The 5th Key Metric [00:45:51]

Henry Suryawirawan: [00:45:52] I agree with you on that. So, one thing that you mentioned about DORA metrics, I think in some of the presentations I saw that you actually insert one more metric. Not just the four key metrics. So what is the fifth? Maybe if you can share.

Charity Majors: [00:46:04] I think that every team should be tracking how often someone gets paged out of hours. I strongly believe that software engineers need to be on call for their own code, but this comes hand-in-hand with commitment from management to make sure that it does not suck. I’m not just saying, software engineers, you need to go be masochist too, just like ops has been all this year. No. The reason to put software engineers on call is because this is how we make it better. This is how we close that feedback loop. How we make it short. How we get things fixed quickly so they don’t get to infest and like become all infected and gross and big problems. We fixed them when they’re small. So I think that these two have to go hand-in-hand. Yes, developers need to be on call for their code, and yes, management needs to make sure it doesn’t suck, and probably compensate them for their time too. I think it’s reasonable to expect any engineer who works on a 24/7, highly available system, should expect to get woken up twice a year or so, once or twice a year for their work. I think that’s reasonable. I do. I think it’s compatible with an adult life. The exception being if you have a child that is waking up in the middle of the night. I would never put one of those people on call. If you have a baby, you are out of the on-call rotation until your kid is regularly sleeping through the nights. It’s not fair to ask anyone to wake up for two things, right? So, with that caveat.

These are human systems, right? So there are exceptions to everything. I’m not actually saying every person must be on call. At Facebook, I had a guy on my team who was totally willing. He wanted to share his part of the burden. But when he took the pager, he got such anxiety, and it wasn’t cause he was getting call or page all the time. But he was so afraid. He just had a personality that was just not compatible with it. We tried for a couple months, and he gave it a good shot. But I was just like, I just felt so bad for him, and the whole team did, right? They were just like, “Okay. No, Alan doesn’t have to be on call. What can Alan do instead?” And so Alan took over ownership of the build pipeline, and just like with build cop. So that was like technically more work, but it was during business hours and the rest of us were totally fine picking up the slack. So it’s about pulling your own weight. It’s about being willing to support your own code. It’s about management making sure you have the time to fix the things that need to be fixed. Cause you can’t just put people on call and then tell them to build features all the time. If you’re putting them on call, you have to give them the time to fix the **** so they’re not getting woken up all the time. And I just feel like having that metric and holding managers accountable for that metric. Never promote a manager whose paging alerts are going off at all hours. That is just a sign of terrible management. Their team is going to burn out. It should come down on them the hardest.

Henry Suryawirawan: [00:48:24] Wow. Thanks for sharing that. I think, yeah it’s really good key metrics. Especially when you have so many distributed systems in production, where problems could happen at any point in time, and worse if you have a global system where your users come from anywhere in the world. I think it’s also critical to monitor this, like how much you get paged per night, right?

Charity Majors: [00:48:41] Because that’s what burns your teams up faster than anything.

Engineer/Manager Pendulum [00:48:45]

Henry Suryawirawan: [00:48:45] So, one of your most popular blog posts is about engineer and manager pendulum. So, because this podcast is also about technical leadership and also engineering management, maybe you can share a little bit what do you mean by this pendulum between engineer and manager?

Charity Majors: [00:48:59] Yeah. Well, when I became a manager, I was told that it was a one-way trip. And I’ve heard this repeatedly. People are just like agonizing over whether to become managers because they see it as a one-way trip. They’re not sure if they can get hired as an engineer afterwards. They’re not sure if they should. They’re often told that they shouldn’t. And I feel like this has to end. The best technical leaders that I’ve ever worked with have been people who have spent time with managers. Because you learn things about how to help a team, overcome obstacles, and work as a unit that you can’t learn any other way. You learn things about how the business works that you can’t learn any other way. But has also like, has been hands on within the past five years. Because when you take a step away from the day-to-day of building and shipping code, like you’re on a path that diverges, right? And there comes a point where you are not an effective leader of people who are writing and shipping code. For the sake of your own career, you need to be cognizant of that.

And there are two different paths you can take. If you want to continue managing people, you can make it your goal to work your way up the ladder. Now, there are an order of magnitude fewer jobs at each level as you go up from manager to senior manager, to director, to senior director or VP or whatever. There are fewer and fewer jobs. It is more and more competitive. There’s more and more bias involved, I think. But for people who love it, it can be very rewarding. But if you were to stay and a line manager who manages engineers, who manages the people who are writing shipping code, you’re going to need to make your way back to the well from time to time. You can’t go more than five years without building and shipping code regularly yourself, and maintain that level of attention and empathy, and immersion in the problem set. Your engineers won’t respect you as much. You won’t be able to mentor and develop engineers as well. You won’t have opinions that you can trust about the technology.

And there are lots of places out there who just take this for granted. They’re like, managers shouldn’t know anything about technology. They shouldn’t have the opinions. We rely on our tech leads for that. And that is a choice. I think that it is a choice that is becoming less and less popular or prevalent. Because I think the outcomes are materially worse. I think that it serves everyone when a manager has opinions of their own that they can trust about what’s going on.

Now that said, you do have to train yourself not to just speak up all the time as a manager. This can be a very humbling thing, right? Like many of us became managers because we were top dog, right? Like we were the most technical or the most senior or whatever. But once you step out of that position and you’re the manager, it’s your job to guide people up to reclaim the oxygen that you once breathe. It’s your job to grow people into that space. And this can require giving up a lot of ego on our parts. But it’s necessary. It’s the best way to grow people up is to have a manager who actually understands what they’re going through and can help guide and nudge and suggest opportunities for them.

So I feel like this is a change that needs to be fought on the ground, company by company. We need to explain to our executives why it’s better for everyone. We need to build this into our org charts. We need to stop talking about management as a promotion. It’s not a promotion. It’s just a change of career. It’s a lateral step. What that means is that it’s not a demotion if a manager wants to go back to engineering. You don’t want to have managers who are resentful or who miss being engineers, and who like try to reclaim that glory, and try to be the one who’s always talking, who has all the best answers. You want your managers to want to be managing people. You want the people who were better as engineers to try it for a while, maybe, and then be able to switch back to engineering without any loss of status, without any loss of ego. So it’s really important that management is not a promotion, just change of career. I think anyone who’s trying management out, you should ask for a two-year commitment. Otherwise, it’s too rough on the teams if you’re like shuffling manager around constantly. You literally have to become a different person to be an effective manager. You have to retrain your own instincts, your own reactions of the way that you think you’re pacing. You’re pacing on it on a daily, weekly basis is so much different as a manager. You have to gain the instinct and intuition, know when you’re doing a good job, and when you’re doing a bad job. That takes at least two years to build. And so in the early days as a new manager, everybody’s all, “Ah! Am I doing a good job or not? Ah! Do I like this or not?” And it’s like, chill out. You don’t know. You have no way of knowing. Just turn the voices off. Learn everything you can and get back to me in two years, and then we’ll have a conversation about it.

But I feel like the industry would be so much better as a whole if we de-mythologized management. If pretty much anyone who is interested in trying management got a shake at it. And then, if we asked ourselves honestly, “Is this good for me? Do I like it or not?” And there should be a path back and forth that is well-trodden. I feel like so much of the pathologies that I see on teams come from the relationship between management and engineers. If we could just make it less of a thing. Just make it another thing that you can try, and it’s not a big deal. It’s not a promotion. It’s not a big ego thing. Because you’re not in charge of people. It’s not like that. It’s a support role. I hadn’t really thought of this until now, but I think in the way that the interval between when you write the code and when you ship the code is the fundamental building block of a healthy software digestive cycle. I think that pathway between senior engineer and manager being an equal one and a well-trodden one, is the fundamental building block of a really good social side of the sociotechnical organization.

Henry Suryawirawan: [00:53:59] I also think like many engineers, of course, aspire to try out management. So the swing from the engineer to manager, right?

Charity Majors: [00:54:06] And they should, because even if long-term, they want to be a senior technical leader, there are things that you will learn as manager that will give you so much more empathy, will make you so much more effective as a technical leader from doing a two-year stint as a manager. I think it’s a good thing for engineers to try management.

Effective Manager [00:54:21]

Henry Suryawirawan: [00:54:22] And maybe what are some of the tips for like first new manager out there. From engineer they want to try it, a manager, how to become effective in that new role?

Charity Majors: [00:54:31] Yeah. I think that a lot of it comes down to self-awareness. So number one, I’d say, go to therapy. Go to therapist and go to therapy every week. Learn stuff about power dynamics. Especially if you’re a white or Asian dude in Tech, I think that there’s a lot of stuff that you need to learn to be sensitive to that you’ve perhaps never been sensitive to before. You owe it to your reports to learn about those things. So start following some feminists and some people of color, and learn about power dynamics. Learn how to listen. All of the advice that I would give for new managers is really about self-awareness and humility and communication skills. Which is another reason that I think it’s a really interesting two or three-year journey to go on. As an engineer who’s now in your thirties, maybe has a family now, I think it’s a nice change of pace. I think that it dovetails really well with just kind of growing up.

Concerns on Manager/IC Pendulum [00:55:19]

Henry Suryawirawan: [00:55:19] And I think one of the reason that pendulum back from manager to engineer, I think it’s about the concern that once they go back to individual contributor, for example, it’s probably difficult, especially in this region at least. The way I see it is that you might not be able to get back that management role, especially when people see the numbers of experience, the bigger team that they manage before. So what do you see in terms of this pendulum back? How could someone assure themselves that actually it’s safe? I can go back to IC, and I can always go back as being manager.

Charity Majors: [00:55:50] Choose your companies carefully. Especially as you get to be more senior in your career. You should be interviewing them just as much as they’re interviewing you. Ask them about their philosophy of management. Is it a promotion? How do they think about it? Has anyone ever gone back from being a manager to IC? Would you discourage them? Talk to them about it. You really want to find a place that has a culture that matches your own aspirations. Also, like just comfort yourself that if it’s been two or three years, you can go back. It’s fine. Especially if you’re a dude, right? Nobody’s going to even question you. If it’s been two or three years, yeah, it’ll take you a month or two. You might have to study a little bit for the coding interview, but you know, you’ll be fine. You might have to change jobs if you aren’t at a company that values this. But it is honestly much easier to go back from being a manager to being an engineer, because this is a superpower. Everyone wants engineers. There is so much more demand for engineers than there is availability. So, if it’s only been two or three years, you’re going to be fine. But if you’re concerned about that, then make sure it’s only two or three years. Don’t go to five years or beyond, or it could actually be very difficult.

3 Tech Lead Wisdom [00:56:49]

Henry Suryawirawan: [00:56:50] So it’s been a pleasant conversation. But before we end our chat here, normally I would ask all my guests to share their three technical leadership wisdom. So Charity, do you have some wisdom that you want to share with all my listeners here?

Charity Majors: [00:57:01] Yeah. I think number one is your career is an enormous asset. It is the most valuable thing you possess. It is a multi-million dollar appreciating asset, and you should treat it that way. In my twenties, I didn’t think of my career that way. I just lurched from one thing to the next, you know, seat of my pants. And it turned out okay for me, I was pretty lucky. But I think that a little bit of introspection, a little bit of where do I want to be? And pointing your nose in that direction. It has a way of making you attuned to opportunities that open up. If you’re like someday, I might want to be a manager, then you might be more likely to take a role at a company that has some obvious management roles opening up. So I just feel like, treat your career the way that you would want a friend of yours with a multi-million dollar appreciating asset to treat their assets. Think about it for the long term. Sometimes this means taking a job with a little bit less pay because you think it makes the story of your career look much better.

I think that being in a position where you are not spending all of your money every month is a very good tip for everyone. Have some ******* money in the bank. Don’t live paycheck to paycheck because then you’re trapped. You can’t take any jobs that pay any less than that, and sometimes you’re going to want to. Look at your career as a long-term asset and think about the story that you wanted to tell. And it’s not something that has to take up a lot of time and energy, right? Just like a couple hours a year, I think will really help.

It’s really valuable, though. Like we are so fortunate to live in this time and be doing this thing. Because this is the only growth industry of my lifetime. We live in a different economic reality than most of the world does. Most of the United States has been in a depression for 15 years. Not in Tech. Opportunities abound. Yes, Tech has problems with sexism and racism, and yet, I often feel like I want to just like tone down, so the **** that we tell the kids about. Cause I’m like, if you’re telling them not to come to Tech cause we’re too sexist here, where do you want them to go? What is this magical, mythical industry where they will be highly paid and valued? Where they’re not going to have to deal with all this ********. This is called living in the world, guys. Like it just is this way. And so I think that we owe it to ourselves to get as many women and under-represented minorities into Tech. It help them survive, and tough it out. Because we’re so ******* fortunate and blessed.

So that was two or three things all in one, I guess. Try not to be financially constrained. Try to be an ally to women gender, other genders and races. The advice that I always give to young women who asked me for advice is like when you’re a junior engineer, just toughen up. This is problematic. I absolutely agree. Toughen up. Try not to take it too personally. Put your head down, learn as much as you can, just survive. Become a senior engineer, and then get sensitive, and then learn to be sensitive on behalf of other people. Use your power for good. Anyone who’s a senior engineer in Tech has an absolute responsibility to learn to be sensitive, and to use our power for good on behalf of those who can’t.

Henry Suryawirawan: [00:59:45] Wow. Thanks for sharing all this. I think it’s a message that we all need to learn. Especially in these days where so many things happening, especially in tech and around the world. So thanks for sharing that. So Charity, if people want to connect with you or find you more, where can they find you online?

Charity Majors: [01:00:00] I am on Twitter as mipsytipsy, my EverQuest enchanter’s name, and charity.wtf is my blog, and Honeycomb.io is where you can sign up for a free Honeycomb account and use the build event stuff to visualize your CI/CD pipeline the way I was talking about. It’s super fun.

Henry Suryawirawan: [01:00:18] So one day I’ll make sure that I’ll try that.

Charity Majors: [01:00:21] It takes like less than an hour. It’s super fun. You get to see where all your time is going. It’s great.

Henry Suryawirawan: [01:00:24] Cool. So thanks again for your time, Charity.

Charity Majors: [01:00:26] Thanks Henry.

Henry Suryawirawan: [01:00:27] I wish you good luck with Honeycomb and everything that you do.

– End –