#57 - Observing Your Production Systems and Yourself - Jamie Riedesel

 

 

“Software telemetry is what you use to figure out what your production systems are doing. It’s all about shortening that feedback loop between the user experience and the engineers who are writing the user experience."

Jamie Riedesel is a Staff Engineer at Dropbox working on the HelloSign product and also the author of “Software Telemetry”. In this episode, Jamie shared an overview of software telemetry and explained why it is important for us to understand how our production systems are behaving by using those telemetry data. She also explained different software telemetry types, concepts such as observability and cardinality, and shared some software telemetry best practices.

In the second part of our conversation, Jamie opened up and shared her own personal experience dealing with toxic work environments. She emphasized the importance of self-awareness and psychological safety, as well as went through the five key dynamics to a successful team based on Google’s re:Work blog post.  

Listen out for:

  • Career Journey - [00:05:15]
  • Software Telemetry - [00:07:22]
  • Knowing Your Production System - [00:12:13]
  • Types of Software Telemetry - [00:16:45]
  • High Cardinality - [00:22:34]
  • Observability & Buzzwords - [00:27:08]
  • In-House vs. SaaS - [00:30:04]
  • Some Telemetry Best Practices - [00:32:35]
  • Toxic Workplace - [00:38:45]
  • Identifying Your Toxicity - [00:44:18]
  • Psychological Safety - [00:49:02]
  • Identifying a Person’s Baggage - [00:53:52]
  • Who is On The Team Matters Less - [00:58:09]
  • 3 Tech Lead Wisdom - [01:01:49]

_____

Jamie Riedesel’s Bio
Jamie Riedesel has over twenty years of experience in the tech industry, and has spent her time as a System Administrator, Systems Engineer, DevOps Engineer, and Platform Engineer. She is currently a Staff Engineer at Dropbox, working on their HelloSign product. Jamie’s blog at sysadmin1138.net has been there since 2004 and survived the apocalypse of Google Reader shutting down. Jamie is the author of “Software Telemetry” through Manning Publications, and also has a deep interest in reforming team cultures to be less toxic.

Follow Jamie:

Mentions & Links:

 

Our Sponsors
Are you looking for a new cool swag?
Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.
Check out all the cool swags available by visiting https://techleadjournal.dev/shop.
And don't forget to brag yourself once you receive any of those swags.

 

Like this episode?
Subscribe and leave us a rating & review on your favorite podcast app or feedback page.
Follow @techleadjournal on LinkedIn, Twitter, and Instagram.
Pledge your support by becoming a patron.

 

Quotes

Software Telemetry

  • Software telemetry is what you use to figure out what your production systems are doing.

  • The oldest of them all is centralized logging.

  • The next one really gets its own recognition is what we call metrics now, but grew out of what ops called monitoring.

  • Operations and development worked together and came up with these systems to provide metrics for the software, as well as using a lot of the same systems that the ops were.

  • And it was somewhere around 2010, give or take a few years, when we really started breaking out. The first major time series databases, which are specifically designed for metrics, started coming out.

  • Just about all the open source telemetry systems you see now came out of large companies, solving their own problem and realizing they could share.

  • Tracing took a little bit to catch on, really come into its own.

  • Observability is more of a practice than a technique. But back then, observability was a tracing system.

  • Observability isn’t a system, it’s a practice. Whatever it is you need to do to figure out what your systems are actually doing and creating observable systems.

  • That’s a cloud philosophy. If your provider does it, make them do it so you don’t have to do it.

Knowing Your Production System

  • It goes back to one of the major revolutions that hit the marketplace about 2008 or 2009. I’m talking of the DevOps revolution, which is when operations and dev were historically very separated. What usually happens is that software engineering and operations is sort of passive aggressive at each other.

  • The DevOps revolution is by far from evenly spread. I mean, there are still of corners of the industry that still haven’t caught up.

  • And once you’ve got that relationship building, suddenly a miracle thing starts happening. Software engineering starts caring about how it’s performing in production.

  • And it was the software as a service revolution that really brought this out. It shortened the feedback loop because back in the ancient .com days, the feedback from release to feedback from the customers could take literal months. Now, these days with your front end frameworks, you can run AB experiments in real time on your Javascript, and your BI team absolutely is paying attention to what the users are doing.

  • If you, as a software engineer, aren’t paying attention to this, you’re falling behind the curve, especially if you’re doing the software as a service.

  • These sorts of feedback systems - all about delivering high quality software, high quality solutions, releasing quickly with fewer bugs, getting that end-to-end feedback loop established - allowed so many synergies along the way.

  • It’s all about shortening that feedback loop between the user experience and the engineers who are writing the user experience. These telemetry systems give you so many ways of looking at it.

Types of Software Telemetry

  • The three big ones, and that is centralized logging, which is all those log statements that’s sending words to a text file. Just a print statement sent to a file. The kind of thing we all use during early stage debugging.

    • Logging is very much print statements as a service. Typically these days, centralized logging, thanks to syslog, frankly, has a concept of log levels. So that is your debug, your info, your warn, your error.
  • The second one, by age, is the metric systems, which I mentioned before, kind of grew out of the old operations monitoring systems, or maybe if not grew out of, were heavily inspired by. It’s a number with a couple of very few number of text tags attached to it, so you can look at numbers in great big aggregates.

  • The third major one that came on was distributed tracing, which has a bit of an interesting history. Because that descended in part out of metrics.

    • Metrics are limited cardinality. The reason they’re limited cardinality is because when they came out 2010 or so, databases were expensive. Solid State Drives weren’t everywhere yet. So you had to be very careful about how fast, how often you wrote to disk.

    • Fast forward a few years, Solid State Drives are now a thing. And people are realizing that, you know what, metrics are a lot better if it can handle way more cardinality in there, or actually, the dimensionality.

    • Distributed tracing brings the real power by building these essentially gigantic linked lists of events. So you can bring up the entire process. In a distributed tracing system, you can see the entire flow on a single screen.

  • There’s a fourth one I talk about in the book, and it’s a bit of a stealth one. I call it the SIEM, the Security Information Event Management system.

    • It’s what’s used by security teams. It’s a combination of all three of the other styles. But instead of processing code, SIEM systems are more about tracking, tracing and observing the people interact with the production systems.

High Cardinality

  • Cardinality is the single biggest constraint of the databases behind telemetry systems.

  • At its base description, cardinality is the number of unique values in a given field, times by the number of unique values in every other field to give you complete database cardinal.

  • There’s a special field called the timestamp that hopefully is unique for every entry.

  • There’s a different one, and that’s called dimensionality. This is like the number of unique fields in the database. This is where we get some interesting times because different databases handle different problems very well.

  • Cardinality is a major thing to worry about, and frankly, the distributed tracing and the observability systems we’re getting simply were not possible until we have more modern databases with better cardinality tolerance.

Observability & Buzzwords

  • The Pillars of Observability. It’s the logs, metrics and traces. Using those three, you can create an observable system.

In-House vs. SaaS

  • Frankly, I believe (for) any company just starting out - trying to build this thing that they want to start off with - absolutely, use SaaS for everything.

  • Every SaaS provider I know of charges by ingestion rate, which is a great thing for them because as you grow, so does your bill. So, the cost sensitivity is a major thing here.

  • There comes a time in any company’s growth, especially startup styles that have to grow, when you have to make that decision. Okay, we can continue paying this well-established partner who we used very well, lots and lots of money, or we can take the time to try to build our own internal service, and maybe mostly is good. But maybe kind of lacking in a few other areas will be way cheaper. That sort of decision is more of a business process decision than a techie one.

  • I will say that most companies will be better off trying to use a SaaS provider, because these are companies who are doing what they can to make the user experience of telemetry the best possible.

  • Short version is use SaaS for most things, but there comes a time when that decision to in-source starts making sense for business reasons.

Some Telemetry Best Practices

  • If you are using logging in any way, absolutely use a structured logger. Don’t just use print statements to a file.

    • Every language I’ve run into, with the possible exception of COBOL, has some way to do structured logging.

    • What a structured logger is, it provides the interface into the logging for your code. Logger is an instantiated version of your centralized logger, .info is kind of like the priority of it.

    • Most centralized loggers are pluggable to a certain extent, just to kind of take all the hassle out of shipping stuff out of your software into the main system.

  • The second point I have is to have a tracing system. These days, that is the missing leg.

    • If your telemetry systems have a lack, it’s probably the lack of tracing. You do need one. It gives you so many insights, and if you can’t put into your production systems, put it in your CI systems.

    • If that one pull request is different, you can actually track over time how various functions change on your CI systems. So you can much better able to identify performance regressions.

  • The third one I gave is be aware of privacy information and health information in your telemetry, even in your tracing system.

    • If you find that your telemetry systems are storing these data, the problem there is that means you have to treat your telemetry systems at the same iron box with two pass-throughs, two keys to get in to look at it to do for your production data. And most companies find that to be too much bother.

    • The cardinal rule of a telemetry system is, you need to make it easy to use.

    • The other side effect of paying far more attention to privacy and health information is that you have to pay less time to deal with spills.

    • And spills can be incredibly costly because if you find something that’s been there for a year and a half, you may have to go back and clean up a year and a half of your historic telemetry.

  • The fourth and final one, and I mentioned this before, is definitely managed cardinality.

    • (Because) that is the thing that will kill your database, if you’re managing it yourself.

    • Now, if you’re playing with the software as a service provider, they will make you care about cardinality by how their pricing structure works.

Toxic Workplace

  • A toxic workplace, from the point of view of the person experiencing it, is one that causes that person to adopt coping strategies to maintain psychological safety. And when they applied those coping strategies to like their home life, their volunteer work, or their next job, it creates toxicity for other people. So, this sort of toxicity is communicable. It builds up over time.

  • What I didn’t recognize at that period was that I was stopping the flow. Now, if you worked in agile, especially in a smaller company, that agile is all about flow. Those sprint planning meetings, the status updates, are all about the flow of ideas, constant flow back and forth, back and forth. And me throwing that hard “No” dropped the ball, which meant somebody had to pick up the ball again and get it moving around. And I did that often enough, people noticed that I was hard to work with. So that was me sharing my toxicity with other people.

  • I just hadn’t realized I’d internalized this toxicity, was sharing it in unhealthy ways.

  • There are ways you can identify someone is reacting to bad situation, and they’re seeing it here, even though it’s not there. But they’re seeing it. They’re sure it is, and so they’re throwing up all the coping strategies, try to maintain the psychological safety. So that’s where so much of the toxicity in this industry comes from. Because one of the key things is that it doesn’t matter where the toxic workplace was.

  • Something that early on in your career could poison entire 20 years. If you’re a senior, you’ve got some of this stuff. You don’t get to senior in this industry without getting your own share of damage along the way. So yeah, identify it in yourself and start learning to identify it in others and help them work through it.

Identifying Your Toxicity

  • I’ve given this talk two different ways. One talk I give is to people who are individual contributors, their team members, and that one, I’m trying to get people to retrospect on their lives.

  • Have that realization. Because once you realize, once you notice that you do it, you can actually start working on it. So that’s one of the big things. So that self-actualization part.

  • If you notice someone who just is absolutely quiet in meetings, they never talk in meetings. But when you get them one-on-one, they have a lot of ideas. But they never talk in the meetings and you need them to talk in the meetings, cause that’s where stuff is exchanged. Now, that kind of thing is a reaction to past trauma.

    • There’s a several different ways that could be, and the only way to really figure it out is to ask them about their history.

    • Another reason why some people can be quiet in meetings is maybe they just don’t like big groups of people. And so if you get them on one-on-one, maybe you could solicit their feedback that way.

    • It’s about learning work styles in a lot of ways. You can turn some of these mice who never talk in meetings into people who give a lot of solid feedback just by finding the way that works best for them.

  • Another one, and this is very much personal for my own ops background, is cynicism.

    • They’ve [ops people] been dealing with power differentials for so long. It’s like cynicism is natural. It’s human. It’s what you do. That sort of cynics handshake is so common when you’re kind of on the wrong side of these differentials.

    • For a team that is truly generative, who’s figured things out, and it’s highly feedback driven, that sort of cynicism is poison, absolute poison. Pay attention to it. So listen.

  • There’s one blog that I wish more people really internalized. And this is Google’s re:Work blog from November of 2015. Because that’s where the phrase psychological safety got injected into all talk of culture. There are four other points on there. And if you miss on any one of those, you’ve got some toxic elements because people need to retrain to maintain the psychological safety.

Psychological Safety

  • Psychological safety is the feeling that a person has that they can safely take risks. Or that they know what behaviors they need to do to be able to stay safe within a given job.

  • You can maintain psychological safety by providing coping skills.

  • If you have a manager playing power games, one coping strategy that’s very effective is just disengage completely. Do what you’re told. Nothing more, nothing less.

  • The coping strategies that maintain safety in one environment aren’t always appropriate for another.

  • Another big one and that’s number two on the re:Work blog list was dependability, which is, I can depend on my coworkers to delivering what they promise. And it’s incredibly important because if your coworkers can’t deliver anything, you just tend to just do it all yourself.

    • You pick the things that you care about, and you only do things that you can do. And if someone offers to help, that’s a risk and you always have a plan for what happens for when they flake out, always. You just don’t trust that they will deliver. You trust that they will fail, and you’ll be happily surprised that they deliver.
  • Down from there, what they call structure and clarity. What I call: we know where we’re going. We have a plan. We know who’s doing it for the roles, and we know how we’re measuring for the goals. So if you don’t have that, but you do have dependability, you tend to just prioritize what you’re doing.

  • Down from there is meaning of work, which is the work is personally meaningful to me. So you’re working on the right stuff. That is incredibly important to have, and if you don’t have that personal meaning, you just work on the subprojects that are important to you as a person. And that leads to the pathology of just not putting your full into things simply because you’re not working on the stuff you really like.

  • Fifth one on that is impact of work. The work our team has positive impact.

Identifying a Person’s Baggage

  • The thing to remember about the interview process is it is definitely a high-pressure sales thing.

  • Charity mentioned this in her talk is that your career is your highest asset in your entire life. And it’s your job as someone who is job interviewing to convince another company to invest.

  • And there are people who get very good at the social game. So when you get into that all around interview, they will show you enthusiasm for your product. They will show you a healthy response to criticism, a positive response to criticism. They’ll show you a positive communication models between various different people.

  • It’s like maybe a month, two months, three months after they’re onboard, when they’re showing all their new coworkers the happy face, the political face, again, the coping strategies.

  • After that honeymoon period is faded a little bit (is) when you might start seeing some of the hidden toxicity.

  • So when you get someone that socially skilled into like the all-around interview, you see this amaze-balls person, but you don’t know, you can’t know in that environment that you’re hiring somebody who’s going to be a net negative, which is why you need to pay attention to this over the next year of someone’s onboarding.

Who is On The Team Matters Less

  • It’s that how the teams inter-operate with each other matters more than what they’re working on. Because it goes back to Conway’s Law and how technology talks to each other.

  • The team itself is its own units, but also how the teams operate with others matters just as much.

  • So when you have that, in order to get the integration working, you need to get the teams talking first. The teams need to talk before we can get the technology talking to each other. The human layer happens before the technology layer.

  • For the most part, we’re dealing with organizational issues. Who should work on this? Which is the right stuff to work on? How do we scope things to make it feasible over time?

  • The people (and) process matter more in a technical organization than the technical (in) a lot of ways.

3 Tech Lead Wisdom

  1. Teams are not just about tool, but also about people in them and how they collaborate.

    • You’re not just these automatons who produce code. There’s people there. They react emotionally to things, and if you ignore that at your peril, absolutely.
  2. Everyone brings their previous jobs with them, even the bad parts.

    • During the interviews, you’re looking for people who are minimally damaged or at least can hide their damage in ways that at least worked well for your team.

    • Definitely be aware of that. Be aware of that in yourself. So look back and see what bad workplaces trained you into.

  3. Old tools can still bring great value.

    • Leadership is knowing when it’s worthwhile to scrap the old and build the new, but also when to say no to the update. Because newer is not always better, and this is where that business decision comes in. The old tools can deliver a lot of value for a long time.

    • The future is not evenly distributed in the [less technically savvy] environment at all. And you’re going to run into this again and again, and being open-minded about that is incredibly important as a leader.

Transcript

[00:01:09] Episode Introduction

[00:01:09] Henry Suryawirawan: Hello to all of you, my listeners. Welcome to another new episode of the Tech Lead Journal podcast. Really, really thank you for spending your time with me today listening to this episode. If you haven’t, please follow Tech Lead Journal on your podcast app and social media on LinkedIn, Twitter, and Instagram. You can also contribute and support this podcast by subscribing as a patron at techleadjournal.dev/patron, and help me to continue producing great content every week.

For today’s episode, I’m happy to share my conversation with Jamie Riedesel. Jamie is a Staff Engineer at Dropbox working on thrd these days. With the maturity of DevOps and SRE practices in the industry, together with the explosion number of applications being developed, cloud technologies, and growing number of new startups and software products, along with all of those, there’s also a growing trend of open source tools and technologies, multiple SaaS providers that offer different solutions to tackle the observability problem. And underlying all those observability solutions is actually the software telemetry.

In this episode, Jamie shared an overview of software telemetry that she covers in her book and explained why it is important for us to understand how our production systems are behaving by using those telemetry data. She also explained the different software telemetry types, concepts such as observability and high cardinality, and shared some software telemetry best practices that she covers in great details in her book.

In the second part of our conversation, we switched our topics of discussion to talk about the impact of a toxic workplace. Jamie opened up and shared her own personal experience dealing with toxic work environments, and how it actually affected her career negatively and the company where she worked at. She emphasized the importance of self-awareness, identifying our own toxic patterns and past trauma, and importantly, having psychological safety in the workplace, as well as went through the five key dynamics to a successful team based on Google’s re:Work blog post.

I enjoyed my conversation with Jamie, learning about software telemetry, observability, and dealing with toxic work environments, and I hope that you will find something useful from this episode as well. Consider helping the show by leaving it a rating or review on your podcast app, and you can also leave some comments on our social media channels. Though it may seem trivial, but those reviews and comments are one of the best ways to help me get this podcast to reach more listeners. And hopefully they can also benefit from all the contents in this podcast. Let’s get this episode started right after our sponsor message.

[00:04:37] Introduction

[00:04:37] Henry Suryawirawan: Hey, everyone. Welcome back to another new episode of the Tech Lead Journal podcast. Today, I have with me a guest named Jamie Riedesel. She’s a Staff Engineer at Dropbox, working on the HelloSign product. She has a lot of experience working in the technical industry, starting from DevOps engineer, system admin engineering, and she has also given a lot of talks in conferences. She recently published a book titled “Software Telemetry”. Congratulations for your book anyway. So, I’ll be talking a lot with Jamie today about software telemetry and also learning about some of her experience that she’s willing to share later on in the conversation. So Jamie, welcome to the show.

[00:05:14] Jamie Riedesel: Thank you. Glad to be here.

[00:05:15] Career Journey

[00:05:15] Henry Suryawirawan: So Jamie, at the beginning, maybe you can help to introduce yourself for people to know more about you. Maybe you can mention your highlights or turning points in your career.

[00:05:23] Jamie Riedesel: Yeah, that works well for me. The reason my name is getting out right now is, of course, the book for Manning, “Software Telemetry”. But this has been a book of 20-odd years coming. Simply because telemetry, what people used to figure out how their systems are working, is something that we’ve been doing since, frankly, the mainframe era. It’s changed so much in the last 10 years that I wanted to capture some of the operability challenges I’ve seen. Cause you’re seeing a lot of details about how to use metrics, or how to use logging, or how to use tracing to do what you need to do, but not so much on how to keep these systems running, which is what started it. The reason that occurred to me simply because I’ve been on the operations side of the technology industry since the very beginning. I started in help desk, doing desktop support way back for a city. I moved sideways from there, the systems administration. Got my Novell stuff in, started doing some Windows, stayed in sysadmin type work until 2010 or 11, when I made a sideways leap, biggest leap of my career, into 20 person startup, which was Ruby on Rails, cause that’s what we were doing in 2010.

It was a fascinating experience building an SaaS product from the ground up. I’ve never done that before. Taught me so much. I learned some of the early telemetry stuff there, because of how they had to do what they did to do. Now fast forward to where I am now. Like I started at HelloSign in 2015, which was about 15 engineers, I think is when we started. The company count was bigger. But now we’ve been bought by Dropbox, and like, the engineering count is somewhere in the seventies for our stuff. Oh my goodness. It’s a much bigger process. Watching this thing grow over time, it’s just given me so much more insight into the scalability problems of just these telemetry systems. Frankly, watching the industry evolve. Because when I started in 2015, the phrase distributed tracing was only seen in certain circles. But now these days, everybody considers it the gold standard. So yeah, it’s been absolutely fascinating to watch this grow.

[00:07:22] Software Telemetry

[00:07:22] Henry Suryawirawan: Thanks for sharing your journey. It’s been a really long journey, multiple experiences from doing the help desk, starting into sysadmin, and working in a startup. I’m sure that along the journey, you have seen a lot of things about software telemetry. But first of all, what is your definition of software telemetry, actually?

[00:07:39] Jamie Riedesel: Software telemetry, frankly, is what you use to figure out what your production systems are doing. I mean, that’s a very nebulous term. I use that as the grand parent term of all this stuff, because there are multiple different techniques people used to peer into that. I mean, the oldest of them all is centralized logging. Spit out an English language statement or whatever language you’re speaking into a log file and look at it. We’ve been doing that since the early eighties. Fun thing about this job is actually getting some history in here. It’s like syslog was part of 3BSD, came around 1980-ish as a plus one to the old sendmail program, and that provided a system logger for systems running on these old BSD systems. And thus, we had our first recognizable modern telemetry system born. Fast forward 15 years, we have people running into the middle of the dot com boom, and you have these large infrastructures with hundreds of servers running and racks somewhere, and they’re sending these syslogs to a file or maybe NFS share somewhere that you have to grep through to figure out what’s going on. But it was definitely telemetry as we know it. So yeah, that was the first one.

I believe the next one really gets its own recognition is what we call metrics now, but was kind of grew out of what opside people called monitoring, you know. How fast is the computer running? How much resources is it doing? The software people said, you know, can we send some numbers in there? Cause I kind of like to know how many transactions per second we’re doing. We can do that. Operations folk and development worked together and came up with these systems to provide metrics for the software, as well as using a lot of the same systems that the ops side was. And it was, oh, geez, somewhere around 2010, give or take a few years, when we really started breaking out. The first major time series databases, which are specifically designed for metrics, started coming out. OpenTSDB was the first big one. I believe that came out of StumbleUpon. And Etsy’s StatsD was like the biggest software out there for doing this sort of thing, just because they were already at scale and solve this problem, and they open-sourced it. Just about all of the open source telemetry systems you see now came out of large companies, solving their own problem and realizing they could share. So that was metrics came out 2010-ish, give or take a few years.

Tracing took a little bit to catch on, really come into its own. But yeah, it was around 2015-16. Charity Majors actually talked about a little bit of this back in April. Because the tracing systems that we have are originally called observability systems. Now, we’ve gone a little wibbly wobbly of what observability is. It’s more of a practice than a technique. But back then, observability was, oh yeah, that’s a tracing system, which gives you an idea as to how the terms have changed in such a short period of time. Because in 2021 here, observability isn’t a system, it’s a practice. Whatever it is you need to do to figure out what your systems are actually doing and creating observable systems. So, yeah, tracing came on and now we’re trying to see how this will evolve into the future.

The thing that gets me, and just because I talked about at the start, I was from a very kind of traditional IT type background. So I’ve seen a lot of the history, a lot of what people called legacy systems, and they’re still out there. It’s like, yeah, if you’re building a brand new product, a new something, a new greenfield thing for your startup, yeah, definitely use like a full-fledged observability system for that. SaaS provider, Honeycomb, is great for that. But the thing is that most of us listening to this podcast right now, or working on code that’s already there, or maintaining something. We’re extending something that’s already there. And so we have this deep legacy of stuff that’s been before. These centralized logging systems, some of which may extend all the way back to 1980. Some of these metrics systems that have evolved several times over the last 10 years, and are now coming into these new high cardinality metrics databases that are coming out of the market right now, but maybe started off as a Zenoss installed way back at the beginning. And the tracing, yeah, maybe we started on Jaeger at the beginning, but now we’ve gone to a SaaS provider just because of scale or maybe our cloud providers started offering it now. Azure and Amazon both have an offering. Let them do it. You know, that’s a cloud philosophy. If your provider does it, make them do it so you don’t have to do it. It’s all there. So yeah, software telemetry covers the vast range of everything that we do from the mainframers, who are still there - they may have first written their code in 1968, but they’re still developing and they still need telemetry - to the people bring in a brand new product to market right now, today. All of them have this telemetry. And just the thing is that there’s a lot of different ways to solve it.

[00:12:13] Knowing Your Production System

[00:12:13] Henry Suryawirawan: So you mentioned about the need for figuring out what your production system is doing. I believe for most people, maybe it’s obvious for them, right? Knowing what is happening in your production system. But in my career, sometimes some developers actually don’t really care about having the applications being observable or have metrics available. So why do you think people should care about knowing what your system is doing in production?

[00:12:35] Jamie Riedesel: That’s a very good question because it goes back to one of the major revolutions that hit the marketplace about 2008 or 2009. I’m talking of the DevOps revolution, which is when operations and dev were historically very separated. Before 2010, the DevOps revolution started kicking off. What usually happens is that software engineering and operations is sort of passive aggressive at each other. “Yeah. I want this deployed”. “Yeah. Tell me how to deploy it”. “Okay. Here’s how you do it”. “Now tell me how to test it?” And this sort of back and forth, somewhat adversarial relationship. Ops convinced that software just doesn’t know how things actually work. Meanwhile, developments are wondering why ops is being such sticking them ups about getting anything done. It was just a terrible time. The thing is that once you get operations and development talking, actually talking to each other, communicating their needs, working way early on in the process, okay, we need to get this out. How do we actually deploy this thing? Maybe we can build some deployment automation. Okay, how do we make sure to plan all these steps, so that builds in your CI systems. All this back-and-forth feedback, that all happened in modern up-to-date companies in the 2008 to 2015 era.

Now the DevOps revolution is by far from evenly spread. I mean, there are still corners of the industry that still haven’t caught up. But I’m talking about why you care about these sorts of things. And once you’ve got that relationship building, suddenly a miracle thing starts happening. Software engineering starts caring about how it’s performing in production. And it’s not just because they’re on call, but now you actually can get the details of, okay, if users are doing this, our systems do that. And it was the software as a service revolution that really brought this out. It shortened the feedback loop because back in the ancient .com days, you’re ship shipping, shrinkwrap software that people had to download and install on the local system. The feedback from release to feedback from the customers could take literal months. That was so long. Now, these days with your front end frameworks, you can run AB experiments in real time on your Javascript, and your BI team absolutely is paying attention to what the users are doing. If you, as a software engineer, aren’t paying attention to this, you’re falling behind the curve, especially if you’re doing the software as a service.

So yeah, these sorts of feedback systems, all about delivering high quality software, high quality solutions. Releasing quickly with fewer bugs. Getting that end-to-end feedback loop established allowed so many synergies along the way. Because anytime you have a little break in the process, you have to wait for someone to get around to that. Lest people’s attention wander off to other things, and they lose that context and they come back to it. All this multitasking doesn’t make everybody very good. It’s like, yeah, we kicked the feature up, but we haven’t looked at any of the metrics in a month. Then you looked finally and “Surprise!” Support’s been screaming at you for three weeks, and no one’s actually paid attention. You don’t want that to happen. You don’t want to hear about problems directly from your users. You want to see it beforehand and go, oh, that’s a problem. We should probably deal with it. So it’s all about shortening that feedback loop between the user experience and the engineers who are writing the user experience. These telemetry systems give you so many ways of looking at it. So many different ways.

If you take the SRE concept of like Service Level Agreements and Service Level Objectives, and Indicators to show your minimum promised service levels, that sort of gross very top level thing tells you if you’re meeting your promises. But at the same time, especially when support comes in and says, " Hi. One of our top 10 customers by contract value is extremely cranky right now, because when they hit this API, it does this weird thing". And so, that feedback from customer support allows you to look at your telemetry and look at specifically their stuff, and see, okay, when they do this, it goes Kaboom. Oh wow. That’s a scaling problem. They are top 10 customer and they’re probably our top three customer by what it is that we do. So they’re running in some sort of weird scaling problem on a table or something. But the telemetry systems are like, look in those unique atomic cases and go, “Oh, that’s a problem”, and start building theories to generalize this solution. Look to see where else it may be happening. And oh wow. Two of our other customers are on the verge of fixing it. This is a hot fix. And go, off you go to try to fix it before the problem compounds and harms your bottom line.

[00:16:35] Henry Suryawirawan: Thanks for reminding us again about the importance of these things. So if you don’t know what is happening in your production system, maybe you won’t be able to react faster to your customer’s problems.

[00:16:45] Types of Software Telemetry

[00:16:45] Henry Suryawirawan: So you mentioned already a couple of times about some of the examples, but maybe there are other variance as well? So what are the different types of telemetry that we have?

[00:16:54] Jamie Riedesel: The three big ones, and that is centralized logging, which is all those log statements that’s sending words to a text file. It’s like account 143 created in zone 52. You know, it’s fully English language. Just a print statement sent to a file. The kind of thing we all use during early stage debugging. Because when we first start a program, when you don’t trust it yet, we haven’t instrumented the full thing for centralized logging. So we’re just trying to figure out this one function, just slap in some print statements. Logging is very much print statements as a service. That’s what it does. In these big centralized logging systems that are full of this sort of natural language processing, you get this sense of, you just learn over time which are the problem statements and which are the non-problem statements. So centralized logging is like I said, the first one. Typically these days, centralized logging, thanks to syslog, frankly, has a concept of log levels. So that is your debug, your info, your warn, your error. Syslog had a critical or a verbose level or a trace level that kind of extends beyond. But nearly every logger that I’ve actually looked at has a concept of saying, yeah, this is an info log, this is a warning log. So theoretically, in your centralized logging system, you could say, just show me the errors, and you can get a much reduced set of the whole thing.

Now, the second one, by age, frankly, is the metric systems, which I mentioned before, kind of grew out of the old operations monitoring systems, or maybe if not grew out of, were heavily inspired by. But these are things like anything you get through Grafana, frankly. Cause if you’re open source, you’ve used this tool. And like SolarWinds is another popular monitoring type product that you can put application metrics into if you want. So hitting like a time series database, like InfluxDB, Prometheus to use for the Kubernetes groups, OpenTSDBs. There’s a few others like Amazon now has a time stream service. I know that Azure has one too. There’s like a new time series database every quarter. I just haven’t kept up. But yeah, the metrics is just what it says in the tin. It’s a number with a couple of very few number of text tags attached to it, so you can look at numbers in great big aggregates. So you can get these things like transactions completed in a given period of time, or time say a user logged in, or various metrics you may be interested in for tracking, trending, or retailing or such.

The third major one that came on was, of course, distributed tracing, which has a bit of an interesting history. Because that descended in part out of metrics. Charity Majors talked about this in April and she’s totally right about this. Metrics are limited cardinality. The reason they’re limited cardinality is because of when they came out 2010 or so, databases were expensive. Solid State Drives weren’t everywhere yet. So you had to be very careful about how fast, how often you wrote to disk. So these older databases had to be very write preserving, in order to preserve speed and provide stated time ranges. You need to be able to limit how much it’s written in. All the optimizations there. Fast forward a few years, Solid State Drives are now a thing. And people are realizing that, you know what? Metrics are a lot better if it can handle way more cardinality in there, or actually, the dimensionality. The number of fields in there and distributed tracing was a way to say, okay, so what if we emitted this rich event, that has maybe more than one metric attached to it? And a bunch of little context details, like account number or session ID or process ID or subscription ID or region or cluster. All these are the little details to help you isolate what generated this event. So you can just drill in.

These tracing systems, as Charity mentioned, the database stuff is right there. I mean, nearly every modern application has some sort of database access somewhere. And everybody cares about how fast that database is. So, all the distributed tracing libraries right now have easy built in ways to try to trace how long a query takes. Plunk that in as another attribute, build it into the chain. Distributed tracing brings the real power by building these essentially gigantic linked lists of events. So you can bring up the entire process. So user hits login. That goes through 15 different workers and sub-processing and microservices. But in a distributed tracing system, you can see the entire flow on a single screen. Building that takes a lot of automation, and we couldn’t have done that in 2010 because of technology and speeds weren’t there yet. But as technology improves, so does this.

But there’s a fourth one I talk about in the book, and it’s a bit of a stealth one. I call it the SIEM, the Security Information Event Management system. It’s what’s used by security teams. It’s a combination of all three of the other styles. But instead of processing code, SIEM systems are more about tracking, tracing and observing the people who interact with the production systems. A lot of the same techniques you use for code can be used also for these other systems. Now, compliance regimes, and regulatory regimes, especially banking, have very clear descriptions about what you have to track like privilege usage, privilege escalation, account logins, account log outs, account lockouts, because someone tried to log in too often, like how often does that happen? All that stuff needs to go into the SIEM. That’s logging for the most part. There’s also metrics. Building these big correlations between a log in, log out, or lockouts, what processes do users did? Where their use of sudo was along the way? That’s a form of distributed tracing. It uses a lot of the same techniques that you use for, like pure software side. But the security teams need similar techniques, just tuned a different way to do what it is that they do. So, yeah, there’s the three major software ones and the stealthy one, the security ones, SIEM, that’s what I call as the four major forms of telemetry.

[00:22:34] High Cardinality

[00:22:34] Henry Suryawirawan: There’s one term that you mentioned a couple of times, like high cardinality. So for people maybe who are beginners in this DevOps and logging, metrics, and all that, can you maybe explain, is it just about adding more tags, or maybe adding more attributes to the thing that you said?

[00:22:49] Jamie Riedesel: It is. Yes. Cardinality is the single biggest constraint of the databases behind telemetry systems. So you really need to pay attention to this stuff. At its base description, cardinality is the number of unique values in a given field, times by the number of unique values in every other field to give you complete database cardinal. There’s a special field called the timestamp that hopefully is unique for every entry, but that is going to be highly cardinal. But the thing is, timestamp is such a special field that every database I’ve run into has special ways to make sure that one index is incredibly fast. Because when you’re looking at telemetry, every single query is going to be time bound in some way. So the time is a special index. Now, we’re talking about the rest of them.

I mentioned like centralized logging can also have quite extensive cardinality, but that kind of comes down to how people do markup as they’re building these log items. So yeah, you have like syslog, that ancient protocol that was standardized in like 2001, ancient by internet times, has something like eight different fields in it. It’s like the process ID, name ID. The log level that it came in as, the facility number, which is kind of an obscure thing, the actual text to the event itself. These things are built into the standard, and you can totally build a system on that. But in my experience, most companies who go deep into centralized logging try and customize their fields to their specific use cases. So like maybe we actually do want account ID in everything, or a bunch of other context on that log thing. So we can get an idea as to where this event happened. So that provides two different kinds of cardinality. It’s like account ID. Hopefully, you have a lot of accounts. So if you have something like Twitter with billions of accounts, most of which are bots, that table is going to be huge and highly cardinal.

Now there’s a different one, and that’s called dimensionality. This is like the number of unique fields in the database. This is where we get some interesting times because different databases handle different problems very well. It’s like Elasticsearch that venerable database behind all full-text search, if you’re not using Mongo, is very good at highly cardinal fields. If you give it 15 fields that are all practically unique on each one, it will happily index all those and be very fast for search. But the problem with Elasticsearch is, it’s like, if you put 15,000 fields in the database, but only 2 to 50 are used in any given document or events, those other 14,950 fields are still there. They’re just carrying a null value. So you’ve got this, what could actually be a short event, short little data on it, because all those nulls are in there, it increases the document size so much. It increases the database size, and it slows down search overall, cause it has to crawl through much more blocks to get the data it’s working on. So that’s one way cardinality can impact your databases.

The other thing to consider, it depends on your databases for index cardinality, and this is where the time series databases, the old school one’s anyway, tend to fall on their faces. Because the main thing they do is numbers. Numbers are easy. Numbers are short. You tend to not index those, but you do index the tags like the application or the data center it’s in. And if those have just a few different values, look at the split zoom, zoom, go, go, go, and then they could have thousands of different application metrics in there for the extra bits. But if you start putting in like a tag with account number on there, Ooh, you could have tens of thousands or hundreds of thousands of that one tag alone, which doesn’t leave you much room for the actual software metric you’re looking for. Because the database will start slowing down or just taking too much memory to keep going. Cardinality is a major thing to worry about, and frankly, the distributed tracing and the observability systems we’re getting simply were not possible until we have more modern databases with better cardinality tolerance. Couldn’t have been done 10 years ago. We’re just getting to the point storage is fast enough, and the techniques are fast enough, and we have enough memory to keep a lot of this stuff in RAM now that we can just brute force some of this stuff.

[00:26:53] Henry Suryawirawan: Thanks for explaining from the history point of view how these all happened. Because currently, I think we take it for granted, like so many SaaS providers, or maybe it’s open source system already available. We just shove in our data. But yeah, I mean, historically probably it was not possible last time.

[00:27:08] Observability & Buzzwords

[00:27:08] Henry Suryawirawan: Another thing that is a source of confusion for many people, there are so many terms being used in the industry. Maybe it’s buzzword, or it’s a trend. Things like monitoring, which was traditional term used before. And there’s observability, there’s APM, there’s SIEM, most of that you mentioned. What’s your opinion about all these? Are they just talking about the same thing or are they representing about different things?

[00:27:31] Jamie Riedesel: Ah, this is fun question. Cause there’s history here. For those who haven’t caught up, it’s like observability is the big word right now. The phrase that goes along with it is the Pillars of Observability. It’s the logs, metrics and traces. Using those three, you can create an observable system. You talked about APM and a few others. Now, APM is an interesting one because APM stands for Application Performance Monitoring. And that’s shown up around 2010, 11, 12 in there, kind of as metrics came in as a way to brand the, “Oh, we’re monitoring for your software.” We actually track this performance of your software. So companies like New Relic are huge in this one. They invest deep into APM as the technique and on the brand, and they steered the conversation for software telemetry of that era, very much into that brand category of APM.

The thing that writing this book has made me further realize just how vast the technology industry is. It’s because we have these companies that were like deep into New Relic. And so, they’ve got the code fully instrumented. They’ve got all these metrics, some proto traces available through New Relic, and now the rest of the industry saying, oh cool, observability systems, observability systems, starts buzz wording on this new thing, which leaves this older guard, which is also known as well-established companies you recognize, you know, New Relic and Datadog who have to find a way to rebrand their terms, which are kind of getting the smell of old and stodgy. It’s a marketing thing in a lot of ways. They’re totally useful tools. They’re great at what they do. But the thing is that they have brand history at these older systems. When you have legacy systems, the decisions made by an engineer 10 years ago can impact you today. The same goes true for product marketing and other sorts of things. You need to have some sort of continuity of identity while still trying to evolve your brand image to adopt these new things. So, yeah, there’s definitely a marketplace of ideas going on right now between the established SaaS players and the newcomers, Honeycomb is one of them that to be able to kind of argue over what is the correct term. “No, no, no. We’re observability.” “No, No. We’re performance monitoring.” And it’s all at this marketplace of ideas. Yes. It’s hard to keep up with all the terms. Absolutely. I have no doubt that in five years time, we’ll be talking about some new thing and it’s maybe only just be murmuring right now. So yeah, there’s a lot going on right now.

[00:29:53] Henry Suryawirawan: Yeah. Sometimes I, myself, am very confused with all these terms. Like, okay. It seems like this vendor is talking about this thing, but sounds similar with the other vendor talking about different thing. So, yeah. Thanks for clarifying that.

[00:30:04] In-House vs SaaS

[00:30:04] Henry Suryawirawan: So, Jamie, a lot of people actually tried to use software as a service for this kind of telemetry. Do you have an opinion whether we should have it in-house, or do we actually need to go SaaS way?

[00:30:16] Jamie Riedesel: Frankly, I believe any company just starting out, so you’ve got those three people with a big idea, probably with day jobs, trying to build this thing that they want to start off with. Absolutely, use SaaS for everything. Good Heavens. Do that. I mean, at that stage of the growth company, you don’t have time to mess about installing your own stuff. Pay someone else. That’s what it’s there for. Honestly, it’s when you start getting growth and you start getting these 50-100 engineers around and you start building on. When you start getting growing to the point when you actually need to have full-time ops people, is kind of when the temptation to in-source some of that stuff internally, maybe open source it, run it in a cloud or something, so you don’t have to pay the SaaS provider. Because every SaaS provider I know of charges by ingestion rate, which is a great thing for them because as you grow, so does your bill. So, the cost sensitivity is a major thing here.

There comes a time in any company’s growth, especially startup styles that have to grow, when you have to make that decision. Okay, we can continue paying this well-established partner who we used very well, lots and lots of money, or we can take the time to try to build our own internal service, and maybe mostly is good. But maybe kind of lacking in a few other areas will be way cheaper. That sort of decision is more of a business process decision than a techie one. But I will say that most companies will be better off trying to use a SaaS provider, because these are companies who are doing what they can to make the user experience of telemetry the best possible. Cause that’s what sells. Charity is a great person. She talked on this podcast back in April, and that one is a great one to listen to because of how she describes this process. Because they had to provide a user interface that was better than anything else that do provide differentiation. And they’re very good at it, and trying to build that in-house. The thing that I pay attention to is what open source projects are out there that are actually getting traction. There’s one major open source project for distributed tracing, and that’s Jaeger. There’s not much else. Because it’s a very hard space, and it’s the kind of hard space that you go, “You know what? I can either open source this and give it away, or I can try to make money off of it. I think I’ll make money off of it.” It’s how a lot of people do because it’s just that much of a crunchy problem. So, yeah. Short version is use SaaS for most things, but there comes a time when that decision to in-source starts making sense for business reasons.

[00:32:35] Some Telemetry Best Practices

[00:32:35] Henry Suryawirawan: So before we move on to the next section of our conversation, maybe for people who are implementing this or who are trying to operate their telemetry systems, could you give some of your best practices on how they should operate?

[00:32:48] Jamie Riedesel: Absolutely. Like if you use logging, and this counts nearly every system I’ve run into has some form of logging. Like the newest ones may just go straight to some sort of distributed tracing. But if you are using logging in any way, absolutely use a structured logger. Don’t just use print statements to a file. Every language I’ve run into, with the possible exception of COBOL, has some way to do structured logging. What a structured logger is, it provides the interface into the logging for your code. So let’s say, logger.info text, whatever, and just use a basic example. Logger is an instantiated version of your centralized logger, .info is kind of like the priority of it. There’re words in there. Now, what happens between when you do that in your code, and when it actually goes into the database, is interesting. Because all the structured loggers have some way to transform that text string between when you call it in your production code, and when it actually goes on to storage or into a queue or something. And that allows you to do very useful things, like be able to search for email addresses, which are now considered private information by GDPR.

So that way, you could live redact these things. So you don’t have to worry about cleaning them out of your database. That greatly enables your software to survive processing private information without accidentally storing it in your telemetry systems. So centralized logging is incredibly powerful, not just because of that. But also because most centralized loggers are pluggable to a certain extent. So if you’re writing to a file, cool. Or if you’re writing to a syslogging point, cool. But if you’re writing to say like a Kafka endpoint or maybe a Redis queue, or maybe a Kinesis endpoint in Amazon or whatever it is, there may be a plugin for that, just to kind of take all the hassle out of shipping stuff out of your software into the main system. So yes, structured loggers are incredibly important for using logging.

The second point I have is have a tracing system. These days, that is the missing leg. If your telemetry systems have a lack, it’s probably the lack of tracing. You do need one. It gives you so many insights, and if you can’t put into your production systems, put it in your CI systems, your Jenkins, or your Travis CI. Because the sneaky thing about putting tracing in your CI systems is that if you have two different test runs of the same code, so if that one pull request different, you can actually track over time how various functions change on your CI systems. So you can much better able to identify performance regressions. It’s incredibly powerful. Your production system may be some planet scale thing that was in 15 data centers. But your CI system may only run on like six during testing, which is a much cheaper way to do it. It didn’t keep much more data that way, too. So, you can keep CI tracing data going back a year, so you can watch the charts of how various functions improved or regressed over time, and identify those things. They are incredibly important.

The third one I gave is be aware of privacy information and health information in your telemetry, even in your tracing system. This is very important. When the European GDPR hit, they changed the threat landscape for handling private information in a lot of ways. Other places have followed in GDPR’s footsteps. You know, California in the United States has a privacy law. Other privacy laws have followed in the image of GDPR. And the biggest one they’ve taken is a much more restrictive view of what counts as private data, which you can do with it, and who can view it. So, if you find that your telemetry systems are storing these data, the problem there is that means you have to treat your telemetry systems as the same iron box with two pass throughs, two keys to get in to look at it to do for your production data. And most companies find that to be too much bother. Because the cardinal rule of a telemetry system is, you need to make it easy to use. And if you have to get on the VPN, go pass the web VPN portal with 2FA, log in to the telemetry system, and only then can you actually look at your logging. No one’s going to do that. Maybe it’s a few senior engineers who are tracing bugs. But someone who just released a tool, someone who’s like maybe a junior engineer, probably doesn’t even know how to log into it. I mean, that’s not a good telemetry system. So by keeping these private data out of your telemetry systems that allows you to make your telemetry far more broadly available, which improves its use.

But the other side effect of paying far more attention to privacy and health information is that you have to pay less time to deal with spills. Because when something does leak in and it goes to half the company, just because that’s who they can look at it, the privacy regulator and the handlers in your company have to make a decision. Is this a breach? Or is this merely an “oops”? How bad is the problem and what is the legal threat do we face by this spill? That has to happen every time you have a mistakenly logged email address or a social security number or national ID number or whatever it is. But if you’re just not logging that in the first place, they have to do that decision far less. And spills can be incredibly costly because if you find something that’s been there for a year and a half, you may have to go back and clean up a year and a half of your historic telemetry. That’s a bad time for everyone involved.

The fourth and final one, and I mentioned this before, is definitely managed cardinality. Cause that is the thing that will kill your database, if you’re managing it yourself. Now, if you’re playing with the software as a service provider, they will make you care about cardinality by how their pricing structure works. So that Sumo Logic is what I know, and they have a limit. Like you can have only so many fields that you can declare when you ingest data. And you can pay for more if you really need it, but they make it sure that you don’t really want to do this. So that’s one example of a SaaS provider providing downward pressure on cardinality. So yeah, unless you’re using like Honeycomb, which clearly has some way to manage cardinality in magical ways, you’re going to have to manage it yourself in some ways. It is probably the top thing to pay attention to in the databases supporting your telemetry systems.

[00:38:33] Henry Suryawirawan: Thanks for sharing all these tips. In your book, actually you have plenty more. You even built a recommendation checklist reference. So, for people who are interested to get more details on tips and best practices, make sure you check it out in the book.

[00:38:45] Toxic Workplace

[00:38:45] Henry Suryawirawan: Let’s go on to the second part of the conversation, which is a topic that you are passionate about, right? You have given talks about it multiple times as well, which is about overcoming organizational trauma or toxic workplace. In the first place, maybe you can share your personal experience?

[00:39:00] Jamie Riedesel: Yes. To start this off, I’d like to frame it with the definition of just what a toxic workplace is. The definition I work from, like a toxic workplace, which is a definition from the point of view of the person experiencing it, is one that causes that person to adopt coping strategies to maintain psychological safety. And when they applied those coping strategies to like their home life, their volunteer work, or their next job, it creates toxicity for other people. So, this sort of toxicity is communicable. It builds up over time.

Now, to give you an example of how that worked for me, it’s like my most toxic job ever was at a large public university. Public is important here because it meant that our funding came in some part from the states with an approval process. But also because it was a university, and I was one of the central systems administrators, there was no person at the very top of the org chart to say, do this because I said so, which meant that it was very consensus driven a lot of ways. The specific group I was with was structurally disempowered in a key way. We are the central systems administrators, but we frequently would get these products that we are asked to deploy. Some departments somewhere identified a business need, qualified a bunch of vendors, got budget approval to buy it. Worked with certain other high-level technical managers to make sure that we had room for this. And then, after the decisions were all made, the money’s paid for the licenses, they would come to us and say, “Okay, here’s a fist full of licenses. Here’s some money for new hardware, if we need it, we can need it in three months. Thanks.” And then walked off. Sometimes that would be okay. But a lot of times, the choices made earlier on, because we weren’t consulted during that early qualification period, didn’t fit our infrastructure well at all. And because there wasn’t any official way to push back on that, we had to use the unofficial ways. Unofficial ways include things like sarcasm, belittling, yelling, maybe a little bit of profanity. Just trying to make people very scared of coming to us and asking that critical question, “Am I going to get yelled at if I bring this to them?”

Sometimes that would cause some of the bad ideas to get less bad, which is a success criteria for that toxic environment. Now, my job immediately after that one was that at that 20 person startup. I was the first full-time ops hire. They had a mess they needed to clean up. They knew they needed to clean it up, and they hired me to deal with it. Awesome! It was a great job. But the thing was is that I literally had a spot at the table then. So we’re going through these sprint planning or a backlog grooming exercises to see what we’re going to do in the future, what we’re going to be working on in future sprints. I brought forth some of these bad habits. It’s like, I was actually there on the table, but someone would say, I think it’s time to upgrade to this, and I didn’t think it was a good idea. I would just flat “no, that one”, I would stop that idea dead. And everybody looked at me with that silent gaze. I got everybody’s attention. I stopped the bad idea, good for me. What I didn’t recognize at that period was that I was stopping the flow. Now, if you worked in Agile, especially in a smaller company, that Agile is all about flow. Those sprint planning meetings, the status updates are all about the flow of ideas, constant flow back and forth, back and forth. And me throwing that hard, “No”, dropped the ball, which meant somebody had to pick up the ball again and get it moving around. And I did that often enough, people noticed that I was hard to work with. So that was me sharing my toxicity with other people. Now, I ended up getting fired from that job, partly for that, partly for other reasons. But that was clearly a contributor I got. I was kind of a “no assholes” rule fire. Just because I just hadn’t realized I’d internalize this toxicity, was sharing it in unhealthy ways.

So part of why I give these talks is to help people deal with the person I was when I came out of that most toxic job ever. Because there’re ways you can identify someone is reacting to bad situation, and they’re seeing it here, even though it’s not there. But they’re seeing it. They’re sure it is, and so they’re throwing up all the coping strategies, try to maintain the psychological safety. So that’s where so much of the toxicity in this industry comes from. Because one of the key things is that it doesn’t matter where the toxic workplace was. You could have spent like one year of high school, two years of college, spending your summers, forming fiberglass shells for lawnmowers. That is not the tech industry. But if your boss was a power gaming asshole who played power games about shift scheduling, who got the shit jobs and all that stuff, you’re going to be predisposed for the rest of your life to be, “Okay. I’m going to be looking out for the power gamers. Cause I need to know how to play that game.” Even if it’s not there. Something that early on in your career could poison entire 20 years. It absolutely can happen. So, yeah. That’s one of the big leadership things I hope to bring forward is that if you’re senior, you’ve got some of this stuff. You don’t get to senior in this industry without getting your own share of damage along the way. So yeah, identify it in yourself and start learning to identify it in others and help them work through it.

[00:43:56] Henry Suryawirawan: Thanks for being vulnerable and sharing your personal story. I do agree. Sometimes we tend to take it unconsciously. Especially in this industry, the tech industry, it is well-known about this power play, about people who are probably more goal-driven, or want to move things fast, and also not inclusive. So there are a lot of things, a lot of places in the world, where most of the tech people are still men and white, Asian and all that.

[00:44:18] Identifying Your Toxicity

[00:44:18] Henry Suryawirawan: So for people who are actually in this kind of situation, maybe some of the tips from you as well, how can they actually cope it? Like you mentioned in the beginning, identify that. But sometimes, it’s unconscious. So how can you identify that?

[00:44:30] Jamie Riedesel: I’ve given this talk two different ways. One talk I give is to people who are individual contributors, they’re team members, and that one, I’m trying to get people to retrospect on their lives, and have that, “Oh, I did that. I totally did that. I realize I do that now. Oh.” And have that realization. Because once you realize, once you notice that you do it, you can actually start working on it. So that’s one of the big things. So that self-actualization part. But here’s the thing. It’s like senior technicians and engineering managers are a unique position because you’re leading these teams or you’re seriously influencing them. So people look up to you and they see you as someone who potentially has power. So if you notice someone who just is absolutely quiet in meetings. They never talk in meetings. But when you get them one-on-one, they have a lot of ideas, but they never talk in the meetings and you need them to talk in the meetings, cause that’s where stuff is exchanged. Now, that kind of thing is a reaction to past trauma.

There’s a several different ways that could be, and the only way to really figure it out is to ask them about their history. It could be that they’re coming from a place, they’re the new kid on the block, they’re feeling a lack of credibility. And so they don’t talk in the meetings because they feel like for whatever reason, they haven’t earned it yet, and that you need to earn it either through doing something else. They just don’t feel that credibility yet. So what you can do is explicitly call on them, and see how they react. Do they actually provide it? Or just sort of demure and back away? If they demure and back away, that tells them that they still feel that they were put on the spot by you unfairly. So you need to come at it a different way. But if they start contributing, you can reinforce that and say thank you for that. And they start learning that they could just talk instead of having to be invited to the table. Sort of build their credentials. I mean, that’s just one example.

Another reason why some people can be quiet in meetings is maybe they just don’t like big groups of people. They just get scared that sort of social isolization, there’s just too much going on here to keep track. And so if you get them on one-on-one, maybe you could solicit their feedback that way. That tells you that as a team leader or a team manager that maybe you should reshape how you do some of your feedback rituals to allow more one-on-one or async. We all figured out this async thing over the last year and a half of the pandemic. Get in a Google Doc, leave a bunch of comments around or whatever it is used. I mean, that sort of thing is a great way in this giant wall of Zoom images staring at you or whatever it is, and people really don’t like that. Some people work best async. And yeah, it’s about learning work styles in a lot of ways. You can turn some of these mice who never talk in meetings into people who give a lot of solid feedback just by finding the way that works best for them.

Another one, and this is very much personal for my own ops background, is cynicism. Because when you’ve lived in structural disempowered areas and this applies to like old school ops, like I was in. This also applies to people who’ve been racial minorities most of their life. They’ve been dealing with power differentials for so long. It’s like cynicism is natural. It’s human. It’s what you do. It’s like the man. When you get to these sysadmin conferences, part of what we talked about for so long was the war stories. Yeah. It was terrible. Yeah. I remember I had to deal with that, too. That sort of cynics handshake is so common when you’re kind of on the wrong side of these differentials. For a team that is truly generative, who’s figured things out, and it’s highly feedback driven, that sort of cynicism is poison, absolute poison. So when you get this new person on the team, they are quiet for a few weeks, and they start making some sallies to the coworkers saying, “Yeah, management, right?” That’s a sign. That’s an interesting thing. Cause that one, manager won’t get it. And maybe not the senior engineer either. That maybe like that IC2, IC3 who is on the team may have get that. Cause they’re low enough on the totem pole to be safe for the new person to try to build the cynical handshake with. So yeah, if you’re someone who sort of middling on the technical chart for your work, you may see this before the seniors do. So, yeah. Pay attention to it. So listen.

These are all different things. There’s one blog that I wish more people really internalized. And this is Google’s re:Work blog from November of 2015. You probably know this one. If you’ve done any sort of work with office cultures, because that’s where the phrase psychological safety got injected into all talk of culture. But what people forget, because this blog title was titled “The five keys to a successful team”, is there are four other points on there. And if you miss on any one of those, you’ve got some toxic elements because people need to retrain to maintain the psychological safety.

[00:49:02] Psychological Safety

[00:49:02] Henry Suryawirawan: Maybe for people who are not yet familiar with this term psychological safety, it seems yeah, it’s the buzzword also like in all these, you know, about great workplace to work in, great team to work in as well. Psychological safety has been the number one that is mentioned most often. So what do you think is psychological safety?

[00:49:19] Jamie Riedesel: Psychological safety is the feeling that a person has, that they can safely take risks. Or that they know what behaviors they need to do to be able to stay safe within a given job. For instance, we’re talking careers here. So psychological safety is that paramount thing everybody does, so they do, to maintain psychological safety, and when they don’t have it, they burn out and they go away. But the thing is, and this is where the toxic environments really play in is that you can maintain psychological safety by providing coping skills. So if you have a manager playing power games, one coping strategy that’s very effective is just disengage completely. Do what you’re told. Nothing more, nothing less, and don’t have any ego about it. Cause if your manager moves you from point A to point B to point C, you just do the thing. You do the work. We’re over here now. That’s fine. I don’t care. I’m just getting a paycheck. Move on. And so when they go from there into a generative environment that’s high feedback and they need attachment, cause feedback is about caring, this fully detached person is like, “I don’t care. I’m here for a paycheck. Okay. Just tell me what to do. I don’t need to tell you my feelings every single time.” The coping strategies that maintain safety in one environment aren’t always appropriate for another. So yeah, psychological safety is important.

Another big one and that’s number two on the re:Work blog list was dependability, which is, I can depend on my coworkers to deliver on what they promise. And it’s incredibly important because if your coworkers can’t deliver anything, you just tend to just do it all yourself. You pick the things that you care about, and you only do things that you can do. And if someone offers to help, that’s a risk and you always have a plan for what happens for when they flake out, always. You just don’t trust that they will deliver. You trust that they will fail, and you’ll be happily surprised that they deliver. That’s what happens. So you’ve kept a job, but dependability wasn’t a thing you could have.

Down from there, what they call structure and clarity. What I call: we know where we’re going. We have a plan. We know who’s doing it for the roles, and we know how we’re measuring for the goals. So if you don’t have that, yeah, but you do have dependability. You tend to just prioritize what you’re doing. Maybe for whatever reason, if you lack the structure and clarity, you just ignore quarterly planning because it always gets pre-empted for some reason. There’s a crisis every month that blows the plan out of the water. You just work on that. So when management comes to you and says, we need some quarterly goals, you’re sitting there going, “We’re not going to work on this, anyway. In a month, we’re going to be working on something else because it’s a crisis. So I’m just going to check out of the process.” So if you are a leader who’s trying to get ideas. We should pitch for the next quarter for where our goals could be, and we’re getting a lot of dead air silence. Is that because you never work on that? Look at your own teams. It’s like how we followed through on the previous quarters or whatever. As it’s always been pre-empted by other crises, and if that’s the case, we have a problem with structure and clarity.

Down from there is meaning of work, which is the work is personally meaningful to me. So you’re working on the right stuff. Maybe you’re a QA team and it’s your job to catch the bugs before they go to production, or catch the bad stuff before customers yell at you. So you can have that personal meaning. That is incredibly important to have, and if you don’t have that personal meaning, you just work on the subprojects that are important to you as a person. Again, you maintain the dependability and maybe some of the other stuff, but you’re not really invested in the work. And that leads to the pathology of just not putting your full into things simply because you’re not working on the stuff you really like.

Fifth one on that is impact of work. The work our team has positive impact. So like that QA team that I mentioned, their job is to find stuff before they go to production. Well, if that QA team or the QA processes at the company is under-resourced, so we’ve got this QA team that’s expected to stop a lot of stuff, but they don’t. Stuff’s still getting out to production. You’ve got very highly valued, cranky customers coming back and yelling about the software quality. So yeah, you’re doing the right work, but just not enough of it. So yeah, it leads to more disengagement. You just stop looking at the ever increasing ticket queue. Just work on what you do at your own pace. No one can rush you because it doesn’t matter. Let’s do what I can do as I do it. There’s so many different ways. Like I could spend an entire hour talking about that blog by itself. There’s a lot there.

[00:53:32] Henry Suryawirawan: Right. Thanks for mentioning all these five key things for highly effective team. Personally, I worked in Google myself. I saw this research. I read it internally. This is one of the great insights that people should look into in order to build a highly effective team. Make sure to check that out, psychological safety, dependability, structure and clarity, meaning, and impact.

[00:53:52] Identifying a Person’s Baggage

[00:53:52] Henry Suryawirawan: So there are two things that stood out also from some of your talks that I saw, which I kind of like want to highlight here for senior people. So the first one is actually about baggage. You mentioned that everyone has a baggage, and you cannot always tell it is there during the interview process. Can you maybe share a little bit more about this one?

[00:54:10] Jamie Riedesel: Absolutely. The thing I want you as listeners to think back on is what did your job interviews look like. Maybe you do a lot of interviewing. You’re on the hiring side of the desk, and you’re looking at stuff. Maybe you’re looking at a lot of code, but you’re also looking at what’s known as the all-round interview. The one where they try to see if you’re a sneaky bozo or an asshole or something. You’re going to be a culture add as opposed to a culture negative. So I think that’s the new term. So, yeah. During that one, that system is supposed to filter out the brilliant asshole, a person who ticks all the boxes on the technical interviews. They’re amaze balls, but when you actually talked to them, they’re belittling, they’re dismissive. They’re basically hard to work with. And yeah, we don’t want to have them around here.

Now, the thing to remember about the interview process is it is definitely a high-pressure sales thing. Charity mentioned this in her talk is that your career is your highest asset in your entire life. And it’s your job as someone who is job interviewing to convince another company to invest the half million, million. If you’re like a staff-plus like me, you could be over a million dollars of lifetime investment in your career. You’re going to be staying there for multiple years. That’s the bigger transaction than buying a house. That’s why it’s so hard to get these jobs, but it also means that if you were trying to get one, you need to treat it like a high-pressure sales thing. And there are people who get very good at the social game. So when you get into that all around interview, they will show you enthusiasm for your product. They will show you a healthy response to criticism, a positive response to criticism. They’ll show you a positive communication models between various different people. Maybe if it’s more of a junior role, they show no ego with asking for help. Maybe if they’re a senior person, they’d listen to the interviewer and you have this nice back-and-forth conversation. “Oh yeah. I see your point”. And you get this from functionally manipulative environments. It’s like maybe a month, two months, three months after they’re onboard, when they’re showing all their new coworkers the happy face, the political face, again, the coping strategies. I need you to like me, so I’m going to do everything in my power to make sure that you like me so I can build my credibility with you, so I can work on my own goals. After that honeymoon period is faded a little bit, when you might start seeing some of the hidden toxicity. When the mask comes off of it, and you start getting maybe some snarky comments off to the side, or maybe something else.

So people like that is like how I was after my worst job ever. Because I mentioned before, that we had no grand high person to say do it because I said so. The way that those of us central systems administrators got anything done was persuasion, manipulation. I quite literally ran good cop/bad cops. One of the people in our College of Arts and Sciences, who wasn’t talking to us. So I came to him, open in hand. I was the new kid on the block and I was reasonable. Unlike the leader of our team, who wasn’t. I got him to commit in ways that the leader of our team didn’t, because I wasn’t that asshole. And when I talked to the leader of our team about this, he says, “Good. So long as they’re doing what they’re supposed to, keep at it. Use me as the bad guy. Go for it.” That sort of thing, which is antithetical to generative environment. Oh my God! It’s absolutely how the game is played in certain other environments. So when you get someone that socially skilled into like the all-around interview, you see this amaze balls person, but you don’t know, you can’t know in that environment that you’re hiring somebody who’s going to be a net negative, which is why you need to pay attention to this over the next year of someone’s onboarding. So yeah, the interview process weeds a lot out, but it is not in any way, the sovereign remedy to all of this.

[00:57:43] Henry Suryawirawan: Thanks for highlighting that possibility of someone who can ace the interview, but over the time they are in the company, they may start to show toxicity, attitude, or maybe some sneaky behavior. And also the other one as well, of people who probably did well in the interview, but over the time, didn’t do well because of, I don’t know, like they are shy or they are not psychological safe. So they also have these kinds of baggage that you should notify as a manager of them or senior people within the team.

[00:58:09] Who is On The Team Matters Less

[00:58:09] Henry Suryawirawan: The other thing that is interesting in your presentation, which I want to highlight as well. I will read it here. “Who is on the team matters less than how the team members interact, structure their work, and view their contributions.” So for people who are in the high-performing team, this is something that you should listen. Maybe you can share a little bit more about this.

[00:58:28] Jamie Riedesel: Yeah, this is actually something Manuel talked about in June, actually, in his team organization talk. It’s that how the teams inter-operate with each other matters more than what they’re working on. Because it goes back to Conway’s Law and how technology talks to each other. So, yes. The team itself is its own unit, if you will, but also how the teams operate with others as well, matters just as much. If you’re in a monolithic team, a team working on a large software monolith, so you have a large code base, you’re responsible for this bit, and then everybody’s responsible for various parts of the same code base. There’s certain amount of presumed inter team communication just by living in the same code base that you have to do. Conway’s Law again, like classing, and how do you pass messages to different classes, and where you put file. All this stuff that you need to know when you’re working within a single code base. So, yeah. You’re going to be in cooperation with these other teams just by necessity, just because of how it works.

Now, if you’ve got a microservices environment, you go up the stack a little bit. You’re talking about how the APIs talk to each other. A single team may have one or five microservices they manage. But you’re also interacting with other teams that have maybe a promise for their API and a bunch of other stuff. But the really interesting thing happens is when you try to get two different teams that historically haven’t talked to each other to start talking to each other. The great example of this I live through, frankly, is HelloSign got bought by Dropbox in February of 2019. So we had a large system. They had an extremely large system. Watching how the interoperability happened there was absolutely fascinating. Because our approaches to telemetry were very different. The thing is that they were huge. They had done far more automation in places that we weren’t ready for yet.

So when you have that, in order to get the integration working, you need to get the teams talking first. The teams need to talk before we can get the technology talking to each other. The human layer happens before the technology layer. Just because of you need to talk to the other team to figure out how to have your technology talk to each other. And how do you handle things like tracking transactions across that? Cause if you’re in a merger situation, you may have two different islands of telemetry, but you’ve got this service that’s ping-pong-ing between both of them. How do you get an end-to-end trace that way? You generally don’t, unfortunately. But that takes a lot of engineering to provide. So yeah, that’s why the social layer is so incredibly important. Staff plus engineers, engineering managers, we all have realized at this point that the biggest problems we’re working, yeah, there’re some really gnarly technical problems that only we have the context to solve. But for the most part, we’re dealing with organizational issues. Who should work on this? Which is the right stuff to work on? How do we scope things to make it feasible over time? Those sorts of people things, sitting in meetings, talking to product managers, working with executives to provide ideas of what’s possible, so they can figure out where to aim. Okay, this is what we should focus on. Tell me how much of it we can do. And we do all that back and forth until we finally get into specifying epics for engineering at the very end of that human process. And yeah. The people, process matters more in a technical organization than the technical a lot of ways.

[01:01:42] Henry Suryawirawan: It’s like what they say, it’s always a people problem, right?

[01:01:46] Jamie Riedesel: Yes. Very much.

[01:01:49] 3 Tech Lead Wisdom

[01:01:49] Henry Suryawirawan: So, Jamie, thanks for sharing your story and talking about software telemetry. Unfortunately, due to time, we have to cut it here. But before I let you go, normally I have this one question that I always ask for all my guests, which is what I call three technical leadership wisdom. Would you be able to share your wisdom for our listeners here to learn from you?

[01:02:07] Jamie Riedesel: Absolutely. The first one is something Manuel talked about in June. Teams are not just about tools. But also about people in them and how they collaborate. Charity talked about this in April as well. When she said that her teams talk about emotion a lot, and that goes straight into all the toxicity things I’ve been talking about, just because you’re not just these automatons who produce code. There’s people there. They react emotionally to things, and if you ignore that at your peril, absolutely.

The second one is everyone brings their previous jobs with them, even the bad parts. So, like I said there earlier, during the interviews, you’re looking for people who are minimally damaged or at least can hide their damage in ways that at least worked well for your team. Like I mentioned before about the person with a job forming fiberglass shells for lawnmowers back in college, that trauma can still affect them 25 years later when they’re on the verge of principal engineer. So yeah, definitely be aware of that. Be aware of that in yourself. So look back and see what bad workplaces trained you into.

And the third one is old tools can still bring great value. Leadership is knowing when it’s worthwhile to scrap the old and build the new, but also when to say no to the update. Because newer is not always better, and this is where that business decision comes in. Because if you’ve got this 20-year-old Java codebase that has been upgraded since Java 2, that’s been running syslog the entire time. Yeah, maybe you could bolt in some distributed tracing in there, but your nuts not going to be rip and replace. It’s going to be a long painful cross fade as the design patterns of telemetry impact how your code operates. And you have to cross fade that because it’s going to be a long project. Get something that deeply personalized to your use case, adapt it to a completely new way of handling telemetry. So yeah, the old tools can deliver a lot of value for a long time.

The thing that I definitely have learned through all this is that in 10 years, at 2031. There are still going to be people who can’t use those highly cardinal, highly dimensional event blobs Charity was talking about. Because they just haven’t caught up yet. Maybe their SaaS providers, a lot more SaaS providers doing it. Maybe there might even be an open source thing. But for whatever reason, places like Orange County, California, and their property tax division. Something that might actually be on a mainframe just hasn’t caught up with it because maybe they don’t need to at that point. The future is not evenly distributed in their environment at all. And you’re going to run into this again and again, and being open-minded about that is incredibly important as a leader.

[01:04:38] Henry Suryawirawan: Thanks for sharing your wisdom. I really love that, especially the last one. So make sure to value the old tools as well. Not always, like scrap and rewrite everything or use a new tool, new kids on the block. So, Jamie, for people who would like to know more about you or, maybe continue this conversation online, where do you think can those people find you?

[01:04:56] Jamie Riedesel: The best place to get ahold of me on social media is Twitter, on SysAdm1138. Would have been SysAdmin1138, but oddly, Twitter doesn’t let you use the string “admin” in user names. So, yeah. SysAdm1138 on Twitter is the best place. I do have a blog at SysAdmin1138.Net. Not Com, .Net. That has been going on since 2003. Ever since the death of Google Reader, I’ve kind of been slow on updates, but it does get some occasionally, add it to whatever you use for RSS.

[01:05:24] Henry Suryawirawan: Any significance of 1138?

[01:05:28] Jamie Riedesel: Ha funny thing. When I was coming up with a blog title back in January of 2004, I wanted to find the nom de blog and that kind of combines the two interests of my life, the systems administration work that I was doing. 1138 is a callback to George Lucas’s student film, THX 1138. So I plumped them together. There we go.

[01:05:48] Henry Suryawirawan: Thanks for sharing that. So, Jamie, it’s been a pleasure talking to you. Thanks for sharing all your stories and your expertise. So looking forward to seeing the book being published all around the world, where people can find the books.

[01:05:59] Jamie Riedesel: Thank you.

– End –