#72 - Managing SRE Toils Using AIOps and NoOps - Amrith Raj

 

   

“It is important to eliminate toil. If you don’t eliminate toil, you won’t have time to fix problems strategically, because strategic initiatives take precedence.”

Amrith Raj is a Senior Solutions Architect at Dynatrace. In this episode, Amrith walked us through the evolution and current state of IT Operations (ITOps). He described how the ITOps role has developed over time and becoming increasingly more challenging with the increased level of infrastructure abstraction and complexity, especially in the current era of cloud and Platform-as-a-Service. In order to manage such high amount of complexity, Amrith shared the importance of having good culture and practices and touched on some important Google SRE concepts, such as toil and automation. Amrith then shared some recent ITOps advancement, i.e. NoOps and AIOps, and how we can leverage on them to improve the way we solve problems. Also, make sure you do not miss Amrith’s pro tips on how we can become a better SRE and how to use Function-as-a-Service effectively.  

Listen out for:

  • Career Journey - [00:05:29]
  • Managing Cloud Capacity - [00:07:51]
  • IT Ops - [00:11:09]
  • Cloud & Abstraction - [00:16:01]
  • Culture & Toil - [00:18:54]
  • NoOps - [00:28:29]
  • AIOps - [00:31:20]
  • Tips on AWS Lambda - [00:44:16]
  • 3 Tech Lead Wisdom - [00:48:12]

_____

Amrith Raj’s Bio
Amrith Raj is a Senior Solutions Architect at Dynatrace, supporting the transformation journey of their highly diversified customers through automated and intelligent observability. He has authored an e-book on Cloud Capacity and has published papers related to Cloud, Data Centres and IT Infrastructure. He and members of the Cloud use cases group published a whitepaper called Moving to Cloud when Cloud Computing was too new to be adopted by companies. He has held multiple roles which involved responsibilities around leadership, engineering and operations, modernisation, cloud architecture, automation, migration and building fault-tolerant cloud infrastructure. Based in Melbourne, he is passionate about how technology can be used to transform human lives.

Follow Amrith:

Mentions & Links:

 

Our Sponsors
Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

 

Like this episode?
Follow @techleadjournal on LinkedIn, Twitter, Instagram.
Buy me a coffee or become a patron.

 

Quotes

IT Ops

  • Think about our journeys. We used to go to banks for transactions. I always feel banks are the oldest software consumers. Today, we order resources, food and taxi through apps, and we think that they know it the best, but the fact is banks were the oldest consumer.

  • Whenever they use systems, they buy hardware and software. Once you get the hardware and software, the question is, who is responsible for maintaining them? Who’s going to build those? Who’s going to maintain those systems? Somebody has to support it.

  • The team that is responsible for maintaining them, in the simple term, is IT operations. And IT operations have evolved to an extent. Because now technologies are much easier to consume. You have cloud computing. You have Platform as a Service.

  • The challenge has been that it’s all abstracted. It’s heavily abstracted now. Before it was a LAMP stack, for example, and we know it’s running on one server. We know it’s there and we try to just troubleshoot it. Now, because of the shared responsibility model, your scope has changed. Your scope is now to the product that you support, to the administration of those products that you support.

  • But operations haven’t changed. We still need operations. It’s the operator who would make sure that things are running all the time. They are the person who fix your car when there is an outage. They are interrupt driven. If something is wrong, they come back and fix it. They are motivated to bring the system back to normalcy.

  • When you see an outage next time, remember that there is an operator trying to bring that system back online.

  • It’s the part of computers. Systems and software would fail. They are meant to be imperfect. They would not work all the time, and that’s an acknowledgement that we need to deal with and face every day.

  • IT Ops is getting tougher and tougher because it’s no longer the LAMP stack that I mentioned. If I need to operate an application, it could be hosted on Kubernetes. It could be hosted on your Platform as a Service, like App Engine. It could be hosted on a physical server. You don’t know where it is hosted. You cannot simply analyze the metrics from these three machines and find out what’s going wrong. You need to know the complete transactions.

  • Monitoring is one part of it. How are you going to measure all these? Are you going to just check the CPU and memory of these machines? Are you going to check the performance of the pods? How’s the application performance? So the complexity has exploded.

Cloud & Abstraction

  • Cloud makes it simple to consume the resources. But it also makes it hard to manage it, monitor it. You need to make sure your systems are running all the time. And then you have your circle of influence where you control.

  • Cloud providers have been doing an excellent job. They’re really good at what they do. Yes, they had outages. But I would still rather trust them than me trusting when building my own data center.

Culture & Toil

  • Google has shared its SRE concepts with the world. It has been there for a couple of years now, and a lot of industries are trying to adopt it. It’s very similar to the ITIL model. It’s not drastically different. But it helps you understand how you should approach these operations.

  • Here’s a pro tip to be an SRE. Stop logging on to the server. That’s the first step that you need to do. Zero touch. Don’t log on. If you’re logging on a day-to-day basis and seeing what’s going on, what’s wrong with my server, you would end up logging on all the time, and then you’re wasting time logging into it.

  • The problem that we face is the scope of influence has now changed. The complexity that we face is now no longer just the infrastructure problem. It’s the application problem. There’s tracing that happens for each individual request. You have no idea how your front end users are using it.

  • Operations team which were part of the ITIL V4 and V3 have to evolve and apply some practices on SRE. You don’t have to extract everything that is there in the Google SRE book and apply. Because number one, you’re not Google. Google does different way because that scale is different.

  • What we could do is start with those baby steps. The baby step is to look at how you log on to the server. When something goes wrong, how do you respond to it?

  • As you evolve, then ideally, you shouldn’t be paged. You shouldn’t get an alert. You shouldn’t get a call for which the responses are robotic, which means that if I get an alert, and the only thing I would do is go and restart the service, that is not intelligence. I’m not applying intelligence. I’m just being robotic and solving the problem.

  • Can that job be automated? That job is called toil. The way in which we respond to issues in our system could be improved drastically. And this is where we need to identify toil.

  • If an incident happened once, and you found out that the incident can be solved by a permanent fix, then we should do the permanent fix. If the incident happens again and again, you try to find out what is going wrong here. Try to investigate it, and go deeper as much as you can, and try to fix the problem, fix the root cause of the problem.

  • And (if) you’re doing this every time, every week, great opportunity to automate that problem. Great opportunity to fix that problem. Then comes the cost of expense. If the effort is too much, you need to reconsider whether it is worth to automate that problem.

  • The next time you step out from an individual perspective, you would see that there are toils everywhere and you could manage it. It is important to eliminate that, because if you don’t eliminate toil, you won’t have time to fix the greater good. You won’t have time to fix problems strategically. We have to identify and fix those toil as much as we can because strategic initiatives take precedence.

NoOps

  • NoOps, by definition, says that your operations team is so automated that you don’t have to involve humans there: how you automate monitoring, how you automate quality, how you automate response, and how you automate your problem remediation.

AIOps

  • Every company, every application, every team that you work for is different. What applies in one solution doesn’t apply everywhere. So you have to change your process and approach by rule number one, which was identifying toil properly. Once you identify it, then you fix it. That’s close to NoOps.

  • AIOps are systems that work heavily by automating the monitoring part of it.

  • Monitoring is not just about collecting data. You want to collect metrics, logs, and traces. Then, what am I going to do with that? That’s the challenge that we have with monitoring.

  • This is where an AIOps platform would address this by having an agent or something that is deployed. It would understand the environment, understand the topology completely. It needs to understand the topology from the front-end way back to the node level, understand the relationship.

  • And then in the next stages, it automatically baselines all this information. When should I let you know when there is a CPU increase? When should I let you know that there are more HTTP 500 errors? When it happens, what am I supposed to do? Am I going to send you an alert? Or do I need to tell you what went wrong?

Advice on Using Lambda

  • Consuming Lambda is so easy. Whenever there is Lambda, it’s like a duct tape solution. The problem is that it’s getting harder to manage functions.

  • The good thing about Lambda is, it doesn’t cost much. There is no cost associated with it. It’s called whenever it gets invoked.

  • If you’re using it for NoOps, you should properly plan it by playbooks and runbooks. Playbooks are the ones that you would create when there is an incident, and runbooks are the ones that help you create or execute a standard operating procedure. So those are driven by playbooks and you should find these playbooks are available through your configuration management tools like Ansible, Puppet, in AWS as Systems Manager.

  • If you decide, “Okay, I think Lambda is the right way for my automation”, build some standards there. Have proper naming convention to your Lambda functions. How are you going to deploy your functions? Use GitOps. Maintain your code of Lambda in a centralized repository. Create those Lambdas through GitOps. Don’t do it manually.

  • Whenever you build a solution, we should be able to show how well we know that. So the confidence in system should increase and should not go downwards. If I make a solution, I should be confident to explain this.

  • I should not check a manual to find out how it was done. The only way operation engineers can actually succeed is by being very confident in the way in which they build these systems. The circle of influence, you should know in and out as much as you can with that influence.

3 Tech Lead Wisdom

  1. Stay curious as much as you can.

    • As a child, we were all curious. We’re all very curious about trying to find out what is going on. We tend to lose some curiosity for some reason. I think that is something we have to correct that.

    • If you’re curious, stay curious as much as you can, because there will be a moment when you realize that whatever you learned in the past would play a role in the future.

    • You may not solve that problem tomorrow, but they would be an ‘aha!’ moment.

  2. Be a leader in your own space.

    • If you are a junior developer, or an engineering manager, be the best junior developer in the world.

    • Nobody’s rating you, but you got to be the best person that you’re trying to do by rule number one, which was curiosity.

    • The role of a leader is to solve the problem, and you would solve the problem when you’re curious, when you try to address it. So specialize and master and excel in that job.

    • And if you don’t know where to start, again, go to rule number one. Be curious.

  3. Stay happy and contented.

    • Whatever we’re doing, we had to find a human connection to the work that we’re doing.

    • Find that relationship from the work that you’re doing. Have that human connection. I feel that when you find that connection, it’ll really help.

    • So that curiosity and leadership that you learn would help you find the next spot and how you can improve things, how you can make things better for yourself, for the community, for the greater good.

Transcript

[00:01:15] Episode Introduction

[00:01:15] Henry Suryawirawan: Hello to all of you, my listeners and friends. Wherever you are, I hope that you’re feeling grounded and full of joy. Welcome to another new episode of the Tech Lead Journal podcast. Thank you for spending your time with me today listening to this episode. If you’re new to Tech Lead Journal, I invite you to subscribe and follow the show on your podcast app and join our growing social media communities on LinkedIn, Twitter, and Instagram. And if you have been regularly listening and enjoying this podcast and its contents, will you support the show by subscribing as a patron at techleadjournal.dev/patron, and support my journey to continue producing great episodes every week.

My guest for today’s episode is Amrith Raj, a Senior Solutions Architect at Dynatrace. He has authored a book on Cloud Capacity and has published multiple papers related to cloud, data centers, and IT infrastructure. One of the papers published in 2013 is called Moving to Cloud, when cloud computing was still too new to be adopted by companies.

In this episode, Amrith walked us through the evolution and current state of IT operations, or ITOps in short. He described how the ITOps role has developed over time and has been becoming increasingly more challenging each time with the increased level of infrastructure abstraction and its complexity, especially in the current era of cloud and Platform-as-a-Service. When dealing with such high amount of complexity, Amrith shared the importance for us to adopt good culture and practices. And he touched on some important Google SRE concepts, such as managing toils and having more automations. Amrith then shared some recent ITOps advancement, which are the NoOps movement and AIOps technologies, and how we can leverage on them to improve the way we operate and solve IT Operations problems. Also throughout our conversation, make sure you do not miss Amrith’s pro tips on how we can become a better SRE and how to use Function-as-a-Service effectively.

I enjoyed my conversation with Amrith learning about the evolution of IT Operations role, how the role has been increasingly challenging and how the NoOps movement and AIOps technologies can help us to give more HugOps to the IT Operations people. If you enjoy and find this conversation useful, I encourage you to share it to someone you know, who would also benefit from listening to this episode. You can also leave a rating and review on your podcast app or share some comments on the social media, on what you enjoy from this episode. So let’s get our episode started right after our sponsor message.

[00:04:34] Introduction

[00:04:34] Henry Suryawirawan: Hi, everyone. Welcome back to a new Tech Lead Journal episode. Today I have with me a guest named Amrith Raj. He’s a senior solutions architect in Dynatrace . He’s very experienced in IT operations, system administrators, cloud computing. And today, we are going to talk a lot about IT operations, how the history started and the evolution of IT operations. I think we all know these days about DevOps, SRE practices. And these days, you might have heard concepts such as NoOps, AIOps. So what are all these? So maybe Amrith today will give some insights to what are all these X-Ops, right? So we get to know all these concepts. And whether they are actually a myth or is it something that is possible to be done these days? We’ll be covering that a lot today. So Amrith, thank you so much for attending this recording, and looking forward to have a good discussion with you today.

[00:05:28] Amrith Raj: Thank you for having me.

[00:05:29] Career Journey

[00:05:29] Henry Suryawirawan: So Amrith, in the beginning as usual, I ask my guests to share their journey, their career highlights, or any turning points that you want to share with the audience.

[00:05:38] Amrith Raj: Sure. First of all, I started as a tech support engineer. I used to support storage arrays. So for people who don’t know what storage arrays are, there is a server. It needs access to storage. So storage would come from a storage arrays and there were large vendors who created these storage arrays and provided them in data centers. So whenever there’s a fault, I used to be notified, and we used to dial in using modems, traditional modems that you can imagine that make those nice noises and then try to troubleshoot what’s wrong and fix it. So that was my first entry point.

I moved on to storage administration, working on variety of storage related projects, transformation project, further into server and data center migration projects, data center transformation projects. And that’s where I was very curious about cloud computing. It was sometime at 2011 or seven, eight time when AWS was very popular. I was interested to know more about cloud. Me and my other members and the community, we’re part of Cloud Use Cases group. Cloud Use Cases group published a paper on moving to cloud, which I contributed. Apart from that, there were other papers that I published which are related to cloud, data center operations.

Then I moved on to managing cloud for CSC, Computer Science Corporation. And this was a time when CSC was in the Gartner quadrant of Infrastructure as a Service. I was managing the cloud capacity. It was hard because we had to work with understanding the demand, forecasting it, getting the disks and storage. So managing that capacity planning of servers, compute, and storage was my job for close to 4.5 years. And then I moved to public cloud. I worked at DXC as the AWS lead architect. I help customers move to cloud, to their app modernization cloud migration journey. Recently, I’ve joined Dynatrace, which is an all-in-one platform with automatic and intelligent observability. And it helps transform world’s largest organizations tame cloud complexity and accelerate transformation.

[00:07:51] Managing Cloud Capacity

[00:07:51] Henry Suryawirawan: Thanks for sharing your journey. So I have one interesting observation when you mentioned all these. So cloud capacity. I think, still many people believe that cloud is elastic, and it’s infinite, like everything is just there when you want it. And I understand. I’ve worked in a cloud provider before, so I know cloud capacity is one of the hard topics as well. So you have limitations of data center. You can’t just spin up everything as many as you want. Maybe a little bit of insights when you are doing this 4.5 years. Tell us more about managing cloud capacity.

[00:08:24] Amrith Raj: Yeah, sure. So CSC had a lot of data centers and this is how I did. With a lot of data centers, they had resources, and these were proprietary hardware. Public cloud, they had more of an open standard to an extent, like Facebook has the Open Compute Project and Amazon created their own infrastructure. Google creates their own infrastructure. Companies like CSC cannot do that. We have to depend on vendors like VCE Vblock, Dell EMC to provide that. What we used to do is get those Vblocks, install it, and then we work on demand with my sales team. So the sales team would give me information about this is happening, we have this project coming up. What I have to do is compare the capacity that’s there, and see whether that meets, and try to stay in line.

The challenge was to juggle capacity, and we used to do that all the time. We used to move disks from data centers from one location to other based on demand. The term we used back then is called trusted cloud, because it’s more of a private and public cloud that customers can trust. We used to move hardwares here and there by working through our vendor to get the hardware, go on a pay as you go model. So that was the way in which it was done. It was, to an extent, predictable. But it was difficult to manage because of economy. If I buy more and then customers don’t use, that’s a lot of hardware that’s there. Whereas vendors like AWS, Google, and Azure, they know they can invest a lot in that hardware upfront and still manage. But even they run out of capacity. I don’t know if anybody have noticed, they run out of capacity as well. When I was provisioning, there were certain instances that said, “try again later”. That does happen time to time, whoever have seen this, they might have noticed that there is capacity issues in cloud.

[00:10:09] Henry Suryawirawan: And especially for small data centers. Cloud has presence in multiple countries and cities, but their data centers capacity are not the same. Thanks for sharing this. And especially the difference between public cloud, which adopted open standard, and trusted cloud or private cloud, which is probably a little bit more like managing data centers. You put in servers, resources as per demand, right? Not like elastically, like you just put everything until one day people come in, ask for it.

[00:10:36] Amrith Raj: For that reason, what I’ve seen is if I go through the numbers, we had more resources and more data centers than AWS and Google had. The reason was it was small in number. The infrastructure was small in number. When companies like Google and Amazon, they go and build data centers. They go (start) from scratch, from the real estate. As much as possible, they try to use their own infrastructure so that they absorb those costs and they go from there. That’s the reason these companies are no longer in the cloud business. We realize that it’s not our cup of tea anymore. Unfortunately.

[00:11:09] IT Ops

[00:11:09] Henry Suryawirawan: So, yeah, I’m looking all this career journey that you have. I think you are pretty much experienced end-to-end, I would say, in terms of IT operations, system administrators. So you manage the resources in data centers. You are now moving to the cloud, and now is in observability area. So maybe coming back to the topics that we want to talk about, IT operations. What’s the significance of IT operations these days? I know this term and this role has existed for a long time. So in your view, what is IT ops these days? Has it transformed a lot? Is it significant part of the tech industry? What’s your view on this?

[00:11:44] Amrith Raj: It is significant part of the tech industry. So think about how it has evolved. Through my journeys, it’s at a high level, but think about our journeys. We used to go to banks for transactions. I always feel banks are the oldest software consumers. Today, we order resources through food and taxi through apps. And we think that they know it the best, but the fact is banks were the oldest consumer. So whenever they use systems, they buy hardware and software. Once you get the hardware and software, the question is who is responsible for maintaining them? Who’s going to build those? Who’s going to maintain those systems? Once they deliver the software, could be in the form of ATM. ATM was probably the first hardware and software amalgamation that we all have been exposed to where we could go there and withdraw cash. So somebody has to support it. So that system has evolved. From there, it went to internet banking, mobile banking, and now we transact through the phone. The team that is responsible for maintaining them, in the simple term, is IT operations. And IT operations have evolved to an extent. Because now technologies are much more easier to consume. You have cloud computing. You have Platform as a Service. You build those resources and now you get that done.

The challenge has been that it’s all abstracted. It’s heavily abstracted now. Before it was a LAMP stack, for example, and we know it’s running on one server. We know it’s there and we try to just troubleshoot it. That simple. Now, because of the shared responsibility model, your scope has changed. Your scope is now to the product that you support, to the administration of those products that you support. But operations haven’t changed. We still need operations. It’s the operator who would make sure that things are running all the time. If you think about the servicemen who fix your car, they are the person who fix your car when there is an outage. They are interrupt driven. If something is wrong, they come back and fix it. They are motivated to bring the system back to normalcy. When you see an outage next time, remember that there is an operator trying to bring that system back online.

[00:13:50] Henry Suryawirawan: Yeah. And speaking about outages, incidents, I think a number of recent weeks, right? Facebook. And I think there’s another famous company, which is down as well.

[00:13:59] Amrith Raj: So, those are interesting, and those are large-scale outages. Whoever were involved in that operations team, big hug ops to them. Make sure that they don’t lose the morale there. But it’s the part of computers. Systems and software would fail. They are meant to be imperfect. They would not work all the time, and that’s an acknowledgement that we need to deal with and face every day.

[00:14:22] Henry Suryawirawan: And speaking about just now, when you mentioned traditionally, IT operations support model probably is easier. You can see the server. How many of them? What software is running? But these days, especially with the prevalence of microservices, distributed architecture, a lot of technologies involved, in fact, for example, like number of database technologies, number of web servers, application servers, and maybe other components installed together. This seems to make it more complicated. How about in this angle? Like how do you see the job of an IT operations becoming tougher and tougher?

[00:14:55] Amrith Raj: It is getting tougher and tougher. Because it’s no longer the LAMP stack that I mentioned. If I need to operate an application, it could be hosted on Kubernetes. It could be hosted on your Platform as a Service like App Engine. It could be hosted on a physical server. You don’t know where it is hosted. And then, if you think about you have it running on a Kubernetes cluster, say, for three nodes, and you have a lot of microservices running on top of that. And each of those nodes are beefy. You have large amount of RAM and storage. You cannot simply analyze the metrics from these three machines and find out what’s going wrong. You need to know the complete transactions.

Now, monitoring is one part of it. How are you going to measure all these? Are you going to just check the CPU and memory of these machines? Are you going to check the performance of the pods? How’s the application performance? So the complexity has exploded. And this is where the topic that we’re talking about, AIOps plays significant role there to demystify and simplify these problems. They are there to address it, which we’ll cover probably in the next section as we go through this talk.

[00:16:01] Cloud & Abstraction

[00:16:01] Henry Suryawirawan: So speaking about abstraction, right? You mentioned about abstraction. These days, a lot of things getting more and more abstracted. So maybe last time you work on physical things. Now there’s a VM. Now we talk about Kubernetes, and maybe cloud itself is another set of abstractions because you just talk based on API, maybe using CLI commands, instead of going in directly SSH and do all these Unix operations. So in terms of abstractions, some people say it is actually making it easier. But also I know sometimes, it is also making it difficult because you have to learn all these new abstractions, the concepts, the ideas, and even the commands and all that. What do you think about the abstraction? Does it make it easier or more difficult?

[00:16:42] Amrith Raj: There are two sides to that. Cloud makes it simple to consume those resources. But it also makes it hard to manage it, monitor it. One way you’re consuming it. It’s very easy. Five seconds, you can have a VM running. It’s done. You can deploy an app. That’s done, but think about the problem that I explained earlier. You need to make sure your systems are running all the time. You don’t get a tick mark for consuming the resources. Your customers are using those all the time. And then you have your circle of influence where you control. This is my app. I’ll take care of it. It’s my baby. I’ll solve the problem. Anything happens to my baby, I’ll take care of it. But if there is something else, if there’s a storm coming and that storm is in the form of a cloud outage, there’s nothing you could do but you try to protect your baby. That’s how it is. It’s unfortunate and fortunate at the same time. The speed, agility, deployment times are what you get from cloud, but you have to manage risks.

To be frank, if we go through the numbers, cloud providers have been doing an excellent job. They’re really good at what they do. Yes, they had outages. But I would still rather trust them than me trusting when building my own data center. Because I don’t have time, and all the disadvantages of building my own data centers is valid. It’s really valid. I’ve gone through that journey. I know how hard it is. I don’t want to do that again. I want to sit with my customers through the problems. I’m going to leave the infrastructure problem and everything else abstracted through the cloud providers. They are accountable. They have SLAs. They give me credits. Of course, I really don’t want them to have bad outages. But I think I would rather live with managing cloud. It’s the right choice today. If there’s a better choice that comes up later on, yeah, sure, let’s keep an eye on that.

[00:18:28] Henry Suryawirawan: And I think like all these AI, serverless, we haven’t talked about serverless as well. Technologies where you seem to just deploy and everything works. I can see a lot of benefits, like what you mentioned, the simple to consume, the simple to deploy, I think, that’s still key. Especially in these modern days where people are just building more and more applications, products, systems. IT is eating the world, really. But at the same time, you’re managing servers. I think can be still tough, especially depending on different abstractions that you manage your resources.

[00:18:54] Culture and Toil

[00:18:54] Henry Suryawirawan: That is one aspect of the IT operations, which is about technology and the resources. How about the practices and culture? I mean, these days we talk about DevOps, we talk about SRE. So what do you think of them? Is it helping IT ops as well?

[00:19:09] Amrith Raj: Yeah. So to rewind the clocks a little bit. When I was part of the IT operations, part of it, supporting a lot of customers and applications, there is still a model called ITIL V4. IT Infrastructure Library. I did the V3 version when I was there. It focused on how to manage incidents, and how to manage problems, and how to respond to situations. You had situations like PIR, Post-Incident Review, that you investigate when something goes wrong. Today, Google has shared its SRE concepts to the world. It has been there for a couple of years now, and a lot of industries are trying to adopt it. It’s very similar to the ITIL model. It’s not drastically different. But it helps you understand how you should approach these operations. We cannot simply log on to a server and fix the problem.

Here’s a pro tip to be an SRE. Stop logging on to the server. That’s the first step that you need to do. Zero touch. Don’t log on. The moment you log on, and you’re logging on to fix the problem, maybe I’ll give you one point there. But if you’re logging on a day-to-day basis and seeing what’s going on? What’s wrong with my server? You would end up logging on all the time, and then you’re wasting time logging into it. The problem that we face is the scope of influence has now changed. The complexity that we face is now no longer just the infrastructure problem. It’s the application problem. There’s tracing that happens for each individual request. You have no idea how your front end users are using it. So that changes.

So operations team which were part of the ITIL V4 and V3 has to evolve and apply some practices on SRE. Now, how do you do that? You don’t have to extract everything that is there in the Google SRE book and apply. Because number one, you’re not Google. Google does different way because that scale is different. They have to serve billions of Google accounts. We probably don’t have. Billion is a big number. Getting Gmail account and YouTube that you see. That’s massive scale. What we could do is start with those baby steps. The baby steps is look at how you log on to the server? When something goes wrong, how do you respond to it? What is wrong with that server? Think about the job I did. I had to be on call. If something was wrong, I would get a phone. They would say that this is the problem, and I would try to guess what was the problem, and try to replay. Okay, this could be the issue, and I go in and try to find out. Is that the issue? Trying to exclude what could be the problem, and then finally navigate, okay, this is the issue. Let me fix it. Now I know how to fix it. I had to spend a lot of time to try and fix what the problem is. And that was the pre SRE phase.

As you evolve, then ideally, you shouldn’t be paged. You shouldn’t get an alert. You shouldn’t get a call for which the responses are robotic, which means that if I get an alert, and the only thing I would do is go and restart the service, that is not intelligence. I’m not applying intelligence. I’m just being robotic and solving the problem. So this job of doing it, is something that we could improve on. And that’s where we need to identify toil.

Toil is something that I would want to introduce here. I want to pick up a general example. So you are running this podcast and then the biggest value that you have is this conversation that you and I are having. That is the biggest value. And then, you have some form of post-process and post-production. That’s editing, which is also valuable. You have to add something. Then there are other channels that you would do. You would have to publish into the various platforms like Spotify, Google Podcasts, Apple Podcasts, YouTube. And now, you would convert the voice to text. You get that text, change that into Markdown, update in your site. These tasks that you do which are repeatable. You’re doing this for every time there’s a podcast which do not have value. I’m not saying that it doesn’t have value. I’m saying the most valuable part is this conversation that you and I are having. The other things are important that you have to do, but you have no other choice. You got to do that, which can be automated. So that’s the next step. Can that job be automated? So that job is called toil.

It could be any other stuff. Any other example that I could think of is cleaning my house with a vacuum cleaner or a broom. I have to use that and clean it. Can I solve that problem? Why is that not so important? It’s not so important because I could spend more time with my family. That is more valuable. I have to spend more time there. Whereas cleaning the house is important. But I have to do it. I’ve been doing it manually. It is important, but I should find a way to automate it. How I do this? I buy a robotic vacuum cleaner. I’m not advertising for a robotic vacuum cleaner right now, but that is an important aspect. There is a baby cradle that automatically moves when the baby cries. Gosh. When people are doing this for babies when it cries during night, how are we responding to our systems? I think the way in which we respond to issues in our system could be improved drastically. And this is where we need to identify toil. If that all makes sense.

[00:24:22] Henry Suryawirawan: Yeah. So toil is one concept that I read also from the SRE book. And in fact, inside the book, it’s mentioned that even though you can identify toil, sometimes it’s always there. So you can’t just eliminate all the toils. In fact, 50-50 ratio is commonly used for any Google SRE. Toils, I like the way how you mentioned it. It’s something that is repeatable. Happened almost all the time. And you know how to fix it, how to respond to it, and potentially it can be automated. I think these three attributes, I would like to say that it’s a core component because if an incident happened, once, and you just fix it and that’s it. I think it’s not considered as a toil. Is that correct understanding?

[00:24:59] Amrith Raj: That is correct. Now, if an incident happened once, and you found out that the incident can be solved by a permanent fix, then we should do the permanent fix. If the incident happens again and again, you try to find out what is going wrong here. Try to investigate it, and go deeper as much as you can, and try to fix the problem, fix the root cause of the problem. At any given point of time you realize, and this is happening. This is happening for a lot of organizations. As we saw the journey from the bank to where they are right now, they’re the oldest consumers of software. They have a lot of legacy softwares. Any business that use computers for a long period of time, they would have legacy softwares. Those softwares would have been outdated. The vendors wouldn’t even release a patch.

The only solution, for example, if there’s a memory leak, is to restart that service. And you’re doing this every time, every week, great opportunity to automate that problem. Great opportunity to fix that problem. Then comes the cost of expense. Now, if I need to spend about, I would say, 5-10 days to just do it, you need to automate that problem, to write the script, to implement the integration, connect the dots and make sure your solution works well in the real world when there’s an actual incident. Can it do that job? If the effort is too much, you need to reconsider whether it is worth to automate that problem. But you would find a lot of use cases. The next time you step out from an individual perspective, you would see that there are toils everywhere and you could manage it. It is important to eliminate that, because if you don’t eliminate toil, you won’t have time to fix the greater good. You won’t have time to fix problems strategically.

Let’s take that example. The operator was there. He’s using those older software. And the only job that he does is restarting that service. But hey, let’s rewind the clock. Why are we doing that? We’re doing that because that software is out of date. If it’s out of date, can we transform that? Can we do an upgrade? No, the vendor doesn’t exist. Okay. Can we move that application to somewhere else? Possible. How much effort would that be? Let me come back to you. And that capital expense, if it’s affordable, you should go for that. But hey, you don’t have time for that because you’re too busy fixing that problem. So it is extremely important that we manage time.

This is not the industrial age problem that we talked about, like machines will take over. Machines will take over and you have operations. That never happened. When I was in school, I was always taught when the industrial age came, machines would replace and people would run out of jobs. To an extent, that never happened. We always change. We always are evolving. I’m not doing the same job that I was doing many years ago. That job does not exist. It would have been automated and I’m focusing on something else. So this is happening. We have to identify and fix those toil as much as we can because strategic initiatives take precedence.

[00:27:51] Henry Suryawirawan: I’m also operating systems these days. I know it’s like delicate to have this balance of how much expense, how much effort required to actually permanently fix this or maybe automate that, versus actually just fixing it? Because restarting something is so easy, right? And I like the thing that you mentioned here about managing time. So you have to manage your time. So don’t spend a lot of time just working on toil because otherwise, you’ll be hiring a lot more people, first of all, if your system grows larger. And the second thing is that you are not able to invest time in doing the strategic thing, the most valuable thing, maybe for your users, for your customers, or maybe for your business.

[00:28:29] NoOps

[00:28:29] Henry Suryawirawan: Speaking about toil, the next concept that you introduced is actually NoOps. So I think this is interrelated, right? Because you want to automate away all the toils, and maybe that’s why there’s this thing called NoOps. So no ops going there and restarting your thing. Is that correct?

[00:28:44] Amrith Raj: That is correct. NoOps by definition it says that your operations team is so automated that you don’t have to involve humans there. It is true to an extent. So I’ll explain how that would work. I’ll give you an analogy of how it is. There is some term called as Medpod that has been seen in fiction, like movies. For people who are not aware of medpods, medpods are, I’ve seen it in a couple of movies like Elysium and Prometheus– movies, not the monitoring tool. In those movies, they have a bed and there is a glass enclosure. If you have an ailment, lie down on that and it would scan for issues. Diagnosis in medical world is so complicated today. You’re going to get X-rays, you’ve got to get a CT scan. Do your blood test. Do a nasal swab. And after those, the doctor would say, okay, this is the problem. Let me fix this problem. And that’s how the doctors actually diagnose the problem. Then they would say, okay, to address this issue, take these medicines. That’s one, or we need to do the surgery, or you need to do some physiotherapy, or you need to do something. They come up with a resolution to solve that issue.

The same thing needs to be applied. Believe me, it has to happen at an earlier stage with IT operations because the complexity is exploding. It’s exploding quite a lot. We have 180 or 190 services when I started to count AWS services. Every cloud provider have a lot of services. Now, you have to use those services. When you’re using those services, go back to the original problem. Your circle of influence. The circle of influence has changed. For EKS, you’re doing something. For GKE, you’re doing something. For ECS, you’re doing another thing. For App Engine, it’s much more less. So your circle of influence is now changing constantly. And now what are you going to monitor? How are you going to monitor it? So the med bots, what it actually would do after a diagnosis is, once it diagnosed the problem, it would immediately fix it. And then you can come out as if it’s over. I wish that was there. It helps global medical problems. But we have from the world that you and I live in, where we have systems, we have the advantages, we could use AIOps to address that problem. But the whole cycle, whole process is NoOps. How you automate monitoring? How you automate quality? How you automate response, and how you automate your problem remediation? All those aspects have to be automated because now you can strategically look at bigger problems, transform the application, move that to a different service, and so on.

[00:31:20] AIOps

[00:31:20] Henry Suryawirawan: Speaking about all these movies that you’ve mentioned, science fiction, right? Is this NoOps also a science fiction? Can it be ever achieved?

[00:31:27] Amrith Raj: The good news is it’s not science fiction. I’ll give you certain examples that we have actually implemented in a lot of customers solutions. And it all varies. Each company, each industry, each team, each individual is different. We have to acknowledge that. The problem that we solved in one location may not be applicable. So we have to pick it up. That’s why identifying toil and all those aspects are really important on how you classify them. Have you marked them out?

So to give you an example what we did. There was this company, and they had a lot of SAP workloads. For specific reasons, this company was very cost conscious. They were very cost-conscious. Anything on cloud, it has to go to rigorous approvals. But the benefits of cloud, you do have a bill shock at the same time, because we generally like to spin up things so easily. So after certain iterations, they realize that we need to control the cost and they need to work out how to do that. At the same time, they had another issue where the disks were expanding and the size of the data inside those disks were exploding. Every time this happens, an engineer has to go in, expand the size of the volume on the cloud level, expand the volume on the operating system level. Once you do that, make sure the application is now functioning properly. So this was the problem that was thrown to us.

What the team did was to look at this problem and work out a solution. And the solution was to pick up those alerts, send those alerts to a monitoring tool, the monitoring tool would pick up based on the tag of that instance, send it back as an API, as a webhook to Systems Manager. This is a solution from AWS. We had two variants of it. One is Lambda. One is Systems Manager. I would highly discourage people from using Lambda because you are going to create another toil, and I’ll explain that. And then from there, it sends send commands to the server to expand the disk. First, it’ll send the command to the cloud provider to expand the volume–specific volume, and then send the command to the server. After it waits that the expansion has completed, there is sufficient logic. They were a lot of trial and errors. So this is an example of NoOps that you could apply in your real world. You may then argue. Wait! Expanding the size of drive in 2021. What are we doing? That’s the important aspect.

Every company, every application, every team that you work is different. What applies in one solution doesn’t apply everywhere. So you have to change your process and approach by rule number one, which was identifying toil properly. Once you identify it, then you fix it. That’s close to NoOps. Now you have this job of automating it. Then you iterate based on the next iteration of it. Okay. I’ve automated this part. Let me look at your alerting configurations. Am I getting a lot of noise? Are all these alerts valid? And this is where we use AIOps. So AIOps are systems, and these are real systems, they work heavily by automating the monitoring part of it. Think about the Kubernetes cluster we like to talk about because of the layers of abstractions that it has. It’s one of the things that I can imagine has a lot of abstraction layers from the cloud provider and us. So where you have various workloads and now you need to monitor the nodes, the pods, the containers. That’s just the infrastructure side of the things. Then you want to look into the application part of it. You don’t know what application is hosted on that cluster. You have multiple frameworks, web frameworks being used. You may have multiple web technologies being used. Then your apps are being used from a front-end, which is mobile based. You have an app for that. And then you have a web version of that. So the complexity there is exploding. You can just imagine how difficult that is. If I go as an operator and say, okay, you want me to monitor this. Where would I start? You have so many components and you want to monitor all these. Monitoring is not just about collecting data. You want to collect metrics, logs, and traces. Okay. I’ll collect it. Then, what am I going to do with that? That’s the challenge that we have with monitoring.

And this is where an AIOps platform would address this by having an agent or something that is deployed. It would understand the environment, understand the topology completely. It needs to understand the topology from the front-end way back to the node level, understand the relationship. And then the next stages, it automatically baselines all these information. When should I let you know when there is a CPU increase? When should I let you know that there are more HTTP 500 errors? When it happens, what am I supposed to do? Am I going to send you an alert? Or do I need to tell you what went wrong? Can I tell you that the 500 errors are happening because you have a database connection issue? So you talk about the symptom as “I’m getting 500 errors”, and then you talk about the cause. The cause is the database is refusing connections. So this way, you can now just address the problem. You’d find out why is the database refusing connections, and then optimize it, change it.

So this is the way in which an AIOps solution works. It’s not an aspirational goal. This is happening in the industry. We have workloads that are doing it. Providing root cause analysis and gives you this in front of us. It says, this is how we would do it. And then there is the next stage of it. The next stage is, okay, now I know this is the problem. You go on (to the) problem remediation phase, where you remediate that problem by automatic solutions. You have self-healing applications. You’d say, “This is the problem. Address that problem.” So that’s how an AIOps system works in the real world, in IT today. And these solutions are available.

[00:37:31] Henry Suryawirawan: So I haven’t personally used these AIOps platform or even solution. Sometimes some observability providers mentioned, yeah, we have AI and all that. I’m not sure how much sophisticated that AI. But from your explanation just now, it seems to help a lot in identifying not just the symptom, which is the problem that we see at the surface, but also the root cause. Because one of the challenge of IT operations these days, someone report a problem. Like you mentioned, we don’t know where to start. We have so much data or some may not. But when we have all the data available, we also do not know where to start. We have a lot of charts. We have a lot of logs. We have to go through the system and find it. So it seems like this AIOps is able to automate that part. Looking at the symptoms and also correlate that with something that is failing as the root cause. Is that correct how it works?

[00:38:20] Amrith Raj: Yeah, that is correct. In my past years of experience, I’ve worked on a lot of monitoring solutions. And I’ve worked on open source solutions, and my exposure to Dynatrace was this. What Dynatrace does is to collect the software. Collect all the information about the dependencies. So first thing is to understand, getting the data. So it starts with installing the agent. There is OneAgent that gets installed on the server, on the host level. The job of the agent is to scan everything. What it is talking to. What technology is it running? Then it looks at the topology mapping. Then it detects. Okay, I am inside a Kubernetes cluster. Cool. How many pods are there? These many pods. How many containers are running? These many containers. Next level up, it detects that there is this Node.js application that’s running. There is a React application that’s running. Cool. That means it’s a web application. Next stage, it detects from the front end that there is a user coming in. So the agent does the majority of the work, and this is automatic monitoring (instrumentation).

The next stage is the agent would report to Dynatrace, and Dynatrace would just automatically baseline these issues. Baselining is extremely important, and a lot of companies I’ve gone through. I don’t know if you remember last December, there was an outage in Google. The authentication system was down. When I looked at the root cause of it, the storage quota went to zero. So these are scenarios. Scenarios are something that we have to plan ahead and think about what would potentially go wrong? Once you define these baselines, you realize, okay, this is how it is. Now, at what point should I alert? It should understand when there is an anomaly. An anomaly is not exactly a single CPU spike. That doesn’t make sense. It needs to know that a week before, there were a lot of users. Today’s Monday and every Monday at nine o’clock for some reason. And for good reasons, the CPU goes up. And this time, it has gone blown over the fence, and there is an issue. That is an anomaly. Anomaly is massive changes in the environment. And this is where Dynatrace picks that up and sends that information back to your front-end, that there is some anomaly.

Now the second stage is you have a lot of things that are happening at the same time. There is HTTP 500 errors. That’s one event. Your database is having issues. Another event. The CPU is spiking. Another event. These all events, if they are there, they would tend to send us one different alarms for different teams. That is another problem. The noise has to be suppressed there and condensed into something that is much more meaningful. The meaningful alert would look like HTTP errors seen from front-end. So many users were affected. The impact is so many users plus these other two applications that were affected, then it goes to the next stage and try to find out what the problem is.

Identifying what went wrong is like solving a mystery. You have to find out what went wrong there. Think about you are a security person in a large building and you’re tracking a person coming through. This building is a skyscraper, about 20 or maybe a hundred floors. So he’s coming through each and every point and you’re trying to find out where and all he went through your CCTV cameras. And that’s your trace that you’re doing with request tracing, also known as distributed tracing. You’re finding out where is my request going? And then you find out that this person is going in this floor, and then he goes into this room, then something happens. He spent so much time there and gives that information back. So that is solving that mystery. So this is the difference where you think about causation and correlation. The man went across all these rooms at certain given point of time.

Another example is why correlation is something that we shouldn’t even think about it is, if I step out of my house and in the driveway, if I see that there is a big dent, a big pothole, and there is a big fat cat sitting in front of that, and that pothole did not exist last night. Should I assume that the cat did that? That will be incorrect. So instead, I have to use my CCTV camera and see what went wrong that day. Was that a rock? Or did I have a hailstorm? What happened? And you uncover those mysteries. This is where the topology information that Dynatrace picked up (is used), it is able to find out that this happened first. The next event happened next. The following event happened later on. Now it is trying to correlate and find out the exact cause of it. So you provide the exact cause through causation and not just by correlation. There is a in-depth article about how Dynatrace does this job. This is how the AIOps should look like. I would say, the maturity level of AIOps, this is one of the highest level of maturity, but that’s not enough. In fact, because this is running systems, you need to even improve making sure that these bugs and all do not get into production in the first place. So that is how an AIOps solution looks like. It’s actually available that we could all consume and use.

[00:43:30] Henry Suryawirawan: Thanks so much sharing all these analogies and try to explain it as much as you could. From the explanation itself, I can tell that there’re a lot of complexities. Especially if you are working in IT operations, you mentioned about HugOps, right? So we have to appreciate the job of the IT operations. These days, systems getting more and more complicated. A lot of abstractions as well. So sometimes when outages happen, it’s not like the IT is not doing its job. It’s just finding the root cause, first of all, is difficult. So hopefully with the AIOps platform, and also I think it will also integrate nicely with the NoOps. So if you have the NoOps automation in place, the AIOps could even probably trigger this NoOps, or maybe you can build your application in a self-healing manner in which you can recover. So thanks for sharing that.

[00:44:16] Tips on AWS Lambda

[00:44:16] Henry Suryawirawan: Speaking about something just now about implementing in Lambda, if you’re still remember, why is it not something that people should do? Because I think some people tend to use function as a service to solve some of these automations. So maybe you can share the light there.

[00:44:30] Amrith Raj: Yes. So consuming Lambda is so easy, right? This is debatable. I may get into a controversy, but this is my view. Whenever there is Lambda, it’s like a duct tape solution. There is a feature that is missing. Okay. I’ll solve that with Lambda. Okay. This feature is missing. I’ll solve with that Lambda. And in our environment, we had hundreds of Lambda functions created by random users. And everybody were like, I want this Lambda because I’m doing that and I’m doing this. What about your framework? Did you apply some framework? No, this does not apply. This is not part of a web framework. Okay. What is it part of? This is part of the automation that we talked about. Okay. How did you name it? I just use the naming convention. Okay. How many times has this been invoked? Let me just check and come back to you. Oh, in fact, it has never been invoked. The problem is that it’s getting harder to manage functions.

Now, the good thing about Lambda is, it doesn’t cost much. There is no cost associated with it. It’s called whenever it gets invoked. You get some cost of it. Having a Lambda is not like having disk volume. It’s not going to consume the storage. But if you’re using it for NoOps, you should properly plan it by playbooks and runbooks. Playbooks are the ones that you would create when there is an incident, and runbooks are the ones that help you create or execute a standard operating procedure. So those are driven by playbooks and you should find these playbooks are available through your configuration management tools like Ansible, Puppet, in AWS as Systems Manager. With Systems Manager, you can create these playbooks, define those runbooks and tell this is how you want to define this execution. There is even step functions that you could use.

But if you are part of the operations, part of it, try to use Systems Manager. Then you have a centralized place where you create these runbooks. You don’t want to create additional toil of managing these Lambda functions. That’s hard. If you decide, okay, I think Lambda is the right way for my automation, build some standards there. Have proper naming convention to your Lambda functions. How are you going to deploy your functions? Use GitOps. Maintain your code of Lambda in a centralized repository. Create those Lambda through GitOps. Build those things. Don’t do it manually. That’s old school. It’s not going to work out. You may solve the problem. The next engineer, please be ready to explain that Lambda function to him. That is a real problem. We have to address that. Have some standards. That’s as simple as that.

[00:47:04] Henry Suryawirawan: Thanks for sharing your battle scar. I know some people might probably think, okay, let me just create a Lambda, a function as a service in whatever cloud provider that they use. So, thanks for sharing this battle scar. I think it really gives a very good consideration for people before starting going that way. And if some solution that works for you, then go ahead, just do some standardization and convention and make sure you are able to manage. Don’t make it grow so large.

[00:47:28] Amrith Raj: Yeah. So it’s about how you explain. Whenever you build a solution, we should be able to show how well we know that. So the confidence in system should increase and should not go downwards. If I make a solution, I should be confident to explain this. This is how it is. This is how it is. I should not check a manual to find out how it was done. The only way operation engineers can actually succeed is being very confident in the way in which they build these systems. The circle of influence, you should know in and out as much as you can with that influence. And then, when you share that information to somebody else, they will be like, ah, this is piece of cake. Now I know what you’re talking about. It makes sense why you did this and the rationale behind it.

[00:48:12] 3 Tech Lead Wisdom

[00:48:12] Henry Suryawirawan: Thanks Amrith. It’s really an exciting talk. A lot of things that we cover about IT operations. Unfortunately, it’s coming to the end. But before I let you go, I normally have one question that I always ask my guests, which is to share your three technical leadership wisdom. So this is advice that you want to give to all the listeners here to learn from your journey.

[00:48:31] Amrith Raj: Sure. As a child, we were all curious. We’re all very curious about trying to find out what is going on. I tend to relate this to myself that we tend to lose some curiosity for some reason. I think that is something we have to correct that. If you’re curious, stay curious as much as you can, because there will be a moment when you realize that whatever you learned in the past would play a role in the future. If it is that technology that somebody is talking about, make notes of that, write it down. You may not solve that problem tomorrow, but they would be an ‘aha!’ moment.

The second tip is you have to be a leader in your own space. If you are a junior developer, or an engineering manager, be the best junior developer in the world. I mean, nobody’s rating you, but you got to be the best person that you’re trying to do by rule number one, which was curiosity. You try to be that. The role of a leader is to solve the problem, and you would solve the problem when you’re curious, when you try to address it. So specialize and master and excel in that job. And if you don’t know where to start, again, go to rule number one, be curious.

The third tip, I would say, is we have to stay happy and contented. Whatever we’re doing, we had to find a human connection of the work that we’re doing. For example, you may be working on a legacy technology that is not on demand at all. But you see that the technology that you’re working on is used in a hospital and doctors and nurses are using those apps every day. So find that relationship from the work that you’re doing. Have that human connection. I feel that when you find that connection, it’ll really help. So that curiosity and leadership that you learn would help you find the next spot and how you can improve things, how you can make things better for yourself, for the community, for the greater good. Those are the three tips that I would recommend.

[00:50:20] Henry Suryawirawan: Lovely! I think it’s really beautiful how you phrase it. So from building your curiosity, taking leadership in whatever role that you do, and then at the end of the day, you have to be happy and fulfilled about the role that you’re doing by finding the human connections, and how it benefits the greater humanity.

So thanks again, Amrith, for your time here. It’s been a very encouraging and a lot of learning from this AIOps, NoOps. I learn a lot of things. I hope one day I can be able to use this kind of platform. And for people who want to follow up more about this concept, where can they find you online?

[00:50:52] Amrith Raj: The best is LinkedIn. You can find me on LinkedIn. I would usually respond to strangers, absolutely, and Twitter, and Amrith.me, that’s my website. I don’t post a lot of things, but it’s there. If you want to find the links on how to connect to me.

[00:51:07] Henry Suryawirawan: Thanks again, Amrith. So good luck with your journey to AIOps and NoOps, all the IT operations in this world.

[00:51:14] Amrith Raj: Thank you Henry, for having me again. It’s my pleasure.

– End –