#225 - Driving Engineering Excellence with Platform Engineering and IDP - Ganesh Datta
“We bucket engineering excellence into four pillars: reliability practices, security practices, efficiency practices, and velocity practices.”
Is your engineering team running like the wild, wild west? What does engineering excellence look like in practice?
In this episode, Ganesh Datta, co-founder and CTO of Cortex, explores what it takes to achieve engineering excellence. Ganesh shares lessons from his own journey, from early bug-fixing to building a company focused on engineering excellence.
We discuss how platform engineering and internal developer platforms (IDPs) can help teams scale, improve reliability, and align with business outcomes. Ganesh also explains why culture, leadership, and clear metrics matter more than any single tool.
If you’re looking to make your engineering team a true business driver, this conversation is for you.
Key topics discussed:
- How to define engineering excellence and why it’s tied to business outcomes.
- The critical role of leadership in connecting engineering initiatives to business values.
- When to invest in platform engineering and internal developer platforms as your team grows.
- Common misconceptions about platform engineering.
- The importance of clear metrics, shared language, and transparency for continuous improvement.
- Building a culture that supports operational excellence through rituals and repeated messaging.
- Real-world examples of using generative AI to accelerate platform adoption and incident analysis.
Timestamps:
- (01:56) Career Turning Points
- (07:50) The Practice of Finding the Patterns in Issues
- (11:39) The Definition of Engineering Excellence
- (17:10) The Leader’s Role in Engineering Excellence
- (22:31) Aligning Engineering Excellence with the Business Outcomes
- (26:30) The Importance of Metrics in Engineering Excellence
- (33:35) The Culture that Drives Engineering Excellence
- (39:05) Platform Engineering and Internal Developer Platform
- (45:02) The Biggest Misconception of Platform Engineering or IDP
- (50:36) Cortex as an Engineering Excellence Platform
- (52:39) Generative AI Use Case in Platform Engineering
- (55:26) 3 Tech Lead Wisdom
_____
Ganesh Datta’s Bio
Ganesh Datta is a Co-Founder & CTO of Cortex. Before co-founding Cortex, he was a Principal Software Engineer at Mission Lane where he was responsible for driving the development of real-time underwriting infrastructure.
At LendUp, Ganesh was a Senior Software Engineer leading the development and optimization of the company’s decisioning infrastructure and financial account management system. Ganesh holds a bachelor of science in computer science from the University of California San Diego.
Follow Ganesh:
- LinkedIn – linkedin.com/in/gsdatta
- Twitter – x.com/gsdatta
- Website – www.cortex.io
- Email – ganesh@cortex.io
- Join Ganesh & Cortex at IDPCon in NYC – ipdcon.com
Mentions & Links:
- Engineering excellence vs developer experience – https://www.cortex.io/post/engineering-excellence-vs-developer-experience
- Platform engineering – https://en.wikipedia.org/wiki/Platform_engineering
- Coding assistants – https://devops.com/10-key-features-of-ai-code-assistants/
- Internal developer platform – https://www.atlassian.com/developer-experience/internal-developer-platform
- Engineering Intelligence – https://www.cortex.io/post/engineering-intelligence-platforms-definition-benefits-tools
- P90 – https://baselime.io/glossary/p90
- P95 – https://www.dragonflydb.io/faq/what-is-p95-latency
- Mean-time to repair (MTTR) – https://en.wikipedia.org/wiki/Mean_time_to_repair
- Customer relationship management (CRM) – https://en.wikipedia.org/wiki/Customer_relationship_management
- North Star metric – https://amplitude.com/blog/product-north-star-metric
- Terraform – https://en.wikipedia.org/wiki/Terraform_(software)
- Helm – https://helm.sh/
- Pulumi – https://www.pulumi.com/
- Spacelift – https://spacelift.io/
- Crossplane – https://www.crossplane.io/
- Kratix – https://www.kratix.io/
- Cortex.io – https://www.cortex.io/
- Datadog – https://www.datadoghq.com/
- Salesforce – https://en.wikipedia.org/wiki/Salesforce
- ServiceNow - https://www.servicenow.com/
Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.
Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.
Career Turning Points
-
When I started at this company, I was working on a team with an initiative to break the first microservice out of the monolith. It was interesting because, at the time, that particular project was successful but kind of struggling. A lot of things were broken, lots of bugs, a lot of bug fixing.
-
It was a turning point in my career because my manager’s coaching helped a lot. He said, “Hey, don’t look at these as just bugs. If you wanna become a very senior engineer one day, the whole point is to be able to look at these things and say, what are the patterns we’re seeing here.” That framing really, really helped. I was like, okay, I’m gonna stop thinking about them as bugs and start thinking about them as patterns of issues in this particular microservice.
-
Eventually, I saw a ton of patterns and got the opportunity to propose a new architecture for the way we were computing account management data for credit cards and loans. It was great because having had that personal pain of going through all these different bugs, I knew exactly the impact it would have to do this re-architecture. I was able to articulate a very clear business value, the types of bugs we were running into, and the customer impact. And so we got to take on that project.
-
Another thing I got to do while I was there was a lot of performance-based things. Small issues kept popping up, and it was an opportunity for me to be much more proactive, without anybody saying, “What are the bottlenecks we’re seeing? And can we start attacking those more aggressively?” In my spare time, I would go and try to fix these issues.
-
The last one was really investing time and energy into how I could take all the things that we learned and share them across the organization. It came from personal pain. I’d go to another service and try to figure out what’s going on. An upstream or downstream service had an issue, and the logs looked totally different, the metrics were named completely different. It sucks to manage these services.
-
So I took it upon myself to say, okay, we’re gonna standardize this thing. What I eventually realized was a production readiness checklist, but it was just a list of best practices, standards, and logging formats. Nobody was following these practices, so I decided to make it really easy to do these things. I built standard libraries that would pre-configure request tracing and logging, and then we built cookie-cutter templates to spin up these services.
-
Without realizing it, I was doing these grunt work, dev experience type things. It was an example of how if you can make your peer’s lives better, you will earn that respect and the right to take on bigger and bigger projects. Don’t wait for permission for a lot of these things. You can make things better.
The Practice of Finding the Patterns in Issues
-
It’s important to be very systematic about it. If we’re a marketing team trying to find which cold emails work best, we’re not just gonna look at our emails and pray. We’re gonna send out three different types of emails and look at the response rates. You’re very systematic about it. Why not do the exact same thing for engineering practices?
-
How would I assess my own view of the world in some sort of meaningful way? It doesn’t have to be perfect. If you start with some simple categorization of these things, as you go through it, you will naturally realize, “Hey, there’s actually another class of things that I didn’t think about.” Let me go back and iterate on the patterns I wanna match against. Starting with something rudimentary is really important.
-
The other thing to do is, now with LLMs, you can take a list of your Jira tickets, put it into an LLM, and say, “Hey, tell me what do you see here.” It’s much easier now than it was 10 years ago. But it’s the ability to go through and say, if you’re on a product team, can you categorize the types of things you’re working on into the highest level abstraction?
-
What is the highest level of abstraction possible before you go really low? Starting at the biggest abstraction layer and whittling it down more and more granular, then you can work your way up and see how things relate. Going through that systematic exercise is really important.
The Definition of Engineering Excellence
-
To us, engineering excellence basically means the alignment of engineering practices with business outcomes. Of course, every engineering organization needs to build features and products—that’s table stakes. But it’s the practices that we adopt as an engineering organization that lead to the outcomes that we care about.
-
Going back to the sales and marketing analogy, sales excellence is: given an account, given an opportunity, how likely am I to close it at what dollar amount? That’s the obvious business outcome. There are practices I can do to achieve that. Am I asking the right questions? Am I talking to the right personas? Am I using the right decks? Am I helping them see the value of the product properly? All the right practices make it more likely for a customer to see value and buy the product.
-
Sales organizations have been doing this for at least a century. It’s a very well-studied thing, like bookkeeping and tracking your CRMs. This idea of, “Hey, if we can build a high-functioning machine and we really operationalize the way we do sales, we’re gonna drive more revenue.” Engineering is slowly coming to that same realization of, “Hey, if we build the right systems and the right processes in the way we do engineering, we will get those business outcomes we care about.”
-
When I think about those business outcomes that engineering excellence ties into, it tends to bucket into these three things. It’s time to value, time to market, and innovation. Are we innovating as much as you can? Are you getting your product into the market as quickly as possible? The second one is around cost efficiency. Are you managing your cost and being as efficient as you can? Are you doing more with less? And the third one is around product quality and customer reliability. Are we delivering a world-class product? Is it reliable? Is the customer happy with it?
-
Depending on where you are in your company journey and what type of company you are, the outcome you care about might be totally different. As a startup, at the very beginning of the Cortex journey, all we cared about was time to market. You don’t care if there’s quality; it’s gonna break. But we wanna see what works and what people care about. So let’s move really quickly. Then from there, it became, “Okay, well now we have a ton of customers. They really care about reliability. We wanna build a great customer experience.” Time to market is still really important. We don’t wanna lose that. But it’s an ebb and a flow.
-
But you talk to a larger organization, maybe they care about cost and efficiency. “Hey, we hired a thousand engineers in the last three years, and there’s a lot of pressure from investors and we gotta focus on cost.” Maybe a company’s trying to go IPO and they need to get their costs in order. There’s some outcome that your processes can relate to. And so engineering excellence is the connection of engineering process to those outcomes.
-
We bucket engineering excellence into these four buckets underneath that. It’s reliability practices, security practices, efficiency practices, and velocity practices. Those four are the pillars of engineering excellence; it spans the entire SDLC.
-
When you think about developer experience, for example, it’s about the tools for a developer in their day-to-day work. It’s very localized. Engineering excellence is more broad. It’s from architecture review process to incident management to security. And developer experience is part of it.
-
So thinking about the business outcome first, the part of the engineering process that I want to change to drive that outcome, and then thinking about how I’m gonna do it. That’s to me what engineering excellence is. It’s our practices in service of business outcomes.
The Leader’s Role in Engineering Excellence
-
It’s extremely, extremely important, especially because if every team that is working on engineering excellence initiatives does not know how their work ties to a business outcome, it doesn’t go anywhere.
-
A lot of organizations in the last two years have spent a lot of time investing in platform engineering. And platform engineering should, in theory, enable quite a few outcomes. It should help you go to market faster. If you build the right platform, your developers are more autonomous, but they’re also following your best practices. Your platform is secure, it’s modernized, so you’re being efficient, you’re using your resources effectively. So at face value, platform engineering should be a great way to drive engineering excellence.
-
But platform engineering, especially in large organizations, can sometimes become this corner of the organization where it’s not really tied to leadership, or the leadership is not able to articulate to that team how their work ties into the broader business outcome.
-
It’s really important for an engineering leader to say, “Hey, this quarter or this year or this half, the thing that we really care about is business outcome X. We wanna move to market faster, or we wanna improve security,” or whatever that is. And interrogating each set of initiatives as to how exactly this initiative ties back to that. And can I then tell that story to the rest of the organization?
-
As an engineering leader, also challenge your teams to steel-thread. So rather than saying, “We’re gonna build the whole platform, we’re gonna do this whole thing and then see the business outcome,” it’s, “No, choose a business outcome. What is the number one thing blocking that today? And what is the bare minimum slice across the entire stack that I can deliver to move that needle just a little bit forward?”
-
When I think about reliability, if MTTR is top of mind and I see my CEO saying, “Hey, we have too many incidents. We’re paying too much in SLA fees. We need to improve our incident costs,” I’m not gonna go and rebuild all of our services and fix all of our tech debt all at once. It’s, “No, what is the number one thing that’s stopping us?” Okay, we’re spending too much time fighting fires. Okay, why are we spending too much time fighting fires? Well, there are like 10 reasons, but the number one reason is we don’t know which teams are in the critical path for these key outages. And so we spend 30 minutes per incident getting the right people in the room. Let’s take that one level further. What is the next thing we can do to standardize that? And so on. Now up and down the entire stack, you have a single thread.
-
As an engineering leader, it’s really important to maintain that focus and challenge the teams. What is the bare minimum thing? If a product leader is doing that for the product, engineering excellence is the engineering team’s product in some sense. It’s the day-to-day practices and the tools and the stuff we’re building for those engineering and business outcomes. We should have that same product management mindset of, what is the minimum viable or minimum lovable process or tool that we can build for this particular issue?
-
Being able to carve out small slices of value that drive the business outcomes by telling that story more broadly across our organization, making sure everyone’s bought into the business outcome, and holding teams accountable to not just doing science experiments all the time, but driving towards something very clear, means that the next time you wanna do one of those things, your CEO’s bought in too. You created the feedback loop.
-
As an engineering leader, it’s your duty to bring those business outcomes from the business to engineering. And then use the same business outcome to tell the story back to the business of, “Hey, here’s how our investments impacted the business.” As an engineering leader, you’re gonna be thinking about both sides of that equation.
Aligning Engineering Excellence with the Business Outcomes
-
It’s understanding what the North Star is. We should understand either what metric we’re trying to move within our organization, or which business metric we’re trying to move. It doesn’t have to be a business outcome necessarily.
-
If you are on a hands-on team, in security or something like that, buying a new tool, you can still tie it back to one of those pillars. Your engineering leader should then be able to tell that story to the business, but as long as you can tie it back to one of these pillars.
-
So it’s not just, “Oh, I’m creating a cool experience for my developers.” Does the CEO really care about that? Maybe. But fundamentally, if you tell me, “Hey, not only are my developers gonna be happy, I’m saving us X weeks per service, I’m increasing our security posture and our incident rates should go down,” that’s a much easier way to make that case.
-
Coding assistants is the same thing. Can we measure the impact of that?
-
I had a customer recently tell me, “Hey, why are you making such a big fuss about measuring the impact of 50 bucks a month? You wouldn’t ask, ‘Does my developer really need 16 gigs of RAM versus eight?’ Or, ‘Do they really need IntelliJ? Can’t they just use Vim?’” It’s obvious that it’s one of their tools in their toolkit. You wouldn’t question a developer who’s trying to get an IntelliJ or something like that. So why are we making a big fuss about spending 50 bucks for a tool when it’s just part of your general toolkit: laptop, IDE, terminal, coding assistant.
-
And so the same thing should hold true for your engineering excellence platform as a whole.
The Importance of Metrics in Engineering Excellence
-
That’s a very, very important question. It is a must, and I’ll explain why. Going back to the sales analogy again, imagine if a sales leader said, “I don’t need Salesforce, I don’t need a CRM. We’re just gonna sell and hope for the best.” You would fire that sales leader tomorrow. Any sales leader that doesn’t have a CRM that’s not using that data to make judgements about the organization and where to invest is gonna be laughed out of the room. Any data leader that doesn’t have data tooling is gonna be laughed out of the room.
-
So why is it that engineering leaders are not treated the same way? Why is it that for engineering leaders, we don’t say, “Of course, I need a system of record”? Every single other function has a system of record. Sales teams have CRMs, marketing teams have marketing automation, finance has ERP, IT has CMDBs, ITSM tools like ServiceNow. Every single function has a system of record. But what does engineering have? We’re just caution to the wind. Do whatever you want. Every team is just out there, it’s the wild, wild west. And it’s really insane if you think about it. Every other function is treated so differently. And engineering is the most expensive organization. We should be much more systematic and data-driven and methodical about the way we build our engineering organizations.
-
What I would challenge though, is that it’s not just about the metrics. Imagine if all Salesforce did was fancy reports, just like a funnel dashboard or something. You wouldn’t need that either. What you need is an end-to-end platform that helps you understand your software ecosystem. So it’s accountability. Given a service, which team is accountable for it? Where’s it deployed, where’s it running? How many vulnerabilities does it have? What is the code coverage like? What is the on-call rotation like? Where can I collect this data? How do I use that data to drive best practices?
-
For example, I can use a CRM to drive practices in sales. Before you start a POC with a customer, you need to fill out all these fields. You need to make sure you understand if they have a budget. You need to make sure they understand who’s gonna buy it. The idea is that you’re using that data to then drive some sort of behavior in the daily workflow of an AE. So you’re using it to drive practices.
-
So it’s not just about the dashboards. The metrics side of the equation for an engineering team should be, A, having a way to store all this data in a meaningful way because it can’t just be metrics. It needs to be the actual data structure on which I’m reporting. To us, the way we think about it is the service catalog, the infrastructure catalog, a catalog of all this information, just like a CRM.
-
The second part is the ability to drive behavior around it. So, defining what good looks like. This is where I need teams to be. Can we create a shared language as to what good looks like? Because I can say your repo has 30 vulnerabilities. Okay. Is that good? Is it bad? Does it matter if it’s high or low vulnerabilities? What does that actually mean? It doesn’t mean anything for me to tell you you have 30 vulnerabilities. There needs to be some distinction for our organization.
-
What does good look like? Okay, for a financial services company, zero critical vulnerabilities is great. Zero critical vulnerabilities resolved within seven days is excellent. And zero critical vulnerabilities resolved within seven days and you’re resolving medium vulnerabilities within 60 days is top of the line. That’s where we wanna aspire to. Now not only do we have a metric, we have a graduated window of how to get better. So the platform should be able to tell you that.
-
And the final bit is the report of engineering metrics. Based on the data that we have, based on the practices that we’re following, are we seeing the impact on the data? So cycle time and DORA metrics and things like that. Those things can’t live in isolation. You can’t go and tell a developer, “Improve your MTTR.” I don’t know where to start. Instead it’s, “Hey, we’re using MTTR as a measurement. And we believe that in order to improve MTTR, we need to change these five practices. So we’re gonna run, change the five practices, and we’re gonna come back and see if the MTTR changed.”
-
The engineering metrics are not about measuring productivity or checking if you’re taking advantage of your teams. It’s about hypothesis generation and hypothesis testing. It’s all about continuous improvement. It’s one of the pillars of engineering excellence we consider. It’s not good versus bad, it’s “I’m better than I was yesterday.” But that’s a very important part of engineering excellence.
-
Engineering metrics should be used to say, “Today our cycle time is X, we want to move faster. The time to first review is really long. Our hypothesis is that the CI build time is taking up a good chunk of that effort. By reducing our CI build time, our cycle time will go down, which means that our throughput will go up.” That’s the hypothesis. And then you go and you say, “Okay, well, how do we change our build time?” Then you run an initiative, and then you go back to your metric and say, “Didn’t move the needle.” So the engineering metrics are a way to create those hypotheses and test them. But if you’re not doing that, how do you know if your efforts are actually moving the needle, if you’re actually making an impact? So engineering metrics are very, very important. The data under the hood is very, very important. And the ability to set standards and drive behavior is also really important. It’s all those things together that create the flywheel of a data-driven, high-visibility, high-trust, engineering culture that is focused on excellence.
The Culture that Drives Engineering Excellence
-
The right culture is one that is transparent and has high visibility. Engineering metrics are not this ivory tower thing that your VPs and CTOs are looking at, and they’re coming down and smiting the teams based on those metrics. If every team has visibility into your SLOs and your latency and stuff in APM, why can’t your teams look at their own metrics in the engineering metrics tool? It’s visibility up and down the stack. Everybody gets access to the metrics. Everybody understands this is not a measurement tool. This is a way for us to find improvements. So a culture around metrics, a culture around visibility, everyone looking at the same data. That’s one part of it.
-
The second part of it is a shared language. Does everyone actually know what good means? If you ask three people on the team, “What does production readiness look like?” and you get three different answers, immediate fail. Having a shared language as to what good looks like. This is why companies do OKRs and North Star goals and North Star metrics and all these things. Everyone at least is thinking of the same thing. Does everyone know what good looks like and what that shared definition is? How much can you dumb down that definition? Your definition of good can be shared, but if it’s 60 criteria for production readiness, it doesn’t matter. No one’s gonna remember all that stuff.
-
The ability to create a clear path to greatness is really important. It’s, “Hey, these are the basics. This is the non-negotiables for production. This is what the gold standard looks like.” Being able to kind of bucket up what good looks like into these categories is really important. Because then you, as a team lead, can say, “Hey, we’ve all agreed that these 15 things are the bare minimum to go to production. We’re not doing those things.” And so I can use that now as a way to advocate for time. I can go to my manager or my VP and be like, “Hey, we all agree that these 15 things are really important. We’re not getting time to meet those things. We need time.” And so now you’ve created that shared language. I can make that trade-off very clearly. And so that communication is very important.
-
The third thing is repeated messaging. How much of the same data can you use in multiple places? Can I talk about it in your all-hands? Talk about it in your monthly operational excellence reviews. Talk about it in your team meetings and team retros. Is there enough data in your tooling standpoint for each team to be able to self-report on these things? This is why teams spend so much time around OKR tracking; it’s like each team can very quickly iterate. The CEO is saying something, the CTO is saying something, and it trickles down and I can check my own work against that. So for engineering practices, the same thing. Does your team, your manager, your director, your VP, do they all have a slice of the data where they can, at their level, come in and see, “Hey, how are we tracking against those practices?” And so the ability to create that instant visibility and repeated communication is really important.
-
This is a common adage in business: you have to keep repeating things until you feel like you’ve repeated yourself so many times that people are sick of you. And at that point, people finally realize what you’re talking about. And so it’s the same thing for engineering practices. Keep repeating yourself. Congratulate people. Celebrate the wins. Do you have data to be able to celebrate those wins? If you say that in an all-hands meeting, people are like, “Oh, clearly this is important enough that it’s getting a slide at the all-hands. Are we actually tracking towards it?”
-
So can you create that culture? It’s about creating momentum. It’s about creating visibility. It’s about shared language. It sounds like cult behavior, but culture in some ways is about how you can create cult-like behaviors within an organization that are healthy and drive the right outcomes.
Platform Engineering and Internal Developer Platform
-
You need to start with data. So every organization should have some sort of data. Do you have a system where you can track your software assets, infrastructure assets? Do you have a single place where you can determine accountability and ownership? You can probably get away without any of this stuff up until about 30 to 40 engineers, roughly speaking. At that point, you have enough clusters in your team graph that having a system where you can say, “Ganesh’s team is working on X, Henry’s team is working on Y, this is their sphere of accountability. They own these different repos.” Having that in one place is really important. You should do that very, very early on. Usually that comes in the form of an internal developer portal, which is a service catalog.
-
Around that same time, you probably want to use, ideally, your portal already provides this, but you’re also starting to measure some of the key metrics. So cycle time, review time, and so on, around 30 to 40 engineers. Anything before that is a little bit of noise, honestly. You can get some value out of it, but it’s not honestly worth it. Ideally, your portal gives you this data, but if not, you’re getting it in a manual way. You’re tracking some key metrics, and the things at a 30-person engineering organization are probably more on moving fast. So cycle time, throughput, incidents, keeping the lights on. Those kinds of metrics are probably pretty important.
-
From there, once you get to around 60 to 75 engineers is where platform-esque investments start to come into play. It doesn’t have to be a platform engineering team, but some sort of collective group of infra, cloud, staff engineers, senior engineers, and maybe one or two platform people kind of coalesce and start to work on org-wide initiatives. Because at about 50 to 60 people, you’re getting to a point where practices are starting to diverge. People are starting to do things in weird ways. And so starting at that stage is really important.
-
However, I will say at that point, you want to be very, very tactical about the types of investments you’re making. So even from a platform engineering perspective, what is the thing that is slowing you down at a 60- or 70-person engineering team? It’s usually around deployments, build systems and tooling, infrastructure provisioning. Are we having repeatability around those things? It’s not the quote-unquote sexy platform engineering stuff that people are doing, but it’s tactical things that really matter.
-
If you hit a point where you’re 60 or 70 engineers, you probably have really slow builds or something like that. You probably have review bottlenecks. You may have started the company with ClickOps in Google Cloud or AWS. And so around 60 engineers you wanna start automating and terraforming or using infrastructure as code for some of that stuff. Thinking about the next phase, “Hey, what’s gonna break when I get to 100 engineers?” Start to get out of that. So around 60 to 70 engineers you wanna start investing in platform practices but very tactically against specific initiatives.
-
I think once you get to around 120-ish engineers, you have enough independent workstreams across the organization that true platform engineering stuff starts to have a real impact. So a standard set of deploy pipelines and deploy capabilities where you’re standardizing whatever kind of platform tool you want. Those kinds of decisions start to become really important because they have compounding effects. I would say somewhere between the 60 to 120 range-ish is when platform engineering starts to become a thing. And anything bigger than that, you absolutely need platform engineering.
-
My hot take, though, is that at whatever point you start bringing in SRE and security and stuff, you probably also wanna start thinking about platform engineering. We’re starting to see this practice a little bit more. Why not have those teams roll into an engineering excellence leader rather than all of them reporting to different places? You want your platform engineering teams to be working with your security teams and your SRE teams; their work depends on each other. Your SRE team cannot build reliable software if your platform doesn’t take that into account. Or security cannot do their job if your infrastructure-as-code tooling is not thinking about security from day one. So why is platform engineering over here and security over here? Bring those things closer together.
-
An easy way to do that is to bring them together into an engineering excellence team. So instead of having a head of developer experience or a head of SRE, have a head of engineering excellence and have these orgs roll up into that. Especially up until you’re a really large enterprise, at which point each of those things ends up becoming their own VPs, have them roll into a single leader. That way, all their work is really, really aligned to each other. It makes hiring easier as well. So that’s one of my hot takes from an organization design perspective.
The Biggest Misconception of Platform Engineering or IDP
-
One of the biggest misconceptions, with portals in particular, is, “Do I have to get all my data in order first before I put it into the portal to get any value out of it?” That’s the biggest misconception. The reason I say that is because if you don’t have a system that can hold data and tell you what’s wrong with the data, how are you ever gonna fix it in the first place? It’s like right now everything is an unknown unknown. And so using a portal is a way to go from the unknown unknowns to the known unknowns. So it’s like, okay, before we didn’t know anything. Now, we know what we don’t know. And then now, we can start to clean up the data. So you can actually clean up your data using a portal. That is the whole point of having a system of record.
-
When a sales leader comes into a new sales organization, they’re not gonna say, “First we’re gonna take all of our calendar invites and put it into a spreadsheet and then clean it up and then go buy Salesforce.” It’s like, “No, I’m gonna buy Salesforce, dump everything in there and then we’re gonna figure out how to make sense of this data, because I need a system that’s gonna help me see the world.”
-
The second misconception is “if you build it, they’ll come.” People think about portal and platform initiatives as, “Hey, we’re gonna build this really cool stuff and these great self-serve experiences and it’s gonna be so good that people will just come and use it.” Completely false. This is basic product management, product thinking. It’s very hard for people to change their behaviors. People are used to doing things a certain way, especially, and I speak from personal experience as developers, we have our vimrcs, we spend so much time screening our setups and our practices, that’s kind of who we are as an engineer. We like to tailor our workflows and our tooling. And so if you’re like, “Oh, there’s this other tool, use a platform,” I’ve been doing these things the other way. Seems fine to me. You go build your platform, I’ll do my thing. That’s a very common issue we see.
-
Instead, it’s really important to think about starting with the definition of what good looks like. It is starting with scorecards next. So it’s like, okay, before we go off and build all these self-serve tools and platform engineering things, we’re gonna build a scorecard that says, “Hey, in order to go to production, these are all the things you should be doing, these are all the practices you should be following.” One of our customers calls it the “Golden State.” What is the Golden State of a service? And so if you define that, then your Golden Path makes a lot more sense. It’s like, “Hey, this is what every service needs to do.”
-
It’s very easy to get people to buy into that, especially because most organizations have a production process already. 99% of organizations have something, it’s a Confluence page or some process. And so you tell your team, “Hey, we’re not trying to introduce a new process. It’s nothing new. We’re taking that old thing that we used to do and we’re automating it and we’re just gonna tell you exactly where you stand from a production-ready standpoint. So we’re just doing a thing that we already used to do, we’re just doing it better.”
-
And so now, everyone agrees this is what production readiness looks like. Then you tell people, “Okay, well you can do it your old way to get to production ready. Or we built this self-starter kit on our platform that you click a button, you got a new service in five minutes. It’s gonna meet all of your requirements within five minutes. So do you want to use the self-serve kit or do you wanna go do your old thing?” One in 20 people will do their old thing, but they still have to meet all those requirements, so at least you’re still getting the same outcome. But 19 out of 20 people will use the new tool because it’s like, “Well, if you’re telling me I need to do all these things, that’s what good looks like, which I agree with, then I’m just gonna use the new tool.” So now you’ve told people the Golden Path and the Golden State, you brought them together as a platform serving a very key purpose. You’ve solved a key bottleneck, and you’ve got that buy-in from the people.
-
Last but not least, it’s very important to tie to a thing that they’re already doing. You don’t want to come and introduce a completely new workflow, a completely new process. Find a thing that the team is doing already and make that better. And that will get buy-in for your platform. So a platform doesn’t have to be brand new. A platform doesn’t have to be one thing. A platform can be a collection of best-of-breed capabilities. You can iterate. You don’t have to go and deliver a whole platform all at once. And so if you think about incremental value like we talked about earlier, business outcomes, solving a thing that already exists and making that better, then you will naturally say, “Okay, we’re just gonna deliver the smallest slice of a platform that we can and continue iterating on it.” So have that product management mindset. Those are the biggest things that we see wrong with platform engineering practices today.
Cortex as an Engineering Excellence Platform
-
Cortex is an internal developer portal, but we like to think about it as an engineering excellence platform at the end of the day. It starts with the catalog. So we help you catalog all of your services, your infrastructure, your teams, and then automatically determine ownership for them. So within minutes, we have an ML model that can go through all of your repos and automatically assign teams that we believe own each repo. So within 10 minutes of setting up Cortex, you have a list of all your repos and which teams are accountable for each of those.
-
We have a product called Scorecards, which allows you to define best practices and standards. So for example, automating production readiness, tracking migrations and audits, package upgrades, security standards, and whatnot. So being able to define those best practices and standards.
-
And from there, we have a tool called Workflows. Workflows is basically a wizard builder where you can build experiences for developers. So spinning up a new service, deploys, infrastructure provisioning, just-in-time credentials access, and so on. So we have a really, really easy and powerful way to build those self-serve experiences, including code scaffolding and project bootstrapping.
-
And finally, we have an engineering intelligence product, which is around engineering metrics. So DORA metrics, cycle time metrics, MTTR, things like that. And the idea is that you should get all these things in a single platform. So you shouldn’t have to go to reporting in one place and self-service in one place, and cataloging in one place, if we bring all those things to the same platform. Because at the end of the day, those things are a flywheel. So measure the metrics, figure out your practices, drive them using scorecards, use the catalog data that underpins all this stuff, and build self-serve experiences that make those things better.
-
We build a lot of native integrations. There’s a lot of automation into it. It usually takes very, very little time to get started with it to get those metrics and catalog up to date.
Generative AI Use Case in Platform Engineering
-
We have been working with one of the big four consulting firms and it’s really interesting the way they use generative AI as they’re doing transformation projects and modernization projects. They’re able to use generative AI to rapidly accelerate the code transformations, the migrations, and things like that. And so rather than going in with a ton of manual effort, they build a really powerful tool chain around platform engineering to help with the adoption of their platform.
-
The platform isn’t about gen AI itself, but it’s about, “Hey, we know this platform is gonna help drive these different use cases and user journeys. Can we use generative AI to migrate existing things into the new world and stuff like that?” So that was a really interesting use case.
-
A lot of tools have gone down the path of just these generic chatbots. It’s like, “Hey, we have all this data. Let’s just throw a chatbot on it with some RAG.” And like, okay, it’s cool, it’s a nice demo, but it doesn’t actually do anything.
-
From our perspective, the application of generative AI is really gonna shine when it applies to specific use cases, specific types of data.
-
Internally, if you’re on the SRE team, we’ve seen folks use generative AI for analyzing their incidents and things like that. If you are on the platform engineering team, using generative AI to summarize your own feature backlog or requests from your end users internally. So can you find patterns of feature requests and things like that? It’s been a really interesting use case.
-
It’s like if you had a junior analyst on your team that you could give a problem and be like, “Hey, can you go and figure out the synthesis and here’s how you should think about it.” Those are the kinds of things that it’s really, really good at.
-
From a platform engineering perspective, it’s about getting another brain on your team that can help you reason about, “Are we working on the right things? Are we finding the right patterns?” and so on. That’s how we’ve seen some really interesting use cases.
3 Tech Lead Wisdom
-
Be a good storyteller.
-
It doesn’t matter if you’re an engineer. It doesn’t matter if you’re an engineering leader. Be a great storyteller.
-
A lot of what you do in any role is telling stories, whether it’s giving one of your reports advice. You’re telling a story. It’s like, “Hey, imagine if we improve these things. How much better could it be? What kind of behavior do we wanna see? How do I help my reports see the light?” You’re telling a story. You’re managing up, you’re flagging a risk to your manager about some project. You’re telling a story.
-
It’s not about making up excuses or fabricating a story. It’s about, “Can I communicate beginning, middle, end? Can I be concise? Can I give clear information? Can I bring people with me in my thought process?”
-
Being able to tell a story is very important, especially as an engineering leader making a big decision. How do you help the team see that? How do you explain to them the trade-offs, the thought that went into this, why we’re doing a certain thing, what does it mean for the business? Tell a story. It’s really, really important.
-
-
Deeply understand your product.
-
Especially as an engineering leader, having an above-average understanding of the product that you’re working on will help you make very quick judgements and reason about the space that you’re operating in.
-
You, of course, will lean on your own team and the experts on your team for tactical advice, and they’re the experts at the end of the day. But if you have a deep understanding of your product, then you are able to reason about trade-offs and route information much more quickly.
-
-
Think about the culture.
-
Think about your operational cadence in particular. The rituals and the practices that you set up help dictate the way your organization operates, especially as it’s larger.
-
For example, if you create an operational excellence review, that itself starts to create the conversation of what metrics should we be looking at, how often should we be looking at these metrics, who should be involved, are we bringing in the right people? Just the fact that you have an operational excellence review creates this conversation.
-
Then go one bit more granular. In the operational excellence review, what outcomes do I care about? How am I gonna structure the meeting agenda to drive those outcomes? Imagine just the fact that you say, “Hey, we are gonna stop our conversation 10 minutes before the end of the meeting. And the only thing we’re gonna talk about in the last ten minutes are action items.” It completely changes the format of the meeting. You are creating a culture that this is not just to talk about things, we wanna do something about it.
-
How can you structure your operational cadence, your rituals, to create the culture that you want? If you wanna celebrate shipping and create momentum, create a weekly demo meeting for your engineering team. No matter what you’ve built, you have to come and demo. You just have to demo something. Even if it’s an API, come show us your code. It doesn’t matter. Show your team what you’re working on. Show us the interesting things we’re doing across the organization. That’s a ritual.
-
Those rituals can also be part of other teams as well. Maybe a ritual could be your engineering team and the support team get together for a hackathon once a quarter. What I wanna hear is the cross-pollination of information that’s gonna reduce the support load and escalations from support to engineering. That’s the thing that I’m thinking about. I’m creating a ritual that creates that natural cross-pollination between those two teams.
-
What are the actual rituals that I can do on a recurring basis that even if I’m gone, on PTO or whatever, those systems are in place, the wheels are turning, and it is creating a culture where, when I’m not in the room, the behaviors are still happening the way I think about it. So think about culture from the lens of operational cadence.
-
[00:01:23] Introduction
Henry Suryawirawan: Hey, guys. Welcome back to another new episode of the Tech Lead Journal podcast. Today, I have with me Ganesh Datta. He’s the co-founder and CTO of Cortex.io. So today we’ll be talking a lot about how to build a great engineering team, best in class. So welcome to the show, Ganesh.
Ganesh Datta: Thanks for having me.
[00:01:56] Career Turning Points
Henry Suryawirawan: Right. Ganesh, I always love to start by maybe having my guests explaining about themselves, maybe any career turning points that you think we all can learn from you.
Ganesh Datta: So my name’s Ganesh. I’m one of the co-founders and CTO of Cortex, like you mentioned. I was a software engineer before starting Cortex. I was at a fintech startup, and I got to see the monolith to microservice journey while I was there. And it was a very interesting opportunity, because there were quite a few turning points, and I’ll kind of give you the story of the journey that I went through and some of those turning points.
So when I started at this company, I was working on a team and there was an initiative to break the first microservice outta the monolith. I eventually got added to that team. And it was interesting, because, at the time, that particular project was successful but kind of struggling. And so like, they, it was the first microservice. There was a lot of learnings, a lot of things we had to do.
And it was an interesting time, because, at face value, it was like, okay, a lot of things are broken, like lots of bugs, like a lot of bug fixing. And it was, especially as an early career engineer, it felt like, oh my God, I’m just like fixing bugs all day long. Like, what’s the point here? But I would say it was actually a turning point in my career because it allowed me, and this is where I think my manager coaching helped a lot, was, hey, like, don’t look at these as just bugs, right? Like, if you wanna become a, you know, very senior engineer one day, the whole point is to be able to look at these things and say, what are the patterns we’re seeing here, right? And so, like, I think that framing really, really helped. And I was like, okay, well I’m gonna start thinking about them as bugs and let me start thinking about them as patterns of issues in this particular microservice.
And so, eventually, I, you know, saw a ton of patterns and I got the opportunity to propose like a new architecture for the way we were computing, like account management data for like credit cards and loans and things like that. And so we got buy in to work on this big project. And it was great because having had that personal pain of going through all these different bugs, I knew exactly the impact it would have to do this re-architecture. And like, I know every engineer wants to do a re-architecture, but I think that the point was I was able to articulate a very clear business value and like the types of bugs we were running into and the customer impact and whatnot. And so we gotta take on that project. And there was a ton of learnings around that. You know, the way you communicate it, the way you set timeline. So that, I would say like that was a really pivotal project for me.
Another thing that, you know, I got to do while I was there was a lot of performance based things. It was like these small issues that kept popping up. And it was an opportunity for me to be much more proactive, like without anybody saying, like, you just go in and dig through Datadog and say like, where are we seeing like our highest P95s and, you know, P90, uh, requests, and what are the patterns? What are the bottlenecks we’re seeing? And can we start attacking those more aggressively? And so, in the, my spare time, I would go and try to fix these issues.
And eventually, it was a really fun thing we would do. Every endpoint that we fixed our P90s and P95s, we would print out a chart of like the latency graph and I would stick it up on the wall next to me. So eventually, it had like a trophy case of these like latency charts. It’s like, look at all the things that we fixed, you know, over the, over the last year or so. And it was just a way to show like, hey, I’m, uh, it’s a way for you to represent the kind of like schlepping work that you have to do to keep services and systems up and running. So I would say that’s another turning point.
And the last one I’ll say was really investing time and energy into how can I take all the things that we learned and share them across the organization. And it was another thing that was just like, it came from personal pain which is I, as an engineer that was on call for a service, I would get paid for something, I’d go to another service and I try to like figure out what’s going on. It was like a downstream or upstream service had an issue. I’m like the logs look totally different, the metrics are named completely different. Like why is my service doing one thing and your services doing another thing?
Like it sucks to manage these services. And so I took it upon myself to say, okay, we’re gonna standardize this thing. And it was more outta frustration than anything else. I was like, we just gotta do things the same way, otherwise I’m waiting, like I’m gonna lose it. And so it put together this like… What I eventually realized was a production readiness checklist, but it was just like list of best practices and standards and like logging formats. And I was like, okay, nobody’s following these practices, so I’m gonna make it really easy to do these things. So I built like standard libraries that like would pre-configure your request tracing and your logging, and then we built like cookie cutter templates to spin up these services.
And without realizing it, it was like I was kind of doing these like grunt work dev experience type things. And now obviously like teams have come up around that, but at the time, it was not really a thing. And it was an example of like, hey, if you can make your peer’s lives better, you will get that, you will earn that respect, you will earn the right to kind of take on bigger and bigger projects. And like don’t wait for permission for a lot of these things. Like you can make things better. And so that was another turning point for me.
And like the reason I tell these stories is also, A, it helped me in my own personal career and growth as a software engineer, but those were the things that led me to Cortex as well. So like the idea of like, hey, how do we help bigger and bigger organizations do like production readiness and ownership and standardizing the way you operate and like, drive excellence and standards. Like that became a bigger and bigger, bigger question for me until it was like, I gotta do this a better way. And that’s kinda how Cortex started. So it was kind of a series of, you know, turning points for me and my career. And then eventually the biggest turning point was like, I quit and started Cortex. So like, yeah, those are kind of the things that I found really impactful as part of my journey.
Henry Suryawirawan: Thank you for sharing such a good story, right? So I feel there are many things that we can learn just from your sharing, right? The first is about like looking for patterns, right? Maybe it could be bugs, could be issues, incidents, whatever that is, right? ‘Cause sometimes when we are tested in those moments, right, we feel frustrated, oh, why is this happening to me?
But actually, if you want to level up and step up, right? So you can probably solve the root cause instead of just the bugs, the symptoms of the issues, right? And I think learning, learning and sharing. I think that’s really a good thing as well for any engineers, right? It could be internal, it could be also publicly, maybe through blog posts or whatever that is, right?
[00:07:50] The Practice of Finding the Patterns in Issues
Henry Suryawirawan: So maybe I would like to pick a little bit about finding the patterns, right? Because I’m sure, in most engineering teams, they will always have these issues, incidents that they have to face, especially when things are in production, right? So for engineers that want to, I don’t know, improve their skills in finding the root cause and fixing it, what would be your advice to, you know, look for patterns instead of just fixing the issues? Is there any practices or things that you do normally?
Ganesh Datta: Yeah, I think it’s, it’s important to be very systematic about it. If we’re a marketing team and we’re trying to find like which cold emails work best, we’re not just gonna like, look at our emails and pray like, okay, well let’s, let’s hope we get the best email. It’s like, okay, we’re gonna send out three different types of emails. We’re gonna look at the response rates and like, you’re very systematic about it. Like, why not do the exact same thing for engineering practices?
And so, if you are like a very junior engineer and you’re feeling frustrated by incidents, let’s, let’s take an example. One of the things that you can do is think about, okay, am I frustrated at the fact that I’m being alerted? Am I frustrated at like constant interruptions? Am I being frustrated? Am I frustrated about the lack of information when something goes wrong? Like something is. Is it something is going wrong and I don’t know what to do about it? Something is going wrong and it’s the same thing has happened multiple times. Nothing is going wrong and I’m being alerted about it. Nothing is going wrong or it looks like nothing is going wrong, but something is going wrong when you realize.
So like there’s kind of these buckets you can bucket it into. And so it’s say, okay, I’ll start there. So how would I like assess my own view of the world in some sort of like meaningful way? And it doesn’t have to be perfect because if you start with some simple like categorization of these things, like as you go through it, you will naturally realize, hey, there’s actually another class of things that I didn’t think about. Let me go back and like iterate on my pattern, the patterns I wanna match against and so on. And so like starting with like even something rudimentary is really important.
The thing that, you know, I started with as a very junior engineer was like alert categorization. It was like this bucketing of alerts and I was like, okay, like I don’t even know what the patterns are, but I know that there’s like general things that like I’m getting alerted for that I shouldn’t be or whatnot. As I went through and I got a list of like all the alerts in the last three months, and I literally just went through and marked them like, there shouldn’t be alerts, there should be alerts, it shouldn’t be alert, here’s a runbook, blah, blah, blah. And then shared it with the team. Cause I was like I don’t even know if I’m doing this right. Let’s get the team’s feedback.
And it was like, hey, actually this alert is really important because of X. Or this alert doesn’t make sense we can like quiet it. Or, you know, it was kind of like that, that conversation of why are we even doing these things? And then you go back and you kind of iterate on it. Say, okay, well if this alert’s important, then why is it frustrating? Is it because I dunno what to do about it? Then you go back with the proposal and like, I think that iteration loop is really important.
I think the other thing to do is like, honestly, I mean now with LLMs you can even like take a list of your Jira tickets and put it into an LLM and say like, hey, like tell me what you, what you, what do you see here, alright? Like it’s much easier now than I think it was like 10 years ago. But the ability to kind of go through and say, if you’re on a product team, can you categorize the types of things you’re working on into the highest level abstraction? It’s like basically start with your practice, right, what is the highest level of abstraction possible before you go really low?
Like, okay, these tickets that I’m working on, these bugs I’m working on are on this part of the product or this part of the product. Okay, let me go into that part of the product now. Make it a little bit more granular. A little more granular. It’s like, okay, well, now I’m not able to break it down into any patterns. But you’ve naturally found patterns that way by like starting at the biggest abstraction layer and whittling it down more and more granular, then you can work your way up and see how things relate. And so I think going through like that systematic exercise is really important.
Henry Suryawirawan: Yeah. So many great tips for those of you engineers who want to kind of like improve your skills in terms of maybe we call it system thinking also sometimes, right? How to actually identify these kind of root cause patterns. And I like the tips where you mentioned using LLM, right? So maybe try, give it a try, you know. LLM is sometimes very good at summarizing things and seeing things that we don’t see normally. So I think that’s a good tips as well.
[00:11:39] The Definition of Engineering Excellence
Henry Suryawirawan: So Ganesh, today, we plan to talk about building a best-in-class engineering team or building engineering excellence within that team, right? So maybe could you start a little bit by defining first, what is engineering excellence to you now that you are a CTO at a company and your previous job as well? You have seen like big engineering teams operate. How do you define engineering excellence?
Ganesh Datta: It’s a great question. The one thing I’ll add is like I feel like I have a unique perspective here, because, you know, having been in software engineering, being a CTO now, but then also the kind of goal of Cortex as a company is to help our customers with their engineering excellence initiatives. I’ve seen like how the world’s best like e-commerce companies and financial services companies, healthtech companies have tried to do this as well. And so like, as I, you know, answer your questions, I’m gonna try and weave in like the things that I’ve learned from our own customers and our, like the other CTOs that we work with.
I think, to us, engineering excellence basically means like the alignment of engineering practices with business outcomes, right? So like, of course, every engineering organization needs to build features and products and all that stuff. Like that’s table stakes. But it’s the practices that we adopt as an engineering organization that lead to the outcomes that we care about.
And so, again, kind of going back to the, the sales and marketing analogy. Like sales excellence is like given an account, given an opportunity, how likely am I to close it at what dollar amount, right? Like that’s like the the obvious business outcome. There’s practices I can do to do that, right? It’s like, am I asking the right questions? Am I talking to the right personas? Am I, you know, using the right decks? Am I, you know, helping them see the value of the product properly? Like all the right practices, it makes it more likely for a customer to see value and buy the product, right? And like sales organizations have been doing this for like, you know, a century, at least at this point of like, you know, it’s a very well studied thing like bookkeeping and like tracking your like CRMs. The idea of CRMs existed way before Salesforce, right? But like this idea of like, hey, if we can build a high functioning machine and we like really operationalize the way we do sales, we’re gonna drive more revenue. Engineering is like slowly coming to that same realization of like, hey, if we build the right systems and the right processes and the way we do engineering, we will get those business outcomes we care about.
And so when I think about those business outcomes that engineering excellence ties into, it tends to bucket into these three things. And this is what we think about as well. It’s time to value, time to market, and innovation. So are we, are you innovating as much as you can? Are you getting your product into the market as quickly as possible? The second one is around cost efficiency. So like are you managing your cost and you’re being as efficient as you can? Like are you doing more with less? And the third one is around product quality and customer reliability. So like are we delivering a world class product? Is it reliable? You know, is the customer happy with it?
I use these three examples, because depending on where you are in your company journey, what type of company you are, the outcome you care about might be totally different, right? Like as a startup, at the very beginning of the Cortex journey, all we cared about was time to market. Like it was like, you don’t care if there’s quality, like it’s gonna break. But we wanna see what works and we wanna see what people care about. So like let’s move really quickly. And then from there, it became like, okay, well now we have a ton of customers. They really care about reliability. We wanna build a great customer experience. Time to market is still really important. We don’t wanna lose that. But it’s an ebb and a flow. Like, okay, these next couple months, we’re gonna really focus on reliability. That’s the thing that really matters.
But you talk to like a larger organization, maybe they care about cost and efficiency. Like, hey, we hired a thousand engineers in the last like three years, and like, you know, there’s a lot of pressure from investors and blah, blah, blah. And we gotta like, you know, focus on cost. We’re gonna really double down on that. You know, maybe a company’s trying to go IPO and they need to get their costs in order. There’s some outcome that your processes can relate to. And so engineering excellence is like the connection of engineering process to those outcomes.
We bucket those, in my head, I think about engineering excellence, I break into kinda these four buckets underneath that. So it’s like it’s reliability practices, security practices, efficiency practices, and velocity practices. And so those four are the kind of like pillars of engineering excellence, if you will. And so it’s like… it spans the entire SDLC, right? When you think about like developer experience, for example, it’s like it’s about the tools for developer in their like day-to-day work, right? It’s like very localized. Engineering excellence is more broad. It’s like architecture review process to incident management to security. Like all those things are like the excellence of your engineering organization, right? And like developer experience is part of it. It’s like, hey, are we giving our developers tools and making it easy to be excellent? But it’s like not the entire thing. And I think that’s where I really focus on engineering excellence. ‘Cause in some, in some cases like maybe the developer experience sucks. But like right now, it’s a trade off I’m willing to make because we need to improve our security or something like that. So thinking about like the business outcome first, the part of the engineering process that I want to change to drive that outcome, and then thinking about how I’m gonna do it. Like that’s to me what engineering excellence is. It’s like our practices in service of business outcomes.
Henry Suryawirawan: Wow, thanks for outlining such a great, you know, the way you pitch it like from outcomes to pillars. So I think at the end of the day, it’s all about business outcomes, right? Engineering is just one organization within the company. Obviously, engineering needs to also drive business outcomes, right?
So in many various companies, I believe there are many challenges. I think what you mentioned, right? Time to market, right? innovation, right? Bringing some new features, maybe new things into your system as fast as you can. And then the second one is about reliability and quality, right? Bugs, incidents, maybe security issues and things like that. And so the third thing is about cost efficiency, which I think in many companies these days, they’re kind of like looking back at the cost, how much they spend and try to, you know, optimize a little bit. So I think every company might go through this journey differently as well. Like what you mentioned, maybe startups prioritize something different than their big organizations.
[00:17:10] The Leader’s Role in Engineering Excellence
Henry Suryawirawan: So maybe if we can dive slightly deeper into all this, right? So obviously, there are many dimensions to all of this, right? One of the most important thing in any engineering excellence is about leadership. Maybe I’ll start from there. Because most likely, you can’t do anything if the leadership doesn’t support all these things, right? So how important the role of the leaders do you think in an organization such that the engineering team can achieve this excellence?
Ganesh Datta: I think it is extremely, extremely important, and especially because if every team that is working on engineering excellence initiatives does not know how their work ties to a business outcome, it doesn’t go anywhere. And I’ll give you a concrete example of this. A lot of organizations in the last two years have spent a lot of time investing in platform engineering. And platform engineering should, in theory, enable quite a few outcomes, right? It should help you go to market faster. Like if you build at the right platform, your developers are more autonomous, but they’re also following your best practices, your platform is secure, it’s modernized, so you’re being efficient, you’re using your resources effectively. So at face value, platform engineering should be a great way to drive engineering excellence.
But platform engineering, especially large organizations, sometimes can become like this kind of corner of the organization where it’s not really tied to leadership or the leadership is not able to articulate to that team like how their work ties into the broader business outcome. And I think it’s really important for an engineering leader to say, hey, this quarter or this year or this half or whatever, the thing that we really care about is business outcome X. We wanna move to market faster, or we wanna improve security, or whatever that is. And interrogating kind of each set of initiatives as to like, how exactly does this initiative tie back to that? And can I then tell that story to the rest of the organization? So like, hey, our platform engineering team is gonna be heads down for three months. You’re not gonna see anything. But it’s because we are building out a, you know, the very first slice of our organization, the very first slice of that product that is going to improve the security of the way we provision infrastructure. Okay, I understand how it connects back to that.
As an engineering leader, also challenge your teams to steal thread. So rather than saying we’re gonna build the whole platform, we’re gonna do it with this whole thing and then like saw the business outcome. It’s like no, there’s like, choose a business outcome. What is the number one thing blocking that today? And what is the bare minimum slice across the entire stack that I can deliver to move that needle just a little bit forward, right?
And so it’s like, when I think about reliability, if you know MTTR is top of mind and I see my CEO is saying like, hey, we have too many incidents. Like we’re paying too much in SLA fees. We need to improve our incident costs. Okay. Where I’m not gonna like go and rebuild all of our services and like, you know, fix all of our tech debt all at once. It’s like, no, what is the number one thing that’s stopping us? Okay, it’s, you know, we’re spending too much time fighting fires. Okay, why are we spending too much time finding fires? Well, there’s like 10 reasons, but like the number one reason is like we don’t know which teams are in the critical path for these key outages. And so we spend 30 minutes per incident getting the right people in the room. After that, there’s still a lot of other problems, but like we spend 30 minutes there. Okay, like, let’s take that one level further. What is the next thing we can do to standardize that and so on? And so it’s like now up and down the entire stack, you have a single thread.
And so as an engineering leader, it’s really important to maintain that focus and like challenge the teams. It’s like, do we have to do all this? Like what is the bare minimum thing? Like if a product leader is doing that for the product, engineering excellence is like the engineering team’s product in some sense. It’s like the day-to-day practices and the tools and the stuff we’re building for those engineering and business outcomes. And so like we should have that same product management mindset of like, what is the minimum viable or minimum lovable like process or, you know, tool that we can build for this particular issue.
And so I think being able to carve out small slices of value that drives the business outcomes by telling that story more broadly across our organization, making sure everyone’s bought into the business outcome, and holding teams accountable to not building, not just doing science experiments all the time, but like driving towards something very clear, means that the next time you wanna do one of those things, your CEO’s bought in too. It’s like, hey, this investment in like this random engineering process resulted in us moving the business outcome. So the next time as an engineering organization, we want to go and say, hey, we need to fix this tech debt, you’ve earned those stripes. You said, hey, the last time we made this argument, we moved the business needle. We’re not lying to you like we have another thing that we wanna do. Like you’ve bought that, you created the feedback loop. And so as an engineering leader, it’s your duty to like bring those business outcomes from the business to engineering. And then use the same business outcome to tell the story back to the business of like, hey, here’s how our investments impacted the business. And so as an engineering leader, you’re gonna be thinking about both sides of that equation.
Henry Suryawirawan: Wow. I think that’s a really great tips, right? Again, it all comes back to the business outcomes first. Try to tie it to maybe like one single business outcome and identify the initiative that you wanna do. And I think I like that you mentioned create a slice, right? You know, maybe people call it thin slice in product management, right?
So the same thing you can do in your technology stack. And platform engineering is also like the most common examples I think these days for engineering team to try to improve things. But obviously, introducing technology itself might not be sufficient, right? Because I think many people try to use technology as the solution for improving their engineering excellence and cannot convey that to the stakeholders.
I think this is one of the challenge simply because maybe there are gaps in technical understanding versus the, you know, business stakeholders who are non-technical. And the other thing is about building the story around what kind of outcomes that you wanna build, right?
[00:22:31] Aligning Engineering Excellence with the Business Outcomes
Henry Suryawirawan: So for people who probably still rely on kind of like implementing tools, be it, I don’t know, like cloud, AI, platform engineering, whatever that is, right? Maybe also sometimes re-architecturing certain things to improve their engineering excellence. What would be your advice for those kind of people, right? Before they approach these kind of things with such new technologies and new architecture, is there anything that they can do before they embark on that journey?
Ganesh Datta: Yeah, I think it’s understanding what the North Star is like. We should understand either what metric we’re trying to move within our organization, which business metric we’re trying to move. Or even if it’s not gonna immediately move a metric, like the hypothesis as to like this series of things that will eventually move some specific needle. And it doesn’t have to be a business outcome necessarily. Like if you are on kind of a hands-on team, you know, in security or something like that, buying a new tool, you can still tie it back to one of those pillars, right? It’s like, you know, hey, I’m improving our security, improving our reliability, I’m doing these things. And like your engineering leader should then be able to tell that story to the business, but like as long as you can tie it back to one of these pillars.
And so the way I think about it, for example, is, going back to the platform engineering example. Let’s say you’re a platform engineer and you wanna spend two months building out a self-starter kit for new services. On its face, it’s not a particularly interesting project to the business, right? It’s like, hey, I’m trying to invest in Terraform. I’m investing in, you know, like IAM policy tools and like I’m doing, I’m buying a bunch of stuff and like setting all these things. I’m spending all this time and energy. But why does it matter? It’s like, okay, well actually, the reason all that stuff matters is, and this is where like the measuring the current state of the world is really important. Today, it takes us one and a half weeks to spin up a new service and go through all the approvals. It takes us four days for security review. Once it’s ready to go to production, it takes us four days for SRE review. And then finally it goes to production. So on like either side of the equation, we’re spending, you know, minimum of two weeks outside of like the actual writing of code.
And so therefore, if I can reduce those two weeks down to days or hours, then like the ROI is there. So by investing in a self-starter kit, you know, through the platform engineering team and buying all the right tools for it, over the 50 services that we create per quarter, you know, across the organization, times a reduction of two weeks, we’ve saved X dollars of money and we’ve improved our security posture because now every team is by default registered with our vulnerability management tool the moment they create a new service. And so the likelihood of a breach is much lower by, you know, some amount. So like now I’ve made a case for these tools from a efficiency perspective and a security perspective.
And so it’s not just like, oh, I’m like creating a cool like experience for my developers. It’s like, does the CEO really care about that? Like, maybe, you know, it’s like, yeah, I want my developers to be happy and like, you know, there’s a retention aspect and all those things about like sentiment. But like fundamentally, if you tell me like, hey, not only are my developers gonna be happy, I’m saving us X weeks per service, I’m increasing our security posture and like our incident rates should go down. That’s a much easier way to make that case.
And the other thing, like you mentioned, like coding assistants is the same thing. It’s like, okay, well, can we measure the impact of that? That’s an interesting one actually though, coding assistants. Just kind of a side tangent here is like I had a customer recently tell me, it’s like, hey, why are you making such a big fuss about, you know, measuring the impact of 50 bucks a month or whatever. You wouldn’t ask like, does my developer really need like 16 gigs of ram, like versus eight? Or like, do they really need IntelliJ? Like can’t they just use vim? It’s obvious that it’s one of their tools in their toolkit, right? It’s like you wouldn’t question a developer who’s like trying to get an IntelliJ or something like that. So why are we making a big fuss about spending 50 bucks for a tool and it’s just part of your general toolkit’s like laptop, IDE, terminal, coding assistant. Like it’s part of our general toolkit, right? So it’s like, it’s an interesting idea. And so the same thing should hold true for your enginering excellence platform as a whole.
Another topic as well is like, you know, why are more engineering leaders not thinking about engineering excellence? But yeah, I kind of a side tangent there. But hope I answered the question.
[00:26:30] The Importance of Metrics in Engineering Excellence
Henry Suryawirawan: Yeah, I think that’s a great example that you outlined just now, right? So I think one thing that I picked from your explanations also in your career journey, right? So engineering also needs to be able to measure things, right? Be it, you know, maybe software telemetry or even like their, I dunno, velocity metrics that you mentioned. Or those kind of measurements that I think must be in place, right?
For us to build engineering excellence, how important for leaders to actually build an initiative to actually gather these kind of measurements? Because sometimes, uh, sometimes it’s easy to, you can come out from systems, right? But sometimes, especially when you touch multiple running elements, right? So it’s very hard to actually quantify the metrics. So how important do you think for engineering organization to actually put an effort, invest time and actually come up with metrics, and maybe do like a good cadence around, you know, looking back at the metrics, how to improve them and things like that. So maybe any tips on this?
Ganesh Datta: Yeah, absolutely. That’s a very, very important question. So I think it is a must and I’ll explain why. Going back to the sales analogy again, like imagine if a sales leader said like, I don’t need Salesforce, I don’t need a CRM. We’re just gonna sell and hope for the best. Like you would fire that sales leader tomorrow. Like any sales leader that doesn’t have a CRM that’s not using that data to make judgements about the organization and where to invest is gonna be laughed outta the room, right? Any data leader that doesn’t have data tooling is gonna be laughed outta the room.
So why is it that engineering leaders are not treated the same way? Like why is engineering leaders, do we not say, of course, I need a system of record. Like every single other function has a system of record, right? Like sales teams have CRMs, marketing teams have marketing automation, finance has ERP, IT has CMDBs like ITSM tools like ServiceNow. Every single function has a system record. But what does engineering have? We’re just like caution to the wind. Like do whatever you want. Like every team just out there like, it’s like the wild, wild west. And it’s really insane if you think about it like, you know. Like every other function is treated so differently. And engineering is the most expensive organization. We should be much more systematic and data-driven and methodical about the way we build our engineering organizations.
And what I would challenge though is like it’s not just about the metrics. Like imagine if all Salesforce did was like fancy reports, right? Like just like a funnel dashboard or something. Like you wouldn’t need that either. What you need is like an end-to-end platform that helps you understand like your software ecosystem. So it’s like accountability. Like given a service, which team is accountable for it? Where’s it deployed, where’s it running? You know, how many vulnerabilities does it have? What is the code coverage like? What is the on-call rotation like? Where can I collect this data? How do I use that data to drive best practices?
So like in, for example, in Salesforce, again, I can use a CRM to drive practices and sales. It’s like, hey, before you start a POC with a customer, you need to fill out all these fields. You need to make sure you understand if they have budget. You need to make sure they understand who’s gonna buy it. And like the AE has to go and fill this information. And the idea is that you’re using that data to then drive some sort of behavior, like in the daily workflow of an AE, right? And so you’re using it to drive practices. Or you’re trying to use marketing automation tools for practices.
And so it’s not just about the dashboards. And so like the metrics side of the equation for engineering team should like, A, having a way to store all this data in a meaningful way cause it can’t just be metrics. It needs to be like what is the actual like data structure on which I’m reporting on? So like to us, the way we think about it is like the service catalog, the infrastructure catalog, a catalog of all this information, right? Like just like a CRM.
The second part is the ability to like drive behavior around it. So like defining this is what good looks like. This is where I need teams to be. Can we create a shared language as to what good looks like? Because I can say your repo has 30 vulnerabilities. Okay. Like is that good? Is it bad? Like does it matter if it’s high or like low vulnerabilities? Like what does that actually mean? Like it it doesn’t mean anything for me to tell you you have 30 vulnerabilities, right? There needs to be some distinction for our organization.
Like for me, for us, as a company, what does good look like? Okay, for a financial services company, zero critical vulnerabilities is great. Zero critical vulnerabilities with, you know, resolving them within seven days is excellent, right? And like zero critical vulnerabilities, you’re resolving them within seven days and you’re resolving medium vulnerabilities within 60 days is like top of the line. Like that’s where we wanna aspire to. So again, now not only do we have a metric, we have like a graduated window of like how to get better. So like the platform should be able to tell you that.
And the final bit is the report of engineering metrics. It’s, and it’s like, okay, like based on the data that we have, based on the practices of what we’re following, are we seeing the impact on the data? So like cycle time and, you know, DORA metrics and things like that. Those things can’t live in isolation, right? It’s like you can’t go and tell a developer improve your MTTR. It’s like, okay, like I don’t know where to start, right? So it’s like, instead it’s like, hey, we’re using MTTR as a measurement. And so we’re gonna see, we believe that in order to improve MTTR, we need to change these five practices. So we’re gonna run, change the five practices, we’re gonna come back and see if the MTTR change.
So it’s the engineering metrics are not about measuring productivity or like, you know, like checking if you’re like taking advantage of your teams. It’s about hypothesis. Hypothesis generation and hypothesis testing, right? It’s like based on the metrics, I have a hypothesis that like we can… It’s all about like continuous improvement, right? It’s one of the pillars of engineering excellence we consider is like continuous improvement. It’s not like good versus bad, it’s I’m better than I was yesterday. But that’s a very important part of engineering excellence.
And so like engineering metrics should be used to say, today our cycle time is X, we want to move faster. The time to first review is really long. Our hypothesis is that the CI build time is taking up, you know, good chunk of that effort. By reducing our CI build time, our cycle time will go down, which means that our throughput will go up. That’s the hypothesis. And then you go and you say, okay, well, how do we, you know, change our build time? Well then you run an initiative, and this is where Cortex can help is like, okay, we’re gonna migrate our teams from, you know, legacy Jenkins servers to GitHub Actions so we have better caching. But like in a big organization right now, without a tool like Cortex, not to sales plug, but like it takes a lot of time to track migration like that. And so having the right tooling in place to like track a migration like that and like make sure everyone’s doing that migration, okay, you do that migration.
And then you go back to your metric and say didn’t move the needle. So the engineering metrics are a way to like create those hypotheses and test them. But if you’re not doing that, how do you know if your efforts are actually moving the needle, if you’re actually making an impact? So engineering metrics are very, very important. The data under the hood is very, very important. And the ability to set standards and drive behavior is also really important. It’s like all those things together is what creates the flywheel of a data-driven, high visibility, high trust, engineering culture that is focused on excellence. Like, that’s how I think about how metrics fit into the broader ecosystem.
Henry Suryawirawan: Right. I like the angle where you mentioned about continuous improvement, right? So obviously, excellence is kind of like the North Star, the vision that we wanna be. Obviously, I don’t think any kind of team can say, hey, I’m excellent, you know. Like you can’t be excellent forever anyway. So the continuous improvement aspect I think is very important. And kind of like the motion of how to improve yourself from, you know, where you are today and then tomorrow and so on and so forth, right? So I think that’s really crucial.
[00:33:35] The Culture that Drives Engineering Excellence
Henry Suryawirawan: So the other thing that you mentioned is about driving the behavior. I believe this is touching into the aspect of culture, right? I think in an engineering organization, culture is really important. What kind of culture drive a good engineering excellence kind of initiatives and those kind of things? So maybe if you can elaborate? I know culture is a bit big topic, open-ended. But maybe from your experience, right, what kind of culture do you think drive this kind of engineering excellence?
Ganesh Datta: Yeah. So I’ll start with, uh, kind of a segue from the last question, so engineering metrics. The right culture is one that is transparent and has high visibility. And what I mean by that is like engineering metrics are not this like ivory tower thing that, you know, your VPs and CTOs are looking at. And they’re like coming down and like smiting the teams based on those metrics. It’s like if every team has visibility into your SLOs and your latency and stuff in APM, why can’t your teams look at their own metrics in the engineering metrics too, right? So it’s like visibility up and down the stack. So everybody gets access to the metrics. Everybody understands this is not a measurement tool. This is a way for us to just like find improvements. So that is really important. So a culture around metrics, a culture around visibility. Everyone looking at the same data. So that’s one part of it.
The second part of it is a shared language. So does everyone actually know what good means? If you ask three people on the team, what does production readiness look like, and you get three different answers, immediate fail. So like having a shared language as to what good looks like, having a shared language. Like this is why, you know, companies do OKRs and like, you know, North Star goals and North Star metrics and all these things. It’s like, okay, everyone at least is thinking of the same thing, right? So if you’re like a, a social media company, you probably have a North Star metric. It’s like, hey, we want to improve retention. Like that’s our number one thing. We’re like this year that’s the only thing that matters. Okay, well, everybody knows that. Everyone can say like, retention’s the only thing matters. I can ask myself, I’m actually doing something about that.
So the same thing holds true for engineering process. So does everyone know what good looks like and what that shared definition is? How much can you dumb down that definition? So like your definition of good can be shared but if it’s like 60 criteria for like production readiness, it doesn’t matter, right? It’s like no one’s gonna remember all that stuff. But like, ah, it’s like that Confluence page over there. And so like the ability to create like a clear path to greatness, I think, is really important. It’s like, hey, these are the basics. This is like, you know, the non-negotiables for production. This is what like the gold standard looks like. So being able to kind of like bucket up what good looks like into these categories, I think, is really important.
Cause then you, as a team lead, you can say like, hey, we’ve all agreed that like these 15 things are like the bare minimum to go to production. We’re not doing those things. And so like I can use that now as a way to advocate for time. So like I can go to my manager or my VP and be like, hey, we all agree that these 15 things are really important. We’re not getting time to meet those things. Like we need time. And so now you’ve created like that shared language. And the VP has cannot say like, I mean, yes, they can say no, but the idea is that, yes, we have agreed to those things. I can make that trade off very clearly. And so that communication is very important.
The third thing is repeated messaging. And so how much of the same data can you use in multiple places? So can I, if we’re doing a migration like I mentioned earlier, talk about it in your all hands? Talk about it in your monthly operational excellence reviews. Talk about it in your team meetings and team retros, right? Is there enough data in your, in whatever you’re using from tooling standpoint, for each team to be able to self-report on these things, right? It’s like, this is why team spends so much time around like OKR tracking is like each team can like very quickly iterate.
So it’s like the CEO is saying something, the CTO is saying something and like it trickles down and like I can check my own work against that. So for engineering practices, the same thing. It’s like does your team, your manager, your director, your VP, do they all have a slice of the data where they can, at their level come in and see like, hey, how are we tracking against those practices? How are we tracking against those things? And so the ability to kind of create that, like instant visibility and repeated communication is really important.
This is like a common like adage in business, right? It’s like you have to keep repeating things until you feel like you’ve repeated yourself so many times that people are sick of you. And at that point, people finally realize what you’re talking about. And so it’s the same thing for engineering practices. Like keep repeating yourself. It’s like, hey, congratulate people. Celebrate the wins. Like, hey, Ganesh’s team has finished 80% of… migrating 80% of their services to the new platform. You know, congratulations, right? Like do you have data to be able to celebrate those wins? And like if you say that in an all hands meeting, people are like, oh, like clearly this is important enough that like it’s getting a slide at the all hands. It’s like are we actually tracking towards it and whatnot?
So like can you create that culture? It’s about creating momentum. It’s about creating visibility. It’s about shared language. It sounds like cult behavior, but like, really, like every organization. Like culture in some ways is like how can you create a cult like behaviors within an organization that like healthy and like drive the right outcomes, of course. But like that’s kinda the framing I like to think about, kind of tongue in cheek.
Henry Suryawirawan: Yeah. So I, in my experience also, the bigger you are, you know, the bigger you grow, right? The more important, all these things, right? Especially the repeating of the message, right? Because when you have like hundreds of engineers, for example, right? It’s very difficult to kind of like align everybody. So a few things that I picked from what you’re sharing, right, so high transparency, maybe have the metrics in place that you can see all the time. Shared language, I can also agree to, you know, use this kind of shared language. Because when you have multiple engineering teams, everyone thinks excellence differently, right? So if you don’t have a shared language, what good looks like? I think it’s very difficult to align everybody. And then yeah, repeated message. Building a clear path, I think is also important because you can say, here’s the initiative and let people do it. But if there’s no good, clear path and you kind of like streamline the process, I think it’s very hard for people to achieve that as well.
[00:39:05] Platform Engineering and Internal Developer Platform
Henry Suryawirawan: So the other aspect that I want to dive deeper is about, you know, platform engineering, because you are building, you know, one for yourself, right? So the cortex.io. So I think people these days talk about platform engineering, you know, internal developer platform, and things like that. So maybe first question is at what point in, you know, engineering stage, do you think everyone should try thinking seriously about having this platform engineering and internal developer platform?
Ganesh Datta: The interesting thing, I kind of break up the two aspects of this internal developer platform, internal developer portal has a whole debate about like what each thing means, like when you need them. Like it’s a whole thing. But basically I would argue that you need to start with data. So every organization should have some sort of data. So like do you have a system where you can track your software assets, infrastructure assets? Do you have a single place where you can determine accountability and ownership? You can probably get away without any of this stuff, you know, up until about like 30 to 40 engineers, I wanna say like roughly speaking. At that point, like you have enough kind of clusters in your team graph that having a system where you can say like, Ganesh’s team is working on X, you know, Henry’s team is working on y, like this is their, you know, sphere of accountability. They own these different repos. Having that in one place is really important. You should do that like very, very early on.
And so usually that comes in the form of an internal developer portal, which is like a service catalog or something like that. Around that same time, you probably want to use, you know, ideally, your portal already provides this. But you’re also starting to measure some of the key metrics. So like cycle time, review time, and so on, around 30 to 40 engineers. Anything before that, like, it’s a little bit of noise honestly. Like you can get some value out of it but it’s not honestly worth it. Ideally, your portal gives you this data, but if not, like you’re getting it in a manual way, you’re tracking some key metrics and like the things at a 30 person team, it’s probably on 30 person like engineering organization I’m saying, it’s like probably more on moving fast. So like cycle time, throughput, incidents, keeping the lights on. Those kinda metrics are probably pretty important.
From there, once you get to around like 60 to 75 engineers is where platform-esque investments start to come into play. Now, it doesn’t have to be a platform engineering team but some sort of like collective group of like infra, cloud, staff engineers, like senior engineers, and maybe like one or two platform people like kind of coalesce and start to work on org-wide initiatives, because at about like 50 to 60 people, you’re getting to a point where like practices are starting to diverge. Like people are starting to do things in weird ways. And so starting at that stage, I think, is really important.
However, I will say at that point, like you want to be very, very tactical about the types of investments you’re making. So even from a platform engineering perspective, like what is the thing that is slowing you down at like 60 person, 70 person engineering team? So like, it’s usually around like deployments, build systems and tooling, infrastructure provisioning. Like, you know, are we having repeatability around those things? Like, it’s not like the quote unquote sexy platform engineering stuff that people are doing, but it’s like tactical things that really matter.
It’s like if you hit a point where you’re like 60 or 70 engineers, you probably have really slow builds or like something like that. You probably have like review bottlenecks. You may have started the company, and I’m speaking from experience, like ClickOps in Google Cloud or AWS whatever. And so like around 60 engineers you wanna start automating and terraforming or infrastructure as coding, some of that stuff. Like thinking about the next phase, like, hey, what’s gonna break when I get to 100 engineers? So like start to get outta that. So around 60 to 70 engineers you wanna start investing in platform practices but very tactically against specific initiatives.
I think once you get to around like 120-ish engineers, like you have enough independent work streams across the organization that like true platform engineering stuff starts to have a real impact. So like a standard set of like deploy pipelines and deploy capabilities where you’re standardizing on Helm, where you’re standardizing on Terraform or Pulumi or Spacelift or Crossplane or Kratix or whatever kind of platform tool you want. Like those kinds of decisions start to become really important because they have like compounding effects. And so I would say like somewhere between like the 60 to 120 range-ish is like when platform engineering starts to become a thing. And anything bigger than that, like you absolutely need platform engineering.
My hot take, though, is that like at whatever point you start bringing in like SRE and security and stuff, you probably also wanna start thinking about platform engineering. And I would argue we’re starting to see this practice a little bit more, but why not have those teams roll into an engineering excellence leader rather than like all of them like kind of reporting to different places. You want your platform engineering teams to be working with your security teams and your SRE teams, right? It’s like they all, their work depends on each other. Like your SRE team cannot build reliable software if your platform doesn’t take that into account, right? Or like security cannot do their job if your like, infrastructure-as-code tooling is not thinking about security from day one. So like why is platform engineering over here and security over here. Like bring those things closer together.
And so an easy way to do that is to like bring them together into an engineering excellence team. So like instead of having a head of developer experience or a head of SRE or whatever, like have a head of engineering excellence and like have these orgs roll up into that. So you have like, especially up until like you’re like a really large enterprise, at which point each of those things ends up becoming their own like VPs, have them roll into a single leader. So that way, like all their work is really, really aligned to each other. It makes hiring easier as well. So that’s kind of like one of my hot takes from like an organization design perspective.
Henry Suryawirawan: Yeah, I think thanks for sharing such a, you know, good analogy of like how big is the engineering team before you start thinking about certain things. And especially, yeah, when you have hundreds or even more thousands of engineers, having such kind of a platform capability, I think, is really important because otherwise people will be doing the same thing with different ways and you cannot rationalize when things going wrong, right? Or especially when you wanna improve something, you kinda like double the effort and things like that. So definitely, uh, when you grow in size, you need this.
[00:45:02] The Biggest Misconception of Platform Engineering or IDP
Henry Suryawirawan: So maybe from your experience building this internal developer platform, what do you see some biggest misconception people have when they think they wanna adopt platform, maybe, some kind of platform engineering or IDP, right, versus actually, what actually IDP is?
Ganesh Datta: Yeah. So kind of going back to the data question. Um, one of the biggest misconceptions, with portals, in particular, is do I have to get all my data in order first before I put it into the portal, like to get any value outta the portal? Like if I want engineering metrics, if I want scorecards and like best practices and all these things, do I have to clean all my data up first before I buy a portal? Or am I wasting time with a portal? That’s the biggest misconception. And the reason I say that is because if you don’t have a system that can hold data and like tell you what’s wrong with the data, how are you ever gonna fix it in the first place, right? It’s like right now everything is an unknown unknown. And you can’t, you can be like, I’m gonna clean up my data, but like, what are you really doing? What does it actually mean?
And so like using a portal is a way to go from the unknown unknowns to the known unknowns. So it’s like, okay, well before we didn’t know anything. Now, we know what we don’t know. And then now, we can start to clean up the data and be like, okay, well now we know about these things, we don’t know about these things. So you can actually clean up your data, using a portal. Like that is the whole point of having some record, right?
Like a sales leader when they come in, and I love this analogy, I’m gonna keep using it. But like when a sales leader comes into a new sales organization, they’re not gonna say like, first we’re gonna take, you know, all of our calendar invites and like put it into a spreadsheet and then clean it up and then go buy Salesforce. It’s like, no, I’m gonna buy Salesforce, dump everything in there and then we’re gonna figure out how to like make sense of this data, because I need a system that’s gonna help me see the world. That’s the first misconception.
The second misconception is if you build it, they’ll come. So like think, people think about portal and platform initiatives as like, hey, we’re gonna build this like really cool stuff and like these great self-serve experiences and like it’s gonna be so good that people will just come and use it. Completely false. This is like basic product management, you know, product thinking, right? It’s like It’s very hard for some, for people to change their behaviors, right? Like people are used to doing things a certain way, especially, and I speak from personal experience as developers, we have our vimrcs like, you know, we spend so much time like screening our setups and like our practices, right? Like we, that’s kind of who we are as an engineer. It’s like we like to tailor our workflows and our tooling. And so if you’re like, oh, there’s this other tool, use a platform. I’ve been doing these things other way. Seems fine to me. Like you go build your platform, I’ll do my thing, right? It’s like that’s a very common issue we see.
And so instead, it’s really important to think about starting with the definition of what good looks like, right? So it’s like, hey, we’re gonna start with, and this is how we built our product, it is starting with scorecards next. So it’s like, okay, before we go off and build all these self-serve tools and platform engineering things, we’re gonna build a scorecard that says like, hey, in order to go to production, these are all the things you should be doing, right? Like these are all the practices you should be following. One of our customers calls it like the Golden State. What is the Golden State of a service? And so if you define that, then your Golden Path makes a lot more sense. It’s like, hey, this is what every service needs to do.
It’s very easy to get people to buy into that, especially because most organizations have a production process already, right? Like 99% of organizations have something, it’s like a Confluence page or some process. It’s like, hey, this is what production ready looks like. And so you tell them, you tell your team, it’s like, hey, we’re not trying to introduce a new process. It’s nothing new. We’re taking that old thing that we used to do and we’re automating it and we’re just gonna give you, tell you exactly where you stand from production ready standpoint. So we’re just, we’re doing a thing that we already used to do, we’re just doing it better. It’s like, okay, that seems fine. Like I can, I can buy in to that.
And so now, everyone agrees like, okay, well, this is what production readiness looks like. Then you tell people, okay, well you can do it your old way to get to production ready. Or we built this self-starter kit on our platform that like, click a button, you got a new service in five minutes. It’s gonna meet all of your requirements within five minutes. So do you want to use the self-serve kit or do you wanna go do your old thing? One in 20 people will do their whole thing, but like they still have to meet all those requirements, so at least you’re still getting the same outcome. But 19 out of 20 people will use a new tool because it’s like, well, if you’re telling me I need to do all these things, that’s what good looks like, which I agree with, then I’m just gonna use a new tool. So now you’ve told people the Golden Path and the Golden State, you brought them together as a platform serving a very key purpose. It’s called a key bottleneck, and you’ve got that buy-in from the people. So that’s another misconception of like if you build it, they will come. It’s absolutely not true. And that’s where we see a lot of IDP initiatives fail.
And last but not least, it’s very important to tie to a thing that they’re already doing. I kind of mentioned this in this answer, you don’t want to come and introduce a completely new workflow, a completely new process. Find a thing that the team is doing already and make that better. And that will get buy-in for your platform. So a platform doesn’t have to be brand new. A platform doesn’t have to be one thing. A platform can be a collection of best of breed capabilities, right? Like that’s the other thing is like we’re gonna deliver a whole platform. Like, no, no, no. Like your platform can be infrastructure-as-code, it can be CI/CD, it can be all these things. And you can like iteratively, like you can bounce around. Like you don’t have to go and deliver a whole platform all at once.
And so if you think about incremental value like we talked about earlier, business outcomes, solving a thing that already exists and making that better, then you will naturally say like, okay, we’re just gonna deliver like the smallest slice of a platform that we can and continue iterating on it. So have that product management mindset. Those are kinda like the biggest things that we see wrong with platform engineering practices today.
Henry Suryawirawan: Thanks for sharing such great misconceptions. I think some people might, you know, especially those who have embarked this journey, right? I’m sure many people can relate to this. Especially, the platform-as-a-product thing. I think still many people maybe build a platform and think people will just use it. Or maybe worse, they kind of like force it to teams without actually, you know, explaining the reason why and, you know, how it benefits them and things like that.
[00:50:36] Cortex as an Engineering Excellence Platform
Henry Suryawirawan: So obviously Cortex is one way of doing this. So maybe a little bit of plug about Cortex. How do you think people can use Cortex to help all these things? Maybe some features. I know that you have this concept called Engineering Intelligence. So maybe tell us a little bit more how can you use Cortex to actually solve all these kind of things?
Ganesh Datta: Yeah, absolutely. So Cortex is an internal developer portal, but we like to think about it as an engineering excellence platform at the end of the day. It starts with the catalog. So we help you catalog all of your services, your infrastructure, your teams, and then automatically determine ownership for them. So within minutes, we have an ML model that can go through all of your repos and automatically assign teams that we believe own each repo. So within, you know, 10 minutes of setting up Cortex, you have a list of all your repos and which teams are accountable for each of those.
We have a product called Scorecards, which allows you to define best practices and standards. So like, for example, automating production readiness, tracking migrations and audits, package upgrades, security standards and whatnot. So being able to define those best practices and standards.
And then from there, we have a tool called Workflows. And so Workflows is basically kind of like a wizard builder where you can build experiences for developers. So like spinning up a new service, deploys, you know, infrastructure provisioning, just-in-time credentials access, and so on. So we have a really, really easy and powerful way to build those self serve experiences, including code scaffolding and project bootstrapping.
And then, finally, uh, we have an engineering intelligence product, which is around engineering metrics. So like DORA metrics, cycle time metrics, MTTR, things like that. And the idea is that you should get all these things in a single platform. So you shouldn’t have to go to reporting in one place and self-service in one place, and like cataloging in one place, if we bring all those things to the same platform. Because at the end of the day, those things are a flywheel. So measure the metrics, figure out your practices, drive them using scorecards, you know, use the catalog data that like underpins all this stuff and build self-serve experiences that make those things better.
So at the end of the day, that’s what Cortex is. We build a lot of native integrations. There’s a lot of automation into it. So, you know, it usually takes very, very little time to get started with it to get those metrics and catalog up to date.
[00:52:39] Generative AI Use Case in Platform Engineering
Henry Suryawirawan: It sounds like a really great tool, especially if you are like big engineering organizations. So you mentioned something about embedding AI model, right? So I think obviously the biggest thing these days people talk about, you know, AI, generative AI. So what do you think are some cool cases that you have seen maybe in your experience as well, using generative AI and internal developer platform or platform engineering. Any kind of cool use cases you can share today?
Ganesh Datta: Yeah, that’s a really interesting question. So there’s some stuff that we’re working on that I can’t share just yet. But I’ll, kind of, you know, give you like a high level idea of how we think about generative AI and like how it applies to this space, and then some of the cool use cases we’ve seen with it as well.
You know, we have been working with, you know, one of the big four consulting firms and it’s really interesting the way they use generative AI, you know, as they’re doing transformation projects and modernization projects. You know, they’re able to use generative AI to kind of rapidly accelerate the code transformations, the migrations, and things like that. And so rather than going in with a ton of manual effort, they build like a really powerful tool chain around platform engineering to help with the adoption of their platform. So it’s not like, the platform isn’t about gen AI itself, but it’s about, hey, we know this platform is gonna help, you know, drive these different use cases and user journeys. Can we use generative AI to like migrate existing things in the new world and stuff like that? So that was a really interesting use case.
Another thing that I think is really important, and I call this example out, is I think a lot of tools have gone down the path of just these like generic chat bots. It’s like, hey, we have all this data. Let’s just throw a chat bot on it with some RAG. And like, okay, like it’s not… it’s cool, it’s like a nice demo, but like it doesn’t actually do anything. And so I think like, you know, from our perspective, the application of generative AI is really gonna shine when it applies to specific use cases, specific types of data. Like that’s, I think that’s one way to think about it. So like internally, you know, if you’re on the SRE team, we’ve seen folks use generative AI for like, you know, analyzing their incidents and things like that. If you are on the platform engineering team using generative AI to summarize your own feature backlog or like requests from your end users internally. So can you find patterns of like feature requests and things like that? It’s been a really interesting use case.
It’s like if you had like a junior analyst on your team that you could like give a problem and be like, hey, can you go and like, figure out the synthesis and like, here’s how you should think about it. Those are the kinds of things that it’s really, really good at. Obviously like, you know, coding assistants and the like, but like, you know, that, that’s fairly obvious. But from a platform engineering perspective, it’s about getting another brain on your team that can help you reason about like are we working on the right things? Are we finding the right patterns? And so on. So that’s kind of how we’ve seen some really interesting use cases.
Henry Suryawirawan: Right. I don’t know what you’re building, but I’m guessing that if you have some kind of assistant within Cortex, right, that you can ask some questions and find patterns, I think that would be cool, definitely. Maybe that’s one use case of generative AI.
[00:55:26] 3 Tech Lead Wisdom
Henry Suryawirawan: So Ganesh, it’s been a pleasant conversation. So I have a tradition in my podcast to ask one last question for all my guests. I call this the three technical leadership wisdom. You can think of them just like an advice you wanna give to listeners. Maybe if you can share your version today that would be great.
Ganesh Datta: Yeah, absolutely. My first one, and this is one I strong by very much, is be a good storyteller. Doesn’t matter if you’re an engineer. It doesn’t matter if you’re an engineering leader. Be a great storyteller. A lot of what you do in any role is telling stories, whether it’s giving one of your reports advice. You’re telling a story. It’s like, hey, imagine if we improve these things. Like how much better could it be? Like what kind of behavior do we wanna see? Like how do I help my reports see the light? You’re telling a story, right? You’re managing up, you’re flagging a risk to your manager about some project. You’re telling a story, right? It’s not about like making up excuses or fabricating a story. It’s about like, can I communicate beginning, middle, end? Can I be concise? Can I give clear information? Can I bring people with me in my thought process? So being able to tell a story is very important, especially as an engineering leader, you’re making a big decision. How do you help the team see that? How do you like explain to them the trade-offs, the thought that went into this. Why we’re doing a certain thing. What does it mean for the business? Like tell a story. It’s really, really important.
The second thing I’ll say is deeply understand your product. Especially as an engineering leader, having an above average understanding of the product that you’re working on will help you make very quick judgements and reason about the space that you’re operating in, right? Like you, of course, will lean on your own team and the experts on your team for, you know, tactical advice and like they’re the experts at the end of the day. But if you have a deep understanding of your product, then you are able to reason about trade-offs and route information much more quickly. So I think deeply understanding your product is the second thing that I would say.
And I would say probably the third thing is think, and this kinda goes back to the early part of our conversation, think about the culture. I know that sounds like a very vague thing to say, but the specific advice I want to give is to think about your operational cadence in particular. So the rituals and the practices that you set up help dictate the way your organization operates, especially as it’s larger, right? So like, for example, if you create an operational excellence review. That itself starts to create the conversation of like what metrics should we be looking at? How often should we be looking at these metrics? Who should be involved? You know, are we bringing in the right people? Just the fact that you have an operational excellence review creates this conversation, right?
But then go one little bit more granular, okay, in the operational excellence review, what outcomes do I want to do I care about? And so like, how am I gonna structure the meeting agenda to drive those outcomes, right? Like imagine just the fact that you say like, hey, we are gonna stop our conversation 10 minutes before the end of the meeting. And we’re, the only thing we’re gonna talk in the last ten minutes are action items, right? It completely changes the format of the meeting, right? Like you are creating culture that like this is not just to talk about things, we wanna do something about it, right?
So it’s like how can you structure your operational cadence, your rituals to create the culture that you want? You know, it’s like if you wanna celebrate shipping and like, hey, like we wanna create momentum, create a weekly demo meeting for your engineering team, right? It’s like no matter what you’ve built, you have to come and demo. Like you just have to demo something. Even if it’s an API, come show us your code. It doesn’t matter. Like show your team what you’re working on. Show us, show the interesting things we’re doing across the organization. That’s a ritual, right?
Those rituals can also be part of other teams as well. So like maybe a ritual could be your team, like an engineering team and the support team get together for a hackathon once a quarter. So it’s like, okay, well I, what I wanna hear is like cross pollination of information that’s gonna reduce the support load and escalations from support to engineering. Like that’s the thing that I’m thinking about. But like I’m creating a ritual that like creates that natural cross crosspollination between those two teams. Like what are the actual rituals that I can do on a recurring basis that even if I’m gone, if I’m on PTO or whatever, like those systems are in place that like the wheels are turning and it is creating a culture where when I’m not in the room there, the behaviors are still happening the way I think about it. So that would be my third thing. It’s like, think about culture from the lens of operational cadence. So those would be my three.
Henry Suryawirawan: I really love the last one, right? Because many times when you are, you know, into the mode of like operations, production, and all that, right? So you think about just the problems solving, right? But not necessarily the culture that you wanna build. And creating a cadence, be intentional, right? Creating a cadence. Maybe the rituals, you know, the meetings that you wanna have, right? The cross pollination thing. I think that’s really important for any engineering leader.
So Ganesh, if people love this conversation, they wanna reach out to you, ask more questions, or maybe find out more about your product and things that you built, is there a place where they can find you online?
Ganesh Datta: You can find me on LinkedIn sort of my name. And you can also email me at ganesh@cortex.io .
Henry Suryawirawan: All right, lovely! So thank you so much for today’s conversation. So I believe people have learned a lot of things about, you know, building engineering excellence, and hopefully they can achieve that. So thanks again for that.
Ganesh Datta: Thank you for having me. Enjoy the conversation.
– End –