#234 - Building for Reliability: Durable Execution & Insights from Temporal's Report - Preeti Somal

 

   

“Developers spend a lot of time worrying about their code, all of the error conditions. How do they make it reliable? Studies say between 40% to 60% of the developer’s time is spent on all kind of scaffolding versus the core business logic.”

How much of your code exists only to prevent failures? Discover a new paradigm for building reliable applications.

In this episode, Preeti Somal, SVP at Temporal, explores a paradigm shift that can dramatically boost productivity and give developers peace of mind. Drawing on her experience leading massive infrastructure at Yahoo and HashiCorp, she explains Temporal’s concept of durable execution that helps developers focus on business logic and remove reliability concerns. Preeti also discusses key findings from Temporal’s first State of Development Report.

In this episode, you will learn about:

  • Lessons from operating large-scale systems at Yahoo and HashiCorp
  • Why reliability ranks higher than cost for most engineering teams
  • How durable execution removes reliability complexity from developer concerns
  • Why unlearning old patterns proves harder than learning Temporal’s model
  • Creating a strong incident response culture through blameless post-mortem
  • Nurturing psychological safety in infrastructure teams and on-call engineers
  • Building security and compliance from day one versus retrofitting later

Timestamps:

  • (02:20) Career Turning Points
  • (04:43) Key Learnings from Operating Large Scale Infrastructure
  • (07:56) Key Learnings on Platform Engineering
  • (09:59) Key Learnings on Maintaining High Reliability
  • (12:02) Key Highlights Working at HashiCorp
  • (13:52) Running Infra as Code using Temporal
  • (15:28) Key Principles for Managing a Strong Incident Response
  • (18:37) The Importance of Nurturing Psychological Safety within Infra Team
  • (21:13) The Temporal’s State of Development Report
  • (22:39) The State of AI Usage & Adoption
  • (23:54) Using Temporal for Building AI Applications
  • (26:06) The Complexities Involved in Building AI Applications
  • (28:51) Key Learnings from Temporal’s State of Development Report
  • (31:03) The Choice of Developer Tooling Misalignment
  • (33:12) Integrating Security, Compliance, and Cost into Your Engineering Mindset
  • (33:39) Building with Security and Compliance-First Mindset
  • (36:57) Temporal Paradigm Shift
  • (39:14) How Temporal Hides Away The Complexities of Building Reliable Applications
  • (42:47) Unlearning Required for Using Temporal Programming Model
  • (46:33) Getting Started Building with Temporal
  • (48:34) Temporal’s Durable Execution Guarantee
  • (51:23) The Concern About Temporal Lock-In
  • (54:09) Temporal’s Strong Developer Focus
  • (56:16) The Compliance and Security Aspect of Temporal Cloud
  • (58:41) 3 Tech Lead Wisdom

_____

Preeti Somal’s Bio
Preeti is Senior Vice President of Engineering at Temporal. Preeti is passionate about building great products, growing world class organizations and solving complex problems. Prior to Temporal, Preeti led the Platform, Security and IT engineering organizations at HashiCorp. Her extensive career includes engineering leadership roles at Yahoo!, VMware and Oracle. While at Yahoo! Preeti was VP of Cloud Services in the Platform organization delivering highly scalable services used by engineers across Yahoo to build and operate applications with improved agility, reliability and security. These services power Yahoo!’s consumer and advertising business.

Follow Preeti:

Mentions & Links:

 

Our Sponsor - Tech Lead Journal Shop
Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

 

Like this episode?
Follow @techleadjournal on LinkedIn, Twitter, Instagram.
Buy me a coffee or become a patron.

 

Quotes

Key Learnings from Operating Large Scale Infrastructure

  • At Yahoo, the scale was literally running the entire sort of internal cloud services. And I’ll pick on one system, which was the monitoring service. I’m talking about, 2008 to 2013, so it’s been a little bit of time. It was called yamas and it was consuming billions of time series data points on a daily basis. In fact, in terms of the scale, it was getting to a point where we were looking at perhaps taking this data and moving it to the Hadoop big data clusters as well.

  • From a learning’s point of view, it’s really around taking scale reliability seriously, because these are mission critical systems and outages are very, very costly. When I say seriously, I mean both in your design, build, how you iterate, how you test, what your incident response sort of culture is. All of these pieces ultimately result in a culture of strong engineering ownership.

  • This was one of the things I really learned at Yahoo, that there was a tremendous amount of ownership and pride in the work we did. ‘Cause you think about it, a lot of the infrastructure work is really invisible, Yahoo would launch a new consumer set of application. But, nobody in infrastructure gets any visibility around that. And so you really need to have a culture where your engineers are deriving that satisfaction because of the scale and the reliability that they’re able to hit.

Key Learnings on Platform Engineering

  • Platform engineering now is becoming like this big sort of pattern. But, back, 10, 12 years ago, literally the team I was part of was called platforms, a lot of the learnings are very relevant, which is really around make sure that you are working very closely with the consumers of your platform.

  • Pick a few use cases. Make your customers successful. Nowadays, we would call it like developer relations or having a product manager on board.

  • Set some goals and targets around adoption. Make sure that you are providing visibility into how you are doing around the objectives that you’ve set out for and talk about them.

  • Going back to the invisible sort of theme we were talking about. One thing we would do at Yahoo was we would put posters up in the kitchens about the work that the platform team had done and just, create more visibility into the work that you’re doing and the value you are providing as well.

Key Learnings on Maintaining High Reliability

  • I wish, first of all, that we had Temporal back when I was at Yahoo. It would’ve really made my life so much easier back then.

  • Really I think the pieces we would worry about a lot is the patterns that break. How do we make sure we have fault tolerant systems? How do we make sure we’ve got all the alerting and monitoring in place so that we can respond and recover quickly? Thundering herd problems, making sure that we had all the right, rate limits in place so that we weren’t overwhelming downstream systems. All of the patterns that have come up around distributed systems and engineering.

  • The thing that is so magical about Temporal is that the Temporal platform itself takes care of all of these reliability concerns.

Key Highlights Working at HashiCorp

  • The highlights really were around how strong the HashiCorp community open source presence was. Especially when you talk about Terraform. Terraform has sort of become a verb. It’s a standard and everybody knows Terraform.

  • The power of taking this concept of infrastructure provisioning and enabling it as code so that it can go in GitHub and, all of these benefits of code that brings forward, and the role that the developer plays versus, prior to something like Terraform, you had to either go ask for permission or you had to speak to somebody in IT.

  • The interesting piece also is that the technology trend around cloud becoming more ubiquitous was also happening around the same time, the highlight really is community, how we worked with them, how we got contributions, and just the sheer scale of the standardization that’s happened around the HashiCorp products.

Running Infra as Code using Temporal

  • It was definitely great to have the code part of it. But the elements of being able to reason about your infrastructure and to be able to have those feedback loops as things were working or not working, scale or not. Those pieces are definitely advancing a lot more now than in the early Terraform days.

  • Fun fact is we are also using Terraform here at Temporal. The workflow orchestration around Terraform is being done by the Temporal workflows in the Temporal platform as well. And we can do a lot more there in terms of being able to really see what’s happening in the provisioning flows and where we are running into issues and how we resolve them.

Key Principles for Managing a Strong Incident Response

  • My number one principle there is you have to lead by example. When an incident happens, I get paged as well. I am joining the incident rooms. You really need to show up for your team and help them understand that this is really important. You care. You are there supporting them. I may not be able to actually help with the details, but I am there and that sets a pattern.

  • Second would be really have strong processes in place, and just reinforce those processes every time an incident happens. For us, those processes are around, what’s the on-call chain? How quickly does that on-call get escalated? Where are the runbooks? We’ve put all our runbooks in GitHub so that they’re available to everybody and we can take improvements and enhancements from anyone within engineering.

  • Finally, it’s around being able to measure, report, and build a strong culture around this. One of the things we do at Temporal is every two weeks we have a meeting that’s open to the whole company where we pick some incidents and the owner, the person who was the incident lead basically runs through the incident timeline, the root cause analysis, what worked well, what didn’t, how we’re improving. Just sort of making it really normal to talk about this in a way that is really looking at how we get better. Building that culture through not just engineering, but through the whole company around reliability is really important in having a strong incident response.

The Importance of Nurturing Psychological Safety within Infra Team

  • It’s very important. The way that we approach building that psychological safety is around striving towards a blameless culture, really focusing on, what was the information that the engineer had and how they made decisions? Was there a need to do better in the tooling, the information gathering? And also really around what is it that we learned from this incident.

  • One thing that’s really interesting that we’ve started doing is, at the end of the quarter, we actually send out a survey to all of the on-call engineers, and it’s an anonymous survey. It really is a way for us to gauge not just how they feel about on-call and their effectiveness and their quality of life with on-call, but also the satisfaction that really comes from feeling empowered and feeling that they have that psychological safety as well. That survey has been really well received in that, okay, we care about that on-call experience and we wanna hear from people and we wanna make it better for the engineers.

  • I feel like the more we talk about this, the more we can share and learn from others as well. So I’m always open to new ideas around how we can do better.

The Temporal’s State of Development Report

  • We’re super excited to have it out. Our focus at Temporal is around making it easier for developers to build these reliable mission critical applications. And we have a really strong community of developers that is growing.

  • What we wanted to do with the survey was really tap into the knowledge of these engineers and developers and really try and understand, especially with AI, what are some of the patterns and trends? How are they viewing the technology that they have? What are their challenges? What’s important to them? And so that’s was really the impetus of getting the survey out. We hope to do this regularly so we can hear back in a very data-driven approach and then publish and share those learnings as well.

The State of AI Usage & Adoption

  • It’s interesting that the data is a little bimodal in the sense of, it’s showing that 94% of the reports, they’re using some kind of AI tools. Whether it’s a Copilot kind tool or ChatGPT. There’s some sort of usage of AI tooling in their workflows.

  • But only about 39% said that they’ve built any sort of major AI projects themselves at scale. So the usage is definitely increasing. But the infrastructure or to use it more natively in the work that they’re doing or the work the customer application they’re delivering that is still early.

Using Temporal for Building AI Applications

  • Within Temporal, we are certainly also doing a lot of work with some of the tooling. One really interesting thing is we are seeing a lot of developers, engineers, companies, building AI tools on Temporal.

  • One of the biggest things that we are seeing is that reliability is really, really important for AI and agents to really be adopted in widespread manner. It’s really interesting because these problems around reliability with AI agents are all very similar patterns to the reliability of distributed systems applications. And so we’re seeing a lot of customers really coming to Temporal for solving these reliability problems.

The Complexities Involved in Building AI Applications

  • When you think about the core of it, what the agentic AI system is doing, it’s really looking at what is a customer problem at hand that they want to solve. Whether it’s booking a flight or pick any example. And then looking at calling out to one of the models around what are the various options, what is the recommendation, and then calling to some tools around making the booking or the changes.

  • It’s essentially a multi-step process where each step can be error prone. It’s very common, for instance, to call out to a model and get rate limited and not get your response back, each of these steps is something that you could run into issues. Some of these steps could be fairly long running, especially when agents are learning based on your usage pattern or when you need a human to do some approvals in that workflow.

  • The big reliability issue here is that if any one of those steps crashes or returns errors, what do you do? Do you start all over again? We know that some of these steps are really expensive as well. And so this is where we’ve been really fortunate to have a product that has been running in production at scale with Temporal, where the core of the problem that Temporal solves is how do you take this multi-step process and you do all of the reliability, state management recovery pieces within the Temporal platform versus every engineer having to worry about it.

Key Learnings from Temporal’s State of Development Report

  • One of the key learnings from the report is how important reliability is. About 36% of the engineers and leaders that were surveyed had reliability and compliance as their top needs, even more than cost or performance.

  • The other finding is around how engineers make tooling decisions versus the people who have the budget. How they make the tooling decisions continues to remain misaligned. This is an age old problem. But that came up in the survey as well.

  • Last but not least is around how failures represent real business risk, and we all know that an outage is potentially an existential threat for a company. The survey validated that these failures come with this business risk that is sometimes hard to quantify. But it’s a very real feeling and effect of a failure.

The Choice of Developer Tooling Misalignment

  • On the misalignment, honestly, we are not sure. The survey, besides highlighting the fact that there is misalignment, didn’t give us insight into why.

  • The key here is to empower the developers. Our suspicion is that some of this stems from all of the compliance pieces that organizations a lot of times have to follow, where some of the tools that developers want to get might not have the SOX compliance or the HIPAA or whatever other compliance pieces are in play for your company.

  • One of the things that is very real in our industry is in order to really get broader adoption within these bigger, more regulated companies, how can some of the new tooling fit into their compliance and security frameworks as well? I suspect that some of that misalignment is coming from some of these requirements that the developer may not see, but the decision maker has to comply with.

Building with Security and Compliance-First Mindset

  • I feel strongly that you really need to be clear about what your monetization strategy is going to be. And if you are really looking at an enterprise market, you have to have the security, compliance pieces in mind.

  • Security, these days it is a no-brainer, you really need to be thinking security first, compliance is something that you could argue, like SOC 2 compliance or these things can come as you mature. But your architecture, your design for the platform and the product you’re building really needs to be very security aware and security first.

  • For instance, one example I’ll give you is Temporal, it is an open source tool. But we actually don’t see any of the customer’s code or customer’s data. That all runs in the customer’s infrastructure. And so that design decision was made very, very early on, but has really resulted in much faster adoption, because it took a whole conversation off the table when it came to a using and buying decision.

  • My advice basically would be you have to think about security from the ground up. You have to make sure you’re not going through what I’d call a one-way door on compliance. But you don’t necessarily need to have all of the compliance pieces lined up early on.

  • On cost, it’s really important to think about how you’re going to monetize. At the end of the day, we know all these tools are competing for some share of budget. Being very clear on the value your tool brings versus the cost and how that lines up. And then, also working hard towards lowering costs for developers.

Temporal Paradigm Shift

  • Temporal is a paradigm shift. Essentially, today developers spend a lot of time worrying about their code, all of the error conditions. How do they make it reliable? There are some studies that would say anywhere between 40% to 60% of the developer’s time is on all of the scaffolding versus the core business logic.

  • The paradigm shift with Temporal is that we take care of all of that for you. So basically we shield the developer from all of the complexities of running and building reliable code, and the developer can focus on their business logic.

  • The way that we do this is that you code to our SDKs and we, the Temporal server, handles all of the task dispatching, retries, all of the state management, all these pieces that are often distributed over lots of systems and very hard to manage and trace through. We do all of that for you.

How Temporal Hides Away The Complexities of Building Reliable Applications

  • Temporal. Our two co-founders have been working on this problem for a lot of years. They got together at Uber and built this open source project that then became Temporal. A lot of their lessons around running distributed systems at scale are what Temporal solves for. The main point here is that they’ve been thinking about this problem for about 15, 20 years. So it’s not an easy problem to solve and it’s definitely one where it has been production tested.

  • How we do that is essentially we introduce some abstractions around what we call a workflow and activity and a task. In this abstraction, we create a programming model where the developer is just focusing on their business logic. They have some rules that they need to abide with. So for instance, anything that is error prone, they would put in an activity. These abstractions then let us do things like retries. We save the state of the execution of that workflow. So we know at any given point in time where that code is. And what’s passed and what’s coming next. If something were to crash, we have a concept called replay where we can go in and run again.

  • There’s a lot of power in the programming model. This is the core of the paradigm shift is there’s a few abstractions that the developers need to learn and then we take care of the rest.

  • Very frequently what we hear developers say is that understanding that paradigm shift was a little bit of effort. Because their natural instinct is, oh, where is my, if error do x block? But once they get it, then they say that they can’t ever look back. It’s just such an elegant model that then they are like Temporal developers for life.

Unlearning Required for Using Temporal Programming Model

  • It’s not difficult to learn, but it is difficult to unlearn the patterns from wherever you came from. What I mean by that is what we see is that developers go through a journey where at some point they don’t believe it and then there’s an aha. And then they’re like, oh my God, is this for real? And then they’ll try it. Once they really understand it, then there is a tremendous amount of excitement.

  • The programming model itself is actually really simple, really elegant. But what typically happens is developers coming in who are new to Temporal expect to have to do all of this work around reliability and unlearning those habits, take some time.

  • One thing that we are trying to do, and we’ve put a lot of focus on, is do more community education. We have a Slack channel where our community joins in and shares what they’ve built. We’ve got something called community exchange where developers can share the code that they’ve built and link to their GitHub repo. So we’re really working towards just showing more code and doing more education around the model.

Getting Started Building with Temporal

  • One is definitely join our community. Second is sign up for Temporal Cloud. For signup, we have credits that we give to developers and we have a quick start guide that helps you get set up and run samples.

  • The third thing would be there’s a lot of talks that our customers, developers who have built on Temporal have done at our conferences. Just listen to another developer talking about how they’re using Temporal. That really will help you get started because then you can start envisioning how you could bring Temporal to the problems you have at hand.

  • We hear a lot is not just that development teams can go faster. For instance, we’ve had customers tell us that they’ve been able to get 6x developer productivity because all of that boilerplate is gone. But also this aspect of peace of mind and knowing that reliability is taken care of. Failures happen. We know that in any distributed system, failures will happen. But recovering from those failures, Temporal will handle that for you. That element of how much peace of mind this brings to the teams that are on-call is also a really significant part of the value we believe Temporal brings.

Temporal’s Durable Execution Guarantee

  • In the early days of Temporal, the natural instinct for developers is to try and compare you with technology they know, are you a message bus? No. Are you like a Visual Studio type thing? No, durable execution for us really encapsulates the core value that once you’ve built your code using the Temporal model, we will execute it for you. That durability guarantee is really, really very strong.

The Concern About Temporal Lock-In

  • The way that we live by our values is around the open source commitment, essentially, our guarantee to a developer who’s building using the Temporal SDK is that they can run their code on the open source server that they could be running in their own infrastructure or they could bring it to cloud without any code changes. We guarantee that code compatibility between Temporal Cloud and open source.

  • From a vendor lock-in point of view, the programming model is sticky, so if you wanna use a different programming model, you will have to go change your code. But you don’t have to rely on Temporal the company, because the server is available in open source. And we have a very strong commitment to code compatibility between cloud and the open source server.

Temporal’s Strong Developer Focus

  • Our commitment is really to make developers’ lives easier. To help them just focus on their business logic and not have to worry about all of this really complex, messy, sort of pieces we’ve been talking about. We’re committed to open source. We are available under the MIT license. We have a really strong community that is growing quite a bit.

  • Our focus on developers is so incredibly strong that we build our SDKs across multiple languages. So we have Python, TypeScript, Java, Go, .NET as well. We wanna meet the developers where they are and we’re truly driven by making their lives better and more productive.

  • On the Temporal Cloud side, we are available in multiple regions across the globe. Our goal with Temporal Cloud is to run the Temporal service so that you don’t have to worry about the reliability of the Temporal service itself.

  • Become a part of the community. The best way to learn is by looking at samples and listening to others.

The Compliance and Security Aspect of Temporal Cloud

  • The first piece to understand is that when you build your code using our SDKs, you are building what we call a worker, and the worker is running in your infrastructure. We don’t see your code. What we are doing on the Temporal Cloud side is essentially dispatching the tasks and the retries.

  • That immediately overcomes a lot of the questions around, do we see your data, do we see your code? The answer is no. The task dispatching pieces obviously need connection that would typically go over something like a private link. On the Temporal Cloud side, we are SOC 2 certified.

  • We have had some requirements from customers around having their disaster recovery and primary region in the same country. We can build a region wherever the cloud providers exist. This elegance of the model around security has really enabled us to move much faster and make inroads in banks using Temporal Cloud pretty quickly as well.

3 Tech Lead Wisdom

  1. Being clear about the problem you’re trying to solve.

    • Think deeply and from a first principal’s viewpoint.

    • You are solving the problem in a way that’s very impactful. And really hold that high bar around the problem and how you solve it.

  2. Don’t shy away from hard things.

    • Sometimes it’s easy to take shortcuts, but we’re in this for the long term.

    • Solve the hard problems and really don’t lose track of that long term piece that you’re building.

  3. Your team is the key to solving any technology problem.

    • The team that you have and the people you work with, they are critical in solving any technology problem.

    • Your engineering team, the talent, is your biggest asset. It’s how you can solve these problems.

    • So make sure that you are hiring well, you’re taking care of your team and your engineers, and you’re motivating them and helping them understand the big picture and empowering them to have an impact.

Transcript

[00:01:28] Introduction

Henry Suryawirawan: Hello, guys. Welcome back to another new episode of the Tech Lead Journal podcast. Today, I have with me SVP of Temporal, a company that I’ve been hearing a lot good things about Temporal in the recent months, right? Preeti Somal here with me today. I’m really looking forward to learning from you from your career journey and about Temporal. And I know that Temporal just recently launched a state of developer report as well. So yeah, looking forward for our chat today, Preeti.

Preeti Somal: Yeah, likewise. Thank you for having me on your podcast, Henry. And uh, I’m really excited to talk more about myself and technology and Temporal.

[00:02:20] Career Turning Points

Henry Suryawirawan: Yeah. So maybe let’s start from yourself first, right? I always love to invite my guests to share some career highlights or turning points, learnings that you have throughout your journey that you think we all can learn from you.

Preeti Somal: Yeah, that sounds great. I think, for me, uh, you know, I started off my career in enterprise software. I started in Oracle, spent some time at VMware. I would say though, that the turning points were sort of the three roles after. The first one, of course, was Yahoo. And, um, Yahoo for me was the first time that I was on the other side, both building and operating tremendous web scale infrastructure for mission critical services. Uh, and so it was my first sort of opportunity to really learn about running services at scale using open source. Honestly, Yahoo, as you probably know is a big user and contributor back to open source. And so that was fantastic learning. My customers at Yahoo were other engineering teams, so really the developers within Yahoo that used all of these platforms I was building.

From there, my next turning point was HashiCorp. And, uh, I got to HashiCorp early in 2018 when the company was still 150 people. And really I think it was an amazing journey, again, open source developer, but also scale how we scaled the organization, the teams, and continued to really remain close to the developer community. So I spent five years there.

And when I left HashiCorp, uh, you know, this opportunity at Temporal came up. And we at HashiCorp were using Temporal. And so I was really familiar with the technology and, uh, another turning point because we run Temporal as a cloud service. And so mission critical scale reliability is like our number one value. And just learning to see how our customers rely on Temporal has really brought me a lot of perspective on the place we have in that technology stack.

[00:04:43] Key Learnings from Operating Large Scale Infrastructure

Henry Suryawirawan: Wow, I think, uh, we can see like quite an illustrious career so far, right? So I think, and also you joined at the right time, I would say, right? So maybe Yahoo back then was still kind of like big, right? And then moving to HashiCorp, we know where HashiCorp is now, right? And then lastly, joining new pretty exciting, uh, companies, right? I think it’s gonna be a big thing as well, Temporal.

So yeah, maybe we can learn a little bit, from these three companies, uh, in your career, right? So you are operating and running, you know, a big infrastructure. Maybe, cloud, uh, engineering teams. Uh, what are some of the key learnings, you know, seeing because you do not just handle a smaller scale, kind of a traffic and kind of infrastructure. So tell us a little bit more about this journey. What kind of scale you were running back then, and what are some of the key learnings?

Preeti Somal: Yeah, yeah, absolutely. Uh, so at Yahoo, the scale was literally running, uh, you know, the entire sort of internal cloud services. And I’ll pick on one system, which was the monitoring service. And, uh, this monitoring service. And I’m talking about, you know, 2008 to 2013, so it’s been a little bit of time. But it was called yamas and it was consuming billions of sort of time series data points on a daily basis. And in fact, in terms of the scale, it was getting to a point where we were, uh, really looking at perhaps taking this data and moving it to the Hadoop big data clusters as well. Uh, so that’s a quick idea on scale, but I think from a learning’s point of view, it’s really around taking kind of scale reliability seriously, because these are mission critical systems and outages are very, very costly. And when I say seriously, I mean both in your design, build, how you iterate, how you test, what your incident response sort of culture is. All of these pieces ultimately result in a culture of like strong engineering ownership.

And this was one of the things I really learned at Yahoo, that there was a tremendous amount of ownership and pride in like the work we did. ‘Cause you think about it, a lot of the infrastructure work is really invisible, right? Like Yahoo would launch a new consumer set of application. But, you know, nobody in infrastructure gets any visibility around that. And so you really need to have a culture where your engineers are deriving that satisfaction because of the scale and the reliability that they’re able to hit.

Henry Suryawirawan: Yeah, I was about to say that when you mentioned about invisible, right? Yes, pretty much when things are all right. But when things are not all right, like outages, I think pretty much infra team is the most visible. Uh, I had that experience as well back then, um, in my previous company. And yeah, I think it’s uh, quite funny when you said that, right?

[00:07:56] Key Learnings on Platform Engineering

Henry Suryawirawan: So I think back then you kind of like building this internal engineering platform, I would say, right, platform engineering is quite famous term these days and internal developer portals. So you were running it back then long time ago. So any kind of learnings that you think are still pretty relevant to what we call platform engineering now?

Preeti Somal: Absolutely! And it’s actually really interesting because platform engineering now is becoming like this big sort of pattern. But, you know, back, you know, 10, 12 years ago, like literally the team I was part of was called platforms, right? And so a lot of the learnings I think are very relevant, which is really around make sure that you are sort of working very closely with the consumers of your platform. Pick a few use cases. Make your customers successful. Nowadays, we would call it like developer relations or you know, having a product manager on board. Set some goals and targets around adoption. Make sure that you are providing visibility into how you are doing around kind of the objectives that you’ve set out for and talk about them. You know, going back to the invisible, uh, sort of theme we were talking about. One thing we would do at Yahoo was we would put posters up in the kitchens about like the work that the platform team had done and just, you know, create more visibility into the work that you’re doing and the value you are providing as well.

Henry Suryawirawan: Wow, I think, uh, it’s pretty interesting, right? The putting the posters, uh, for your internal platforms. I remember Google also has this something they call testing in the toilet kind of thing. So…

Preeti Somal: Right.

Henry Suryawirawan: I used to work at Google as well, so whenever we went to toilet you can see some posters about testing patterns or these, you know, good design pattern for testability. So I think it is pretty similar thing. I guess like you should put visibility so that people kind of like unexpectedly learn about your platform.

[00:09:59] Key Learnings on Maintaining High Reliability

Henry Suryawirawan: When you mentioned about reliability as a key thing. Back then Yahoo has a lot of engineering teams, I suppose, globally, right? And a lot of internet scale as well. What kind of key things that you think is the most critical thing that people should know about reliability and how to kind of like do a good pattern to ensure that reliability is kind of high?

Preeti Somal: Yeah. Um, great question and you know, I think reliability for me has been, an area that I now spend a lot of time in, given my role at Temporal. I think. I wish, first of all, that we had Temporal back when I was at Yahoo. It would’ve really made my life so much easier back then. But really I think the pieces we would worry about a lot is the patterns that break. How do we make sure we have like fault tolerant systems? How do we make sure we’ve got, you know, all the alerting and monitoring in place so that we can respond and recover quickly? Thundering herd problems, you know, making sure that we had all the right, like rate limits in place so that we weren’t overwhelming downstream systems. All of the patterns that, you know, have kind of come up around distributed systems and engineering. And just to segue to my current role at Temporal, uh, the thing that is so magical about Temporal is that the Temporal platform itself takes care of all of these reliability concerns. And that’s sort of why I was saying, you know, this has followed me to my current role as well.

Henry Suryawirawan: Yeah, I think it’s pretty hard to run a high reliable systems, especially when you have a distributed systems, many components running altogether, right? Just to piecemeal what is actually happening for serving one request, for example, I think it’s also not easy. Uh, and the nature of, you know, high traffic as well makes it even more difficult.

[00:12:02] Key Highlights Working at HashiCorp

Henry Suryawirawan: So maybe let’s go back a little bit to HashiCorp experience, right? I don’t know whether HashiCorp back then runs, uh, like a high, highly scalable system, right? From my point of view, it’s mainly the open source products that they have, right? Starting with Terraform, you know, Vault, uh, and all that. But lately, they have also cloud platform. So tell us, what was your main highlights when you work in HashiCorp? What kind of things that you think are your biggest achievement there?

Preeti Somal: Yeah, I think for me, the highlights really were around how strong the HashiCorp community open source presence was. Especially when you talk about Terraform. You know, Terraform has sort of become a verb. It’s a standard and everybody knows Terraform. And so I think the power of like taking this concept of infrastructure provisioning and enabling it as code so that it can go in GitHub and, you know, all of these benefits of code that brings forward, right? And the, the role that the developer plays versus, you know, prior to something like Terraform, you had to either go ask for permission or you had to speak to somebody in IT. And the interesting piece also is that the technology trend around cloud becoming more ubiquitous was also happening around the same time, right? So, um, to me, I think the highlight really is community, how we worked with them, how we got contributions, and just the sheer sort of scale of the standardization that’s happened around the HashiCorp products as well.

[00:13:52] Running Infra as Code using Temporal

Henry Suryawirawan: Yeah. And how do you see the, you know, the kind of like the infrastructure as code side, right? So you were running platform engineering back then. How did you automate some of these, you know, server creations and all that? And how do you think tools like Terraform, you know, like infrastructure as code, maybe these days is more declarative as well. Like how do you see some kind of things that uh, you think will be great to have back then?

Preeti Somal: Oh, absolutely. I think, uh you know, I think the parts around… You know, it was definitely great to have like the code part of it. But the elements of being able to reason about your infrastructure and to be able to have those feedback loops as things were working or not working, scale or not. You know, I think those pieces are definitely advancing a lot more now than sort of in the early Terraform days.

Fun fact is we are also using Terraform here at Temporal. But the workflow orchestration around Terraform is being done by Temporal, the Temporal workflows in the Temporal platform as well. And we can do a lot more there in terms of being able to really see what’s happening in the provisioning flows and where we are running into issues and how we resolve them.

Henry Suryawirawan: Wow, It’s pretty interesting. Uh, so using Terraform in your workflow as well. So I must assume it’s more like a dynamic infra propositioning and all that, that is part of the Temporal workflow.

[00:15:28] Key Principles for Managing a Strong Incident Response

Henry Suryawirawan: So you mentioned about incident response, I think for someone running infra engineering org, especially a big scale, right? Definitely you cannot run away from like outages, issues, incidents, and all that. So tell us a little bit, how do you build such ownership and, you know, maybe a great incident response as part of your learnings in all these organizations?

Preeti Somal: Yeah. I think my number one principle there is you have to lead by example. And so you know, when an incident happens, I get paged as well. I am joining the incident rooms. And you really need to show up for your team and help them understand that, you know, this is really important. You care. You are there supporting them. I may not be able to actually help with the details, but I am there and that sets a pattern. So I’d say lead by example.

Second would be really have strong processes in place, and just reinforce those processes every time an incident happens. Uh, so for us, those processes are around, okay, what’s the on-call chain? How quickly, uh, does that on-call get escalated? Where are the runbooks? We’ve put all our runbooks in GitHub as well, so that, you know, they’re available to everybody and we can take sort of improvements and enhancements from anyone within engineering. So have a really, really strong sort of process in place.

And then finally, it’s around being able to measure, report, and build a strong culture around this. So one of the things, for instance, that we do at Temporal is every two weeks we have a meeting that’s open to the whole company where we pick some incidents and the owner, the person who was kind of the incident lead basically runs through the incident timeline, the root cause analysis, what worked well, what didn’t, how we’re improving. And just sort of making it, uh, really normal to talk about this in a way that is really looking at how we get better. Uh, so building that culture through not just engineering, but through the whole company around reliability is really important in having a strong incident response.

Henry Suryawirawan: Yeah, so I must also highlight the, you know, the RCA, the root cause analysis, post-mortem culture part, right? Because if we don’t normalize talking about it, making sure that, you know, first, we handle it well, so hopefully the incident got resolved, right? And then what key learnings from there, how we can iterate and have this feedback loop so that we can improve the process. And I think the most important thing is probably the so-called the psychological aspect and kind of like making sure people are not very, very disappointed about failures, right? So obviously we don’t want to have failures, but the key thing is how can we learn and improve the processes from there?

[00:18:37] The Importance of Nurturing Psychological Safety within Infra Team

Henry Suryawirawan: So maybe anything about psychological safety, especially infra team, right? They always get a lot of scrutiny by everyone when things are going wrong and maybe people think, oh, you’re not doing your job well. So tell us a little bit more on this side. Like, how do you ensure the infra engineers feel safe? And they are okay whenever incidents happen.

Preeti Somal: Yeah, absolutely. It’s really important as you, as you know, and I can see that you’ve kind of been through this and you feel it. It’s very important. The way that we approach building that sort of psychological safety is around really, you know, striving towards a blameless culture, really focusing on, you know, what was the information that the engineer had and how they made decisions? And was there sort of, was there a need to do better in the tooling, the information gathering? And also really around what is it that we learned from this incident.

And then one thing that’s really interesting that we’ve started doing is, at the end of the quarter, we actually send out a survey to all of the on-call engineers, and it’s an anonymous survey. But it really is a way for us to gauge kind of like, not just how they feel about on-call and their effectiveness and kind of their sort of quality of life with on-call, but also the satisfaction that really comes from feeling empowered and feeling that they have that psychological safety as well. And that survey is, has been really well received in that, okay, you know, we care about that on-call experience and we wanna hear from people and we wanna make it better for the engineers. Uh, so I think these are some of the things that we’re doing. And of course, I feel like the more we talk about this, the more we can share and learn from others as well. So I’m always open to new ideas around how we can do better.

Henry Suryawirawan: Yeah, I think the survey part is really important, right? Especially people may not talk. Especially this part of the world, right? I’m in a Asian part. So some people actually don’t like to talk about failures, you know, their frustrations, so they think it’s a part of the job. And I think anonymous surveys, probably every leaders can do, right? Just, you know, build that culture where people can submit their feelings, their feedback, their frustrations anonymously and maybe, yeah, we can take actions around that.

[00:21:13] The Temporal’s State of Development Report

Henry Suryawirawan: So let’s move on a little bit to another topic that we would like to discuss today. I know that Temporal recently just released this state of developer report, right? Every time I’m seeing in the industry, there’s always this new state of something report, right? It’s very exciting. But this is, I think the first time Temporal is doing it. So tell us a little bit more about this report and why are you guys publishing this report?

Preeti Somal: Yeah, absolutely. We’re super excited to have it out. And uh, essentially, you know, our focus at Temporal is around making it easier for developers to build these reliable, uh, mission critical applications. And we have a really strong community of developers that is growing. So what we wanted to do with the survey was really sort of tap into the knowledge of these engineers and developers and really try and understand, especially with AI, you know, what are some of the patterns and trends? How are they viewing the technology that they have? What are their challenges? What’s important to them? And so that’s was really kind of the impetus of getting the survey out. And we hope to do this regularly so we can kind of, you know, hear back in a very data-driven approach and then publish and share those learnings as well.

Henry Suryawirawan: Wow. Yeah. Looking forward to learning from this report. So…

Preeti Somal: yeah.

[00:22:39] The State of AI Usage & Adoption

Henry Suryawirawan: You mentioned about building, you know, maybe building and using AI kind of thing, right? So that’s pretty trendy these days, right? So. Um, do you see, first of all, the usage of AI? You know, the adoption rate is really high, as part of your survey? And secondly, how many teams are actually building AI-related, you know, systems or solutions in their, you know, engineering teams?

Preeti Somal: Yeah, great question. So it’s interesting that the data is a little bimodal in the sense of, it’s showing that 94% of the reports, they’re using some kind of AI tools. You know, whether it’s like a Copilot kind tool or ChatGPT. You know, there’s some sort of usage of AI tooling in their workflows. But only about 39% said that they’ve got sort of built any sort of major AI projects themselves at scale. So the usage is definitely sort of increasing. But like the infrastructure or to use it more sort of natively in the work that they’re doing or the work the customer application they’re delivering that is still early.

[00:23:54] Using Temporal for Building AI Applications

Henry Suryawirawan: Yeah, so I would say many people would have adopted, you know, some kind of AI tools, especially the chat part, obviously it’s quite common, ubiquitous, right? So what’s your takeaway as a leader, right? So do you use also AI often and can we actually use AI reliably for infrastructure-related stuff?

Preeti Somal: Oh, I wish. I think, uh. So within Temporal, we are certainly also doing a lot of work with some of the tooling. One really interesting thing is we are seeing a lot of developers, engineers, companies, building AI tools on Temporal. And one of the biggest things that we are seeing is that reliability is really, really important for kind of AI and agents to really be adopted in any kind of a, sort of, uh, widespread manner. It’s really interesting because these problems around reliability with AI agents are all very, very similar patterns to the reliability of distributed systems applications. Uh, and so we’re seeing a lot of customers really coming too Temporal for solving these reliability problems. Now on the other hand for your question, sort of, are we using any infrastructure AI tooling at the moment? No. Um. We may, we may build some ourselves for our own uses because we haven’t seen anything coming out that we think really fits our needs. Um, but you know, we don’t have anything in production that we use internally at the moment.

Henry Suryawirawan: Right. So yeah, definitely very interesting to see this space. I’m sure there will be lots of new inventions coming. Troubleshooting, debugging and all that, I think is still pretty much the, maybe the most use case that people use, right. Especially when you see kind of like peculiar error messages. Sometimes like we don’t know what is going on with that product, right? So I think having an AI can help us.

[00:26:06] The Complexities Involved in Building AI Applications

Henry Suryawirawan: So, um, you mentioned about building, operating AI kind of like systems, right? Personally, I haven’t done it myself, right? So what do you think are some complexities that involved in, you know, building AI systems, maybe agentic systems as well that, you know, requires like good reliability?

Preeti Somal: Yeah, I think the, you know, when you think about like at the core of it, what the agentic AI system is doing, it’s really, uh, sort of looking at what is a sort of customer problem at hand that they want to solve. You know, whether it’s like booking a flight or pick any example. And then looking at kind of calling out to one of the models around what are the various options, what is the recommendation, and then calling to some tools around making the booking or the changes. You know, it’s a… We think of it as, it’s essentially like a multi-step process where each step can be error prone. It’s very common, for instance, to call out to a model and get rate limited and not get your response back, right? So the, each of these steps is something that you could run into issues. Some of these steps could be fairly long running, especially, you know, one of the things that is coming up is like agents that are learning based on your usage pattern or when you need a human to do some approvals in that workflow.

And so the big sort of reliability issue here is that if any one of those steps crashes or returns errors, what do you do? You know, do you start all over again? And we know that some of these steps are really expensive as well. And so this is where, you know, we kind of have been really fortunate to have a product that has been running in production at scale with Temporal, where the core of the problem that Temporal solves is how do you take like this multi-step process and you do all of the reliability, state management recovery pieces within the Temporal platform versus every engineer having to worry about it.

Henry Suryawirawan: Yeah, so I think agentic systems definitely requires a lot of multi-step. Sometimes even the steps are not well defined in the first place because the model will kind of like build, you know, what steps that, uh, it needs to, you know, do in order to solve the problem. And I can see like the complexities, right? So there’s a undeterministic thing happening as well. And plus you dunno the kind of failure, I dunno, patterns that might happen, uh, with all those steps.

[00:28:51] Key Learnings from Temporal’s State of Development Report

Henry Suryawirawan: So I wanna save the time to talk about Temporal at the end. But what are the key highlights or key learnings that you think, we can talk about from the report?

Preeti Somal: Yeah, I think, one of the kind of key learnings from the report that actually mirrors a lot of our conversation is how important reliability is. And so, you know, about 36% of the engineers and leaders that were surveyed kind of had reliability and compliance as their top needs, even more than, for instance, cost or performance. So that was one finding from the survey.

The other finding is, this is not going to be new to you and me, but it’s around how engineers make tooling decisions versus, uh, the people who have the budget. You know, how they make the tooling decisions and that continues to remain misaligned. I think this is an age old problem. But that kind of came up in the survey as well.

And, uh, you know, last but not least is, again, not a surprise, but it’s around how failures represent like real business risk, right? And we all know that an outage is potentially like a existential sort of threat for a company. And so again, the survey validated that these failures come with like this business risk that is sometimes hard to kind of quantify. But it’s a very, very real, uh, sort of feeling and, uh, effect of a failure.

Henry Suryawirawan: Right. I must say that I’m pretty, uh, kind of like not surprised by this. You know, reliability is the key thing. You know, compliance, especially, uh, security, privacy and all that is definitely top of mind. And that failures, maybe incidents result in like business kind of, uh, risk, right? Especially if you are running like AI agentic system, first of all, how do you know it’s accurate? How do you know it’s reliable and all that? So definitely those part is, I think it’s unsurprising.

[00:31:03] The Choice of Developer Tooling Misalignment

Henry Suryawirawan: But what still surprised me is the choice of developer tooling, like what you mentioned, the misalignment. I know that it happens, uh, throughout various different organizations. But what I can say is like, typically, like if the developers choose the tooling, right, it can help a lot in terms of, you know, choosing the best tool that can serve the purpose really, really well. Maybe in terms of, uh, adoption rate as well, it will be much, much higher. So tell us why, you know, the decision makers or the executives still think they can choose, you know, the best tools on behalf of the engineering team. So tell us a little bit more about this problem, misalignment, how we can solve this?

Preeti Somal: Yeah. So first isn’t it great when the survey kind of validates how you’re thinking about a problem, right? But second, on the misalignment, honestly, we are not sure. Like the survey didn’t, besides highlighting the fact that there is misalignment, didn’t give us insight into why. And I think, you know, I, for instance, a hundred percent agree with what you’re saying, uh, you know. The key here is to empower the developers. Our suspicion is that some of this stems from all of the compliance pieces that organizations sometimes, you know, a lot of times kind of have to follow, where some of the tools that developers want to get might not have the SOX compliance or the HIPAA or, you know, whatever other compliance pieces are in play for your company.

Yeah, so I think, I think one of the things that is very real in our industry is in order to really get broader adoption within these bigger, more regulated companies, how can some of the sort of new tooling fit into their compliance and security frameworks as well? So I suspect that some of that misalignment is coming from some of these requirements that the developer may not see, but the decision maker has to comply with.

[00:33:12] Integrating Security, Compliance, and Cost into Your Engineering Mindset

Henry Suryawirawan: Yeah, so compliance, security, definitely, uh, typically top off, you know, their concerns. And the second one is probably cost, right? Because some of these developer tools may be deemed as expensive. Even though sometimes it’s really hard to quantify the, you know, the value coming out from the cost, right? If it can save a lot of, you know, I dunno, engineering hours, I think that tool is worth that much, right? Um, but definitely, yeah, these are some of the things I also experienced myself back then, right?

[00:33:39] Building with Security and Compliance-First Mindset

Henry Suryawirawan: One concern is always about, you know, like let’s say the management always think, okay, this tool is not compliant, not secure. The cost is pretty high. But every time I see new technologies coming. Obviously they don’t put a lot of focus on this area first. Because like if you think about open source, it’s more about functionality first, features, developer experience first maybe. So tell us how for engineers who build these solutions to have this mindset about compliance, security, and maybe cost as part of their design as well.

Preeti Somal: Yeah, it’s an excellent observation and I think, you know, I feel strongly that you really need to be clear about what your monetization strategy is going to be. And if you are really looking at sort of an enterprise market, you have to have the security, compliance pieces in mind. I think, security is like, these days it is a no-brainer, as in you really need to be thinking security first, right? Compliance is something that you could argue, you know, like SOC 2 compliance or these things can come as you mature. But your architecture, your design for the platform and the product you’re building really needs to be very, very security aware and security first.

And so for instance, one example I’ll give you is Temporal, it is an open source tool. But we actually don’t see any of the customer’s code or customer’s data. That all runs in the customer’s infrastructure. And so that sort of design decision was made very, very early on, but has really resulted in much faster adoption, because it took a whole conversation off the table when it came to a using and buying decision.

So my advice basically would be you have to think about security from the ground up. You have to make sure you’re not going through any, what I’d call a one-way door on compliance. But you don’t necessarily need to have all of the compliance pieces lined up early on. And on cost, it’s again, really important to think about just how you’re going to monetize. And, you know, at the end of the day, we know all these tools are competing for some share of budget. And being very clear on the value your tool brings versus the cost and how that lines up. And then, you know, also working hard towards lowering costs for developers, right?

Henry Suryawirawan: Yeah, so hopefully, you know, uh, everyone that is building, you know, this kind of tooling, developer tools and all that can also learn from this thing, right? So personally for compliance as well, I think it is, uh, useful if let’s say you already know what are the controls that the, you know, whatever compliance certifications that you are aiming for, right? Like what are kind of controls that are coming. Think maybe how you can bridge the gap, right, in the future, right? Rather than having to retrofit, which is always difficult, especially if you’re already running with customers and data and all that, right? So I think that is also a tip from me.

[00:36:57] Temporal Paradigm Shift

Henry Suryawirawan: Let’s maybe dive deeper into Temporal itself, right? So sometimes, every few years, uh, we will hear about this new paradigm in the engineering world where it is kind of like new, novel, like the way for solving problems. So we can think of, I dunno, like cloud is one, infrastructure as code is one, right? Service mesh is probably another one. And I think Temporal is kind of like somewhere in this space as well. First of all is there’s a paradigm shift. So maybe let me ask you like what kind of paradigm shift that peoples need to have whenever they think about Temporal.

Preeti Somal: Yeah, yeah, absolutely. Uh, so you are absolutely right. Temporal is a paradigm shift. Essentially, the thing that is happening is today developers spend a lot of time worrying about their code, all of the error conditions. How do they make it reliable? And there are some studies, for instance, that would say, you know, anywhere between 40% to 60% of the developer’s time is on all of the kind of scaffolding versus the core business logic. And the paradigm shift with Temporal is that we take care of all of that for you. So basically we shield the developer from all of the complexities of running and building reliable code, and the developer can focus on their business logic. Now the way that we do this is that you code to our SDKs and we, the Temporal server, handles all of the task dispatching, retries, all of the state management, all these pieces that are often distributed over lots of systems and very hard to manage and like trace through. We do all of that for you. Sounds good, right? Doesn’t it sound great?

Henry Suryawirawan: Yeah, I think it sounds too good to be true because, uh, I don’t know like how many developers have built this kind of workflow. I would say if you are doing some kind of workflow, distributed systems, you definitely can relate to this problem, right? But if you are just building a CRUD monolith, probably it’s less so.

[00:39:14] How Temporal Hides Away The Complexities of Building Reliable Applications

Henry Suryawirawan: I’m quite curious when you mention about this kind of complexity, right? Because I actually want to borrow the model like service mesh. I think it’s pretty good analogy as well. Because in the past we used to, you know, worry about networking, how do we call, you know, uh, other services out there, the retries, and also the back off, you know, exponential back off, and things like that. And we put that, bake that in the code, right? But with thing like service mesh, right? It is all transparent. You don’t actually see it. And I see the same thing with Temporal as well, right? When you create your business logic, your workflow, you just focus on the business logic and you don’t actually see all this other concerns, right? Like, be it the retries, be it the state management, and be the dispatching of the task. So this is, I think first of all is the most important thing. How do you actually do that, right? How do you actually make sure that developers can really focus on just business logic?

Preeti Somal: Yeah. So you captured kind of the, the paradigm shift really, really well. How we do that is kind of, first I think I wanted to spend a minute on like the history of Temporal. So Temporal. Our two co-founders have been kinda working on this problem for a lot of years. They got together at Uber and sort of built this open source project that then became Temporal. And a lot of their sort of lessons around running distributed systems at scale are what Temporal solves for. So the main point here is that they’ve been thinking about this problem for about 15, 20 years. So it’s not an easy problem to solve and, um, it’s definitely one where it has been production tested.

Now how we do that is essentially we introduce some abstractions around what we call a workflow and activity and a task. And we, in this abstraction, essentially we create a programming model where the developer is just focusing on their business logic. They have some rules that they need to kind of abide with. So for instance, anything that is error prone, they would put in an activity. And that these abstractions then let us do things like retries. And we, we save the state of kind of the execution of that workflow. So we know at any given point in time where that code is. And what’s like passed and what’s coming next. And so if something were to crash, we have a concept called replay where we can go in and run again. And this, you know, I’m simplifying this down quite a bit, but you know, there’s a lot of power in the programming model.

But this is kind of the core of the paradigm shift is there’s a few abstractions that the developers need to learn and then we take care of the rest. And very frequently what we hear developers say is that like understanding that paradigm shift was a little bit of effort. Because their natural instinct is, oh, where is my, like, if error do x kind of block, right? But once they get it, then they say that they can’t ever look back. It’s just such an elegant model that then they are like Temporal developers for life.

[00:42:47] Unlearning Required for Using Temporal Programming Model

Henry Suryawirawan: Yeah, so I remember back then I, whenever I dealt with this kind of new programming model, right? So I’ve dealt for example, like for example, Hadoop, right? Dataflow, which is, uh, having also Apache Beam programming model and also all that. Definitely the first key difficulties is understanding the model itself. And it depends on how good documentation is, how intuitive the, you know, programming model is. There’s always this hurdle, right? So maybe from your experience looking at your customers or maybe developer adoption of the open source version, do you think Temporal programming model and SDK is something that is difficult to learn?

Preeti Somal: It’s not difficult to learn, but it is difficult to unlearn the patterns from wherever you came from. And what I mean by that is what we see is that developers kind of go through like a journey where at some point they don’t believe it and then there’s an aha. And then they’re like, oh my God, do you know, is this for real? And then they’ll try it. And once they really kind of understand it, then there is a tremendous amount of excitement. And you know, we’ve had someone come up to me at a conference and say, I couldn’t sleep that night. The moment I finally understood Temporal, I was so excited, right?

So the, you know, the, the programming model itself is actually really simple, really elegant. But what typically happens is developers coming in who are new to Temporal expect to kind of have to do all of this work around sort of reliability and unlearning those habits, take some time.

One thing that we are trying to do, and we’ve put a lot of focus on, is really, do more community education. We have a Slack channel where our community joins in and shares what they’ve built. We’ve got something called community exchange where developers can kind of share the code that they’ve built and like linked to their GitHub repo, et cetera. So we’re really working towards just showing more code and doing more sort of education around the model as well.

Henry Suryawirawan: Right. That’s very interesting insight that you mentioned, right? It’s much more difficult to unlearn, uh, what we know. Especially for maybe senior, more senior developers who have had battle scars of building such system, right? So maybe for people who cannot relate, if I can imagine some of the problems that we typically have.

Whenever you have like a business workflow with states, for example, right? It could be very simple, you know, three states, for example, start doing and stop, right? And if it involves like kind of like events and multiple distributed systems, it’s always very difficult to build some kind of reliability between these transitions and calling different workflows from other things. We used to have like messaging system with DLQ. And then the DLQ you need to take care of it. How do you know that the messages, uh, is there, somebody’s picking up? So all this seems to be magically handled by Temporal, right? So I think that’s a very big thing. I haven’t actually played with Temporal. Uh, but I can actually imagine like these things will definitely be something that is great for developers to work with because simply because it removes away all these unnecessary complexities, the boiler plates, the things that we don’t like to do, right? And plus, when you mentioned about, persisting the states and all that, it’s very, I assume it’s very easy to troubleshoot. Um, because if, let’s say you don’t have this, you only rely on logs and we know that reading logs is not fun and it’s always, uh, very difficult.

Preeti Somal: Yes. Absolutely, yes.

[00:46:33] Getting Started Building with Temporal

Henry Suryawirawan: Yeah, yeah. So I think, um, if people are sold to kind of, you know, new programming model and paradigm, right? So what do you think are some of the first step that they can do to actually try Temporal.

Preeti Somal: Yeah. So a few things. One is definitely join our community. Uh, second is sign up for Temporal Cloud. For signup, we have like credits that we give to developers and we have like a quick start guide that helps you get set up and run samples, et cetera. And I think the third thing would be there’s a lot of talks that our customers, developers who have built on Temporal have done at our conferences. So, you know, just listen to like another developer talking about how they’re using Temporal. And that really, I think, will help you get started because then you can start sort of envisioning how you could make, bring Temporal to the problems you have at hand.

One of the other things that we didn’t talk about, but the we hear a lot is not just that development teams can go faster. Uh, so for instance, we’ve had customers tell us that they’ve been able to get like 6x developer productivity because all of that boilerplate, like all of that is gone. But also this aspect of like peace of mind and knowing that, you know, reliability is taken care of. And, uh, you know, failures happen. We know that in any distributed system, failures will happen. But recovering from those failures, Temporal will handle that for you. So that element of like how much peace of mind this brings to the teams that are on-call is also a really significant part of the value we believe Temporal brings.

[00:48:34] Temporal’s Durable Execution Guarantee

Henry Suryawirawan: Yeah, thanks for the plug, right? So definitely, uh, recovering from failures, again, it’s not fun thing. And I think the term that you guys use is called durable execution, right? So you just think I submit an execute, uh, execution, right? And we just think that it will happen sometime, you know, like, um, especially if the system, you know, doesn’t crash all the time, right? I assume if the system can recover, I think the execution will get guarantees that it will be executed, right?

Preeti Somal: Yes, yes.

Henry Suryawirawan: So maybe a little bit about this term. How do you come up with this concept, durable execution? Maybe if I don’t explain it correctly.

Preeti Somal: Yeah, no, uh, you explained it really well. And I think essentially, what we were working towards is how do we describe the concept. Because the, you know, in the early days of Temporal, the natural instinct for developers is to try and compare you with technology they know, right? So, are you a message bus? No. Are you, you know, like a Visual Studio type thing? No, right? And so durable execution for us, really sort of encapsulates the core value that once you’ve built your code using the Temporal model, you know, we will execute it for you. Like that durability guarantee is really, really very strong.

In fact, Henry, there’s really fun blog and video on our website. What we did was we sent our mascot with a Raspberry Pi running a Temporal workflow to space. And what you can see is the workflow is executing and at some point, it stops because the Pi loses connection. And then after a little bit of time, it starts again because the Pi has come back into orbit and has picked up connection again. So we do these fun things as well around helping, to show very visually in a very like real way, like how seriously we take that durable execution guarantee.

Henry Suryawirawan: Wow, I assume that’s a very, uh, fun video, right? Definitely we’ll put that in the show notes. When you mentioned that, I also remember when I studied Temporal docs, uh. So you can even imagine like a scheduled task that is sometime in years ahead. Let’s say you have a contract, multi-years and you have to schedule something that runs whenever, I dunno, they wanna renew or something like that. You can also submit it reliably to Temporal, and it will make sure that at that point in time it’ll execute. So I think again, like building a scheduling thing is not fun, right? Especially for developers, right? How do you ensure it’s reliable and you still think about like when it gets executed or not. So I think that’s a pretty interesting thing.

Preeti Somal: Yeah.

[00:51:23] The Concern About Temporal Lock-In

Henry Suryawirawan: Maybe one concern, yeah, one concern that people have, uh, whenever they adopt this new programming model, especially when it comes with a set of infrastructure, is the locked in, right? So especially when you are embedded into this programming model, which is probably not so much common these days. So for example, people talk about cloud locked in, right? Even though in cloud you have many different options these days and they’re kind of like interchangeable, right? So tell us about these, probably, I dunno, concerns that people have about lock in. How do you ensure that their code can still be portable if let’s say one day Temporal is not the solution for them?

Preeti Somal: Yeah. It’s a very real concern. And thank you for bringing it up. I think the way that we sort of really live by our values is around the open source commitment, right? So essentially, our guarantee to a developer who’s building using the Temporal SDK is that they can run their code on the open source sort of server that they could be running in their own infrastructure or they could bring it to cloud without any code changes. And so we kind of guarantee that sort of code compatibility between Temporal Cloud and open source.

And so, you know, from a vendor lock-in point of view, you are honestly kind of, the programming model is sticky, right? So if you wanna use a different programming model, you will have to go change your code. But you don’t have to rely on Temporal the company, because the server is available in open source. And, you know, we have a very strong sort of commitment to code compatibility between cloud and the open source server.

Henry Suryawirawan: Yeah, that sounds really great, right? Especially like in the past I also have this experience with Apache Beam. I dunno how much you know about it. Like it’s a programming model, uh, runs first class on Google Cloud Dataflow, right, the product for this data pipelines. But they also encourage other, you know, they call it runners back then, right? Other open source project like Spark and few other, Flink, for example. A few other things can also run the same code with the same programming model and it can still be run, right? So people don’t have a fear of being locked in and especially if there’s an open source solution that, uh, you know, people can use, right?

So definitely it’s something that is good that you, uh, uncover, right? I guess because people always have this concern about locked in. And what I’m concerned most is for those people who have unlearned that, you know, their previous habits. They got this aha moment. They kind of like locked into that paradigm shift, right? So I think that is more concerning for me. Uh, especially for developers, we don’t like to go back and do all these mundane tasks.

Preeti Somal: Yeah, absolutely.

[00:54:09] Temporal’s Strong Developer Focus

Henry Suryawirawan: So maybe as we, yeah, as we go out, uh, go to the our, almost wrapping up the conversation, is there anything else about Temporal or about the developer report that you think, uh, you wanna cover as well?

Preeti Somal: Yeah, I think mainly, you know, I wanna just reiterate our commitment is really to make developers’ lives easier. To help them just focus on their business logic and not have to worry about all of this sort of really complex, messy, sort of pieces we’ve been talking about. We’re committed to open source. We are available under the MIT license. We have a really strong community that is growing quite a bit.

One thing we didn’t touch is that our focus on developers is so incredibly strong that we build our SDKs across multiple languages. So we have Python, TypeScript, Java, Go, really sort of, .NET as well. Our focus is, the way we think about it is we wanna meet the developers where they are and we’re truly driven by making their lives better and more productive.

On the Temporal Cloud side, we are available in like multiple regions across the globe. Uh, so we have a number of regions in the Asia Pacific area as well. And our goal with Temporal Cloud is to run the Temporal service so that you don’t have to worry about the reliability of the Temporal service itself, right?

And so again, you know, go check it out. Become a part of the community. And the best way to learn is by just sort of looking at samples and listening to others. ‘Cause, you know, Henry, you and I know I can stand here and I can talk about Temporal, but having someone who’s used it from outside the company is always a much stronger indicator of the strength of the platform, right?

[00:56:16] The Compliance and Security Aspect of Temporal Cloud

Henry Suryawirawan: Right. And also you just, uh, whenever you, when you explain that, you just remind me about the compliance piece that you mentioned earlier as well, right? How do you ensure Temporal, especially the Cloud part, is compliant and secure? And I can see actually banks are starting to adopt Temporal, right? I know like it’s always very difficult to have banks adopting Temporal Cloud.

Preeti Somal: Right.

Henry Suryawirawan: So tell us about this important piece of compliance and security aspect that people don’t know about, right? It’s always unintuitive, yeah.

Preeti Somal: Yeah. So I think the first piece to understand is that when you build your code using our SDKs, you are building what we call a worker, and the worker is running in your infrastructure. We don’t see your code. What we are doing on the Temporal Cloud side is essentially dispatching the tasks and the retries.

One way I like to talk about it is think about a puppet master who is sort of, you know, directing all of the puppets, but we aren’t running the puppets in our environment, that’s in yours. And we are sort of the brains that is orchestrating all of these pieces.

So that immediately kind of overcomes a lot of the questions around, do we see your data, do we see your code? And the answer is no. And then the sort of task dispatching pieces obviously need connection that would typically go over something like a private link. And then on the Temporal Cloud side, you know, we are SOC 2 certified. We have had some requirements from customers around having kind of their disaster recovery and primary region in the same country. We can sort of build a region wherever the cloud providers exist. So that’s been pretty easy for us. And I think this is the kind of the elegance of the model around security that has really enabled us to move much faster and make inroads in banks using Temporal Cloud pretty quickly as well.

Henry Suryawirawan: Yeah, pretty exciting. So I think, uh, if people hear about this conversation, maybe they think it’s too good to be true, but I would invite people to just learn from the links, the community that Preeti has mentioned, right? Because, uh, I’m sure this requires a much dive deep, kind of like discussions, maybe the study, right? Before you actually understand, what Temporal is.

[00:58:41] 3 Tech Lead Wisdom

Henry Suryawirawan: So Preeti, I think it’s a pretty exciting talk, right? Uh, unfortunately we have to wrap up the call pretty soon. But before I let you go, I would like to ask you one question. This is like a tradition in my podcast. I call this the 3 Technical Leadership Wisdom. You can think of it just like advice you want to give to us. Maybe if you can share your version today that will be great.

Preeti Somal: Absolutely. I’d love to. And you said three, right, Henry? Yeah.

Henry Suryawirawan: That’s correct, yeah.

Preeti Somal: Okay. So I think my first one would be, you know, really being clear about the problem you’re trying to solve. And think deeply and from a first principal’s viewpoint. You are really sort of solving the problem in a way that’s very impactful. And really hold that high bar around the problem and how you solve it. I think that would be one.

I think the second piece of wisdom would be don’t shy away from hard things. You know, sometimes it’s easy to take shortcuts, but, you know, we’re in this for the long term. Solve the hard problems and really don’t sort of lose track of that long term piece that you’re building.

And then I think the third thing, and I know you said technical, but I think this is very related, is the team that you have and the people you work with, they are critical in solving any technology problem. Your engineering team, the talent, is your biggest sort of asset. It’s how you can solve these problems. So make sure that you are hiring well, you’re taking care of your team and your engineers, and you’re motivating them and helping them understand the big picture and empowering them to have an impact.

Henry Suryawirawan: Yeah, I would say it’s pretty relevant, right? The last one, right? So I think it’s pretty beautiful especially. When these days, when people think about, you know, maybe replacing more people with the AI. But I think engineering team is still kind of like, you know, requires a lot of people, uh, a lot of innovation, creativity, and ownership from people as well, right? So to get it right.

So if people, uh, love this conversation, they would like to reach out to you, learn from you. Is there a place where they can find you online, Preeti?

Preeti Somal: Yeah. Uh, in our community Slack, I am in there if you could find me there. Also on LinkedIn and Twitter. Uh, so absolutely come find me. I’d love to hear from you.

Henry Suryawirawan: All right. Okay. Thank you so much for the, you know, sharing today. Uh, I learned a lot of insights from you, from your career, from the developer report, and especially about Temporal as well. So thank you so much for sharing today, Preeti.

Preeti Somal: Thank you, Henry. And I hope that your listeners enjoy the discussion today.

– End –