#60 - Software Tradeoffs and How to Make Good Programming Decisions - Tomasz Lelek

 

   

“Software engineering involves a lot of decisions, and that decision has some trade-offs. We have pros and cons. It’s not like one decision is always better than the other."

Tomasz Lelek is the author of “Software Mistakes and Tradeoffs”. In this episode, Tomasz shared what led him to write his book and one of the past software mistakes from his career experience. He also gave advice on how software developers should approach the potential software mistakes and explained some typical trade-offs when making software engineering design decisions, such as code duplication vs flexibility, premature optimization vs optimizing hot-path, data locality and memory, and finally delivery semantics in distributed systems.  

Listen out for:

  • Career Journey - [00:05:00]
  • Why Write about Software Mistakes and Trade-offs - [00:07:42]
  • Software Mistake Experience - [00:10:16]
  • Tips for Software Developers - [00:13:08]
  • Trade-off 1: Code Duplication vs Flexibility - [00:15:24]
  • Trade-off 2: Premature Optimization vs Optimizing Hot-Path - [00:20:08]
  • Trade-off 3: Data Locality and Memory - [00:25:02]
  • Trade-off 4: Delivery Semantics in Distributed Systems - [00:33:01]
  • 3 Tech Lead Wisdom - [00:40:28]

_____

Tomasz Lelek’s Bio
Tomasz currently works at Datastax, building products around one of the world’s favorite distributed databases - Cassandra. He contributes to Java-Driver, Cassandra-Quarkus, Cassandra-Kafka connector, and Stargate. He previously worked at Allegro, an e-commerce website in Poland, working on streaming, batch, and online systems serving millions of users. He is also a published author of “Software Mistakes and Tradeoffs: Making good programming decisions” that is focusing on real-world problems you may encounter in your production systems.

Follow Tomasz:

Mentions & Links:

 

Our Sponsors
Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

 

Like this episode?
Follow @techleadjournal on LinkedIn, Twitter, Instagram.
Buy me a coffee or become a patron.

 

Quotes

Why Write about Software Mistakes and Trade-offs

  • From the beginning of my career, I’ve started noticing that software engineering involves a lot of decisions, and that decision has some trade-offs. So we have pros and cons. It’s not like one decision is always better than the other. But of course, you can have more than two options and possibilities.

  • Sometimes you are not aware about trade-off at the decision time. But later, like after a year or two, this is costly. So maybe this cost could be alleviated.

Software Mistake Experience

  • We picked one solution, one library without realizing all its trade-offs, and without properly analyzing its source code. Because on the one hand, this library was documenting that it should work properly in a distributed system. But in reality, it turns out that this is problematic if you have multiple nodes. There were some problems with synchronization.

  • That could be generalized to a fact that we should know the threading model of libraries that we are importing. Sometimes, if you are using libraries that are not well suited for that threading model, you may block your main threads for processing. It may turn out that it blocks your main flow and affects your performance. So that should be found at the design stage.

  • If you have quite simple flow that works in a synchronous way, using an asynchronous library, you should be aware of problems that may arise. Because you need to be aware about ordering, about multi-threading. Basically, when you are introducing multi-threading to your code, it means that you need to be aware of that, and it means that you will have additional problems or maintenance costs.

Tips for Software Developers

  • First one is when you are picking some framework or library that will shape your code base and influence your code base alone.

    • It will be hard to change it to something else in the future. You need to the aware about these trade-offs.

    • Whatever distributed systems you are integrating with, you need to ask the same questions, like how to influence consistency, availability, and so on.

    • For those situations, I think it’s better to try those upfront. It is possible to know them upfront by doing good research and prototyping.

  • On the other hand, there are those low-level decisions like refactoring of the code and picking a specific pattern or design.

    • In this situation, maybe it’s better just to experiment with the code, and compare like two approaches.
  • In the second scenario, the cost of experimentation is lower. So maybe you can postpone the decision about which one we’ll choose. And in the first one, the cost is higher because once you have this in your application, it’s really hard to migrate.

  • It’s better to read about it, educate about it, and all those trade-offs upfront.

Trade-off 1: Code Duplication vs Flexibility

  • I’m considering this problem in two ways. First one is at the architecture level.

    • In the microservices world or multiple services, each service is responsible for some specific business domain, ideally, and contains some self-contained functionalities. It is often the case that they may share a similar code in some way. For example, authorization, some kinds of validation, or things like that. So if you have this duplication between services, the first idea might be to just remove it, and unify it using, for example, a library and use it in both microservices.

    • But it also has a trade-off. For example, if you have the common code, you have a tight coupling between two microservices and this library. So it means that it may impact the evolution of your code base. Also, the fact that this code was duplicated at the beginning doesn’t mean that it was the same functionality. Maybe they needed to evolve in a different way.

    • And suddenly, you have a tight coupling and also the flexibility of your software is lower, so you are not able to deliver your software as quick as possible.

    • Other solution would be to just extract this new functionality to other microservice, but also it has other problems like additional network request, response time, additional latency, deployment that you need to create for this microservice and so on.

    • At this architecture level, sometimes some code duplication is not a blocker.

  • The second one, we’re not starting from the abstraction, but wait some time until this will settle out.

    • If you have a couple of components and they were living for some time, so they are in quite stable state, so it’s not a frequent evolution phase. If you see some abstraction, then it’s a good time to extract it. At the beginning, you may be doing that too soon, and it may be hard to unwind that.

    • It also comes to the fact that you need to have a good test coverage of your components, so this is not an excuse to not create a good quality code. Also need to have a well-tested components. If they are well-tested, it’s of course easier to refactor.

Trade-off 2: Premature Optimization vs Optimizing Hot-Path

  • If you have code that you don’t want to premature optimize, how to make it early optimization, or just-in-time optimization? How to have enough data to optimize it?

  • We need to have two things that will make good decision, which code path you should optimize? So the first one is the number of requests per second, or some expected throughput on your paths.

  • But it is not enough because we need to have latency data. Once we have data about latency, for example, we can get average latency for the first endpoint, and the second one, then you can multiply number of requests and average latency and have a distribution of overall latency over those two endpoints.

  • Having that data, you can detect the path that is executed most often. I’m calling it hot-path. It means that it is executed almost for every user request.

  • In a system that I was working with, it was very often the case that they were following this Pareto principle distribution. So, for example, like 80% of value was delivered by 20% of the code, or it was 10/90, 30/70, and so on. And it means basically that if you focus your optimization efforts on a small part of the code, you will gain a lot of benefits.

  • Basically you need to have this request per second, measure latency. Once you have the data, you can detect hot-path and dig deeper in your code. So you can go to the specific parts of this hot-path and measure each of those parts, and then you can focus on optimization efforts on specific thing.

  • It is basically a way to limit the scope of your work. Because in production systems, it is not possible to optimize every possible path. If you detect the hot-path, you can make rational decision. It is a place where you can do some additional efforts and they will really be worth it.

  • If you have a system (that) has defined SLA. So if there is some kind of SLA, it should state that your system should be able to accept some requests per second with things specified like latency. If you don’t have this data, then you will need to measure it somehow. To get latency, you should write benchmarks that will validate your application.

  • There are a number of requests per second you need to have from SLA, or from some product information. Maybe what kind of traffic are you expecting? How the traffic will grow over time? So then you can adapt your tests and measure again.

Trade-off 3: Data Locality and Memory

  • That’s in the big data systems, but also all data intensive systems. Data locality means that you are moving your computations to data and not data to computations.

  • You are not fetching the data from the nodes or perform operations on the computation nodes. But you are serializing the computations and sending them to the data nodes, as sending computations is fast because they will not take a lot of data.

  • It’s basically serializing your Java, Scala, or other language program into binary format, sending it over the wire to the data nodes. On the data nodes, some kind of executor needs to be running to deserialize the computation and apply it to the data that is on the local node.

  • It can be done in a distributed way. So multiple data nodes, each of those is processing part of the data. And then, there’s another phase of coalescing the results. Sometimes, this phase involves sending data because, for example, it depends on the partition or partitioning scheme. Sometimes you need to send some data, but the amount of data usually will be substantially reduced.

  • The second aspect here is partitioning. That is important. And picking the proper partitioning scheme for your processing. So we need to optimize your partitions to your patterns to help pick this partitioning depending on the processing that you plan to execute.

  • It can be also, of course, applied to microservices. Microservices often need to fetch a lot of data from other microservices to create an end result for customer or maybe for another microservice.

  • The computations will be done on the second microservice, and only the result will be returned. In that case, you will save network bandwidth and also processing CPU time, and so on.

  • The stored procedures had a couple of problems. Mainly testing, the stability over it, and also it could be very vendor specific. So you are coupled to a specific vendor, and other problems. So the implementation of it was not ideal.

  • There is a trend to move computations back to the data, and all big data frameworks are doing that. But they are doing that on a bit higher level to provide some abstraction, to not tie to a specific vendor.

Trade-off 4: Delivery Semantics in Distributed Systems

  • System that you are interacting with gives you data. Whether it will be delivered once, at-least-once, at-most-once, or exactly-once. If you are interacting with network call or any distributed system, then those delivery semantics are important for you.

  • You can design a system to be idempotent. So it doesn’t matter if you are replaying the message, it will resolve in the same outcome.

  • But there are some operations that are harder to design to be idempotent, like updating some counter. Because there will be inconsistencies there. So we have at-least-once in that situation. If the customer system is not aware of this, those duplications may result in the data in an inconsistent state.

  • Lastly, you can have effective exactly-once because when the network is involved, you will need to make decision about retry at some point. It’s going to be done even on the protocol level, TCP layer and so on. So in effective exactly-once, basically you have this duplication that can be hidden somewhere or some kind of transaction built into your system. It’s hard to implement. It has some problems, complexities, also the performance overhead usually.

  • Also, what’s important if you have this microservice architectures, when there are multiple microservices connecting with each other, even if there is effective once between two of them, the next one if it does not provide effective exactly-once, there will be the same problems. So your whole pipeline needs to work in this way.

  • We can remove the need for deduplication by designing our system to be idempotent. If you are designing your system to be idempotent, at-least-once will work in a very good way. For example, each event, each message can have some unique ID. It’s going to be UUID, and at the customer side, you can keep just some kind of table persistent map, while you are mapping the ID to the fact if it was processed or not.

  • Kafka is nice in that way that it can allow you to configure your producer and consumer behavior to work in either of those ways. You can have at-least-once, at-most-once, depending on your need. So then you can tune your settings properly, commit offsets in the proper way, and you can design your end-to-end pipeline to work in either of those situations.

3 Tech Lead Wisdom

  1. Leading by example.

    • If you want your team to enforce some specific standards, we should give an example.
  2. Always ask the question “Why?”

    • If you are given to do something, implement a specific feature that will result in complexity, we should always be aware what value it will bring. If it gives you any value or not? What’s the end-to-end flow of this new feature? How it will impact your customers, and so on. So, being aware of this, not only about the technical things.

    • Ask why we do even need to implement that? Because every code is a maintenance overhead and cost. So if you are able to solve the problem without code at all, then it will be ideal.

  3. Create a blameless culture, so that there is no blame for a single person.

    • If you are delivering software and there were some error or failure in production, there are high chances that there was a problem of a process and not a single person.

    • So it means that you should change your process, fix the process to not do the same mistake in the future and learn from their mistakes.

Transcript

[00:00:53] Episode Introduction

[00:00:53] Henry Suryawirawan: Hello, everyone. Welcome back to another episode of the Tech Lead Journal podcast with me your host Henry Suryawirawan. Thank you for tuning in and spending your time with me today listening to this episode. If you’re new to this podcast, please follow Tech Lead Journal on your podcast app and social media on LinkedIn, Twitter, and Instagram. Also consider supporting the show by subscribing as a patron at techleadjournal.dev/patron, and support me to continue producing great content every week.

Speaking about software mistakes, I’m pretty sure that many of us have made such mistakes in our software projects, in whatever shapes or forms. Or how about having to decide on important trade-offs to incorporate in your software design? And if you look back to those experience, some of you who got them right will be patting yourself on the back for being able to choose the right solutions for your software problems. However, I also believe that a lot of us could point back in time and say, “Yeah, I think I got that one wrong. And what a big mistake that was.” Personally, I’ve made a number of those costly mistakes and how I wish I could use some guidance that can teach me how to avoid typical and common software mistakes and trade offs.

Fortunately, our guest for today’s episode has written a book just on that particular topic! Tomasz Lelek is the author of “Sohis book, and also she had one of his past software mistakes taken from his career experience. He also gave advice on how we as software developers should approach the potential software mistakes we could make, and explained some trade-offs that we typically face when making software engineering design decisions these days, such as code duplication versus flexibility, premature optimization versus optimizing hot path, data locality and memory, and finally different delivery semantics and their trade-offs in building distributed systems.

There is so much to learn from Tomasz about software trade-offs in this episode, and I hope you learn a lot from this episode as well. And if you do, help the show by giving it a rating and review on your podcast app or share some comments on the social media channels. It may sound trivial, but those reviews and comments are one of the best ways to get this podcast to reach more listeners. And hopefully they can also benefit from all the contents in this podcast. So let’s start our episode right after our short sponsor message.

[00:04:01] Introduction

[00:04:01] Henry Suryawirawan: Hello, everyone. Welcome back to another episode of the Tech Lead Journal podcast. Today I have with me someone named Tomasz Lelek. He’s actually the author of a soon to be published book by Manning called “Software Mistakes and Trade-Offs.” So when we talk about writing software, most of the times, actually, yes, we do care about correctness, we do care about performance, and things like that, but sometimes we also need to be aware of what kind of possible mistakes or trade-offs that we need to think about when we write software, so that it performs based on the context that we want to build it from. And also the business requirements, and also in terms of maybe functional correctness and also performance. So today, I think Tomasz will discuss a lot about this, and hopefully we can learn from him about all the mistakes that maybe he has seen throughout his career, or maybe some of the epic stories. Tomasz currently works at DataStax, a company that built databases such as Cassandra and also few other things. Yeah, Tomasz, really happy to have you in the show. Looking forward for this conversation.

[00:04:59] Tomasz Lelek: Hi, welcome all.

[00:05:00] Career Journey

[00:05:00] Henry Suryawirawan: So Tomasz, maybe in the beginning for you to introduce yourself, maybe telling us more about your career journey or maybe any highlights or turning points.

[00:05:08] Tomasz Lelek: Yeah, sure. So I’ve started in Schibsted company that was like on the Scandinavian market. We had all Scandinavian people were reading our content. It was like newspaper and also website providing news. But there were only 2 million of people and users, so that scale was not so big. So after that, I progressed to Allegro. It’s an e-commerce website here in Poland. The biggest one. We had 20 million of active users. And there were a lot of interesting problems there, big data problems. We were collecting users traffic and saving that for the future analytics machine learning and so on. The data was like eight years or so. So there was a petabytes of data. So I was involved in streaming solutions, mainly using Kafka, also microservices architecture. That was truly microservices architecture with like hundreds of microservices. I think there were 700 of them. And each of microservice was deployed to multiple nodes that was auto scaling, also rolling deployment. All those patterns that are well known in this ecosystem. And also big data solutions based mainly on Spark because it’s quite fast and has some advantages over other brand and those old solutions.

After almost four years, I’ve progressed to DataStax because here we have a lot bigger scale. As you mentioned, we are providing ecosystem out of Cassandra, and some tools around it. Like for example right now, Stargate API that provides variety of APIs to our customers that could speak to Cassandra. Let’s not give directive, we are CQL but also for example, gRPC, GraphQL and so on. Also, I was involved in the Kafka connector, so it was based on connect framework provided by Confluent. So we provided a way to insert data to Cassandra and DataStax enterprise from Kafka. Performance was critical there for sure as well, so we needed to take performance considerations seriously. I was involved in some performance testings as well, and so on. And main product was also Java driver. So the main entry point for Cassandra from JVM ecosystem to JVM world. Also, design trade-offs, interesting decisions that I was involved as a part of the team. So that’s a summary of my journey. Besides that, I’m also a technical trainer here in Poland. So that’s also an interesting experience because I can teach folks from different companies, but also I can learn from them from their questions. By doing so, I’m better at what I’m teaching them. For example, Kafka or caching solutions, performance tests, and so on.

[00:07:42] Why Writing About Software Mistakes and Trade-offs

[00:07:42] Henry Suryawirawan: Thanks for sharing your journey. It seems like throughout your career, you have worked a lot with kind of big data problems, distributed systems, microservices, and also like optimizing to the low level. You mentioned about building like connectors, drivers, and things like that. So obviously, you have seen a lot of different types of software, mostly are for high performance characteristics. But in the first place, why are you interested in writing a book about software mistakes and trade-offs? This is quite interesting because I don’t see much of such book, except, yeah, some anti-patterns maybe there is. But in the first place, software mistakes and trade-offs, why do you want to publish this kind of a book?

[00:08:18] Tomasz Lelek: Yeah. So, from the beginning of my career, I’ve started noticing that software engineering involves a lot of decisions, and that decision has some trade-offs. So we have pros and cons and so on. It’s not like one decision is always better than the other. But of course, you can have more than two options and possibilities. When I was involved in this variety of systems, in some business strictly e-commerce systems, big data, streaming processing and so on, the team had tackled a lot of those decisions and trade-offs at different levels. So all decisions at the low level, like code patterns, that every software engineer needs to make them almost every day. But also more architecture decisions that you can like discuss with your team each sprint or each two weeks, or even more high-level decisions that will influence your evolution of your software. It’s maintenance, flexibility, also performance.

During those breaking points in my career, when we choose one solution over another, I’ve started to write a personal list of those decisions and how they end up like from the time perspective. For example, how this decision impacted our software after one year, after two years and so on. Because also, I have the pleasure to work with those systems, not only during the development, but also maintenance, and when the system was deployed to production, and I saw actual usage of it. After that, I’ve created this list of decisions. It was like 15 of them that were really important. And after some time, I saw that those rules are quite generic. So I decided to share it with other engineers, architects, and so on. So they will not make the same mistakes, and they will be aware of those trade-offs, because sometimes you are not aware about trade-off at the decision time. But later, like after a year or two, this is costly. So maybe this cost could be alleviated after we will read one of my chapters of my book.

[00:10:16] Software Mistakes Experience

[00:10:16] Henry Suryawirawan: So, it’s interesting that you started this from like curating your personal list of decisions and trade-offs in your career. Maybe if you can share maybe one or two, the costliest mistakes that you have seen in your career based on this list?

[00:10:29] Tomasz Lelek: Yeah. Okay. So first one, and I’m touching on that in chapter 10 of my book. So we were building a streaming solution, and we needed some kind of a scheduling library, scheduling framework that will work efficiently. So we needed to handle like 10 thousands of requests per second at least. Our system was distributed. It was based on the Hazelcast back then. We picked one solution, one library without realizing all its trade-offs, and without properly analyzing its source codes. Because on the one hand, this library was documenting that it should work properly in distributed system. But in practice, in reality, it turns out that this is problematic if you have multiple nodes. There were some problems of synchronization. Their domain was totally different, and the library was not well prepared for that. So, this was costly mistake because we needed to debug those problems. And then after some time, turns out that this is a lot easier to just implement the minimum product that provides those functionalities for us, instead of using third party software that we didn’t fully understand. And we didn’t have enough influence on fixing that. So I think that was one of the mistakes that could be changed.

That could be generalized to a fact that we should know the threading model of libraries that we are importing. So, for example, even if you have some non-blocking, right now there is a trend to implement microservices that works using non-blocking solutions, like Netty or NodeJS and so on. Sometimes, if you are using libraries that are not well suited for that threading model, you may block your main threads for processing. It may turn out that it blocks your main flow and impacts your performance. So that should be found at the design stage. And also different way around, if you have quite simple flow that works in synchronous way, using an asynchronous library also should be aware about problems that may arise. Because you need to be aware about ordering, about multithreading. Basically, when you are introducing multithreading to your code, it means that you need to be aware of that, and it means that you will have additional problems or maintenance costs. Maybe sometimes they may be hidden, like for example, in chapter five, I had this example of Java Streams API and using parallel stream abstraction so it hides the internal forbidden pool that is doing processing in a multi-threaded way. So at the first glance, at the code complexity level, there is no change at all. You are just changing streams, method invocation to parallel streams. But underneath, you are creating a lot of threads, and your code is multithreaded now. I’m touching on that in two chapters of my book, basically.

[00:13:08] Tips for Software Developers

[00:13:08] Henry Suryawirawan: So as a software developer, I know that a lot of times, as a developer, we tend to work based on just business requirements, and you know, just ensuring that the functional correctness, the input produces the correct output. So in your opinion, right? What should be the attitude of developers or software engineers in terms of looking at possible software mistakes and trade-offs? Is it something that they have to be aware since the beginning, before they even write the software? Or is it like during the implementation? Or like how should you advice for software developers to re-look at all these potential software mistakes and trade-offs?

[00:13:43] Tomasz Lelek: Yes. I think there are two different approaches to that. So, first one is when you are picking at some framework or library that will shape your code base and influence your code base alone. So, for example, in the JVM world, if you are picking Spring framework that will influence your code base, it will be hard to change it to something else in the future. Also, for example, if you are picking a queue solution and you are picking Kafka, you need to the aware about these trade-offs. It is favouring availability over consistency depending on the settings of acknowledgements. Whatever distributed systems that you are integrating with, you need to ask the same questions, like how to influence consistency, availability, and so on. So for those situations, I think it’s better to try those upfront. It is possible to know them upfront by doing good research and prototyping.

On the other hand, there are those low-level decisions like refactoring of the code and picking a specific pattern or design. In this situation, maybe it’s better just to experiment with the code, and compare like two approaches, right? So maybe it’s not so costly to, for example, implement the solution using inheritance, but also implement another solution using composition. It’s going to be even implemented by two engineers, and then you can compare your solutions and pick the better one. So basically, in the second scenario, the cost of experimentation is lower. So maybe you can postpone the decision about which one we’ll choose. And in the first one, the cost is higher because once you have this in your application, it’s really hard to migrate. So it’s better to read about it, educate about it, and all those trade-offs upfront.

[00:15:24] Trade-off 1: Code Duplication vs Flexibility

[00:15:24] Henry Suryawirawan: So maybe let’s go into some of the software mistakes and trade-offs that you covered in the book. Let’s start with the first one, which is code duplication versus flexibility. Maybe tell us more about this kind of trade-offs, like what do you mean by duplication versus flexibility?

[00:15:40] Tomasz Lelek: So I’m considering this problem in two ways. First one is at the architecture level. I would focus on that, firstly. In the microservices world or multiple services, each service is responsible for some specific business domain, ideally, and contains some self-contained functionalities. So if you got, for example, two microservices, it is often the case that they may share a similar code in some way. Like for example, authorization, set up in the token and submitting it, or maybe some kinds of validation, or things like that. So if you have this duplication between services, the first idea might be to just remove it, and unify it using, for example, a library and use it in both microservices. But it also has a trade-off. So, for example, if you have the common code, you have a tight coupling between two microservices and this library. So it means that it may impact the evolution of your code base. Also, the fact that this code was duplicated at the beginning doesn’t mean that it was the same functionality. Maybe they needed to evolve in a different way. In that situation, it’ll be really hard to evolve it because you will have some abstraction that is not fitting properly, all use cases. You will need to add additional content to the library, develop it and so on. And suddenly, you have a tight coupling and also your flexibility of your software is lower, so you are not able to deliver your software as quick as possible.

Other solution would be to just extract this new functionality to other microservice, but also it has other problems like additional network request, response time, additional latency, deployment that you need to create for this microservice and so on. So at this architecture level, sometimes some code duplication is not a blocker. For example, you have two components that are doing a similar job, and you suddenly start seeing some abstraction in those two components, and you are, for example, decide to use inheritance to reduce the duplication. Both components suddenly are dependent on this new component via inheritance. And the same situation can happen here, and eventually this abstraction doesn’t fit that evolution of those two components. Maybe you decided to extract it to its own. And at these stages maybe really hard to get back to the previous solution, and have all those components independently. Similarly, you will introduce tight coupling and impact the speed of delivery, flexibility of your code.

[00:18:06] Henry Suryawirawan: I think this is quite a common case, right? Where, as a software engineer, we are also kind of like taught, so you have to remove duplication. Don’t Repeat Yourself, DRY principle. Also, like when you see things over the internet, you have so many open source projects. There are basic creating libraries, frameworks, and reusable components and things like that. So we are kind of like taught in a way that, okay, duplication might not be a good idea. And also reusability is quite important. So in the first place, when we are tackling a particular problem, what is your advice for software engineer to actually re-look at this perspective? Should you start with just a simple implementation first, then over the time you will see some kind of abstraction that could be built around that? Or you should start seeing the abstraction in the beginning based on the particular context that you know? Like, what’s your approach here?

[00:18:53] Tomasz Lelek: Yeah. I think the same solution is the second one, right? So we’re not starting from the abstraction, but wait some time until this will settle out. So if you have a couple of components and they were living for some time, so they are in quite stable state, so it’s not a frequent evolution phase. If you see some abstraction, then it’s a good time to extract it. At the beginning, you may be doing that too soon, and it may be hard to unwind that.

[00:19:19] Henry Suryawirawan: But there’s also this argument that if you don’t start early, where you can reuse some of these stuffs, it can also be costly in the future where you have to tweak your existing software, refactor it in order to make it more reusable, generic and things like that. So any thoughts around this?

[00:19:35] Tomasz Lelek: There is no simple answers here. Also, if you will go to the extreme, your code will not be good. But yeah, I think it also comes to the fact that you need to have a good test coverage of your components, so this is not excuse to not create a good quality code. Also need to have a well tested components. If they are well-tested, it’s of course easier to refactor. But there are some functionalities that you can be sure that they will be reused. For example, some simple string manipulation like what you have in standard SDKs. Some connection to external systems and so on.

[00:20:08] Trade-off 2: Premature Optimization vs Optimizing Hot-Path

[00:20:08] Henry Suryawirawan: Cool. So let’s move on to the next trade-offs, which is what you call premature optimization versus optimizing hot-path. Maybe the first term almost all engineers will be familiar with. Premature optimization is the root of all evil, they all say. The second term here, optimizing hot-path. What do you mean by that?

[00:20:27] Tomasz Lelek: That’s all in chapter five. I’m proposing a framework. If you have code that you don’t want to premature optimize, how to make it early optimization, or just-in-time optimization? How to have enough data to optimize it? So we need to have two things that will make good decision, which code path you should optimize? So the first one is number of requests per second, or some expected throughput on your paths. So assume that we have two endpoints. Let’s say, one endpoint is handling one request per second, and second one is handling 10 requests per second. So, having that data, you might be tempted to just go and optimize this endpoint that is executed 10 times per second. But it is not enough because we need to have latency data. So once we have data about latency, for example, we can get average latency for the first endpoint, and the second one, then you can multiply number of requests and average latency and have a distribution of overall latency over those two endpoints.

Having that data, you can detect the path that is executed most often. I’m calling it hot-path. It means that it is executed almost for every user request. In a system that I was working with, it was very often the case that they were following this Pareto principle distribution. So, for example, like 80% of value was delivered by 20% of the code, or it was 10/90, 30/70, and so on. And it means basically that if you will focus your optimization efforts on a smaller part of the code, you will gain a lot of benefits. The one I’m proposing there is how to detect this hot-path, so the smaller part of code that optimizing will result in huge benefits and help to make this decision. So yeah, basically you need to have this request per second, measure latency. Once you have the data, you can detect hot-path and dig deeper in your code. So you can go to the specific parts of this hot-path and measure each of those parts, and then you can focus on optimization efforts on specific thing.

It is basically a way to limit the scope of your work. Because in production systems, it is not possible to optimize every possible path. If you detect the hot-path, you can make rational decision. It is a place where you can do some additional efforts and they will really be worth it. After optimizing the hot-path, it may result that you reduce latency so much that it is no longer responsible for most of the overall utilization on your system. But it may also turn out that still optimizing this hot-path will give you better benefits than optimizing other paths in your codes. Even if, for example, this first end point, that was one request per second, effort latency is, for example, 500 milliseconds, and the hot-path is even 50 or 10, if you multiply that with this number of requests, you have the full context, it may turn out that still is worth optimizing.

[00:23:24] Henry Suryawirawan: Maybe a little bit of recap. So yeah, you should know the throughput of your system, request per second, and also the latency of each of that request. And then you multiply it, get a sense of distribution, right? What’s your P95? Maybe your average, your maximum and things like that. Maybe a little bit of tips here because sometimes engineers, when they write software, I don’t think they have this in mind in the first place. What kind of tools or what kind of techniques that they should introduce to their software in order to get all these numbers easily?

[00:23:53] Tomasz Lelek: If you have a system (that) has defined SLA. So if there is some kind of an SLA, it should state that your system should be able to accept some requests per second with things specified like latency. If you don’t have this data, then you will need to measure it somehow. To get latency, you should write benchmarks that will validate your application. Yeah. So we can, for example, use Gatling tool, WRK tool. There’s a lot of them. Like Cassandra, also has its own testing framework. You can, of course, use it to extract the latency. But there are a number of requests per second you need to have from SLA, or from some product information. Maybe what kind of traffic are you expecting? How the traffic will grow over time? So then you can adapt your tests and measure again.

[00:24:39] Henry Suryawirawan: So sometimes also I think these days, people classify these things as observability, right? So having good monitoring or metrics as part of your software. Maybe instrument your code base so that you can get this data easily, maybe even a profiler. Some tools this day, they can embed instrumentation code inside your code so that you can get this profiling statistics easily. So thanks for highlighting this importance about optimizing hot-path.

[00:25:02] Trade-off 3: Data Locality and Memory

[00:25:02] Henry Suryawirawan: Let’s move on to the next one, which is what you mentioned about data locality and memory. Maybe you can share a little bit more about this.

[00:25:10] Tomasz Lelek: That’s also in the big data systems, but also all data intensive systems. There is an important optimization that you should make and all of those recent frameworks are doing those optimizations. But also the impact, the way how the systems big data frameworks like Spark, Hadoop, and so on are built and some complexities in them. If someone is new to big data ecosystem, may think that those complexities are unnecessary, or it’s really hard to, entry point is high. But data locality means that you are moving your computations to data and not data to computations. What does it mean? So it means that, for example, if you have your data saved on multiple nodes, like Hadoop nodes, file system nodes, even database notes. If you have some big data processing or processing logic, like, for example, find the average of some value, and your processing is on different nodes, then you are not fetching the data from the nodes or perform operations on the computation nodes. But you are serializing the computations and sending them to the data nodes. As sending computations is fast because they will not take a lot of data.

It’s basically serializing your Java, Scala, or other language program into binary format, sending it over the wire to the data nodes. On the data nodes also some kind of an executor needs to be running to deserialize the computation and apply it to the data that is on the local node. It can be done in a distributed way. So multiple data nodes, each of those is processing part of the data. And then, there’s another phase of coalescing the results. Sometimes, this phase involves sending data because, for example, it depends on the partition or partitioning scheme, sometimes you need to send some data, but the amount of data usually will be substantially reduced, if you will do it from the beginning, send the data at the beginning. That’s basically is the simplest description of data locality. But implementation of it is provided within those distributed systems. And I’m focusing on Apache Spark because it’s a recent technology that gives a lot of benefits over the old Hadoop base technologies. Mainly, because it does a lot of processing in the RAM, not on the disk.

The good example of data locality would be using the Join operation. So if you’re joining data sets, like you have one big data set and second, maybe smaller ones. For example, you have list of accounts and all purchases made by those accounts. So there will be a lot of purchases and number of accounts would be smaller. Leveraging data locality also would mean that you are making a sane decision, which data is sent to other nodes. In that case, we would send the smaller data set to the bigger data set to reduce this network traffic. In fact, it is called network shuffling, and that’s one of the biggest that you need to focus when writing big data processing, to optimize big data processing. But it can also be generalized to just normal applications, right? If you need to fetch some data, you need to ask the question, maybe it’s better to tell other system to compute that and send me only results.

Also, the second aspect here is partitioning. That is important. And picking the proper partitioning scheme for your processing. So we need to optimize your partitions to your patterns. For example, if you need to analyze data per month, then it will be nice if one data node will keep the data for the whole month. If that’s feasible, then it will send a computation to this data node, and computations will be local to the whole month, and they could be executed in a local way. On the other hand, if you have data partitioned by, for example, account ID in that case, and you need to extract data for the whole month, it would mean that you will need to scan data from each data node, and then it involves network partitions. So yeah, I’m showing a couple of those examples in this chapter, and how to basically help to pick this partitioning depending on the processing that you plan to execute.

[00:29:16] Henry Suryawirawan: So thanks for sharing all these details about how big data distributed system works, because I think like for small kind of data, normally people are not really aware of such trade-offs. But yeah, it makes a lot of difference if you are processing a lot of data, and where you store the data, especially if your data cluster is large. So the thing that I pick up from your explanation just now, instead of sending the data to the computation, why not doing it the other way around? So you’re sending the computation, which is your code, and you can call it the functional code, serialize it into binary and then send to the data, do some local processing there and just send the result back. Or maybe you do some kind of join and reduce operation, right? I think in terms of Hadoop paradigm, MapReduce. You basically send all these map operations to different nodes and then you reduce them and aggregate them into one possible result.

So looking at this data locality, could you also actually apply this to like microservices, that kind of architecture pattern? Is there something around that or is this just applicable for big data?

[00:30:13] Tomasz Lelek: Yeah, I think it can be also, of course, applied to microservices. Microservices often need to fetch a lot of data from other microservices to create an end result for customer or maybe for another microservice. If you start noticing that your microservice needs to fetch a lot of data from multiple services, and each services returning thousands of records. And then you are just iterating over those records, filtering some subset of those records, or even coalescing it to the one result. Maybe that’s a situation when you could send to the second microservice, just compute with this value and return just one value. In that case, you will reduce the traffic. The computations will be done on the second microservice, and only the result will be returned. In that case, you will save network bandwidth and also processing CPU time, and so on.

[00:31:05] Henry Suryawirawan: So, yeah, maybe a little bit on this, right? Maybe some kind of a viewpoint, because last time in a traditional way, we used to have this database stored procedure to process database queries, ensure that you can actually compute on the database node. But over the time, people move on from that strategy and actually put the application code, so to speak, like a backend API, right, where you just query the data and you compute on the application backend. So what’s your view on this? I think last time we used to do it near to the data, which is the database itself. And maybe throughout the time because of scalability, because of maybe other considerations, people are moving away that kind of computation, away from where the data is residing. Do you have any thoughts around that?

[00:31:45] Tomasz Lelek: The stored procedures had a couple of problems. Mainly testing, the stability over it, and also it could be very vendor specific. So you are coupling to a specific vendor, and other problems. So the implementation of it was not ideal. But still the direction had some pluses. There’s those trade-offs. And so you want to keep the computations on the node. As you mentioned, there is a trend to move computations back to the data, and all big data frameworks are doing that. But they are doing that on a bit higher level to provide some abstraction, to not tie to a specific vendor. For example, in the Spark, it doesn’t matter what format of the data you have. It can be Avro, Parquet, or file system, just plain text files and so on. On Cassandra and database connector that provides the abstraction, and all the logic is in actual programming language. So it’s easier to test, easier to maintain. There are some pros of these approaches, of course.

[00:32:41] Henry Suryawirawan: Yeah. Looking at the distributed data framework, I must say that it’s also exciting. So I’ve used like Apache Beam, Dataflow as well, like Spark. I think there are lots of innovations around that. I’m really glad that some people actually writing this kind of software so that as an end user, so to speak, software engineers, they just need to understand the paradigm and write the code appropriately.

[00:33:01] Trade-off 4: Delivery Semantics in Distributed Systems

[00:33:01] Henry Suryawirawan: Let’s move on to the next trade-offs, which is what you call delivery semantics in distributed systems. So what’s the importance of delivery semantics? Or maybe in the beginning, explain to some of us what is delivery semantics?

[00:33:14] Tomasz Lelek: Yeah. So system that you are interacting with gives you all the data. If it will be delivered once, at-least-once, it’s going to be at-least-once, at-most-once or exactly-once, basically. If you are interacting with network call or any distributed system, then those delivery semantics are important for you. Even if you have two simple HTTP microservices, and one is sending a request to the second one, there could be a network partition, for example, when the response get back to the first microservice. In that case, the second microservice accept the data, process it properly, maybe saved to the database, but the first service doesn’t know what happened. And in that case, you need to make a decision from the perspective of the first service. What to do? We can retry the request. In that case, we will have a consistent view of the system if the request succeeded. But it means that you have submitted more than one request. So you may have duplicates, and this will be at-least-once you will deliver the message, but you could deliver it more than once. It gives a good guarantees on both systems, but it also has some problems because the second system needs to be prepared for that. So it means that there could be duplicates.

You can design a system to be idempotent. So it doesn’t matter if you are replaying the message, it will resolve in the same outcome. Like, for example, deleting the value for a specific primary key may be idempotent because it doesn’t matter if you delete it once or 10 times. It will be just deleted. But there are some operations that are harder to design them to be idempotent, like updating, for example, some counter. Because there will be inconsistencies there. So we have at-least-once in that situation. If the customer system is not aware of this, those duplications may result in the data in an inconsistent state. On the other hand, on the producer side, you might decide that in such a situation, you are not retrying. But it also has a problem because the fact that response didn’t come to the first service, it can also mean that it was not processed at all. That was a failure. In that situation, the first service will not retry, and it might turn out that the second one didn’t get the value at all. So it was lost at-most-once. So it means that it could be delivered zero times or one time. There will be no duplicates, but possibility for zero delivery. Of course, as everyone can be aware, it can have problems because when you lost some data in transmission, there are some use cases that it is also good enough, like for example, collecting some metrics that are collected every millisecond or something like this, and if you lost some of them, it may be not problematic.

Lastly, you can have effective exactly-once because when the network is involved, you will need to make decision about retry at some point. It’s going to be done even on the protocol level, TCP layer and so on. So in effective exactly-once, basically you have this duplication that can be hidden somewhere or some kind of a transaction built into your system. It’s hard to implement. It has some problems, complexities, also the performance overhead usually. You need to build it and be aware of that. Also, what’s important if you have this microservice architectures, when there are multiple microservices connecting with each other, even if there is effective once between two of them, the next one if it does not provide effective exactly-once, there will be the same problems. So your whole pipeline needs to work in this way.

[00:36:42] Henry Suryawirawan: So maybe just to recap again, there are three delivery semantics. So like at-least-once, at-most-once and exactly-once. So, from my experience in my career, most of the systems, especially the distributed systems, they will opt for the at-least-once delivery semantics. Because maybe it seems like it’s a much cheaper to accomplish, and also it’s less complex in terms of how to guarantee that. So, in your point of view, is it fair to say that for most of the use case, we should design our system using this at-least-once delivery semantics?

[00:37:14] Tomasz Lelek: Yes. Yes. I do think that’s correct. We can remove the need for deduplication by designing our system to be idempotent. It requires some thinking. For example, if you have event-based system, it requires some thinking, but it’s feasible. So, for example, you can send the whole state instead of only updates. Like if you have the whole state, you don’t need to do this counter updates, as I mentioned. So there will be no possibility of making the data inconsistent. So, if you are designing your system to be idempotent, at-least-once will work in a very good way. Also, you can build up deduplication mechanism. So, for example, each event, each message can have some unique ID. It’s going to be UUID, and at the customer side, you can keep just some kind of a table persistent map, while you are mapping the ID to the fact if it was processed or not.

Although it has its own problems because you might operate with distributed database that also may involve network partition. That’s why this update of the fact that event was processed or not needs to be done also in an atomic way. But yeah, I have this in my chapter as well. In that case, you need to apply this modification on the database itself, nodes fetch this data state and to do some intermediate operations or there needs to be one atomic operation. So, for example, comparing some kind of a compare and set on the database and NoSQL databases offers that as well. So we can implement that mechanism.

[00:38:39] Henry Suryawirawan: Another thing that is commonly related to this delivery semantics is when you choose your messaging queue. I think maybe these days, a lot of messaging queues opt for at-least-once as well. But there are some products that offer different delivery semantics. Maybe from your point of view, any kind of thought, what kind of messaging queues should people opt for? Or maybe some kind of technologies that people should be aware of with all these different delivery semantics?

[00:39:01] Tomasz Lelek: In this chapter, I’m referring to Kafka because I have the biggest experience in this technology. Kafka is nice in that way that it can allow you to configure your producer and consumer behavior to work in either of those ways. You can have at-least-once, at-most-once, depending on your need. So then you can tune your settings properly, commit offsets in the proper way, and you can design your end-to-end pipeline to work in either of those situations. Because yeah, as you mentioned, you can have use cases that are good enough with at-most-once but also some use cases when you have and you need at-least-once. So using Kafka, you can have both of those. It’s only the matter of configuration, and configuring consumer producers properly. I have this in chapter 11 of my book.

[00:39:50] Henry Suryawirawan: Thanks for sharing that. So for those of you who are interested in looking into more details about delivery semantics, make sure you check the chapter 11, in particular about messaging queue with Kafka and all the configurations that you can do.

So Tomasz, I think it’s been a pleasure learning from you about all these different trade-offs. I’m sure as a software engineer, I hope we can all upskill ourselves to be aware of what kind of mistakes that could possibly happen. And what kind of trade-offs that we should think in the beginning before we implement something so that we don’t make costly decision that is quite difficult to retrofit, or quite impossible to actually fix sometimes. Because when you have the data actually involved, right? Sometimes fixing data is not so easy.

[00:40:28] 3 Tech Lead Wisdom

[00:40:28] Henry Suryawirawan: So Tomasz, before I let you go, normally I have this one last question that I always ask for all my guests, which is to share your three technical leadership wisdom for all of us to learn. Maybe from your journey or from your career. So can you share your three technical leadership wisdom?

[00:40:44] Tomasz Lelek: Yeah. So, the first one will be leading by example. If you want your team to enforce some specific standards, we should give an example. For example, if you want to have good test coverage, you should also do it. If you want to have good description of your work, also you should do it, and other will follow. Is that a good direction?

Second one. So it’s always ask the question “Why?” So if you are given to do something, implement a specific feature that will result in complexity, we should always be aware what value it will bring? If it gives you any value or not? What’s the end-to-end flow of this new feature? How it will impact your customers, and so on. So, being aware of this. Not only about the technical things. Ask why we do even need to implement that? Because every code is a maintenance overhead and cost. So if you are able to solve the problem without code at all, then it will be ideal.

And the third one is to create a blameless culture, so that there is no blame of a single person. If you are delivering software and there were some error or failure on production, there are high chances that there was a problem of a process and not a single person. So it means that you should change your process, fix the process to not do the same mistake in the future and learn from their mistakes. So create some meeting discussing how does it happen? And try to incorporate that in that new process that will not have it.

[00:42:15] Henry Suryawirawan: Thanks for sharing all this wisdom. I find it interesting, especially the second one. Always asks why, especially if it involves some kind of a complex or big kind of effort, before you actually go down the rabbit hole and you realize, oh, actually it’s not so important, or there’s another alternative way where you actually don’t need to write so much code. So thanks again, Tomasz. For people who want to reach out to you, or learn more about the book itself, when is it going to be published? Where can they find you online?

[00:42:42] Tomasz Lelek: Yeah. So on LinkedIn, also Twitter. My Twitter handle is tomekl007, and it’s also for GitHub. I will respond any question. So feel free to ask them. Regarding book, you can also join the Manning Slack channel. This week, we have book of the week and our book is book of the week. It means that everyone can ask any question regarding the book and I’m answering them.

[00:43:05] Henry Suryawirawan: That’s cool. That’s cool. So thanks again, Tomasz. I wish you good luck with the publication of the book. So looking forward for that.

– End –