#131 - Data Essentials in Software Architecture - Pramod Sadalage
“The notion of transaction, consistency, and ACID compliance are many times tech imposed. It should be the business that makes the decision. We as technologists should not make that decision.”
Pramod Sadalage is a Director at ThoughtWorks and the co-author of the Jolt Award winning “Refactoring Databases”. In this episode, we discussed data essentials in software architecture. Pramod started by explaining why dealing with data is hard in software architecture and some data related concerns we should think about when making architecture decisions. He then shared the thought process of how we can choose the right database for our purpose and shared insights on data modeling differences between SQL and NoSQL. Pramod also touched on the important considerations in managing transactions and the trade-offs between ACID and eventual consistency. Towards the end, Pramod shared practical advice on the step-by-step how we can split a monolithic database through database refactoring.
Listen out for:
- Career Journey - [00:04:23]
- Data is Hard - [00:15:57]
- Data Related Architecture Concerns - [00:18:36]
- Choosing the Right Database - [00:24:19]
- Data Modeling in SQL vs NoSQL - [00:30:28]
- Managing Transactions - [00:37:31]
- Tradeoff Between ACID & Eventual Consistency - [00:44:06]
- Refactoring Database - [00:46:58]
- 3 Tech Lead Wisdom - [00:54:58]
_____
Pramod Sadalage’s Bio
Pramod Sadalage is Director at ThoughtWorks where he enjoys the rare role of bridging the divide between database professionals and application developers. In the early 00’s he developed techniques to allow relational databases to be designed in an evolutionary manner based on version-controlled schema migrations. He is co-author of Software Architecture: The Hard Parts: Modern Trade-Off Analyses for Distributed Architectures, co-author for Building Evolutionary Architectures - Automated Software Governance, co-author of Refactoring Databases: Evolutionary Database Design, co-author of NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, author of Recipes for Continuous Database Integration and continues to speak and write about the insights he and his clients learn.
Follow Pramod Sadalage:
- Twitter – @pramodsadalage
- LinkedIn – linkedin.com/in/pramodsadalage
- Website – sadalage.com
- Database Refactoring – databaserefactoring.com
- DevOps for DBA – devopsfordba.com
- Agile Data – agiledata.org
Mentions & Links:
- 📚 Refactoring Databases – https://www.amazon.com/gp/product/0321774515/
- 📚 NoSQL Distilled – https://martinfowler.com/books/nosql.html
- 📚 Recipes for Continuous Database Integration – https://www.thoughtworks.com/insights/books/recipes-for-continuous-database-integration
- 📚 Software Architecture: The Hard Parts – https://www.oreilly.com/library/view/software-architecture-the/9781492086888/
- 📚 Building Evolutionary Architecture – https://www.oreilly.com/library/view/building-evolutionary-architectures/9781491986356/
- FAIR principles – https://datamanagement.hms.harvard.edu/news/fair-data-principles-how-can-i-apply-them-my-study
- CAP Theorem – https://en.wikipedia.org/wiki/CAP_theorem
- MapReduce – https://en.wikipedia.org/wiki/MapReduce
- Strangler pattern – https://microservices.io/patterns/refactoring/strangler-application.html
- Dr. Gajanan Chinchwadkar – https://www.linkedin.com/in/dgraph-cto/
- Martin Fowler – https://martinfowler.com/aboutMe.html
- Ward Cunningham – https://en.wikipedia.org/wiki/Ward_Cunningham
- Scott Ambler – https://en.wikipedia.org/wiki/Scott_Ambler
- Riak – https://en.wikipedia.org/wiki/Riak
- Neo4J – https://neo4j.com/
- Amazon Neptune – https://aws.amazon.com/neptune/
- DGraph – https://dgraph.io/
- TigerGraph – https://www.tigergraph.com/
- Apache Cassandra – https://cassandra.apache.org/_/index.html
Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.
Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.
Career Journey
-
By that time, I had come up with this concept of every person working on the code base should have their own schema. They shouldn’t be like sharing schemas. Because early in the beginning, I noticed that if they’re sharing schemas and I make a change for Henry, then John here gets affected because the schema is a single point of failure.
-
And then at some point in time, it so happened that this was all going good, but now we are trying to figure out, okay, how do I deploy QA? How do I deploy to prod? How do I figure out what was last deployed? What are the diffs between that deployment? So that’s where the whole concept of version control, migration scripts came about.
-
One problem was how do I know what change was when this build was made? So then we started publishing what was the last change that was available when this build was (published), and really took that, put it as a tag in the build itself, and published the build package. So whoever unpacked the build knew what change was needed for that build to work.
-
I did give that talk and I felt like that was a good way to share knowledge. Like you learn something, it’s better to share instead of just keeping it to yourself.
-
Around 2008, I would say this whole NoSQL movement was like really firing up and lots of stuff like the Hadoop paper (and CAP theorem) had come about. I started talking, thinking about the way you design objects in object-oriented languages and how you store them in relational databases, the translation that needs to happen. It’s not the same with NoSQL databases. Like object-oriented stuff, how does it translate into documents? How does it translate to key value stores? How does it translate to column stores? How does it translate to graph stores? So that gave me an idea. We should probably talk about how this modeling works and why you would choose one over the other.
-
Relational databases picked consistency over anything else. They picked a bunch of that kinda stuff (configuration) for you, and you didn’t have to worry about anything else. So by default you get those things and developers didn’t have to worry about other choices, because there was no choice. But once you go to a NoSQL databases, there’s a lot of choice. So you have to decide what is the choice that you want. The availability of choices creates lots of options, and at the same time, also creates lots of pitfalls. What if you make the wrong choice?
-
How we should dispel people from like, “Oh, NoSQL is schema-less, so I don’t have to worry about schema.” That’s not true. The schema is there in your code. It’s not in the database, but it’s still there in your code, so you have to keep track of it. And how you shouldn’t have to be like, “Is it SQL or is it NoSQL?” Within the same enterprise or within the same system, you could be having more than one. And that’s where the polyglot word came about. You could have more than one type of storage.
-
I personally think many times, the data people are excluded from architectural decision. And it may be by choice or it may be like, “Oh, once we decide on the architecture, we can just store data somewhere.” And I think that’s a misnomer. You should be including data architects, people who deal with data into the architecture decisions. Because long term where data is stored, how it’s stored affects a lot of different things.
-
Ultimately, all business really wants to see is not your code, but data. Because business runs on data. Data is what is stored. It persists. Applications come and go. So how to like, make sure data is available and findable, interoperable.
-
Inclusion of the data architects in this whole enterprise architecture, or architecture generally, should be encouraged. And even data architects should be saying, “I wanna be there in that discussion.”
Data Related Architecture Concerns
-
The first thing, probably a higher level thing that you should be thinking about is what are the forums where data people should be invited? And a bunch of clients that we go at, I tell them anywhere where there’s an architect involved, you should have a data architect.
-
The areas of concerns is you want to give people enough freedom to innovate, to do things. But at the same time, it shouldn’t be such a blank canvas that they can do whatever they want. You should be having some guard rails saying, okay, you could use certain things within these parameters and we won’t ask anything. But if you have to go outside of this, then you have to come and justify why. We are not saying no, but at least come and justify why.
-
At some places, we have come up with a flowchart for picking the right database. Architecture governance matters a lot in this kind of scenario, because for solutions’ sake purposes, that may be great. But for maintenance’s sake, operational’s sake, that other thing over there is a hindrance on the operational side. So you have to have very good reasons why that is necessary.
-
The other is sometimes we tend to pick technologies or styles or architecture styles based on our own biases of understanding. If I am a data type of person, my bias is to make sure all data is correct. I have taken care of query load vs write load vs read load and all that kind of stuff. And then do the right kind of design. But somebody else coming from the other side may not be thinking about those things. They may be thinking about modularity, they may be thinking about coupling vs cohesion or orchestration vs choreography and a bunch of other stuff. All valid concerns. So that’s why I say like when any kind of new stuff comes up, they (software and data architects) should pair and plan.
-
One other place where I think this pairing really helps is when product managers or business people are coming up with new ideas and they’re saying, okay, let’s at least scope out and see how much this would cost. Is this even feasible? That’s where the data architect should also be, because they can tell you where can I get the data from? Where is the data available? Do we even have this kind of data available?
-
The other place is also to talk about, especially on the analytics side, like our reporting side or things like that. How is data being given to you? What is the lineage of the data? How or what kind of transformations are being applied? What are the standards being followed? What kind of quality rules? Even in application development, I think like naming standards, quality standards, and a bunch of those kinds of stuff is where data people can really help you.
-
The other thing also is probably coaching and mentoring. There should be a constant flow of ideas from the software side to the data side, from the data side to the software side. There may be some techniques I can use on the data side that can scale this much easier versus you scaling the app side like crazy.
Choosing the Right Database
-
The first one is productivity. Right now, human time costs more than anything else. So if you are working longer hours, that cost is much more than the cost of any other technology you buy. Productivity is one thing that drives a lot of these decisions. Like, oh, this thing makes me productive. I don’t need to worry about other stuff. And then the other decisions, then optimize later, but that’s the primary decision thing that we’re going after.
-
The second is sometimes, use cases which need a lot of thought to pick the right database. Let’s say there is some use case where you have to store like trillions of rows, and that won’t fit in a single machine, so I need to find something that can distribute itself, easily shard it, or whatever. What are my read patterns and what are my write patterns, right? So you need to analyze that a little bit and figure out what is the style of database that works for that.
-
20 years ago, we used to think a lot about disk space, how much disk space we are using, and the cost of the disk space and all that stuff. Today, the amount of time you spend in thinking about that will pay for a disk. So the cost of the disk space is no longer a question. So it’s more about how much can I write? How much can I read? How fast can I read, how fast do I need to read, and a bunch of that kinda questions come into play. And then you pick the right kind of database for that.
-
Like some databases you write only once, but read millions of times. So then maybe you need something that is read performant. In some other places, the write efficiency may matter a lot, because data is coming so fast at you and you want to make sure you can write it. So then you pick something that caters to write efficiency.
-
-
And the third style is, what is it you want to get outta that database? Like for analytical purposes, are you doing lots of rows of aggregations and that kind of stuff? Or are you doing like graph traversal or graph analytics or relational analytics or that kinda stuff?
-
Nowadays, there are many databases that are coming up, but if you really look deeper, classically there are like five types built-in at a very broader level, like relational databases, key value stores, document style databases, wide column stores, and graph stores. And then you can decide which type of database that you want to look at. And then, within that type, there may be multiple choices available to you. And then you pick based on that.
-
Now how do you pick that choice is a question that always comes up. And then to do that, generally, what I tend to do is set up a scoring metrics of what is it that I want to pay? Because every company can have different measurements. Some people may say familiarity is very important. Some people may say, oh, I don’t want to leave the AWS landscape that we are in. Some people may say, oh, open source matters to me, or distributed databases matter to me, and that kinda stuff.
-
That, I think, is what you need to set up for your own company or your own situation and come up with what are you going to measure the products on. So set up a rubric for yourself and then score these products. And the answer will just come up in a spreadsheet and then you pick one based on that.
-
Data Modeling in SQL vs NoSQL
-
Many of times, people pick up new types of databases for themselves. They have picked a new database, but their design thinking still is in the relational world. So I have seen a lot of document databases where the collections have been designed as if they’re tables. And then the complaint comes in, like, I can’t join these collections.
-
A document database is not supposed to join collections. You have to have like an aggregate version of the collections. That’s where this domain-driven design way of thinking really helps. So in the NoSQL Distilled book, we talk about what is the boundary of your aggregate.
-
When you go back to the standard relational way of thinking and you split too much, of course, the database is not performant. The first thing I recommend when people pick a new database style is to do a pet project. Play with it. Try to understand, okay, how do I need to think about it? How do I need to worry about it? If you need to get training, get training. If you need to like talk to someone who’s experienced in that database, talk to them. Understand these design patterns that are different.
-
Figure out how this tool works, get some training, talk to someone who has experience, play with it a little bit, solve a sample problem, and understand how this thing works.
-
One other thing I’ve also done is if you know that in the space of this database type, I need to pick one technology—I was talking about the rubric before—so performance can also be a rubric on that.
-
Setting up that experiment may take a little bit of time, but it’ll save you time in the future. So create a framework or create a harness, which puts in all this data in there, and then you run your query load. And this is where knowing your query pattern makes a lot of sense. Like if you don’t know how many writes and how many reads you have, you cannot set up a performance harness.
-
Set that up and run it and see what it does. And then you’ll figure out, okay, should I change my design pattern? Should I change my product? Should I change my database type? Should I change my technology? Whatever it is, it tells you where you need to focus more.
Managing Transactions
-
That is always such a use case specific discussion that needs to be had with the business. One thing I have realized in the past probably 15 years is, this whole notion of transaction and consistency and ACID compliance and all that stuff is many of times tech imposed. All the business wants is don’t lose my transaction. They don’t really care if it’s like ACID compliant or this or that.
-
Many of times, I always tend to go back to this discussion of how much money do you want to put? Like money is a proxy for time here, basically. How much money do we wanna invest to make this ACID compliant or make this 100% consistent? And the business should be driving that. They should be saying, oh, it’s okay if it’s 5 minutes late. Okay, then I have a different solution for that. Versus someone saying, oh, I need to make sure all stores are consistent all the time. And that costs a lot and is a totally different solution.
-
We as technologists cannot make that decision or should not make that decision. It’s the business that should make that decision, because they’re the ones who are deciding how is it they want their data? How is it they want their systems? And how much do they want to invest in that desire that they have? So we should expose this to the business.
-
-
The second thing is, let’s say someone says, I need the highly consistent database or highly consistent data. What are the patterns in which we can implement the consistency that comes into play? And then within a microservice, if you have, let’s say, more than one database, then how do I make sure the consistency of the writes that are going on? And in most distributed systems, even if you’re writing to the same database, you’ll have consistency issues.
-
You still have to think about, even it’s the same database, you have to think about what are the consistency requirements that I consider. So every write could have different consistency requirements.
-
So every write, you can decide what is your consistency level. And many of these databases, especially if you’re writing to a single database, provide you with different types of consistency guarantees. Like quorum is one where it says, the majority of the nodes got it. And then you can say, oh, just like as long as one node got it, I’m fine. Or as long as there’s more than two nodes having it, I’m fine. Or like in MongoDB, I think for example, you can just say, oh, as long as the node received it, I’m fine. I don’t even care if it was written on the disk or not.
-
A bunch of these options are available and you can pick which option you want based on each different type of write, even within the same application. Making these kinds of decisions at probably a tech lead level or at architecture level, you can say, oh, this type of write will always go with this type of consistency. And probably encode it and don’t make the developer think for every write. Just have some kind of convention about it.
-
The other part is when you’re writing into two different databases. Like one probably is like, let’s say you write to a relational database and the other is going to a graph database or something. If I want to maintain consistency, like 100% consistency, what are the patterns that are available? Implementing those patterns, like you can use ZooKeeper or a transaction coordinator or something. A two-phase commit is a really hard problem. And you really need to think if I want it and what are the types of challenges that I am willing to take on to make that happen?
-
Many times what happens in a polyglot. Those things can be eventually consistent. And relational can be like 100% consistent. So you can even make a decision between that, like what is the use case in which I’m writing to multiple databases. How much wrong is it for the recommendation to be not based on this last order that you wrote?
-
That’s a tradeoff. And you say, no, no, no, it has to be like 100%. All the orders that are committed here, the recommendation has to be based on that. Then you pick a transaction coordinator, you put all this, the architectural complexity increases because of that. That’s a tradeoff you make. And is it okay?
-
You should ask the business, is it okay? And then if they say yes, that’s how I want it, then yes. Then you increase the architectural complexity to meet that business need. Then introduce a transaction coordinator. And then, of course, those come with its own trade-offs in introducing transaction coordinator or like ZooKeeper or something. But then you made an informed choice about using it.
-
Tradeoff Between ACID & Eventual Consistency
-
I like to think about everything as a tradeoff. When you say ACID compliant, it makes my life easy. Great. But you traded off scaling for that, right? So when scale comes in, then you start having this replicated databases in relational databases, setting up replication, and making sure replication works properly. All of that is extra stuff you took on because you traded off scaling for ACID compliance. So when that happens, unknowingly, you have traded something for something else.
-
And I am saying make that visible for yourself. I picked ACID compliance. What did I give up? And once you start thinking about those terms, it makes a lot more sense.
-
That’s why I was saying like pairing with someone else in this time makes sense, because you have your own biases and everyone has. Even I have my own biases.
3 Tech Lead Wisdom
-
Keep learning new things, whatever it is, in your own work area, or it could be outside of the work area.
-
Make sure you are mentoring someone, always. Make sure you’re always mentoring others. And I have found that mentoring others gives me new perspectives on things. Because I may be telling them things, but I’m learning a lot while I’m telling them.
-
Whatever decisions you are making, involve the business, because it has cost implications.
-
I have said this problem three times already in this. And as technology people, I don’t think we can decide on cost. It is the business’s decision to take and give them the choices. Because sometimes, they don’t know the choices and they just assume. Like using big words like, do you want ACID versus eventual consistency? They have no idea what you’re talking (about and what consequences the decision creates).
-
Give them the choices, show them like a rubric or show them competitive analysis. Show them if I choose this, these are the implications. If I choose that, those are the implications. And here’s the cost for A versus cost for B. Now pick one. Because that gives them lots of information on which they can make a decision. And it’ll make you a better technologist, because you are thinking as a neutral person, giving them both choices.
-
And that forces you to do deep analysis and research when you’re putting in those two choices. If one choice is lots of pros and the other choice is lots of cons, then they know that you basically want this and not that. And that makes you focus more on how do I analyze this? How do I make sure there are pros, proper pros on both sides, proper cons on both sides, so that people can make the right decision? And I think it makes you a better technologist when you make someone else take the decision and you are giving all the information.
-
[00:01:03] Episode Introduction
Henry Suryawirawan: Hey again, everyone. Welcome to the Tech Lead Journal podcast, the podcast where you can learn about technical leadership and excellence from my conversations with great thought leaders in the tech industry. If this is your first time listening to this show, subscribe and follow the show on your podcast app and social media on LinkedIn, Twitter, and Instagram. And for those of you longtime listeners who want to appreciate and support my work, subscribe as a patron at techleadjournal.dev/patron or buy me a coffee at techleadjournal.dev/tip.
My guest for today’s episode is Pramod Sadalage. Pramod is a Director at ThoughtWorks and the coauthor of several data and architecture books, such as “Software Architecture: The Hard Parts” and the Jolt Award winning “Refactoring Databases”. In this episode, we discussed data essentials in software architecture. Pramod started by explaining why dealing with data is hard in software architecture and some data related concerns we should think about when making architecture decisions. He then shared the thought process of how we can choose the right database for our purpose and shared insights on data modeling differences between SQL and NoSQL. Pramod also touched on the important considerations in managing transactions and the trade-offs between ACID and eventual consistency. Towards the end, Pramod shared practical advice on the step-by-step how we can split a monolithic database through database refactoring.
This is such a great discussion with Pramod to discuss all things about data and databases, and I hope you enjoy it as much as I do. And if you do, I would appreciate it if you can help share this episode with your colleagues, friends, communities, so that more people would also be able to learn from this episode. Please also leave a five-star rating and review on Apple Podcasts and Spotify, which I will highly appreciate. Let’s go to my conversation with Pramod after a few words from our sponsors.
[00:03:22] Introduction
Henry Suryawirawan: Hey, everyone. Welcome back to another new episode of the Tech Lead Journal podcast. Today, I have with me a consultant at ThoughtWorks. He has been there for 25 years. That’s pretty long, in my view. His name is Pramod Sadalage. He’s actually someone who is very experienced in dealing with data, databases, the intersection between data and applications.
He’s also the well-known author of this book called Refactoring Databases. Even though it has been written a long time ago, I think it’s still applicable and relevant these days, whenever you wanna deal with RDBMS and how to make changes and evolution towards your schema. He’s also the co-author of NoSQL Distilled, Recipes for Continuous Database Integration, Software Architecture: The Hard Parts, and the recent book Evolutionary Architecture, which is the second edition.
So Pramod, welcome to the show. Really looking forward for this conversation to talk a lot about databases and data.
Pramod Sadalage: Yeah. Thank you, Henry. Data and databases have been my passion for almost 30 years now, so I’m happy to talk about it.
[00:04:23] Career Journey
Henry Suryawirawan: Nice. So Pramod, in the beginning. I always love to ask my guests to share a little bit more about yourself first. Maybe if you can share some career journey or any highlights, turning points that you feel are worth to share with the audience here.
Pramod Sadalage: My journey started back when I was in seventh grade with computers. Our school had these two computers for the whole school, and you could go play with them and do some programming. So I did a little bit of BASIC and a little bit of playing games and that kinda stuff. That got me hooked onto computers. And so all through high school, I did computers. And then I got into computer engineering at my college undergrad level and finished that. And during that time, we had this amazing professor, Dr. Gajanan Chinchwadkar, and he really inspired me on how to think about computers, operating systems, or distributed computing, and that kinda stuff. So I did a project on distributed computing, (trying to answer) what is the most efficient way to distribute work among a cluster of computers and that kinda stuff. So really interesting work.
And then after that, job hunt and stuff, ultimately ended up in ThoughtWorks around ‘99, three or four years in the middle at different places. (My first project at Thoughtworks,) we were doing this amazing project called Atlas, where we had an existing database that was being worked on, a big upfront design kind of database that was already designed, 600+ tables. And then this whole notion of Agile and XP kind of like was catching fire and Martin Fowler and Ward Cunningham came in to coach us. And those coaching sessions over a period of time were so mind changing that you would think, “Why was I doing upfront design in the first place?” C’mon, like, why did I even do that, right?
So we had this first iteration planning. And we said, “Oh, we’ll do this much functionality, whatever (is necessary).” And I went back to my desk and deleted all the tables. That’s the first thing I did. And that’s a scary thought for a data person. But I did. And every iteration as we planned new stuff, we would do new functionality, we would add things, we would add things to the table and that kinda stuff, or move, split tables, do a bunch of that constant change. And there are about 30 or so developers in that team, plus testers, plus BAs, plus a bunch of other people. And I was the only data person. So any change that came to the database, I literally walked into their space, discussed with them, did the change with them, and then I came up with the scheme of like, let’s say Henry makes a change, how do the rest of the 30 people get that change? Because by that time I had come up with this concept of every person working on the code base should have their own schema. They shouldn’t be like sharing schemas, right? Because early in the beginning, I noticed that if they’re sharing schemas and I make a change for Henry, then John here gets affected because the schema is a single point of failure.
So I said, okay, let’s split everyone into (their own) schemas. I came up with like a bunch of scripts that they can create their own schema, they can create their own, like populate that schema and all kinds of bunch of batch scripts. I wouldn’t call them DevOps, but a bunch of batch scripts on Windows machines that you could run, like create-my-schema.bat, and it would create a schema for them and that kinda stuff. And once that happened, and then we came up with this concept of, oh, I want to modify, or I wanna split, I would pair with them and come up with techniques.
And then at some point of time it so happened that this was all going good, but now we are trying to figure out, okay, how do I deploy QA? How do I deploy to prod? How do I figure out what was last deployed? What are the diffs between that deployment? So that’s where the whole concept of version control, migration scripts came about. So we didn’t really have the number of things that people do nowadays, or timestamped based, those things that they do. I just kept track of change one, change two, change three myself. And I kept applying them and I knew what change was applied, and I started tracking metadata in those schema saying what schema is at what level. A bunch of that kinda stuff. And then ultimately we figured out progression of changes, right?
And so when you are figuring out a new path somewhere, like you encounter a bunch of problems as you go, right? And then we are just trying to like, find out solutions to those problems. Like one problem was, okay, how do I know what change was when this build was made? Right? So then we started publishing what was the last change that was available when this build was (published). And really like took that, put it as a tag in the build itself and published the build package, right? So whoever unpacked the build knew what change was needed for that build to work. So a bunch of that small stuff like that, if you really think of those, all of them were inconsequential, but a collection of them was really effective. So lots of those kinds of stuff.
And Martin at some point looked at the whole scheme I had set up and said, this is really interesting, Pramod. Like you should speak about this, talk about this, let other people know about it. And so he took me to Madison Java User Group, and there were like 600 people in the big auditorium there. And everyone had come to listen to Martin, not me. But Martin put me in front of those 600 people for the first time. And I was shaking and all kinds of stuff. But I did give that talk and I felt like that was a good way to share knowledge. Like you learn something, it’s better to share instead of just keeping it to yourself. So I started sharing. And then from there on like lots of other talks like at XPCon and XP days, different places, and I refined based on questions. Like in your project, there’s a certain context, and when you’ve solved all that context, there are no more new problems to solve. So when we go out and speak to people, give conference talks or generally speak to people, you hear different contexts, right? And then you want to learn how to solve those, like branching and merging. We didn’t do branch and merge because at that time we always did trunk-based stuff. But some people did that (branching). And like how do you manage change sets when you’re branching and merging, right? So really interesting stuff like that.
So ultimately I think around 2004 or something, Addison and Martin came up with this idea of whatever I was doing, and similar stuff was also being done by Scott Ambler. He (Martin) basically put us both together and said, why don’t you write this book, The “Refactoring Databases: Evolutionary Database Design” book that we talked about. So we then took all the changes that I was doing on projects like move column, split column, add table, split table, and codified them and came up with a structure that, just like how Martin’s Refactoring book said, if you tell a name, like extract method, if someone says that, you know exactly what you’re talking about, right? So a language or a naming convention to a bunch of things that would happen. So something similar to that.
So by that time, IntelliJ had shortcut keys to extract method and a bunch of that kinda stuff. We are inspired by that, so we came up with that naming convention and came out about 70 or so patterns that we put in the book. And then I talked a lot about the book and later on I wrote about Continuous Integration and that kinda stuff.
Around 2008, I would say this whole NoSQL movement was like really firing up and lots of stuff like the Hadoop paper (and CAP theorem) had come about, and lots of people were doing different things with it. MongoDB, I think 0.5 or 0.8 had come out. So I did a project with MongoDB. We also did a project with Riak.
And then I started talking, thinking about the way you design objects in object oriented languages and how you store them in relational databases, the translation that needs to happen. It’s not the same with NoSQL databases, right? Like object oriented stuff, how does it translate to documents? How does it translate to key value stores? How does it translate to column stores? How does it translate to graph stores? So that gave me an idea, we should probably talk about how this modeling works and why you would choose one over the other.
I like to talk about this concept of like automatic car, like gears. Like you put in drive and you just drive. You don’t worry about what’s underneath, right? You don’t worry about first gear, second gear, third gear, clutch, and this and that. Like automobile reference here.
That’s what relational databases were. They picked consistency over anything else. They picked a bunch of that kinda stuff (configuration) for you, and you didn’t have to worry about anything else. So by default you get those things and developers didn’t have to worry about other choices, right? Because there was no choice. But once you go to a NoSQL databases, there’s a lot of choice. So you have to decide what is the choice that you want. Because you can’t just go blindly saying, I don’t care about choice here. Once you go into that, the availability of choices creates lots of options. And at the same time also creates lots of pitfalls, right? What if you make the wrong choice, right?
So that’s why we thought about writing that book, the NoSQL Distilled book. And it was basically a broad survey of what types of NoSQL databases are there, how would you pick them? What is consistency? The CAP Theorem, distribution, Map Reduce, a bunch of that kinda stuff all together. And how we should dispel people from like, “Oh, NoSQL is schemaless, so I don’t have to worry about schema.” That’s not true. The schema is there in your code. It’s not in the database, but it’s still there in your code, so you have to keep track of it and that kind of stuff. And how you shouldn’t have to be like, is it SQL or is it NoSQL? It’s not that. It’s just like within the same enterprise or within the same system, you could be having more than one. And that’s where the polyglot word came about, is you could have more than one type of storage. And we give like a couple of examples and that kinda stuff. So that was pretty good.
And then later on, I wrote something with Neal Ford around software architecture, the hard parts about how do you pick (these databases). Because I personally think many times, the data people are excluded from architectural decision. And it may be by choice or it may be like, “Oh, like once we decide the architecture, we can just store data somewhere.” And I think that’s a misnomer. Or you should be including data architects, people who deal with data into the architecture decisions. Because long term where data is stored, how it’s stored affects a lot of different things.
Ultimately, all business really wants to see is not your code, but data, right? Because business runs on data. Data is what is stored, it persists. Applications come and go. So how to like, make sure data is available and findable, interoperable like the FAIR principles. How can they be applied and all that kinda stuff. So inclusion of the data architects in this whole enterprise architecture or architecture generally, I think, should be encouraged. And even data architects should be saying, I wanna be there in that discussion.
Yeah. So that’s my journey. I’ve been here at ThoughtWorks for almost 25 years. I’m still enjoying it. And right now, I’m doing bunch of stuff around data mesh type of architectures and implementations of that idea.
Henry Suryawirawan: Wow. Thanks for sharing your story. It’s really interesting how you got this all started. Right? So from refactoring databases, from project experience, having to share it in front of the audience, for that large amount of audience, I think that’s pretty scary to me as well in the beginning, I believe.
And I think since then, yeah, you have dealt with a lot of data challenges and wrote a few books along the way. So I had Neal a couple of episodes before, talking about software architecture the hard part. And I think, with having you here in this episode as well, I wanna continue our conversation about the architecture, but dealing with data.
[00:15:57] Data is Hard
Henry Suryawirawan: I think, as we all know, dealing with data is very difficult. In fact, many people think it’s the most difficult thing. Because yeah, applications come and go like what you said, but the data always stays there. So the first thing is probably, if you can share with the audience here, how much has that changed these days with all the advancement of technology, different types of databases. Is data still the hard part of the architecture?
Pramod Sadalage: Data is still hard, because a bunch of data is locked inside commercial systems that are bought off the shelf. So a bunch of stuff just sits there. And then how do you get it out? So it’s useful to other parts of the organizations. It’s really important. Many a times data is also stuck behind some kind of, I would say walls, because oh, it’s available in that data lake, but you cannot get to it. Or it’s available there, but you don’t have the right keys, you don’t have the right way to get it out, and that kinda stuff.
So right now, more than the desire, I think the paths to get to the data are difficult. And people are working on that. Like I think that’s why data mesh has become so popular, is because it gives you like this domain oriented way at getting at data. And you can deal with data in piecemeal stuff, instead of like, “Oh, I need to build this whole data warehouse before I get access to it.” Now people are saying, okay, can we like talk about it in a smaller chunks of pieces? And time to market has reduced because of that, right? And that I think is making life easier.
There’s also this notion nowadays all cloud providers give you some type of storage mechanisms. And those are also like all the way from like file storage, block storage, all the way to analytics based storage kind of stuff. Again, there’s a bunch of choice there. But as people are using cloud technologies, interoperability of that data has become a little bit easier. And having access to that data has become easier.
But we are still dealing with the problem of people deciding to put stuff all over the place. And then knowing what did I store? Where was it stored? What’s the definition of that? What’s the meaning of that column or table or whatever. It’s still harder, right? So we need to be a little bit more disciplined and structured about how do I store stuff? I mean, the basics of computer science is naming things, right? Like, did you name your table right, column right? All kinds of stuff. Like, even if you have a JSON, really name the JSON attributes correct. So bunch of that kinda stuff still needs to be followed diligently, so that someone who has not worked with you can understand what that structure is.
Because again, we have so remote and split and all kinds of stuff. And someone may be looking at the JSON that you publish and they may not even know that you exist, right. So how do they get what the JSON means without spending a lot of cycles is the challenge.
[00:18:36] Data Related Architecture Concerns
Henry Suryawirawan: Right. And specifically, you also invited, like just now when you were sharing in the beginning, right? You invited people to include the data architects in the beginning whenever people are talking or deciding about the architecture, so that they’re involved. Maybe specifically in your experience, what will be some of the areas of concerns for a bunch of architects and data architects to discuss about whenever they wanna make decisions about data related?
Pramod Sadalage: Yeah, so I think the first thing, probably a higher level thing that you should be thinking about is what are the forums where data people should be invited? And a bunch of clients that we go at, I tell them anywhere where there’s architect involved, you should have a data architect.
Having said that, the areas of concerns is generally you want to give people enough freedom to innovate, to do things, and that kinda stuff. But at the same time, it shouldn’t be such a blank canvas that they can do whatever they want. So a good example of this is if you’re a specific cloud usage kind of a company, and someone says, “Hey, that other cloud gives me that feature, I’m just gonna go use that.” So you should be having some guard rails saying, okay, you could use certain things within these parameters and we won’t ask anything. But if you have to go outside of this, then you have to come and justify why. We are not saying no, but at least come and justify why.
So at some places we have come up with a flow chart for picking the right database, for example, right? So you go through this tree of information decision making tree, and then you arrive at something. And if you wanna use that, just fine, like we won’t ask anything, you just start using it. But if you arrive at that decision and you say, oh, that other thing over there is better than this for my use case, then you should have a conversation with someone before you pick that up.
So architecture governance matters a lot in this kind of scenario, because for solutions sake purposes that may be great. But for maintenance sake, operational sake, that other thing over there is a hindrance on the operational side. So you have to have very good reasons why that is necessary. So that’s one.
The other is sometimes we tend to pick technologies or styles or architecture styles based on our own biases of understanding. If I am a data type of person, my bias is to make sure all data is correct. I have taken care of query load versus write load versus read load and all that kind of stuff. And then do the right kind of design. But somebody else coming from the other side may not be thinking about those things. They may be thinking about modularity, they may be thinking about coupling versus cohesion or orchestration versus choreography and a bunch of other stuff. All valid concerns.
But when you don’t pair that with the data side of the shop, then you end up like, Hey, this is all on software side, is all well designed, but I’m taking all of that and storing it as a blob somewhere that nobody else can read. And then you have difficult purpose of moderating, because now on the data side, you’re not moderating. So stuff like that matters. So that’s why I say like when any kind of new stuff comes up, they (software and data architects) should pair and plan.
One other place where I think this pairing really helps is when product managers or business people are coming up with new ideas and they’re saying, okay, let’s at least scope out and see, okay, how much this would cost, is this even feasible and that kinda stuff. And they generally tend to talk to the architects and say, oh, is this like workable? That’s where the data architect should also be, because they can tell you where can I get the data from? Where is the data available? Do we even have this kind of data available? And stuff like that then would also give you a way to make the feasibility question answered in a complete sense.
So the other place is also to talk about, especially on the analytics side, like our reporting side or things like that. How is data being given to you? What is the lineage of the data? How or what kind of transformations are being applied? What are the standards being followed? What kind of quality rules? Even in application development, I think like naming standards, quality standards, and a bunch of those kinds of stuff is where data people can really help you.
Because if the app has gone through development or any kinda software has gone through development for ten iterations. And then you tell, oh, this quality rule, this standard has to be applied, nobody’s gonna listen to you because there’s no time now. Because the PO or the product manager or the product owner is gonna say, I don’t have time, I need to take this to market. And now you’re carrying the tech debt forever, right? Instead, if there was earlier, you would’ve talked to them, you would’ve gotten those naming standards, you would’ve put all those things in place, so you would not have that tech debt in your space. So stuff like that, at a developer type level, those things matter a lot for a period of time.
The other thing also is probably coaching and mentoring, right? There should be a constant flow of ideas from the software side to the data side, from the data side to the software side. And those could be like, what can I use for certain things, right? There may be some techniques I can use on the data side that can scale this much easier versus you scaling the app side like crazy, right? They made some things, I can say, “Hey, if I package this data up and give it to you, then you can distribute it much easier than me coming up on the data side design.” So those give and take, I think, as a team you can come up with a better solution.
Henry Suryawirawan: Thank you for sharing all these areas of concerns. And yeah, I have to say for me personally as well, dealing with data related tech debt is not fun. First, it’s like we assume it’s risky also. And second thing is how we deal with the existing data, right? Sometimes we have to patch them, migrate them, and do all those stuffs in order to comply with the new decisions right. So I think it’s always not fun. And I think, like you said, if you can deal with it earlier, that is always the best preferred options.
[00:24:19] Choosing The Right Database
Henry Suryawirawan: You mentioned earlier as well about picking the right database types, right? And these days, I think I remember when I read your book NoSQL Distilled long time ago, there was probably only four types of databases, like the document, the graph, the key value store. But these days there are so many. In fact, there are plenty of new database companies being formed with their different type of characteristics.
So maybe if you can guide us also, for people here who are new with many different types of database as well, what will be some of the attributes or characteristics that we should think about whenever we want to pick the right database for our workload or for our use case? Maybe some guidance here will be great.
Pramod Sadalage: Sure. I think there are three or four things at the very high level that you need to think about. And then you can go down. As you go down, you can have more choices to worry about. The one is productivity, right? Like right now, human time costs more than anything else. So if you are working longer hours, that cost is much more than the cost of any other technology you buy.
So if something makes you more productive, like for example, if you’re writing JavaScript and NodeJS and that kind of stuff, I have a JSON, I just need to store the JSON and read the JSON, go with document type style databases. Kinda a simple choice, right? So I think productivity is one thing that drives a lot of these decisions. Like, oh, this thing makes me productive, I don’t need to worry about other stuff. And then the other decisions, then optimize later, but that’s the primary decision thing that we’re going after.
The second is sometimes, use cases which need a lot of thought to pick the right database. What I mean by that is, let’s say there is some use case where you have like, oh, I need to store like trillions of rows, and that won’t fit in a single machine, so I need to find something that can distribute itself, easily shard it, or whatever. So let me find the right type of database. Or there may be some instances where you say, oh, I have this external identifier that I always have and I can get data based on that always, or write data based on that always. Then you pick the style for that type of database, right? So like what are my read patterns and what are my write patterns, right? So you need to analyze that a little bit and figure out what is the style of database that works for that.
And the reason I’m saying is 20 years ago we used to think a lot about disk space, how much disk space we are using, and the cost of the disk space and all that stuff. Today, the amount of time you spend in thinking about that will pay for a disk, right? So the cost of the disk space is no longer a question. So it’s more about how much can I write? How much can I read? How fast can I read, how fast do I need to read, and a bunch of that kinda questions come into play. And then you pick the right kind of database for that.
Like some databases you write only once, but read millions of times. So then maybe you need something that is read performant. And I may want to write the same data three, four times in different styles, like different aggregations and that kind of styles. That’s fine because I’m reading so many times and I want to make it read efficient. In some other places, the write efficiency may matter a lot, because data is coming so fast at you and you wanna make sure you can write it. So write efficiency matters. So then you pick something that caters to write efficiency. And the third style is what is it that you want to get outta that database, right? Like for analytical purposes, are you doing lots of rows of aggregations and that kind of stuff? So let pick something that helps that. Or are you doing like graph traversal or graph analytics or relational analytics or that kinda stuff, right? So let me pick something that works for that.
And nowadays, there are many databases that are coming up, but if you really look deeper, classically there are like five types built in at a very broader level, like relational databases, key value stores, document style databases, wide column stores, and graph stores. So most of them kind of fit in this style. Like even a block storage is nothing but a key value, right? In a block, the file name is a key. Everything inside is a value, right? If you think about it that way, then there are five big types and then you can decide which type of database that you want to look at. And then you can, within that type, then there may be multiple choices available to you. And then you pick based on that, right?
So if you take an example of, oh, I need a graph database because I’m doing graph traversal. Fine. Now you have options of Neo4J, Neptune, DGraph, TigerGraph, and a bunch of other choices. Now how do you pick that choice is a question that always comes up, right? And then to do that, generally what I tend to do is set up a scoring metrics of what is it that I want to pay? Like in this graph sense, I can pick Neo4J, TigerGraph, DGraph, and Neptune, put them on like a header type of stuff. And I have metrics on which I wanna measure them, because every company can have different measurements, right?
Some people may say familiarity is very important. And I have people who know Neo4J, so no discussion for me, I’ll just pick Neo4J. Some people may say, oh, I don’t want to leave the AWS landscape that we are in. So I’ll just pick Neptune and we are done. Some people may say, oh, open source matters to me, or distributed databases matter to me, and that kinda stuff. My nodes are gonna be really large, so I need a distributed graph database. So people say, okay, we’ll pick DGraph.
So that, I think, is what you need to set up for your own company or your own situation and come up with what are you gonna measure the products on, right? So set up a rubric for yourself and then score these products. It’s very easy, because most of them you can try. You can run a POC, you can do some research, and then set up that rubric and score yourself. And the answer will just come up in a spreadsheet and then you pick one based on that.
Henry Suryawirawan: Thank you for the guideline. I think it’s really interesting the way you first mentioned about the ease of use or the productivity, right? So based on the developers or the language stack that you’re using.
[00:30:28] Data Modeling in SQL vs NoSQL
Henry Suryawirawan: One question related to that. In my experience working in multiple projects and different types of products, one thing that always happens is that I’ve heard multiple times the developer chooses a certain database. It works well in the beginning, but after a while it didn’t perform or it doesn’t scale. And a lot of learnings as well also point out that eventually, actually many people referred back to RDBMS, you know. Things starting to become like ubiquitous again, right? With RDBMS, maybe either MySQL, Postgres, or even cloud databases these days also have RDBMS compliant interfaces, although maybe the way they store or query is different.
So I’m interested to hear your thoughts about this coming back to SQL, coming back to RDBMS. Is it still the preferred default choice that people should choose? And when you talk about read write patterns as well, maybe if you can give rough ideas, when does RDBMS actually does not scale that people need to start thinking about other types of database?
Pramod Sadalage: So many of times where people pick up new types of databases for themselves. They have picked a new database, but their design thinking still is in the relational world. So I have seen a lot of document databases where the collections have been designed as if they’re tables. And then the complaint comes in, like, I can’t join these collections. A document database is not supposed to join collections. You have to have like a aggregate version of the collections, right? That’s where this domain-driven design way of thinking really helps. So in the NoSQL Distilled book, we talk about what is the boundary of your aggregate.
So a good example of this would be, let’s say an e-commerce system. You have a customer that has multiple orders. And each order has order line items, has payment information, shipping information, probably packaging information on it, like is it a gift, not a gift, and that kinda stuff, right? In a relational way of thinking, you have a customer table, you have a address table, you have order table, you have order item table, probably have a payment information and a bunch of these kinds of tables, right? And when you go to document databases, you can’t say, oh, I’ll take that idea and just have multiple collections of these things. Like I have a customer collection, I have address collection and that kinda stuff. Because when you put this back together, you have split the aggregate into really small parts and now you have to put it all back together. And the database was not built for putting all those things back together.
So where do you pick your aggregate boundary? One example would be, I have a customer aggregate. Inside the customer aggregate, I have all the customer information and I have a collection or a list of orders inside it. And then each order in turn, the JSON object if you think about, it’ll have the order header, will have the order items, will have the payment, and everything all inside it. So when you go to the database and you pull the customer out, everything comes with it. So that’s one.
And somebody may say, oh, what if a customer has thousands of orders? Then my objects will become too big, right? Like you’re transporting all of that over the wire, sending it all back. Okay. Now it’s time to think about how do I split this aggregate, right? It’s too big. So then you can say, oh, I want to have a customer aggregate and I want to have orders aggregate. So you have one object for customers that has all the customer information, probably their preferred shipping address, preferred packaging, whatever. And then you have one object for each order. And the order has a reference back to the customer. So when you go back to when you want to show all the orders about the customer, you go to the customer, pick up the customer, and then go to the orders list and say, gimme all the orders that have this customer ID. So now you have split the aggregate, but not too much, just a little bit so that it can manage this work.
So you have to think about these kinds of stuff, because when you go back to the standard relational way of thinking and you split too much, of course, the database is not performant. So I think the first thing generally I recommend when people pick a new database style is do a pet project. Like don’t do your actual project, do a pet project. Play with it. Try to understand, okay, how do I need to think about it? How do I need to worry about it? if you need to get training, get training. If you need to like talk to someone who’s experienced in that database, talk to them. Understand these design patterns that are different, right?
So again, a classic example would be like graph traversal, for example. Some people may say, oh, graph traversal, I can do this in databases, like relational databases. I just put pointers and that kinda stuff. And after two or three levels of pointer, you’ll figure out that relational databases don’t work for it, right? And then they go to graph databases. And then in graph databases, they start treating each node as if it is a table. It is not. It is a single instance of a row that is connected to something else, right? So we have to think about these things a little bit differently. And then you realize the gains of using that technology.
So I think that is what I would recommend people. Like figure out how this tool works, get some training, talk to someone who has experience, play with it a little bit, solve a sample problem or something, and understand how this thing works.
One other thing I’ve also done is if you know that in the space of this database type, I need to pick one technology. I was talking about the rubric before, right? So performance can also be a rubric on that. Like does it handle a billion nodes, for example? Or does it handle a billion documents? So put a billion documents. Nowadays, you can create fake data. Put it in the database and see how it works. Setting up that experiment may take a little bit of time, but it’ll save you time in the future. So create a framework or create like a harness, which puts in all this data in there, and then you run your query load. And this is where knowing your query pattern makes a lot of sense. Like if you don’t know how many writes and how many reads you have, you cannot set up a performance harness, right?
So you basically, once you’ve set up like a billion documents, for example, And then, you know, okay, I’m gonna do like 60% of my load is read, 40% is write. Okay, so what does that mean in one hours of production time, I’m gonna do this many writes, I’m gonna do this many reads, right? Okay. Set that up and run it and see what it does. And then you’ll figure out, okay, should I change my design pattern? Should I change my product? Should I change my database type? Should I change my technology? Whatever it is, it tells you where you need to focus more.
Henry Suryawirawan: I think your sharing about the aggregate boundary is gold, I feel. This will save a lot of developers time and effort as well. I think it’s actually a very good advice that we have to look maybe from the use case, the business use case, product requirements, what kind of, aggregate boundary that we are dealing with, right? And then not to use the old paradigms, the RDBMS, this data modeling applied to different types of databases, which is probably, majority performance issues come from there, right? So the impedance mismatch between the data model and how you query them. So thanks for sharing that.
[00:37:31] Managing Transactions
Henry Suryawirawan: The other thing that I want to ask is about, whenever we have all these data types now. People tend to think about storing special type of data into different databases or what we call also polyglot persistence, right? Especially now with the microservices. Maybe one service could have multiple databases, but it also comes with the hard challenge, right? It’s about the transaction. I think it’s always very important to think about transaction. And especially with polyglot persistence, multiple data types, right? How do you have this in a distributed manner.
So maybe from your view, how should we think about managing all these transactions? Maybe multiple databases if they’re involved, right? Is there any special thing that we have to think about?
Pramod Sadalage: Yeah. So I think that is always such a use case specific discussion that needs to be had with the business. Like one thing I have realized in the past probably 15 years is, this whole notion of transaction and consistency and ACID compliance and all that stuff, is many of times tech imposed. As in all the business wants is don’t lose my transaction. They don’t really care if it’s like ACID compliant or this or that, right?
So many of times, I always tend to go back to this discussion of how much money do you want to put? Like money is a proxy for time here, basically. How much money do we wanna invest to make this ACID compliant or make this 100% consistent or whatever. And the business should be driving that, right? They should be saying, oh, it’s okay if it’s 5 minutes late. Okay, then I have a different solution for that. Versus someone saying, oh, I need to make sure all stores are consistent all the time. And that costs a lot and is a totally different solution.
And we as technologists cannot make that decision or should not make that decision. It’s the business that should make that decision, because they’re the one who are deciding how is it that they want their data? How is it that they want their systems? And how much do they want to invest in that desire that they have? So we should expose this to the business. That’s the first thing I wanna say in there.
The second thing is, let’s say someone says, I need the highly consistent database or highly consistent data. Then okay, what are the patterns in which we can implement the consistency that comes into play? And then within a microservice, if you have, let’s say, more than one database, then how do I make sure the consistency of the writes that are going on, right? And in most distributed systems, even if you’re writing to the same database, you’ll have consistency issues.
Because let’s say you’re writing into Cassandra, which is a distributed database, and the data may reach one node, and you may say, oh, it’s done. But that node fails, then you have lost the data. So you still have to think about, even it’s the same database, you have to think about what are the consistency requirements that I consider. So every write could have different consistency requirements. Or I’m writing just some kind of audit log. That may be a little bit different consistency requirements versus I’m writing an order.
So every write, you can decide what is your consistency level. And many of these databases, especially if you’re writing to a single database, provide you different types of consistency guarantees, right? So I think they’re called, like quorum is one where it says, majority of the nodes got it. And then you can say, oh, just like as long as one node got it, I’m fine. Or as long as there’s more than two nodes have it, I’m fine. Or like in MongoDB, I think for example, you can just say, oh, as long as the node received it, I’m fine. I don’t even care if it was written to the disk or not.
So a bunch of these options are available and you can pick which option you want based on each different type of a write, even within the same application. So I think making these kinds of decisions inside at like probably a tech lead level or at architecture level, you can say, oh, this type of writes will always go with this type of consistency. This type of writes will go with this type of consistency. And probably encode it and don’t make the developer think for every write. Just have some kind of a convention about it and you go.
The other part is when you’re writing into two different databases. Like one probably is like, let’s say you write to relational database and the other is going to a graph database or something. Then now you are in this, like if I wanna maintain consistency, like 100% consistency, what are the patterns that are available? And implementing those patterns, like you can use ZooKeeper or a transaction coordinator or something. Like a two-phase commit is a really hard problem. And you really need to think if I want it and what are the types of challenges that I am willing to take on to make that happen?
So many times what happens in a polyglot, like the relational and graph as an example, right? You’re writing to relational for operational purposes and you’re writing to graph for graph traversal, analytical, probably recommendation engine, that kinda purpose, right? And those things can be eventually consistent. And relational can be like 100% consistent. So you can even make a decision between that, like what is the use case in which I’m writing to multiple databases, right?
So someone writes, takes an order, you persist the order, and the data eventually within the next 5, 10 seconds goes to the graph database. And some other person is going on a webpage and you wanna show them a recommendation based on the graph. How much wrong is it for the recommendation to be not based on this last order that you wrote. That’s a trade off, right?
And you say, no, no, no, it has to be like 100%. All the orders that are committed here, the recommendation has to be based on that. Then you pick a transaction coordinator, you put all this, the architectural complexity increases because of that, right? And that’s a trade off you have to take on like, oh, I cannot even be five seconds behind, so I’m increasing my architectural complexity. That’s a trade off you make. And is it okay?
You should ask the business, is it okay? And then if they say yes, that’s how I want it, then yes. Then you increase the architectural complexity to meet that business need. Then introduce a transaction coordinator. When the write happens here, you go there, write that side. And then of course, those come with its own trade offs in introducing transaction coordinator or like ZooKeeper or something. But then you made a informed choice about using it.
Henry Suryawirawan: Thank you for sharing this guidelines, right? I think it’s really key for people to hear, right? So always involve business whenever you make this type of decision. Maybe sometimes business actually don’t require real time, 100% consistent within milliseconds, right? Sometimes it could be delayed, they accept delay. And knowing this is actually key whenever you create your solution, right? Because then you can have different types of options.
[00:44:06] Tradeoff Between ACID & Eventual Consistency
Henry Suryawirawan: So I have one interesting thought, right? Whenever you said in the beginning that it’s tech imposed and how much money or effort you associate with choosing this, right? But actually many type of this decision, like you said, is tech imposed. It’s developers who make the decision. And actually from their argument, choosing like an ACID compliant database is actually much, much faster and saves a lot of effort. Because yeah, when you deal with eventual consistency, that means you need to write a lot more code, you need to do a lot of more testing.
So maybe a little bit of advice for developers here, which I’m sure we have plenty of developer listeners, how we should think about the effort here. Because storing into an ACID compliant is pretty well known these days. There are a lot frameworks, very easy to do versus having to do it in eventual consistency manner and the effort that requires for them to do that. So maybe an advice here for developers as well.
Pramod Sadalage: Yeah, certainly. I mean, again, I like to think about everything as a trade off, right? So you, when you say ACID compliant, it makes my life easy. Great. But you traded off scaling for that, right? So when scale comes in, then you start having this replicated databases in relational databases, setting up replication, and making sure replication works properly. All of that is extra stuff you took on because you traded off scaling for ACID compliance. So when that happens, unknowingly, you have traded something for something else. And I am saying make that visible for yourself. Like, okay, I picked ACID compliance. What did I give up? And once you start thinking about those terms, it makes a lot more sense, right?
Someone may say, oh, my database doesn’t have to go beyond 50 gigabytes. Oh, fine. Like that trade off is perfect for you, right? But if someone’s saying, oh, my business is huge across the world. I’ll have billion, millions and billions of transactions, rows, reads, writes, whatever; at that time, you cannot just blindly say, oh, I picked ACID compliance. Because that comes with some trade off choices. And making those visible to yourself. That’s why I was saying like pairing with someone else in this time makes sense, because you have your own biases and everyone has. Even I have my own biases. So I generally say, oh, let’s sit and work together, because they will expose my biases, which is good.
Henry Suryawirawan: Wow. I think that’s a very interesting insights, right? So yeah, sometimes we, developers, we tend to have our own bias and we tend to think of the problem maybe from limited perspective, right? So having other people giving the same thought process, maybe from different perspectives, maybe design review process or even pairing, whenever you come up with this decision, I think is always important.
And yeah, like you said, in the end, there is always a trade off. Like Neal also said that, right? So architecture is all about trade off. So knowing that when we pick something basically, we also maybe compensate it for other aspects, which we may not realize, but sometimes it happens later on, right? Which is normally the performance, scalability issues.
[00:46:58] Refactoring Database
Henry Suryawirawan: So another thing that is commonly being discussed in this data world is, we all have microservices, right? But most of the time, startup start with the monolith. And then hence we need to break our databases, you know, doing more like evolutionary schema changes. So since you wrote this book Refactoring Databases long time ago. And I think it’s very applicable whenever we split monolith into microservice. Yeah. What would be some of your advice here for people who are going to embark or are embarking on this journey to split their so-called giant monolithic databases into more multiple types? So is there any key advice that you wanna give here?
Pramod Sadalage: Yeah, definitely. I mean there are a lot of stepwise things you should probably do. I think, in the Software Architecture: The Hard Parts book, we put out like a five step process for this. The first step is to identify what are those boundaries between domains, right? Like, if you’re gonna take a monolith application and split it, let’s say into three, for example sake, then there are probably gonna be three domains or three services. And then you should decide within your tables which service owns which tables.
So let’s assume for simplicity sake there are 18 tables. So each service is gonna get six tables. Or maybe some service gets four, some service gets eight, whatever. What are those tables that belong to each one of those, right? So create a boundary kind of a situation. And then once you have that kind of defined, then there are edge cases here because one service may be writing to the table and the other service may be reading from the table. So you need to decide who is the owner of this table, so that you can clearly split it off. So first step, you can do that.
And the second step is, in the same database, start moving these tables owned by service A into service A’s schema. Like many table, many database products, especially relational database products, they give you the facility to move between schemas. Like you can ALTER table move schema, so you can just move it to a different schema, and it just gets allocated to a different schema. So you are logically segregating, not physical segregation yet.
And the monolith is still talking to all of them. Probably you’ll have to do some kind of a synonym or something so that the underlying movement is transparent to the application, right? So once you have done that, you have physically separated the tables, but the application is still talking to one database schema via synonyms or something like that.
So this is where now you start to work on the application side. Say, okay, now all of these tables, this part of the code is talking to that. So maybe we should create a separate connection pool for that, or maybe create a separate thing for that, that is talking directly to that one schema. And you do that on the application side and you deploy it and make sure everything’s working. And throughout this journey, our goal is always to iteratively keep working, right? It shouldn’t be that you do all these first five steps and deploy once. It’s every step, you deploy. Every step, you deploy. Because something goes wrong, rollback is easy. So second step, you do those stuff.
And third step, what you do is now that you have three schemas in your sample application, three schemas in the same database that are physically in the same place, but logically separate, right? And the application is independently talking to them. Service A is talking to schema A, service B is talking to schema B, service C is talking to schema C. For all practical purposes, they’re independent. The only thing that are dependent is on one machine, because they’re all running on one machine, right?
So now this is where you start bringing in replication. You install two other or maybe three other databases of the same type, and replicate schema A to machine A, replicate schema B to machine B, replicate schema C to machine C. And you just set up replication. Don’t do anything else. And then whatever is happening on here, like writes get replicated to other side. And one fine day, you can just switch connections to those other servers. So now service A is talking to machine A, service B is talking to machine B, and service C is talking to machine C. And you can get rid of this database, break the replication, get rid of this database. Now those become the main databases and now they’re all split into separate places, right. So it’s a five step process.
There are some complications in this, of course, especially like I said, one service is writing to one table and other services is reading from that table. That service now needs to figure out where do I get the data from, because it’s in a different database. So you have to create probably some API changes to read. Don’t make the service do a cross database connection, because that defeats the purpose of microservices. So you have to expose an API to do that, and a bunch of that kinda dependencies will come in. But it at least gives you a starting path to think about this like how do I split? What are the paths and how do I get to the end goal here?
And many times, what do we do is you don’t split everything at one go. You just take one piece of functionality, take it out, and then do whatever you need to be done. Just take that one thing out. So if you think about this monolith, 10% has gone out to a different place, 90% is still a monolith. And then you think of what’s the next thing I can work on?
Again, business value plays a big deal here, because within the monolith, the reason to split generally is because of architectural components like scalability, auditability, deployability, a bunch of those kinds of stuff is being hindered because of a monolith. And that’s why you want to split. And if scaling is the only problem you’re trying to solve, you take one or two of these out, the rest of the application is fine because it doesn’t have that much load. The two that you took out has a lot of load and you can scale them out and whatever. You probably don’t need to split the rest of 80%.
So you have to think about it in this way. What is it that I wanna split first? Split it out. And then do I still need to split more? Okay. Then what is it that I need to split next? Does the remaining need to be split? If the answer is yes, keep going down that path. If the answer is no, just stop. Stop right there because you don’t need to split anymore.
So I generally tend to think about it like, did I achieve what I wanted to achieve? Like scalability, deployability, auditability, whatever it is that you are trying to fix. Once that’s fixed, you can stop splitting further.
Henry Suryawirawan: I think in the application software world, it is also known as a strangler pattern, right?
Pramod Sadalage: Yes, exactly! I think that was written, maybe around 2008 or something.
Henry Suryawirawan: Yeah. And I think what your advice here, don’t do it in one go, because sometimes we want to do it in one go, because we feel it’s risky and we don’t wanna deal with this kind of work, because it tends to be quite long, right? Whenever we wanna evolve our database schema, right? So I think don’t do it in one. Go do it iteratively, right? Know what’s the target as well, like what you said. So maybe we don’t need to split everything into different databases. Some could still exist in the monolith as long as you have the proper boundaries, and you define the ownership really well. While some, maybe different quality attributes require you to actually split them. And hence, yeah, you can do this migration.
And I think all your books here have the recipes, I would say. The Software Architecture: The Hard Parts, I think like you mentioned, there are five steps you can think of how you deal with this splitting of databases. And whenever we deal with eventual consistency and distributed transaction, it’s also covered there. And eventually also, if you deal with RDBMS and you need to evolve your schema, we can always go back to your classic book Refactoring Databases where you can have splitting column, renaming columns, and things like that. There are 70 patterns that you mentioned in the beginning. So I think for people here who are dealing with all these data journey, please do check out Pramod’s resources, books, right?
And also lately about data mesh, right, I also had an episode with Zhamak. I think it’s also becoming trendy, especially mixing the operational and analytical concerns and also how to get data out of it easily, right? So I think all these definitely are some of the trends in the data space.
[00:54:58] 3 Tech Lead Wisdom
Henry Suryawirawan: So, Pramod, thank you so much for this conversation. I really love and learn a lot about how to deal with data better. We are reaching the end of our conversation, but I have one last question that I normally ask for all my guests, which I call the three technical leadership wisdom. You can think of it like an advice for people here to learn and listen from you, right? What will be your three technical leadership wisdom?
Pramod Sadalage: The first one I would say is keep learning new things, whatever it is. Like in your own work area, or it could be outside of the work area or whatever it is. Keep learning new things.
The second is make sure you are mentoring someone, always. It could be other developers. It could be the data people. It could be someone outside. Make sure you’re always mentoring others. And I have found that mentoring others gives me new perspectives on things. Because I may be telling them things, but I’m learning a lot while I’m telling them. I find that really interesting. I do enjoy mentoring and that kind of stuff.
The third is, always, I have said this problem three times already in this, whatever decisions you are making involve the business, because it has cost Implications. And as technology people, I don’t think we can decide on cost. It is the business’s decision to take and give them the choices. Because sometimes, they don’t know the choices and they just assume. Like using big words like do you want ACID versus eventual consistency. They have no idea what you’re talking (about and what consequences the decision creates).
So give them the choices, show them like a rubric or show them competitive analysis. Show them if I choose this, these are the implications. If I choose that, that are the implications. And here’s the cost for A versus cost for B. Now pick one, right? Because that gives them lots of information on which they can make a decision. And it’ll make you a better technologist, because you are thinking as a neutral person, giving them both choices, right? And that forces you to do deep analysis and research when you’re putting those two choices. If one choice is lots of pros and the other choice is lots of cons, then they know that you basically want this and not that. And that makes you focus more on how do I analyze this? How do I make sure there are pros, proper pros on both sides, proper cons on both sides, so that people can make the right decision. And I think it makes you a better technologist when you make someone else take the decision and you are giving all the information.
Henry Suryawirawan: Wow! I really love that. Better technologist is not the one who make the decision, but actually, they’re giving the people the chance to make the decision, knowing the well informed information given by us from technologist. I think it’s very, very lovely.
And also what you mentioned, sometimes yeah, the business didn’t know what options exist. They all know it’s just like jargon here, jargon there, and yeah, they will just leave it to developers to decide. So I think as developers, we have to give them the information in the maybe easily understandable. And yeah, decisions should be taken collectively, I would say, not just by business also, but also with the technologist.
Pramod Sadalage: What are the implications of that decision? I think that’s where we need to be. Making decision is probably you just toss a coin and make a decision. But what are the implications of that decision, downstream how it affects you, is the key part of that.
Henry Suryawirawan: Yeah, correct. So implication is always something that we regret whenever we reach that stage. So I think, yeah, the implications should be well thought out and discussed in the beginning.
So Pramod, if people would love to continue the conversation here, would like to ask you or reach out to you asking about data related stuff, is there a place where they can reach out and contact you online?
Pramod Sadalage: Yeah, I do have a Twitter handle. It’s Pramod Sadalage, @pramodsadalage. I also maintain a website, sadalage.com. So you can reach out to me from there. And kind of like very easy to spot me on Twitter and LinkedIn and my website. So any of those forums are fine. And I generally tend to reply, because I’m super interested in data and it’s affects some people.
Henry Suryawirawan: All right. Thank you so much for your time. I really love our conversation and, again, I hope people here are much well educated about dealing with data, and doesn’t think it’s a hard problem anymore. So thanks again, Pramod.
Pramod Sadalage: Great. Thank you, Henry. Thank you for this time and thank you for inviting me.
– End –