#203 - Building Effective and Thriving Machine Learning Teams - David Tan & Dave Colls

27-Jan-2025 58 mins David Tan, Dave Colls

included in AI/ML & Data Engineering AI/ML Culture & Practices Team Collaboration

“The ability to build the thing wasn’t always the major success factor. Often it was making sure there was alignment between technical teams, business teams, and product teams on the right thing to build.”

Is your ML project stuck in POC hell and failing to deliver real value?

In this episode, I sit down with David Tan and Dave Colls, co-authors of the book “Effective Machine Learning Teams” to discuss the complexities of building and deploying ML models, building and managing effective ML teams, and ensuring successful ML projects.

Key topics discussed:

Learn the key differences between ML engineering and traditional software engineering.
Why so many ML projects fail to reach production or deliver real business value and how to overcome common challenges.
Discover the ideal ML team composition and how to apply concepts like team topologies for effective collaboration.
The importance of product thinking in ML projects and best practices for designing robust, valuable ML products.
Essential engineering best practices for successful ML projects, including automated testing, continuous delivery, and MLOps practices.

Listen out for:

(00:02:35) Career Turning Points
(00:04:54) Writing “Effective Machine Learning Teams”
(00:08:29) ML Engineering vs Other Types of Engineering
(00:12:24) ML and LLM
(00:14:59) Why Many ML Projects Fail
(00:19:53) ML Success Modes
(00:23:32) Ideal ML Engineering Team Composition
(00:31:39) Building the Right ML Product
(00:39:23) ML Engineering Best Practices
(00:49:14) MLOps
(00:52:44) Make Good Easy
(00:53:56) 3 Tech Lead Wisdom

Tune in to learn how to build and deploy ML products that truly make a difference!

_____

David Tan’s Bio
David Tan is a lead ML engineer with more than six years of experience in practicing Lean engineering in the field of data and AI across various sectors such as real estate, government services and retail. David is passionate about engineering effectiveness and knowledge sharing, and has also spoken at several conferences on how teams can adopt Lean and continuous delivery practices to effectively and responsibly deliver AI-powered products across diverse industries.

Follow David Tan:

Dave Colls’ Bio
Dave loves building new things and helping others build too – products and services, people and teams – using technology, Data & AI.

His work building the Thoughtworks Data & AI practice shows how these elements combine: creating a new business, consulting to leaders and making a home for data people to grow. Currently, Dave is working at Nextdata, pushing the frontiers of how organisations work with data.

Dave thrives on navigating complex challenges and evolving environments. His extensive experience in large-scale digital transformations and deep tech R&D has honed his ability to align efforts and drive impactful results.

A dedicated mentor, Dave is passionate about guiding data professionals in their career growth. He actively shares his expertise as a speaker, blogger, and author of the book “Effective Machine Learning Teams.”

Follow Dave Colls:

LinkedIn – linkedin.com/in/davidcolls

Mentions & Links:

🎧 #198 - Better Software Faster: Measure & Improve Developer Productivity with DX Core 4 - Laura Tacho – https://techleadjournal.dev/episodes/198/
📚 Effective Machine Learning Teams – https://www.oreilly.com/library/view/effective-machine-learning/9781098144623/
- Free preview and overview – https://www.thoughtworks.com/content/dam/thoughtworks/documents/e-book/EffectiveMLTeams_preface.pdf
📚 Herding Tigers – https://www.amazon.com/Herding-Tigers-Leader-Creative-People/dp/073521171X
Guide to Evaluating ML Platforms – https://www.thoughtworks.com/insights/whitepapers/guide-to-evaluating-mlops-platforms
Coding Habits for Data Scientists – https://www.thoughtworks.com/en-au/insights/blog/coding-habits-data-scientists
Testing AI solutions, data test grid, designing for testing non-deterministic systems – https://www.thoughtworks.com/insights/articles/time-to-get-serious-about-AI-testing
Team Topologies for ML (& Data & other specialisations) – https://safetydave.net/effective-topologies-for-data-and-ml-teams/
GenAI Stone Soup – https://safetydave.net/genai-stone-soup/
Devex Triangle – https://queue.acm.org/detail.cfm?id=3595878
Team Topologies – https://teamtopologies.com/
Large language model (LLM) – https://en.wikipedia.org/wiki/Large_language_model
ChatGPT – https://en.wikipedia.org/wiki/ChatGPT
MLOps – https://en.wikipedia.org/wiki/MLOps
Gemini – https://en.wikipedia.org/wiki/Gemini_(chatbot)
model.eval() test – https://www.geeksforgeeks.org/what-does-model-eval-do-in-pytorch/
Snyk – https://github.com/snyk
Renovate – https://github.com/renovatebot/renovate
Dependabot – https://github.com/dependabot
PyHamcrest – https://github.com/hamcrest/PyHamcrest

Our Sponsor - JetBrains

Enjoy an exceptional developer experience with JetBrains. Whatever programming language and technology you use, JetBrains IDEs provide the tools you need to go beyond simple code editing and excel as a developer.

Check out FREE coding software options and special offers on jetbrains.com/store/#discounts.
Make it happen. With code.

Our Sponsor - Manning

Manning Publications is a premier publisher of technical books on computer and software development topics for both experienced developers and new learners alike. Manning prides itself on being independently owned and operated, and for paving the way for innovative initiatives, such as early access book content and protection-free PDF formats that are now industry standard.

Get a 45% discount for Tech Lead Journal listeners by using the code techlead24 for all products in all formats.

Our Sponsor - Tech Lead Journal Shop

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Writing “Effective Machine Learning Teams”

This book started as a blog post. Back in 2019, we wrapped up a project. We were involved in a lot of refactoring of data science codebases. We had data scientists who wrote code in the Jupyter Notebook, with no tests, and a lot of lines. It was good; it solved a real business problem. And then we had to productionize that. So we started writing about how we did that. How do we encapsulate code into functions? How do we have automated tests? How do we do CI/CD for those changes?
We wrote a blog post, and it exploded on Twitter. There was this desire in the data science community or ML community to have more engineering and robust practices. Then we started writing more about our work to scale ML training pipelines and high-volume ML products. And the practices that helped us build reliably, build ML products that are safe, and get fast feedback on changes.
In my experience working with data-intensive ML initiatives or other technology specializations, I found that the ability to build the thing wasn’t always the major success factor. Often it was understanding what the right thing to build was, making sure there was alignment between technical teams, business teams, and product teams on the right thing to build, which could be complex when people don’t speak the same language. So what sort of processes and techniques can you use to build that alignment, especially when there’s a lot of uncertainties about the problem and the solution?
Also, because of the specialized nature of this work, it’s often hard to build a team that can execute end-to-end. So then the factors around either building in a single team, how do you bring all of those different perspectives together, or when you have to rely on multiple teams to get a piece of work out the door, what’s the best way to achieve fast flow and high quality in a way that’s sustainable for the people working in that environment?
Those were the perspectives I wanted to bring to machine learning and product development, which is an essentially multidisciplinary activity. How do we do that better as teams?

ML Engineering vs Other Types of Engineering

I think ML systems are fundamentally orders of magnitude harder and more complex than software engineering in some ways. Of course, software engineering is hard. You have got microservices, distributed computing, a lot of things in itself; it’s an art and whole discipline. ML itself, the difficulties manifest themselves in some common ways, it’s not just one service or one component. Usually, it’s a data pipeline, dependencies upstream, that is complex in itself. You have got your scale or compute requirements for your own ML workload, which can be done in many ways.
You can have a really large instance with really large memory, do everything there. Or at some point that breaks down, so how do you architect or make it simple to scale up large workloads?
There’s also the monitoring, when you think about software products, typically your monitoring is your four golden signals: success rate, latency, things that. You can do that for ML, and you can be all green and good, but that says nothing about the quality of the predictions that models are producing. And then the correctness of those. So it adds another layer of challenges that I think the MLOps community has come to solve in recent years.
Even further down, yes, we have built this product. It’s giving accurate predictions, to an extent. What is the business impact or what levers or dials is it moving? That’s the fundamental difference between ML products and software engineering products.
One of the characteristics of building an ML feature is there’s a lot less certainty about how it’s going to perform upfront. Often you have to integrate this exploratory data analysis or building prototypes or architecture search as it might be described into the product delivery process as well. As you converge on a potential solution, then you’re also dealing with another vector of change, which is the data and the world changing as well.
Often that changes slowly, but sometimes we can see step changes. COVID was a classic example, with recommender systems moving from lowest price to highest, to best availability. And ML models needing to catch up with that.

Why Many ML Projects Fail

ML projects tend to fail in these common failure modes. It’s preventable types of failure. If we can learn from experience from history, we would bring in the right tactics. That’s the main thesis of the book: how have we failed, what have we learned, and how can we do better?
One common failure mode in ML projects is what I call POC hell. It’s a combination of factors, maybe businesses don’t have enough of the appetite to release something to users. We might take the first easy step to build a prototype, to test it, maybe an internal demo of it. But the resources or risk to productionize that and put it in the hands of users is too great.
With the lack of commitment there, it means that teams on the ground build POCs after POCs after POCs. That’s detrimental to the business in some ways, you’re missing out on opportunities. Detrimental to the morale of the team. You do a few, and then after a while you say, what’s the point? So, you move on, and then you have got churn and onboarding, and you’re losing talent.
Another common failure mode is that it’s easy to get maybe the first thing out of the door. You hustle, you get the data, you get the compute, train the model, you deploy the thing. But then it’s stuck together with duct tape and sticky gum. It’s hard to test or change after you evolve based on new customer feedback. That comes to continuous delivery, CI/CD, how confident we can be to test and deploy a change. This set of problems is solved by MLOps, by continuous delivery.
Is there any value in doing this at all? That can be really challenging when the responsibility is divided across different teams and might even be in different parts of the organization. While each group can do what’s within their control to move a little bit towards a future state, it’s not aligned or stitched up in a cohesive way that allows us to establish whether there is some value quickly with cheap and low-risk tests.
Once you have established there’s some value, if you have productionized something that’s not supported by a lot of these good practices around MLOps and continuous delivery, which is often the case to understand if there is value there, then you start to understand how much value you’re leaving on the table. This can be an opportunity to invest in those practices to be able to maximize the lifetime value of an ML product. But that requires a certain degree of maturity and a certain ability to work across existing organizational boundaries or reshape them.

ML Success Modes

There’s also the flip side of the failure mode: success. We have also worked with and seen ML teams successfully deliver really great ML products. And what tactics did they use to succeed? We could see things like customer testing, user testing, so before they build out a prototype or an MVP, which is a really expensive way to test something, even just talking to users, understanding what the pain point is, what their jobs to be done are. Where could ML help?
Combating that failure mode of what Dave mentioned, multiple teams throwing across each other. Instead of throwing code between scientists and MLOps engineers, we do the inverse Conway maneuver where we have a cross-functional team shipping end-to-end for this initiative.
When we talked to the team members who worked on that project, they enjoyed that end-to-end ownership, the satisfaction of putting something, releasing to customers. You get to touch different parts of the stack, learn about data science, and learn about Ops. You can get fast feedback, you have the same standup, you ship things iteratively, and you showcase it every two sprints, three sprints. That feeling of flow and speed was amazing.
When something is released, you have got your continuous delivery, production monitoring, when things are not going well. You will have safety for changes. If I need to make a refactoring, upgrade the library. If your CI checks are all green, your model monitoring dashboard is saying performance is good, then you merge that PR and you don’t feel nervous or stressful about any releases.

Ideal ML Engineering Team Composition

We look at multiple levels of team effectiveness. One of the levels we look at is an individual in a team, the types of practices you can use, the expertise you can build as an individual.
The next level we look at is within teams, how those practices around technology, delivery, and product. Augment teams, but also broader team dynamics. The third level we look at is between teams. How can teams be effective between teams? The best makeup of a team depends a bit on the shape of the team and its interactions with other teams.
This is where we use the team topologies model to identify the different types of ML teams that you might sit in. You might have a stream-aligned team, and this is the basic unit to think of, first ML project, a stream-aligned team that can deliver end-to-end. You’re not going to be conducting groundbreaking ML research; you’re going to be using tried and true techniques. Or at the other end where you have an established ML ecosystem internally, you can add another stream-aligned team that draws on those existing services internally quite easily.
As you scale beyond one team, there might be two routes that you take. You might go down the route of where you have a sort of low-level, common concerns across different initiatives, that might be the opportunity for a technology platform to support ML at a lower level like compute and data and feature engineering. That would be a platform team in the team topologies team shape. That would aim to provide as a service, and that would be composed differently from a stream-aligned team.
The other route you might take to scale is you might identify business clusters of ML needs. There might be something around audience engagement or there might be something around asset valuation. Or there might be something around content moderation. Those are a sort of specific set of business needs that also come with an associated set of ML paradigms as well, that don’t necessarily have a common support in a technology platform at that right business level of abstraction. So then you might have what’s called a complicated subsystem team.
As you have a bigger ecosystem, then there’ll be a range of problems that come up of a similar nature, but have a unique presentation each time, which might be around, say, privacy, or ethical use of data, or optimizing particular techniques. This is where you might have an enabling ML team as well that sort of acts as a consultancy to other ML teams to make themselves redundant.
If we treat everything as a stream-aligned team, like in software engineering, but then the failure mode there is Conway’s Law. We have got three teams; we’re going to build three sets of architecture, rebuild certain tools that we need. That’s where a platform team comes in to abstract all of that.
What we have seen work really well in one particular case was this complicated subsystem team. ML is hard, a lot of moving parts. They took on the effort to build this ML product. That was how that team scaled, rather than having to integrate with each particular thing, encapsulating that complicated subsystem as a kind of formal set of APIs, either through batch or through real-time, that helped the team achieve more and get more knowledge out of the ML product they built.
A complicated subsystem team is probably going to have a bunch of specialists. To scale their impact in the organization, they do need that business domain expertise. They do need to deliver as a service instead of constantly collaborating with other teams. Maybe some product thinking helps with that service definition as well.
One of the opening quotes in our book was from Edwards Deming, saying a bad system would beat a good person every time.
The whole thesis of our book is how do we create those systems to help teams build the right thing? Solve the right problem, build the thing right. In a way that’s right for people. It’s not just shipping and shipping, but in a way that has the right team shape, right collaboration mode, right trust, psychological safety, and also the right processes to deliver, ship early and often.

Building the Right ML Product

It is hard, and it goes beyond the technical, although the technical informs it. I find it useful to think about the essential characteristics of ML solutions. We’re exploring them because we think they’re going to be superhuman in some aspects. They’re probably not going to beat the best experts in a field. That’s maybe one myth that we should tackle straight up. But they’ll be able to do things at speed and scale that no human could.
We need to consider that they’ll make mistakes by their very nature. We wouldn’t consider an ML solution if we knew the right answer every time. We would write some rules instead. So there’ll be some percentage of mistakes, however small, in any ML solution, by design, as well as the unforeseen mistakes.
When it comes to those mistakes by design, we really need to understand the cost sensitivity of what value do we get out of a bunch of right answers and what is the impact of maybe a very small number of wrong answers, but they could have a very huge impact across all sorts of dimensions, the financial, as well as security bias and fair treatment of all our stakeholders.
We need to start with products that are designed to handle that failure. They have some upside from when ML gets it right, but they’re robust to the times when ML gets it wrong, because it will. So starting from that perspective, we can then identify some experiments about how well does this need to work? We can start with very simple baselines.
Ideally, to resolve this uncertainty, we’re getting into some real data and understanding the predictive potential of the real data. This is where it’s a challenging multidisciplinary exercise. We were trying to proceed on multiple fronts. Are we building the right thing? Can our solution support or a proposed solution support the performance that we expect? And so on.
That emphasizes the importance of that cross-functional nature of the work. If we frame this as a data science or ML problem, then we can try our level best to go from, let’s say, 55% accuracy to 99. You’ll never get to a hundred, and we will spend many, many weeks and months trying to get there. So it’s not just the ML problem where we try to improve the model’s accuracy or recall precision, but also how we can design for these failure modes of the ML model.
Designing the product in a way that mitigates these failure modes of the ML model. It’s easy for teams if they don’t have the right capabilities or right skillsets to try to solve it. Like if you are a hammer, everything is a nail. It’s very costly to try to level up the accuracy. We may never get there, but having that cross-functional approach, how do we design it in a different way? How do we talk to users? Do users find this okay? That’s a more holistic way to solve the problem.
That hammer and nail is really important as you start to shift to how do we make this viable? Or how do we make it viable from the perspective of solving the problem effectively, but also from being economically sustainable to maintain it? The fact that ML solutions are narrow, the very training process is to optimize a loss function, which is a narrow definition of success. But they’re composable.
This is where Gen AI can be really interesting. I have described Gen AI as a stone soup for innovation.
You might have a spark of an idea, you might be able to prototype it easily, but it might actually be made up of many different components composed together to produce something that looks it behaves intelligently. It needs to be factored into that process as well.
One of those big components will be the differentiated data that you bring. Often the step from prototyping something in an experimental environment to actually plumbing those data pipelines in a way that’s sustainable for production use. That can be a major step as well that needs to be considered upfront.
When we explored AI and ML initiatives, we put a heavy weighting on that factor of where will the data come from and how will you deliver it to the product or application. It’s one of those laws where it always takes longer than you expect, even when you plan for it to take longer than you expected.

ML Engineering Best Practices

As a healthy team, if we do those right practices, with automated testing, with continuous delivery, automated deployment, then nobody should need to work late nights.
Number one, test automation is a big part. Every ML engineer, data scientist we have worked with, they enjoy the automated tests that we introduced and added. At this point, it’s a problem of information asymmetry. We have got pockets of teams of people who know how to do automated testing for ML systems. Every team that I have joined and worked with, they’re just, oh, I didn’t know this. Oh, you could do that. I think there’s that desire and demand for more automated testing ML systems.
You touched a little bit on software design or code design as well. Any code base, not just ML, is susceptible to that. Taking that next level of discipline to say, can I extract this function? For this, can I have a readable variable name, not just df or x? All of those software hygiene practices. Single responsibility principle.
My favorite one is the open-closed principle. Can you design something that is open to extension, but you don’t have to modify it every time? The design of it or lack of design sometimes stems from the lack of tests. If there’s no test and nobody can refactor, refactoring is so scary, so risky, nobody does that. We take the path of least resistance.
The teams that we have worked with, the moment we added that safety harness, or when you have that test on the path to production, then a lot of things can happen.
A lot of these practices have been emerging. It’s just about spreading it more and it’s why we wrote the book so that our teams don’t do late nights writing code I did in one project. Just enjoy the flow. Work out work-life balance. Zero stress the production deployments because tests are passing, you have got production monitoring. A lot of these engineering practices will really help the team feel the joy and flow of building ML products.
The call-out around testing is really crucial and actually, allocating your effort effectively. In a regular software product, we might use the test pyramid to direct effort. So that we have a large number of cheap, low-level tests. We have a medium number of integration-level tests. And then we might have a small number of end-to-end tests.
When we’re looking at ML applications and other data-intensive applications, we can add a second dimension to that, so it’s a grid instead of a pyramid. That dimension is the data dimension. We might have a lot of small cheap tests around individual data points. We might also then have samples, tests at a sample level. These are going to give us a little bit more insight than a point data test. They’re going to have some variability, so there’s going to be some tuning there, but they offer us much faster feedback than the final level, which might be a kind of global, let’s test on all the data that we have.
Often people don’t think hard enough about balancing the tests across that spectrum of data. You might be able to do a very quick training run on a very small subset of data. If anything’s misconfigured in the training run and the training doesn’t work properly, for instance, you’ll get that feedback really quickly rather than waiting hours for it. The training might pass, you might have a lower benchmark or threshold for acceptable performance on that low run, but at least you have tested end-to-end and got some feedback with a sample data set that it’s likely to work at the large scale. So it’s all about bringing that feedback back.
Also, being thoughtful about how you test under those conditions of uncertainty when you don’t even know what it is you’re looking for in exploratory data analysis. It’s moving from the unknown-unknowns to known-unknowns to known-knowns through testing.
Visualization is really key to being able to look at the data and understand what it’s telling you or use automated tools to find relationships in the data. That can actually be a form of testing. I have called it visualization-driven development at times.
Once you understand qualitatively what you’re looking for, then there’s a whole range of data science techniques that you can use to turn that into a binary expectation that can pass or fail. Really thinking hard about how you use testing to get fast feedback all the way through the life cycle is pretty crucial.
LLMs is all the rage for the past two years. One of the teams we worked with, we had to innovate and think about how to shift left. Full evaluation can be costly. It takes time. It can cost money, especially for LLMs.
What we ended up doing was to shift again that left. Before we kickstart the big eval or any further deployments, we had an integration test. One of the challenges that the team said was, how can you test something that’s non-deterministic? The answer is different every time.
We had to write an assertion function that asserts on intent rather than vocabulary. So we can still evaluate that this response, given these conditions, yes, is the intent of what we expected.

MLOps

It’s yet another tool in our toolkit, which I very much welcome. Back in the day, we had to wrangle and think how to solve large-scale distributed processing. But now there are these MLOps tools that let you abstract away that concern.
Compute is one part of the MLOps stack. There’s experiment tracking, which has really helped us as well. Every pull request runs an experiment that reports some results that we can check over time.
The challenge here is too many tools, and it’s hard to navigate. End of the day, I think MLOps is about abstracting away complexity so that you can focus on solving the right problem, not having to deal with undifferentiated labor in your day-to-day work.
Coming back to the testing perspective as well, one of the things we highlight in the book, to get the most out of MLOps automation and abstraction, you also need to ensure that you’re doing the right testing to give you confidence that when you’re moving fast, you’re doing so safely.
You can’t MLOps your problems away, just like how you can’t DevOps your problems away. How do you get faster feedback loops? How do you manage cognitive load? How do you get in the flow state? MLOps helps in a few ways, but MLOps is not going to write your tests for you. They are not going to talk to users and make sure you’re writing the same right features or implementing the right features. They’re not going to make sure your code is nicely factored and readable so that you can stay in the flow.
It’s another tool, but it needs to be coupled with these other disciplines that Dave and I mentioned in this podcast and in the book.

Make Good Easy

One key takeaway is how do we make good easy. As engineering leaders, as ML practitioners, we have talked a lot about a lot of different practices. If we can make good easy, then teams can, by following the team practices, following exemplar, repos, then you get that for free in your CI/CD setup, your test strategy, even maybe hygiene checks of talking to users. Have you put a business case together before you start asking people to work on this for six months?
It can get out of hand easily with so many moving parts. As an engineering leader, how do we make good easy? Make teams on the ground, when they get the mission to do a certain piece of work, it’s built into the way of working.

3 Tech Lead Wisdom

In line with some philosophy of the book, being able to take different perspectives on technical problems is really key for your leadership growth. Looking for opportunities to play different roles in projects, even for a short time, it gives you that understanding of what other stakeholders require and how to make them successful as you aim to be successful yourself.
Focus, function, and fire. This was an idea I got from Todd Henry in his book, Herding Tigers.
- We are building things every day. Teams can get distracted with many things. To set our team up for success, how can we give them that focus? That clear mission, the milestone, the why we’re doing it for the customers to benefit for the business. The business case, how will this benefit the metrics that we care about?
- Function, the way of working. Instead of throwing work or communications over to teams, have we got the right way of working?
- Fire, meaning the implicit motivation of why we are doing this? How is this helping people? How is this helping business?
Trust and psychological safety.
- When it’s absent from the room, then a simple conversation becomes a process of bureaucracy. When it’s also not there, then team members are afraid to voice concerns of how things might fail. So then we continue to track down the wrong direction.
- As tech leaders, how do we make sure we embody and encourage and ensure that we have that psychological safety in the team so that we can all do our best work?
- That idea of trust is also really important in a multidisciplinary team and innovative initiatives under conditions of uncertainty. We want anyone to be able to speak out with good ideas or concerns that things might be broken. To be able to create those serendipitous moments, as well as the well-understood moments that trust facilitates, is a pretty key focus for leaders.

Transcript

[00:01:45] Introduction

Henry Suryawirawan: Hello, everyone. Welcome back to another new episode of the Tech Lead Journal podcast. Today, I have with me two Davids, very excited. One is David Tan. He is actually my ex-colleague in ThoughtWorks. And the other one is David Colls. So they are the co-authors, together with another one, Ada, who couldn’t join us today. They wrote a book titled Effective Machine Learning Teams. Even though it’s a machine learning as part of the book, but I think we are not going to cover solely just machine learning, because I’m also not an ML expert. But we’ll discuss things like how we can build effective machine learning teams, what kind of practices that we should be aware of, and things like that.

So welcome to the show. I’ll maybe mention Dave as David Colls, right? And David for David Tan. So welcome to the show, guys.

David Tan: Yeah, thanks for having us Henry, happy to be here.

Dave Colls: Thanks, Henry. Nice to be here.

[00:02:35] Career Turning Points

Henry Suryawirawan: Right. I always love to ask my guests maybe first to share a little bit of themselves by sharing your career turning points that you think we all can learn from that. So any one of you can start first.

David Tan: Cool. Um, so I’m currently engineering manager in the AI products group in Xero and my career into tech hasn’t always been straight. Like I was actually non-technical when I graduated from university. So I was working in government for maybe two, three years. And I got, by accident, hand, foot and mouth disease and I had to like be quarantined for eight days. And from there I started like, you know, watching some YouTube videos about data analytics, about coding, and then tried some tutorials and, you know, it was really cool. So then decided to, you know, join a programming bootcamp, quit my job, did that, and then joined ThoughtWorks. And yeah, learned a lot about Agile, about test driven development, refactoring, CI/CD, got into ML engineering and so, you know, finally got into this exciting space of ML engineering, DevOps, building ML products. So yeah, I have to thank this 8 day quarantine back in the day for my career turning point.

Henry Suryawirawan: Yeah, maybe. How about you, Dave?

Dave Colls: If I’m to pick on an external theme like that, I’d probably choose deciding to play Ultimate Frisbee early in my career, where I met a number of people involved in the software scene. But, in particular, a ThoughtWorker who encouraged me to interview at ThoughtWorks. And when I think about turning points, I guess that kind of leads me into the fact that I started as a developer in technical, data intensive roles. I then, you know, looked for ways to make more impact in the leadership space. And when I joined ThoughtWorks, although I’d actually passed a coding interview, I joined as a project manager. Then I’ve worked for a number of years in Agile transformations and organizational transformation before then picking up the data practice. So I guess, my career has turned from deeply technical to leadership positions and organizational design a number of times over that period.

Henry Suryawirawan: Thank you so much for sharing your story. Very interesting, right? So the kind of serendipity that could happen throughout our career, right? And what led us to where we are at the moment.

[00:04:54] Writing “Effective Machine Learning Teams”

Henry Suryawirawan: So today we are going to discuss topics from your book, Effective Machine Learning Teams. Maybe in the beginning, let’s share what made you wrote this book.

David Tan: Yeah. This book started like as a blog post, actually. So back in 2019, we had wrapped up a project. We’re involved a lot of refactoring of data science codebases. So we have data scientists wrote code in the Jupyter Notebook, no test, a lot of lines. You know, it was good, like it was solving a real business problem. And then we had to kind of productionize that. And so we started writing about how we did that. How do we encapsulate code into functions? How do we have automated tests? So we are, you know, clear about that, about the quality. How do we do CI/CD for those changes?

And then we wrote a blog post and it kind of like exploded in, back then Twitter, someone from the R community shared it. And a lot of likes and it’s like, okay, I think we’re onto something here. There’s this desire, I think, in the data science community or ML community to have more kind of engineering and robust practices. So yeah, you know, then we started writing more and more about… from the subsequent projects or work we did to kind of large scale ML training pipelines, large scale high volume ML products. And, you know, the practices that helped us build that reliably, build ML products that are safe, help you get fast feedback on your changes.

And some of those things worked really well. So we wanted to share it more broadly with our readers. Uh, and that’s just the engineering space. And I think Dave also has some other kind of, yeah, aspects that, you know, we started with this thread, calling engineering practices for ML. And then we discovered this whole kind of world of other practices that also help ML teams.

Dave Colls: Yeah. So, you know, in my experience of working with, I guess, data intensive ML initiatives or other technology specializations, I’d found that the ability to build the thing wasn’t always the major success factor. Often it was understanding what the right thing to build was, making sure that there was alignment between technical teams and business teams and product teams on the right thing to build, which could be complex when people don’t speak the same language. So, you know, what sort of processes and techniques can you use to build that alignment, especially when there’s a lot of uncertainty both about the problem and the solution.

And then also, because of the specialised nature of this work, it’s often hard to build a team that can execute end to end. So then the factors around either building in a single team, how do you bring all of those different perspectives together, or when you have to rely on multiple teams to get a piece of work out the door, what’s the best way to achieve fast flow and high quality in a way that’s sustainable for the people working in that environment.

So for me, those were the perspectives I wanted to bring to machine learning and product development, which is an essentially multidisciplinary activity. You know, how do we do that better as teams?

Henry Suryawirawan: Right. When I read your book, I must admit that I was thinking that I was going to learn a lot of the technical things about specific ML, you know, projects and ML products and things like that. But actually, you cover kind of like the holistic approach. not just the engineering, also like product, delivery, and things like that. And actually there are so many things that any software engineering team, I believe, could learn just by reading the book as well. I mean, putting aside some parts of the ML.

[00:08:29] ML Engineering vs Other Types of Engineering

Henry Suryawirawan: But let’s start maybe in the first place, because I’m sure many listeners here not coming from the ML background. What are the differences between, you know, ML engineering and the typical software engineering these days where people build websites, APIs, and things like that?

David Tan: Yeah, that’s a great question. And I think ML systems fundamentally is orders of magnitude, I think, harder and more complex than software engineering in some ways. Like, of course, software engineering is hard, you know, you’ve got microservices, you’ve got distributed computing, you’ve got a lot of things, like in itself it’s an art and whole discipline. ML itself, the difficulties manifest themselves in some common ways, like, you know, it’s not just one service or one component. Usually, it’s like a data pipeline, dependencies upstream, that, you know, it’s complex in itself. You’ve got your scale or compute requirements for your own ML workload, which is also, can be done in many ways.

You know, you can have a really large instance with really large memory, do everything there. Or, you know, at some point that breaks down, so how do you architect or make it simple to scale up large workloads? And then there’s also the monitoring, like when you think about software products, typically your monitoring is kind of your four golden signals, right? Success rate, latency, things like that. You can do that for ML, and you can be all green and good, but that says nothing about like the quality of the predictions that models are producing. And then the correctness of those. So it adds another layer of challenges that I think the MLOps community have come to solve in recent years. And then further, even further down, like, yes, we’ve built this product, it’s giving accurate predictions, to an extent. What is the kind of business impact or what levers or dials is it moving? So that’s like kind of essentially the fundamental difference between the ML product and software engineering products.

There’s a small… There’s this diagram from Google where like ML is just like a little box. And there’s like infra, there’s monitoring, there’s data quality, data skew. So a lot of moving parts and that’s why we wanted to bring a lot of practices that kind of help you tame the complexity in this different sub-components.

Dave Colls: And, yeah, when you look at how to manage the work as well, then, I guess, one of the characteristics of building an ML feature is there’s a lot less certainty about how it’s going to perform up front. And so often you have to integrate this exploratory data analysis or building prototypes or architecture search as it might be described into the product delivery process as well. And then, as you converge on a potential solution, then you’re also dealing with another vector of change, which is the data and the world changing as well as David said around monitoring. Often that changes slowly, but sometimes we can see step changes. COVID was a classic example, with recommender systems moving from lowest price to highest, to best availability. And you know, ML models needing to catch up with that. So yeah, you both have less certainty and you have another vector of change to deal with when it comes to managing that work.

Henry Suryawirawan: Yeah, so thanks for sharing some of these complexities, right? So definitely for people who may not be able to relate. I think typically from what I can see, right, there are so many data pipelines when you build ML. You probably need more than just a small data, right? Something like a medium data or big data and the kind of compute that you need. Sometimes it’s more like a distributed computing with large scale kind of pipelines. And I think typically what I can see as well, the feedback loop might be long in ML projects, simply because you have this training kind of thing.

[00:12:24] ML and LLM

Henry Suryawirawan: In the first place, maybe let’s clarify when you say ML, I think we are not just associating it with LLM, right? These days people are crazy about LLM and maybe they associate ML now as LLM. So maybe a little bit of insight here, like what do you mean by ML projects?

Dave Colls: Yeah, I guess, so yeah, great observation, Henry, and I was thinking, you know, as you talked about feedback cycles and big data and all of those elements. Yes, those are the things we traditionally associate with supervised machine learning, which I guess is our sort of primary focus in the book. Which might be anything from a binary classification problem, like is this transaction a fraudulent transaction? And you know, to solve that with a supervised paradigm, we get a lot of historical data about transactions, their amount, the age of the account. You know, maybe some information about the network around that account and then we label those transactions, whether they were fraudulent or not. That’s actually a nice example of where the labels might be easy to get because we might get them from customer reports of fraud. But in general, that effort of labeling the data set can be really intensive as well.

So that’s the main paradigm we’re looking at, but we think that, you know, looking through this lens of working with uncertainty, working with data, working with specializations. But there’s a whole lot of other paradigms of ML or AI that might fit into that. So obviously working with a large language model in inference mode, that’s still a challenging environment. The supervised paradigm might apply to fine tuning or even training from scratch your own large learning model as well. But then beyond that sort of paradigm, we’re also looking at initiatives that might be, might use reinforcement learning, for instance. It’s another approach to recommender systems. And this is where you might not need big data to start with. You might not have historical data to start with, but you might actually build or tune your data set as you go.

We might also look further afield to techniques like simulation or operations research for optimisation, they share a lot of the same characteristics. Simulation is actually very similar to machine learning when it comes to productionising it. It’s just that instead of a learned model of the world, you’re working with an explicit model of the world. But you still need to go through feature engineering, ability to monitor and assess business impact and align to, um, stakeholder expectations too. So, while, yeah, while we’re focused on supervised machine learning as a paradigm, yes, there’s a big world out there. Anything that’s decision making driven off data or models of the world or requires specialization might fit into this view that we’ve put forward in the book as well.

[00:14:59] Why Many ML Projects Fail

Henry Suryawirawan: Thanks for the clarification. So maybe let’s start going deeper into the book, right? I think one thing that piqued my interest when I read the book in the first few chapters of the book, right, I think you cover some statistics here. The first one is actually many ML projects doesn’t make it to production, right? So and the second one, even though they reach production, they actually don’t solve the real business problem or they don’t bring any value, right? So maybe let’s share a little bit why this is the case. What do you see out there as ML practitioners, right? So why is it so hard to actually bring ML projects to production?

David Tan: Yeah. Thanks, Henry. Like that’s the crux of the book, in that ML projects tend to fail in this common like failure modes that we describe and I can go through them. And it’s kind of preventable types of failure, like if we can learn from experience from history and that we would bring in the right tactics to make sure, like, as you mentioned, let’s say a project that, you know, let’s say we invest half a year, nine months into it and ship it. And we found that, oh, actually it’s not solving the right problem. Or like users didn’t really care about it. It’s like, then what tactics can we bring in there with prototype testing or customer research? So that’s kind of the whole kind of the main thesis of the book is like, how have we failed and what have we learned and how can we do better?

So one common failure mode in ML projects is what I call POC hell. It’s like a combination of factors, maybe business don’t have enough of the appetite to release something to users. So we might, you know, take the first easy step to build a prototype, to test it, maybe internal demo of it. But the resources or risk to productionize that and put it in the hands of users is kind of too great. So with the lack of commitment there means that teams on the ground just kind of build POCs after POCs after POCs. So that’s detrimental to, you know, the business in some ways, like you’re missing out on opportunities. So detrimental to the morale of the team. You do a few and then, you know, after a while you say, what’s the point? So, you move on and then you’ve got churn and you’ve got onboarding and you’re losing talent, right?

Another common, I think, failure mode is that it’s easy to get maybe the first thing out of the door. You hustle, you get the data, you get the compute, train the model, you deploy the thing. But then it’s kind of stuck together with duct tape and sticky gum. And it’s hard to test or change after you evolve based on, you know, new customer feedback, things like-that. So then that comes to kind of continuous delivery, CI/CD, how confident can we be to test and deploy a change. And so this set of problems is solved by MLOps, by continuous delivery, from machine learning. There’s probably a couple more, uh, I don’t want to hog the mic. Dave, was there any other failure modes that came to mind or these challenges?

Dave Colls: Yeah, those are two big ones. It’s, you know, is there any value in doing this at all? And that can be really challenging when the responsibility is divided across different teams and might even be in different parts of the organization. There might be a, I guess, the release of ChatGPT sparked a lot of curiosity in how do we best use LLMs in products. So, but we might have responsibility diffused across the organisation. We might have an ML team that’s interested in looking at the use of large language models. We might have a product team that has their own roadmap, where they don’t have these features anywhere on that roadmap. And then we might rely on data, especially in the case of Gen AI, that’s unstructured data that’s not well governed, that’s hard to make available in a responsible way to these systems. And so while each group can do what’s within their control to move a little bit towards a future state, it’s not aligned or stitched up, you know, in a cohesive way that allows us to establish whether there is some value quickly with cheap and low risk tests.

And then, as David said, once you’ve established there’s some value, you know, if you have productionized something that’s not supported by a lot of these good practices around MLOps and continuous delivery, which is often the case to understand if there is value there, then you start to understand how much value you’re leaving on the table. And then this can be an opportunity to invest in those practices to be able to maximise the lifetime value of an ML product. But then that requires a certain, again, a certain degree of maturity and a certain ability to work across existing organisational boundaries or reshape them.

[00:19:53] ML Success Modes

David Tan: Yeah. And there’s also kind of the flip side of the failure mode is the success, right? So we’ve also worked with and seen teams, ML teams, successfully deliver, you know, really great ML products. And what tactics did they use to succeed? Like we could see things like customer testing, user testing, so before they build out a prototype or a MVP, which is a really expensive way to test something, you know, even just talking to users, understanding what is the pain point, what are their jobs to be done. Where could ML help? And then, you know, once they’re bought in, they say, okay, this is the right ML product that will solve those problems. Then combating that failure mode of what Dave mentioned, like multiple teams kind of throwing across each other. If we remember the DevOps comic, where you have devs on one side, you have ops. So instead of throwing code between scientists and, you know, MLOps engineers, you know, we do the inverse Conway maneuver where we have a cross functional team shipping end to end for this initiative.

And when we talked to the team members who worked on that project, they enjoyed that end to end ownership, the satisfaction of putting something, releasing to-customers. You get to touch different parts of the stack, learn about data science, you learn about Ops. You know, so they help them, you know, you can get fast feedback, you have the same standup, you ship things iteratively, you showcase it every, you know, two sprints, three sprints. So that feeling of flow and speed was like really amazing.

And then, you know, when something is released, you know, you’ve got your continuous delivery, production monitoring, when things are not going well. Yeah, and you’ll have safety for changes. If I need to make a refactoring, upgrade the library. If your CI checks are all green, your model monitoring dashboard is saying performance is good, then, you know, you merge that PR and you don’t feel nervous or stressful about any releases. So yeah, there are also bright spots in how teams deliver ML products.

Henry Suryawirawan: Right. So I think when I heard you mentioned, you know, in the beginning about POC hell, you know, so many duct tapes when you bring production. I could relate to some other ML projects that I knew of before. So I think many, many, ML teams probably are in this mode, right? They keep building POC because maybe the expectation from stakeholders or from the business also kind of like with a lot of hype, right? Because these days people think AI can do a lot of magic, right? And especially with all the success of ChatGPT and all that, they even think that it is easy to do. But I think the reality may not be the case, right?

And the other thing is about the practices, right? So many ML teams, first, either they don’t have many good software engineers, so they are like data scientists or, you know, ML engineers who just have a lot of expertise in building the models. But actually they don’t have all these other skills like refactoring, continuous delivery that you mentioned, right? Testing as well. Or the other flip side, which is a lot of software engineers being turned into ML engineers. So they are not necessarily an ML engineer, but they just come from, you know, like web application development and turned into ML engineer. So I think some of these things definitely I can relate.

David Tan: Yep, and just to add to that, there’s also the flip side of engineers or ML engineers being excellent ML engineers, but not treating it as a data science problem where there is uncertainties, the need for experimentation to get fast feedback. So yeah, I think the, kind of emphasizes the need, I think, for cross functional collaboration from both, like data science, from MLOps, from, you know, SRE, things like that.

[00:23:32] Ideal ML Engineering Team Composition

Henry Suryawirawan: So maybe let’s go there, right? So the composition of the team, you mentioned a few times about cross functional teams and, you know, these silos between, for example, data science, maybe the ML Ops or whoever operates the system in the end. Maybe there are also other functions, like for example other teams, because ML typically also relies a lot of dependencies like data or maybe some kind of model, things like that. So maybe what are the best compositions here in your experience? Maybe you can advise us.

Dave Colls: I guess here that there’s another question that is, and it’s what we tackle towards the end of the book. We look at sort of multiple levels of team effectiveness. And one of the levels we look at is an individual in a team, the types of practices you can use, the expertise you can build as an individual. The next level we look at is within teams, again, how those practices around technology, delivery, and product. Augment teams, but also broader team dynamics. But then the third level we look at is between teams. So how can teams be effective between teams? And so the best makeup of a team depends a bit on the shape of the team and its interactions with other teams.

And so this is where we use the team topologies model to identify the different types of ML teams that you might sit in. And so to run through it quickly and we can come back and dive into details. You know, you might have a stream aligned team, and this, we’d say, this is the basic unit to think of, first ML project, a stream aligned team that can deliver end to end. You’re not going to be conducting groundbreaking ML research, you’re going to be using tried and true techniques, in this case. Or at the other end where you have an established ML ecosystem internally, you can add another stream aligned team that draws on those existing services internally quite easily.

Then as you scale beyond one team, there might be two routes that you take. And so you might go down the route of where you have a sort of low level, common concerns across different initiatives, that might be the opportunity for a technology platform to support ML at a lower level like compute and data and feature engineering. And so that would be a platform team in the team topologies team shape. And that would aim to provide as a service, and that would be composed differently from a stream aligned team.

The other route you might take to scale is you might identify business clusters of ML needs. So there might be something around audience engagement or there might be something around asset valuation. Or there might be something around content moderation. And so those are a sort of specific set of business needs that also come with an associated set of ML paradigms as well, that don’t necessarily have a common support in a technology platform at that right business level of abstraction. So then you might have what’s called a complicated subsystem team. And again, the makeup of that might be a little bit different.

And then as you, you know, as you have a bigger ecosystem, then there’ll be a range of problems that come up of a similar nature, but like have a unique presentation each time, which might be around, say, privacy, or ethical use of data, or optimizing particular techniques. And this is where you might have an enabling ML team as well that sort of acts as a consultancy to other ML teams to make themselves redundant.

And so that was a long way of saying, it depends on the ideal makeup of a team. But you might assume it’s a stream aligned team and talk about the roles that sit in that and maybe I’ll throw to David.

David Tan: I think team topologies is a really useful set of constructs, the ones that Dave went through. Because it helps teams scale through the team’s API. So as Dave mentioned, you know, if we treat everything, let’s say, as a stream aligned team, like in software engineering, we’re very familiar with cross functional teams. Um, but then the failure mode there is that teams, like Conway’s Law, right, we’ve got three teams, we’re going to build three sets of, you know, basically architecture, rebuild certain tools that we need. Then that’s where like maybe platform team comes in to abstract all of that.

So then with that, what we’ve seen work really well in one particular case was this complicated subsystem team, like as you mentioned, ML is hard, a lot of moving parts. They took on the effort to build this ML product. Let’s imagine it’s, um, I can’t say the specific example, but let’s say it’s a car valuations, right? That’s kind of an ML product without a product. API to this team or this product is, I will give you the value of cars and now that is self serviceable by other teams. They could embed it on the mobile, mobile team could integrate with this API and expose the ML capability. A web team can do the same, or email marketing team can do the same, and then send personalized emails to say, you know, about car valuations. So then that was how that team scaled, rather than having to integrate with each particular thing, encapsulating that complicated subsystem as a kind of formal set of APIs, either through batch or through real time, that helped the team like achieve more and get more knowledge out of the ML product they built.

Dave Colls: Yes, yes, that’s a great example that complicated subsystem team is probably going to have a bunch of specialists. It’s going to look maybe most like what people imagine an ML team looks like. But as David said, to scale their impact in the organization, they do need that business domain expertise. They do need to deliver as a service instead of constantly collaborating with other teams. And you know, maybe some product thinking helps with that service definition as well. So, you know, that might be one team composition.

David Tan: Yeah, that’s right. And if you replace that example there from car valuations, which is bit like, think not everybody can relate with that. Let’s say it’s like travel recommendations or product recommendations, then that one team that spent all the effort in building product recommendations now can impact multiple parts of the business by having the right team topology to, you know, have that fracture plane around, okay, my team is doing product recommendations. And yeah, you know, applying the kind of data product disciplines, exposing this as an API and encapsulating the details.

Henry Suryawirawan: Very exciting to hear about team topologies mentioned for ML product and ML teams, right? So I think it’s kind of like back then, right, it was a revolutionary approach to how we kind of like create different teams. And I, I’m glad that you brought it up because still, I believe in many companies, they think, okay, we want to build ML product, ML project. They just hired a few data scientists or, you know, these ML experts, and they just asked them to kind of like build the model first without actually involving the other aspects of software engineers. It could be the stream aligned team from the product side. Or it could be, you know, the data, or could be anything, right?

But I think that tends to kind of like have its challenges. So I think bringing the concepts such as team topologies to actually think holistically how we are going deliver the ML projects is something that’s really, really important.

David Tan: Yeah, so a really great point, and I think one of the quotes, opening quotes in our book was from Edwards Deming, saying a bad system-would beat good person every time. So you can hire the smartest data scientists, put them in an environment where it’s not, the right system, then, you know, we get what you described there.

So the whole thesis of our book is how do we create those systems to help teams build the right thing. You know, solve the right problem, build the thing right. You know, engineering, ML engineering, data science. And then in a way that’s right for people. I think what Dave puts in a really good way. Like it’s not just shipping and shipping, but in a way that has the right team shape, right collaboration mode, right trust, psychological safety, and also the right processes to deliver, ship early and often. So yeah, I really was intrigued when you mentioned, you know, you can’t just put a group of data scientists together or engineers together. You really got to create that system where they can, you know, ship to the right, you solve the right problems.

[00:31:39] Building the Right ML Product

Henry Suryawirawan: Yeah, thanks for adding that. To come back to the theme of, you know, this product discipline, right, so I think we know that a lot of ML projects don’t make into production, right, or they solve the wrong problem, or maybe don’t bring value, right? So I think they are, like Dave mentioned in the beginning, there are a lot of uncertainties when you actually build ML model, ML product, right, because, I mean, the way it works also is kind of like prediction. It’s kind of like there’s some kind of ambiguities inside. LLM, there’s a hallucination, right? So how can you actually come up with an approach such that, you know, when you first building the ML product, you can actually build something that is kind of like bringing the business value either to the users or to the organization? So I think this is probably one hard aspect as well. So typically how would you run this?

Dave Colls: Yeah, it is hard and it goes beyond the technical, although the technical informs it. I find it’s useful to think about the essential characteristics of ML solutions. We’re exploring them because we think they’re going to be superhuman in some aspects. They’re probably not going to beat the best experts in a field. That’s maybe one myth that we should tackle straight up. But, you know, they’ll be able to do things at speed and scale that no human could. But then we need to also consider that they’ll make mistakes by the very nature. Again, we wouldn’t consider an ML solution if we knew the right answer every time. We’d write some rules instead. So there’ll be some percentage of mistakes, however small, in any ML solution, by design, as well as the unforeseen mistakes.

And then when it comes to those mistakes by design, we really need to understand the cost sensitivity of what’s, you know, what value do we get out of a bunch of right answers and what is the impact of maybe a very small number of wrong answers, but they could have a very huge impact across all sorts of dimensions, the financial, as well as security bias and fair treatment of all our stakeholders. So, yeah, we need to consider that fallible nature of them as well. And so, you know, we need to start with products that are designed to handle that failure. You know, they have some upside from when ML gets it right. But they’re robust to the times when ML gets it wrong, because it will. So starting from that perspective, you know, we can then identify some experiments about how well does this need to work? We can start with very simple baselines. Sometimes, you know, even just predicting the majority class or random guessing and, you know, seeing how that works as a product. But, you know, ideally, to resolve this uncertainty, we’re getting into some real data and understanding the predictive potential of the real data. And so this is, again, where it’s like it’s a challenging multidisciplinary exercise. We were trying to proceed on multiple fronts. Are we building the right thing? Can our solution support or a proposed solution support the performance that we expect? And so on.

David Tan: Yeah. And just to add to that, I think that emphasizes the importance of that cross functional nature of the work as well. Like if we frame this as a data science or ML problem, then we can try our level best to go from, let’s say, 55% accuracy to 99. You’ll never get to a hundred, and we will spend many, many weeks and months trying to get there. So it’s not just the ML problem where we try to improve the model’s accuracy or recall precision, but also how can we design for these failure modes of the ML model. So like for example, you know, displaying, designing a product in a way, right, to show users that, okay, this is not a confident prediction, or this is a prediction, but would you correct that? And also maybe even giving users options, like these are the top three, like, the model’s top one prediction maybe, you know, not where we want it to be, but top three, okay, is much higher.

So designing the product in a way that mitigates these failure modes of the ML model. And yeah, it’s easy for teams if they don’t have the right capabilities or right skillsets, to try to solve it. Like if you are a hammer, everything is a nail. And it’s very costly to try to level up the accuracy. We may never get there, but yeah having that cross functional approach, like how do we design it in a-different way? How do we talk to users? Like do users find this okay? That’s a kind of more holistic way to solve-the-problem.

Dave Colls: Yeah, that hammer and nail is really important as you start to shift to, okay, yeah, how do we make this viable? Or how do we, yeah, make it viable from the perspective of solving the problem effectively, but also from being economically sustainable to maintain it? And I guess there’s another essential characteristic around that. The fact that ML solutions are narrow, the very training process is to optimize a loss function, which is a narrow definition of success. But they’re composable.

And, you know, this is, I think, where Gen AI can be really interesting. And I’m not the only one to take this perspective, but I’ve described like Gen AI as a stone soup for innovation. So the story of stone soup, if you haven’t heard it, is that a weary traveler arrives at a village late at night, and all they have in their knapsack is a stone. So they go to the first house in the village and they ask the villager there, could I have an onion to make stone soup? You know, I’ve got the stone. All I need from you is an onion. That villager says, oh, great, yeah. I’ll provide an onion. They go to the next house and repeat and ask for carrots and, you know, proceed around the village. By the end of that process, they cook up an amazing soup that feeds the whole village. And it was all cooked from a stone.

Considering that, you know, you might have a spark of an idea, you might be able to prototype it easily, but it might actually be made up of many different components composed together to produce something that looks like it behaves intelligently. It needs to be factored into that process as well. And you know, one of those big components will be the differentiated data that you bring. And often like the step from prototyping something in a experimental environment to actually plumbing those data pipelines in a way that’s sustainable for production use, you know, that can be a major step as well that needs to be considered upfront. And so, you know, when we’ve talked about, when we explored AI and ML initiatives, you know, we put a heavy weighting on that factor of where will the data come from and how will you deliver it to the product or application. You know, that’s, again, you know, it’s one of those laws where it always takes longer than you expect, even when you plan for it to take longer than you expected.

Henry Suryawirawan: Like any software engineering projects out-there, right? So I think the most important thing is ML product, ML project, right? So you need to still have the product thinking concept-in-the very beginning, right, so you, we’ve mentioned a little bit about, you know, user interviews, experiment, you know, building prototypes. Making sure the data that you feed into the ML training and all that is also appropriate, right? And I like the mentioning about failure modes, right? Because unlike other products out there, we kind of like know the input and output that we want the features to be, right?

So ML product typically could fail in a, I don’t know, unpredictable way, right? So we have so many things mentioned in the news, like for example, Google, you know, image classification, you know, the ChatGPT, and, you know, Gemini or Bard giving a wrong hallucinating answers. So like, how do you tackle that? Plus, I think the whole aspect of, you know, data security, bias, right? And also other associated aspects of fair data that you use in the training, right? I think it’s also another thing that you should put your product thinking concept holistically so that you come up with a very useful and valuable product.

[00:39:23] ML Engineering Best Practices

Henry Suryawirawan: So let’s go to the other discipline, which is the engineering side, right? So I think what I could see in the past as well, like a lot of ML code is kinda like highly unstructured, I would say, right? So it’s like procedural. There’s no proper modeling. It’s-just function calls over function calls, very complex with trace. So maybe in your view, and you mentioned in the beginning as well, you took a project to refactor ML project, right? So what are the disciplines that typically are lacking in the engineering aspect of ML product? And how we could do better?

David Tan: Yeah, I personally can resonate and relate with that experience. I had to get glasses recently, maybe because of old age, but I think mainly because looking at too much code all time and sometimes into late nights, cause of stress or, you know, which is other things, which we, as a healthy team, you know, if we do those right practices, with automated testing, with continuous delivery, automated deployment, then nobody should need work late-nights. So I think a couple of things that you mentioned there.

Number one, I think, test automation is a big part. Like every ML engineer, data scientist we’ve worked with, they enjoy the automated tests that we introduced and added. And so it’s, I think here at this point, it’s a problem of information asymmetry. Like we’ve got pockets of teams of people who know how to do automated testing for ML systems. And every team that I’ve joined and worked with, they’re just like, oh, I didn’t know this. Oh, you could do that. So I think there’s that desire and demand for, you know, more automated testing ML systems.

One encapsulating story was, uh, we had a project that has kind of close to zero test coverage. It was an LLM system. So over time we added the test pyramid, like the simple things like unit tests, integration tests to touch our whole LLM application, check that it’s okay. Those are still point based tests. Then we also have that kind of more like deeper model eval test, a suite of however many examples. We run it, five minutes later we know, okay, model accuracy is 75%, whatever number that is.

So then we had this automated dependency manager, like on upgrade called Snyk, or Renovate, or Dependabot. So Snyk opened up a PR saying, you need to upgrade this. It automatically, the PR had all these green ticks, tests were passing. We automatically trigger model eval, we know, okay, performance is just as good. So in 15 minutes, we could merge the PR. No stress, no effort. So that’s how we grew capacity of a team, just by having these automated testing, model eval.

You touched a little bit on software design or code design as well. And, so yeah, that’s, I think, any code base, not just ML, is susceptible to that. And so I think taking that next level of discipline to say, can I extract function? For this, can I have a readable variable name, not just df or x? All of those software hygiene practices, which we can link some in the show notes. Single responsibility principle. My favorite one is open close principle. Like can you design something that is open to extension, but you don’t have to modify it every time. Probably I’m a bit going into too much detail there, but I think the design of it or lack of design sometimes stems from the lack of tests. As you know, if there’s no test and nobody can refactor, refactoring is so scary, so risky, like nobody does that. We take the path of least resistance.

So yeah, I think the teams that we’ve worked with, the moment we added that safety harness, or when you have that test on the path to production, then a lot of things can happen. One time we did a massive refactoring of like this variable that was linked in all different places. You know, it was, you use an IDE shortcut, replaced it in like, 150 places, and then, you know, test pass, commit, done, right? It was like, the effort reduced from maybe days of testing to, again, minutes. And then in 20 minutes, it was running in production.

So yeah, it’s like a lot of these practices that have been emerging. I think it’s just about spreading it more and it’s why we wrote the book so that our teams don’t do late nights writing code like I did in one project. And yeah, just enjoy the flow. And work out work life balance. And yeah, zero stress the production deployments because, you know, tests are passing, you’ve got production monitoring. So a lot of these engineering practices will, you know, really help team feel the joy and flow of building ML products.

Dave Colls: Yeah, I think the call out around testing is really crucial and actually, allocating your effort effectively. So in a regular software product, we might use the test pyramid to direct effort. So that we have, you know, a large number of cheap, low level tests. You know, we have a medium number of integration level tests. And then, you know, we might have a small number of end to end tests. We can actually, when we’re looking at ML applications and other data intensive applications, we can add a second dimension to that, so it’s a grid instead of a pyramid. And that dimension is the data dimension.

So, you know, we might have, you know, a lot of small cheap tests around individual data points. So canaries, I guess you might also describe those as. We might also then have samples, tests at a sample level. So these are going to give us a little bit more insight than a point data test. They’re going to have some variability, so there’s going to be some tuning there, but they offer us much faster feedback than the final level, which might be a kind of global, let’s test on all the data that we have.

And often people don’t think hard enough about balancing the tests across that spectrum of data. So, you know, you might be able to do a very quick training run on a very small subset of data. If anything’s misconfigured in the training run and the training doesn’t work properly, for instance, you know that will break and you’ll get that feedback really quickly rather than waiting hours for it. But you might also, you know, the training might pass, you might have a lower benchmark or threshold for acceptable performance on that low run, but at least you’ve tested end to end and got some feedback with a sample data set that it’s likely to work at the large scale. So it’s all about bringing that feedback back.

And I think when we come to that uncertainty in the front end as well, also being thoughtful about how you test under those conditions of uncertainty when you don’t even know what it is you’re looking for in exploratory data analysis. And so there, it’s kind of moving from the unknown-unknowns to known-unknowns to known-knowns through testing. You know, visualization is really key to be able to look at the data and understand what it’s telling you, or use automated tools to find relationships in the data. And then when you sort of understand what the data is telling you. So that visualization, does it look right? Does it look like I expect? That can actually be a form of testing. I’ve called it visualization driven development at times as well.

But then once you understand qualitatively what you’re looking for, then, you know, there’s a whole range of data science techniques that you can use to turn that into a binary expectation that can pass or fail. That, you know, might be useful in an exploratory environment, but then might also be something that you promote into a production integration pipeline as well as David was describing. So really, yeah, thinking hard about how you use testing to get fast feedback all the way through the life cycle is pretty crucial.

David Tan: And just to add to that, LLMs, as you mentioned, is all the rage for the past two years. And so, one of the teams we worked with, we had to innovate and think about how to shift left. So as Dave mentioned, full evaluation can be costly. It takes time. It can cost money, especially for LLMs. So what we ended up doing was to shift again that left. So before we kickstart the big eval or any further deployments, we had, uh, integration test. And one of the challenges that the team said was like, how can you test something that’s non deterministic? You know, the answer is different every time. So we had to, you know, wrote a assertion function that asserts on intent rather than vocabulary. So we can still evaluate that this response, given these conditions, yes, is the intent of what we expect, what we had expected. We were using, um, PyHemcrest for that kind of matcher style extensions. But you could use, you know, other tools as well. But yeah, I think sometimes at the forefront of this new capability, we have to be a bit creative on how can we shift that left to get the feedback that Dave mentioned.

Henry Suryawirawan: Right. Thanks for mentioning some of these techniques. I’m always intrigued like you mentioned, right? So I’m always intrigued how can you test something that is non deterministic and you know, so many variables that could come in into play, right? So I think thanks for bringing also the importance of automated testing.

I think I like what you mentioned when you explained that, right? There’s a little bit of information asymmetry. Maybe there are some ML engineers who are never exposed to some of these techniques. And when they know it actually, they could actually follow the discipline and make sure that the products are getting better and better. And I like also the approach of, you know, slicing the data for different stages of tests. I think that’s also key in making sure that the ML projects also kind of like still behaves as what we expect. Because like for example, you can tweak the model a little bit, you know, the output can change so much, right? So we don’t want that happen in the production.

[00:49:14] MLOps

Henry Suryawirawan: The other aspect of ML that people always talk about lately is about MLOps, you know, building platforms, you know, how-to actually deploy a model and operate it. So maybe a little bit here, what do you think about MLOps? Is it something that we all need for building ML product? And what problem does it solve?

David Tan: Yeah, I think it’s yet another tool in our toolkit, which I very much welcome. You know, back in the day, we have to wrangle and think how to solve, you know, large scale distributed processing. But now there are these MLOps tools that lets you abstract away that concern. So in one case, one team, we have built an ML platform where now the data scientist, anyone who doesn’t have, know anything about infrastructure or AWS or Kubernetes, they will just write plain Python. And say, I want to have this, you know, large vertical scaling this way. I want to fan out and all of that is-done in Python.

Compute is one part of the MLOps stack. You know, there’s experiment tracking, which has really helped us as well. Every pull request runs an experiment that reports some results that we can kind of check over time. If David creates a new PR that this is better than our champion model. So that MLOps practice was really, really welcome as well.

The challenge here is like too many tools and it’s hard to navigate. ThoughtWorks had this article called “A guide to evaluating ML platforms”, which we could link in the show notes. That really helped like thinking about the capability. Like some platforms try to do everything, some are narrow. So how do you pick what’s right? And how do you avoid like shotgun surgery, like vendor coupling, things like that.

But yeah, end of the day, I think MLOps is about abstracting away complexity so that you can focus on solving the right problem, not having to deal with undifferentiated labor, you know, in your day to day work.

Dave Colls: And I think, yeah, coming back to the testing perspective as well, I think one of the things we highlight in the book, to get the most out of MLOps automation and abstraction, you also need to ensure that you’re doing the right testing to give you confidence that when you’re moving fast, you’re doing so safely.

David Tan: Yep. And to add to that as well, like, that’s a key point we make in the book in that MLOps, you can’t MLOps your problems away, just like how you can’t DevOps your problems away. Last week, you had on the show, um, or last episode, DX with Laura Tacho, then talking about DevEx. So I really like this diagram of the DevEx triangle. How do you get faster feedback loops? How do you manage cognitive load? How do you get in the flow state? MLOps helps in a few ways, but MLOps is not going to write your tests for you. They are not going to make, talk to users and make sure you’re writing the same right features or implementing the right features. They’re not going to make sure your code is nicely factored and readable so that you can stay in the flow. So yeah, I think it’s another tool, but it needs to be coupled with these other disciplines that Dave and I mentioned in this podcast and in the book.

Henry Suryawirawan: Yeah, thanks for the plug for the developer experience as well, right? So don’t forget any kind of ML product, essentially, it’s also like a software engineering problem, right? So it’s a socio technical, don’t forget also the aspect of, you know, this feedback loop, psychological safety, also mentioned in the beginning, right? So all this, like I mentioned in the very beginning, right? It’s not just ML technicals that you need to understand. But it’s actually at the end, it’s a software engineering thing that you have to handle really, really well.

[00:52:44] Make Good Easy

Henry Suryawirawan: So we have talked a lot about the other things. As we move towards the end, is there anything that we haven’t covered that you think should be mentioned as well?

David Tan: I think one key takeaway is how do we make good easy. As engineering leaders, as ML practitioners, we’ve talked a lot about a lot of different practices. If we can make good easy, then teams can kind of just by following the team practices, following exemplar, repos, then you get that for free in your CI/CD setup, your test strategy, even maybe hygiene checks of talking to users. Have you put a business case together before you start asking people to work on this for six months? So, yeah, it can get out of hand easily with so many moving parts. So I think as an engineering leader, how do we make good easy? Make teams on the ground, when they get the mission to do a certain piece of work, like it’s kind of built into the way of working.

Henry Suryawirawan: Yeah, so I like it. Make good easy, right? So sometimes we all get excited about the technology, so many moving parts, so many technologies that we can play with, right? But we forgot the aspect to make it easy for people to adopt, make it easy to get the buy in as well. So I think thanks for mentioning that.

[00:53:56] 3 Tech Lead Wisdom

Henry Suryawirawan: So it’s been an exciting conversation. I learned a lot about what it takes to actually build an ML project, which I find it really complicated. But as we reach the end of our conversation, I have one last question that I’d like to ask you. I call this the three technical leadership question. Just treat of it like an advice that you want to give to us as a listeners. So what will be the three technical leadership wisdom that you can share with us?

Dave Colls: Shall I go first?

David Tan: Yeah, if you like.

Dave Colls: I-think, um, again, in line with some of the philosophy of the book, being able to take different perspectives on technical problems is really key for your leadership growth. And so looking for opportunities to play different roles in projects, even for a short time, it gives you that understanding of what other stakeholders require and how to make them successful as you aim to be successful yourself.

David Tan: Yep. I thought of two. So one is, I call it focus, function, and fire. So this was an idea I got from Todd Henry in his book, I-think, [Herding] Tigers. So, you know, we are building things every day. Teams can get distracted with many things. So how can we, to set our team up for success, how can we give them that focus? That clear mission, the milestone, the why we’re doing it for the customers to benefit for the business. The business case, like how will this benefit, you know, the metrics that we care about? Function, you know, the way of working. Instead of throwing work or communications over to teams, is have we got the right way of working. And Fire, you know, meaning the kind of implicit motivation, like why are we doing this? How is this helping people? How is this helping business? So that, that for me is one principle I take to, uh, my teams.

The second one was really interesting and we covered this in the book as well about trust and psychological safety. So when it’s absent from the room, then a simple conversation becomes a process of bureaucracy. Okay, I’ve got to write up this documentation, you’ve-got to pre-read it, and let’s have a meeting in two weeks, discuss. And when it’s also not there, then team members are afraid to voice concerns of how things might fail. So then we continue to track down the wrong direction. So yeah, as a kind of tech leaders, how do we make sure we have, you know, embody and encourage and ensure that we have that psychological safety in the team so that we can all do our best work?

Dave Colls: On David’s final point, this is what I would say. That idea of trust is, is also really important in a multidisciplinary team and innovative initiatives under certain, under conditions of uncertainty. We want anyone to be able to speak out with good ideas or concerns that things might be broken. And so to be able to create those serendipitous moments, as well as the well understood moments that trust facilitates, is a pretty key focus for leaders.

Henry Suryawirawan: Yeah, I like the last aspect that you mentioned about psychological safety. So you also mentioned that if it’s not there, right, things can tend to become like a bureaucratic kind of thing, right? So I-think that’s really a good plug.

So if people want to, you know, learn more about these exciting things that you mentioned in the book, or if they want to discuss with you, maybe is there place where can find any of you or both of you online?

Dave Colls: Yeah, you can find me on LinkedIn, David Colls.

David Tan: Yep. And I’m Davified and I can share a link in the show notes or we can share a link in the show notes. And our book, Effective Machine Learning Teams, is also the first chapter and preface. Actually, sorry, the preface is available for free on the link that we can share in the show notes. So that gives you an overview of everything we’ve talked about in this podcast in seven pages. And also the book itself, we can link it in the show note on where you can, yeah, read or listen to it.

Henry Suryawirawan: Thank you so much. So, uh, it’s been a pleasure to have both of you Davids in the show. So, I hope people learn a lot about the aspects of machine learning model and software engineering good practices, anyway at the end.

David Tan: Thanks so much, Henry. This was a great chat.

Dave Colls: Thanks for having us, Henry. That was great.

– End –