#175 - How to Solve Real-World Data Analysis Problems - David Asboth

20-May-2024 57 mins David Asboth

included in AI/ML & Data Engineering AI/ML Data Analytics

“All data scientists and analysts should spend more time in the business, outside the data sets, just to see how the actual business works. Because then you have the context, and then you understand the columns you’re seeing in the data."

David Asboth, author of “Solve Any Data Analysis Problem” and co-host of the “Half Stack Data Science” podcast, shares practical tips for solving real-world data analysis challenges. He highlights the gap between academic training and industry demands, emphasizing the importance of understanding the business problem and maintaining a results-driven approach.

David offers practical insights on data dictionary, data modeling, data cleaning, data lake, and prediction analysis. We also explore AI’s impact on data analysis and the importance of critical thinking when leveraging AI solutions. Tune in to level up your skills and become an indispensable, results-driven data analyst.

Listen out for:

Career Journey - [00:01:38]
Half Stack Data Science Podcast - [00:06:33]
Real-World Data Analysis Gaps - [00:10:46]
Understanding the Business/Problem - [00:15:36]
Result-Driven Data Analysis - [00:18:28]
Feedback Iteration - [00:21:44]
Data Dictionary - [00:23:48]
Data Modeling - [00:27:18]
Data Cleaning - [00:30:43]
Data Lake - [00:35:05]
Common Data Analysis Tasks - [00:36:50]
Prediction Analysis - [00:40:23]
The Impact of AI on Data Analysis - [00:43:15]
Importance of Critical Thinking - [00:47:05]
Common Tasks Solved by AI - [00:50:07]
3 Tech Lead Wisdom - [00:53:10]

_____

David Asboth’s Bio
David is a “data generalist”; currently a freelance data consultant and educator with an MSc. in Data Science and a background in software and web development. With over 6 years experience teaching, he has taught everyone from junior analysts up to C-level executives in industries like banking and management consulting about how to successfully apply data science, machine learning, and AI to their day-to-day roles. He co-hosts the Half Stack Data Science podcast about data science in the real world and is the author of Solve Any Data Analysis Problem, a book about the data skills that aspiring analysts actually need in their jobs, which will be published by Manning in 2024.

Follow David:

LinkedIn – linkedin.com/in/david-asboth-9256772
Website – davidasboth.com
Podcast – halfstackdatascience.com

Mentions & Links:

📚 Solve Any Data Analysis Problem – https://www.manning.com/books/solve-any-data-analysis-problem
🎙️ Half Stack Data Science – https://halfstackdatascience.com/
Predicting the survivors of Titanic – https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8
SQL – https://en.wikipedia.org/wiki/SQL
Power BI – https://www.microsoft.com/en-us/power-platform/products/power-bi
Tableau – https://www.tableau.com/
Matplotlib – https://matplotlib.org/

Our Sponsor - Manning

Manning Publications is a premier publisher of technical books on computer and software development topics for both experienced developers and new learners alike. Manning prides itself on being independently owned and operated, and for paving the way for innovative initiatives, such as early access book content and protection-free PDF formats that are now industry standard.

Get a 40% discount for Tech Lead Journal listeners by using the code techlead24 for all products in all formats.

Our Sponsor - Tech Lead Journal Shop

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Career Journey

Over time, I sort of started to prefer doing that part of the job, because I found that I was closer to the value generating aspect of the business. I was closer to real business problems, where software development in a lot of cases is a little bit devolved somehow from the business.
If you’re a software developer, you don’t necessarily have to understand how the business runs, right? You don’t necessarily have to understand how does the business make profit. Who are the customers? How do the customers make profit? What is the operating model? Those things are not that relevant to software developers a lot of the time. Whereas if you’re a data person, you can’t provide any value unless you know how the business works.

Half Stack Data Science Podcast

Very early on in that job, we both quickly realized that what we thought the job was going to be is not at all what the job is like in reality. A lot of data science education is focused on tools, techniques, algorithms, and so you get this picture that I’m going to be deriving formulas and doing all this complicated machine learning at work.
And then you go in and it turns out, a lot of the job is navigating the complexities of an enterprise environment working with a sort of office politics. And things like, oh, the data is not actually available, and no one actually has any solid research questions. So we have to find those as well.
We were very, very quickly hit on this difference between education and reality. And so we started having these conversations internally about what are the things we’ve learned about how industry is different? What are the skills that people should actually be trained on or at least be warned about up front? So people have a better picture of what the job looks like.
One day we just said, well, we’ve had these conversations quite a lot. And every time we went to a meetup, we would talk to like-minded people and have these same conversations. And we thought we might as well just put them on the internet for other people to learn from.
It’s a response to the full-stack idea. One thing we never liked was this idea that a single data scientist has to be this unicorn who does everything in a company. That’s just not the reality. That would never function in a business, especially a solid decade old enterprise. You can’t just drop a single data science unicorn in there and hope that they’ll make the company lots of money with a machine learning model immediately.
There was one presentation we gave somewhere and Shaun, we talked about things like the difference between academic data science and business data science, and just somewhere on those slides, he coined the phrase half-stack data science. He just put it on the slide. And it sort of, it just sort of stuck.
And the idea is that because data science is so different in the real world as opposed to in education, then you need people with a sort of this hybrid skill set. A more generalist skill set. And we don’t really think that having someone be full-stack is realistic. Or at least being an expert in everything from data cleaning to statistics to software development to business strategy development.

Real-World Data Analysis Gaps

I’ve taught accelerators and boot camps and I was faced with this problem of trying to teach the right skills, but also within the framework of what I was expected to teach.
There’s a list of technical topics that you absolutely have to teach in order to make someone a data analyst:
- You need to be able to read data, combine data, clean it, identify missing values, outliers, all this kind of technical stuff.
- We usually teach like SQL, some kind of tool like Power BI or Tableau. Maybe Python, ideally has some kind of programming language in there or maybe R.
- These are sort of technical skills that you just have to teach in the foundational training, because otherwise you can’t do the job right
- Excel counts as well, so all the different things you can do in Excel. And then sometimes, depending on the course, you’d also teach machine learning.
- How to build a machine learning model, how to do some of the sort of the practitioner things like cross validation, other things like that.
But technical skills are not the whole job. Software engineers don’t spend most of their time writing code and data analysts don’t necessarily spend most of their time analyzing data. There’s all sorts of other things to do, like identifying problems to solve in the first place. That’s not something we teach in foundational training.
How do you have a conversation with another human and read between the lines of what problems they’re actually trying to solve? And you don’t even necessarily realize that is a skill you’re going to need. And my problem with that kind of thing is that we often hand wave it away and say, oh, well, of course, people would just learn that on the job. But how? There’s no one to actually teach them in any sort of formal way.
Anything about navigating priorities in a company where five different stakeholders ask you for seven different projects. How do you know what to work on? How do you generate a value statement for any analytical work? How do you think about the actual quantifiable value that this project is going to have? What kind of impact it’s going to have?
And then the other things is, so you’ve got building all these machine learning models, but then how are these models going to be used by the company? And so sort of planning after the first model component as well. That’s something that you just sort of have to learn on the job rather than being prepared for it.
Finally, the other thing that I try and teach as much as possible is like the most realistic data sets that you can work on rather than the toy examples that we teach. The fact that the data sets in education are not that realistic is one of the problems. And the other one is often they’re actually quite clean.
There’s a meme in data science, which is 80 percent of data science is spent cleaning data, and the other 20 percent is spent complaining about the fact that we have to clean data.
It’s true that often we’re the ones who have to either collect the data or find where it is in the first place, combine it, document it for the first time. And these are all things that take time, and these are all things that we don’t think about much in the classroom. And one reason for that is we just don’t have the time. We have to teach these technical skills. But I think there is room to make education better at the foundational level by incorporating more of these real world elements.

Understanding the Business/Problem

All data scientists and all analysts should spend more time in the business, outside of the data sets, just in the actual business to see how it works. They should be shadowing their colleagues who are in charge of either entering the data or just doing business operations. The sales people, the customer engagement people, everyone who’s contributing some way to the company. I think data science should understand all those functions, because then you have the context, and then you understand the columns that you’re seeing in the data.
In education, we normally say, okay, here’s the data set, and then here’s something called the data dictionary, where each column is labeled, and then we tell you what each of the columns means, which is great. For educational purposes, that thing doesn’t exist in the real world. Usually we have to write our own data dictionary. But even then, that’s a very narrow view of looking at it just as a table of numbers.
What we should look at it as is the wider context in the business. Where does this data come from? Who enters it? When do these records get generated? So just understanding that data generating process is really important.
It’s really important for a data scientist to understand the domain they’re working in. And that’s the kind of thing you can learn that on the job, for sure. Like you don’t need to spend a year as a used car salesman before you go into a data science job in the used car industry. But once you’re in there, having that additional curiosity and context about the domain is definitely important and not something that you can delegate to other colleagues.

Result-Driven Data Analysis

As data people, we need to remember that our primary goal in any company is to provide a value of some sort, whether that’s clearly monetary or saving time through automation, whatever it is, that is our primary goal.
Something I say to students is if you can solve a stakeholder’s problem, answer their question with a single bar chart, then fine. It’s the problem solving that matters. It’s not the level of sophistication in your tools.
That’s what I’m trying to get across in the book is that the key thing is to have an end goal to start with. One of the first things you need to do is define an end goal that you want to reach. And it might be like the simplest version of the problem. It might be the smallest possible answer. I call it the minimum viable answer in the book, which is what is the absolute minimum amount of work that you can do to get something, a result that will take you to the next step.
And then it’s usually then a conversation, as you said, a collaboration with stakeholders to say, look, this is what I did based on our conversation. What do you think? What direction should we take this in? And just having that at the forefront of your mind throughout the analysis, I think is really helpful because as data people, we can go down the rabbit holes.
You’re working on a data set. Usually, you get more questions generated than answers. Every time you look at a new column, you’re like, oh, there’re missing values here. Oh, there’s a relationship between these columns. And so you can keep exploring it forever. And as you say, never get to a result.
If you already know upfront what your results should be, the whole exploration has a goal that you’re directing towards, which makes it quicker to get to an answer. And it will also make the answer more useful and contextual.

Feedback Iteration

In the software world, what I was used to is that we got pretty good at estimating how long a task would take. And we were pretty close most of the time for these little chunks of work.
In the world of data analysis, there’s so much uncertainty, partly because the problem is ill-defined, partly because we don’t know the data very well, partly because we could find anything of interest and go down all these rabbit holes, that it becomes very difficult to estimate how long something will take.
But you can’t just say that to a stakeholder. You can’t just say I have absolutely no idea when I’m going to get back to you on this. We didn’t call it a minimum viable answer at the time. We just said we’ll do some work to get towards this particular answer, which we’ve all agreed on looks like a plausible first step. So in a week’s time, we’ll report back or in a few days we’ll report back. And then, rather than saying when the work will be finished, we would just check in at intervals.
And the check-in might be, here it is. Here’s your minimum viable answer. Or the check-in is this question that you had actually is very difficult to answer with the data we have. Here’s the other kind of data that we need to collect before we can give you an answer. Again, it’s about collaborating and keeping your stakeholders in the loop. Shorter iteration cycles basically are what you’re after.

Data Dictionary

In the real world, it is rare to have a document like that. You might have pieces of it scattered around but not collected together in something as coherent as a data dictionary.
The purpose of the data dictionary is at a surface level to record all the different columns in the data and what they mean, what data type they should be, and then what they represent. Sometimes, it’s because the column names are abbreviated, so it just tells you sort of what the column actually means or what the abbreviation stands for. But deeper than that, what you want a data dictionary to tell you as well is the process that generates those columns.
It needs to give that deeper context of why we have these columns in the first place and what are the possibilities of the different ways that the values could be filled in. So ideally, the data dictionary also talks about the data generating process and how the sort of the business operations translated to this particular data set.

Data Modeling

One of the ways I think about this is that there’s a good definition of data science, which is the process of turning data into information and information into insights. And it’s this first step of turning data into information.
As you said earlier, you alluded to operational transactions that happen have to be stored in a database. They weren’t collected with analytics in mind necessarily, they’re just database records that power some kind of customer facing application. And as analysts, we want to come in and analyze that data, but in its raw form, it’s almost never usable. We need to do something to it. And when we do things like what we might call data cleaning as part of an analysis, what we would like to do is to do that data cleaning once and have a cleaned version of that data stored somewhere.
Part of data modeling is doing that data cleaning once so that the logic of the cleaning is encoded already in the data that we use. Lots of companies will have this problem where, for example, you have a bunch of Tableau dashboards, and in every dashboard, there’s a formula that calculates some relevant metric, but that calculation is duplicated across every dashboard. So if you ever make a change to it, you need to remember which dashboard it’s also in.
- If you have a clean data model where that metric is already pre-calculated, and the dashboards just read from that clean data model, then you’ve got that problem in one place. So if you need to change the metric, all the dashboards will update.
The other reason to do data modeling is to sort of capture business entities in the right way. So one of the problems we had to work with was, what is a customer in our business? And that sounds like a very simple question.
- We had different business areas that worked with individuals. So individual used car dealers. So they were people. But then we also had customers who were entities, and they might be a single dealership or they might be like a parent group. And so, of those different entities, which one is a customer? That depends on who you ask, and it depends on the purpose that you want to use the data for. And so what you need to do then is have some sort of customer data model where everybody agrees on the definition of a customer.
And this, again, is something that we don’t really talk about in foundational training. We say, yes, you need to clean your data. But we don’t say once you’ve cleaned your data, you should probably have a clean version of it somewhere and stop cleaning your data multiple times. Stop repeating yourself.
That’s why I dedicated a whole project in the book to data modeling, to sort of practice this idea of taking raw data and turning it into a specific structure, which is tailored, again specifically tailored, to the questions you’re going to ask in the business.

Data Cleaning

What’s funny is we used to have this term big data to describe data that cannot be processed on your laptop. And you don’t see that term around very much. That’s because processing power, even on individual laptops, has grown so much. And even getting access to a remote cluster that has a lot more resources in your laptop is pretty easy these days. So I don’t think we often have that problem where you can only clean half of your data and the other half you can only revisit when you run some code later or something.
Some companies obviously have data that’s so huge that they need special methods, but I think in most cases it’s not the case anymore.
When it comes to data cleaning, one thing I would tell students is don’t try to clean the whole thing at once before you do anything with it. Because you’re going to find some issues down the line, anyway. It’s just this pragmatism of figure out what part of the data you need right now and have a look at it.
There are some checks you can do. There are some surface level checks you can check, like what are the unique values in this column? Are there missing values? Are there outliers? But some of the more complex problems or patterns in the data that will either invalidate your analysis or require you to redo some of your work. You won’t notice them until you start working.
Don’t be wedded to this idea that you have to make a perfectly clean data set before you start working. Do the minimum that you need and just start doing the analysis with the knowledge that you’re probably going to have to go back to step one again and again.
If you would ever watch me doing an analysis, it’s never like I have the analysis in my head and I just have to type it out. It’s an active process where you do some stuff, and you go, oh, no, this doesn’t make sense at all because I found something in the column. I have to go right the way back to the top and start again.
You will find data issues throughout the process. So don’t worry about getting it perfect the first time.

Data Lake

It’s very tempting to take absolutely any data that you have lying around and dumping it somewhere and saying, oh, we’ll come back to it when we need it. The technology is there to allow people to do that quite easily. The problem with that is that there’s no thought given again to the end product of like what are we going to use this data for? And dumping stuff into a data lake because at some point in the future we might need it is not necessarily the best approach. This can create a lot of problems down the line.
If you think about the phrase data science, half of it is the word science. So that’s not how science works. Science, when you need to collect data, you have a hypothesis; you set up an experiment, you actually have a sort of theoretical framework to build around before you even think about the data part.
And some of that could be applied to the business world, where we don’t dump stuff into a data lake for the sake of it. We think a bit more about what is the problem we’re actually trying to solve, therefore, what is the data that we need, therefore where should we store what information?

Common Data Analysis Tasks

I picked the projects in the book specifically to address topics that I thought were missing from foundational data training, but that actually come up a lot in the business world.
You mentioned time series forecasting. That’s one of the things I talk about a lot with students. And we’ll teach them a little bit about how to reshape time data, how to think about time data differently from tabular data, and how some of the methods are different. The opportunity to forecast things in the real world, there are a lot of those opportunities and it’s actually much bigger than we let on in basic training. So that’s why I have a project dedicated to time series data.
There’s also this other idea of working with categorical data. Now that’s something that we mention as an aside in foundational training. Sometimes your data is categorical and here are a couple of methods that you can use to transform that data into something else. If you work with operational data, like people filling in forms and entering things in, records into a system, anytime there’s a drop down, you’ve got categorical data.
We spend a lot of time talking about correlation. We spend a lot of time looking at distributions and things for continuous data. But we don’t talk about methods for categorical data enough. And one problem with that is then you’re not equipped to deal with all these columns that you’ll actually see in the real world. But the other problem is that people accidentally shoehorn continuous methods into categorical data.
Remember that you need to think through your methodology harder because the computer is not going to tell you otherwise.

Prediction Analysis

Prediction is obviously something everybody says they want. The question is, what is the output of that work?
That’s something we found out the hard way on a project is, I built a predictive model for something that was, from a technical point of view, it was accurate enough to use. And when we tried to put it into production, we found various organizational barriers to it. Like the data that the predictive model requires doesn’t arrive in time. So we can only make the prediction when it’s too late. And then the clients that we would use this with didn’t actually have the levers in their business to change anything based on our predictions. So it was a sort of two-fold failure from an organizational point of view.
From then on, we were much more strict about starting with the end of why do you want us to make these predictions? And I think as a data person, that’s a question you should ask immediately when somebody says, I want you to build a predictive model or we should be predicting this thing, is, okay, but what are you going to do with the predictions? What is going to change in the business? How are you going to respond to these predictions? And so it’s nice to have that conversation up front, because then you know your stakeholders are forced to think about, okay, if we had this predictive model, what would we actually do with it?
It’s not a technical challenge. Because I think the technical challenge of prediction is pretty well-catered for. There are lots of tips and tricks out there. But the organizational side of it is really where these projects are won or lost. So my biggest advice would be to have that human conversation of what are you actually going to do with your predictions, first and foremost.

The Impact of AI on Data Analysis

Some tools are more advanced in data analysis than others. And on the one hand, it’s great to democratize the ability. That’s great because you don’t have to put people through technical training to get there. But I think what it creates is the necessity that everybody understands how data analysis is done from a more sort of theoretical point of view. To understand what is possible, what are the limitations, what are the biases to look out for, what are the biases, societal biases that will be baked into the data?
And this is true for any output these AI tools generate, but also for any analysis that comes out and also any analysis you do, regardless of AI or not. As you know, there’s going to be these biases in there.
On the one hand, it’s good that there is transparency in these tools. It actually gives you the code it ran to get to that analytical answer. So you can double check and you can check its homework. So in that sense, it’s less of a black box.
But if you don’t have the required data literacy or statistical training, even that little bit of statistical training to understand what does an outlier mean, what is the difference between a mean and a median, you wouldn’t necessarily see what was wrong. And you might be trapped into just believing the answer that comes out.
Just like with any response from an AI tool, people just need the healthy skepticism of just reviewing the answer and checking whether it makes sense.

Importance of Critical Thinking

It’s just the same as with programming is, for very basic tasks, I can see automation happening and I can see some of these tools taking some of the work away, potentially. Generating boilerplate code, generating a chart, getting up to speed with that and getting the right chart out of the other end. You can definitely see AI accelerating that process. But just like with writing code, if you don’t understand fundamentally how that task is done, you can’t check the outputs and you can’t debug any problems.
And the problems with data analysis are even trickier than with programming, because you don’t get an error message that you’ll get some kind of answer. You can average a numeric column and still get an answer, whether or not that makes sense from a methodological point of view.
The future might be that we have AI built in to help us accelerate those little bits of code and little bits of manual tasks that we might want to automate away. But I don’t see an analyst’s job changing, fundamentally, because we’re still supposed to be trusted business advisors, we’re still supposed to be generating value for the business, and AI is just going to be another tool in our toolbox to do that.

Common Tasks Solved by AI

I do have some AI examples in the book as well. I try to identify places where AI will actually accelerate the process. One of the examples, there’s a chapter where the data set is actually a bunch of PDF files. And so the task is to extract data from these PDFs and then do the analysis. And that’s not something that we usually train people on. It’s quite a niche thing, although PDFs are everywhere.
If it’s a domain I’m not familiar with, if there’s a problem I’m trying to solve where I think there must be a Python library for this, somebody else must have solved this problem in a way that I can use it. AI acts as like a superpowered search. Because it’s not just a search query, you can actually give it some context about what you’re trying to do.
And then just accelerating smaller tasks. I mentioned creating charts. If you want to figure out how to do a specific kind of data visualization, AI can give you the starter code for it.
Although one problem, and this is I think true for Copilot and other kinds of tools that generate code, is that it’s only trained on what it’s found on the internet. And there’s a lot of not even incorrect code, but maybe inefficient code or maybe not quite the right way to do it.
Even for tasks like using a library, you gotta be careful about what the training data is out there.

3 Tech Lead Wisdom

One thing that differentiates good analysts from the best analysts is curiosity. Just wanting to find out the answer to something.
- I don’t know if you can teach that to someone. I don’t know if you can learn to be more curious. But even if just mechanically you try to find the answer to things and persevere beyond the first answer or beyond the obvious answer, that is really such an important skill in life, but it’s particularly in data analysis.
- Really wanting to dig in to find the answer and not resting until you’re satisfied with the answer is a particularly good skill.
Practicing your skills. Whatever your tech skills are, just constantly practicing them is so important.
- Once you’ve got the foundations, the best way to learn is to apply your knowledge and practice. And so just keep doing that.
- In the data realm, that just might mean solving problems for yourself, even if it’s optimizing your fantasy football team or whatever. It doesn’t have to be a money-making business opportunity every time, just something that solves a problem with data is great.
- So practicing your skills to stay relevant and to keep them fresh and to learn new things is vital.
Have a good reason to do data work in the first place.
- On the sort of more organizational side of things, I think it’s very important to have a good reason to do data work in the first place.
- There is a maybe mistaken belief in business that doing data analysis is inherently useful. It’s just inherently a good thing that we should be doing. But if you don’t have a purpose, if you don’t have an end goal, if you don’t have a good reason to do it, it’s not going to be successful.
- And that’s true for analysts as much as for entire organizational strategies.

Transcript

[00:01:04] Introduction

Henry Suryawirawan: Hello, guys. Welcome to another episode of Tech Lead Journal podcast. Today, I have David Asboth here. He’s the author of “Solve Any Data Analysis Problem”. So any data analysis, so I’m sure today you know that the topic we are going to cover is about data analytics or maybe data science. Whatever data problems that you’re facing right now, right? So I think, I hope today David will be able to give some insights how we can actually learn from his experience. So David also has a podcast. Probably, we will touch on a little bit about his podcast. So welcome to the show, David.

David Asboth: Hi, Henry. It’s great to be here.

[00:01:38] Career Journey

Henry Suryawirawan: Right. David, in the beginning, I always love to ask my guests maybe share a little bit more about yourself, right? So if you can mention any highlights or turning points that you think we all can learn from you.

David Asboth: Sure. So I’ve changed careers a little bit a few times. I mean, it’s always been in the tech space. I started off as a software developer. My undergraduate degree was actually in video games programming, because I thought that’s what I want to do. I like video games, so I thought well obviously I’d love to make them. And it turns out it’s really difficult. You have to program some really difficult things around like graphics and there’s actually a lot more maths involved. And so you know that was, it was a very interesting degree to do. And the thing I definitely learned from it is that I really like coding. That’s something I wasn’t really exposed to before that degree. And so I became a software developer and I did that for a few years. I was really enjoying it, writing sort of enterprise software.

As it sort of happens if you’re in a small team, I was also in charge of the reporting. Eventually that became one of my roles as well is that the sort of, we didn’t have a data team as such in the company, and so I took on a lot of that responsibility and ended up delivering answers to internal customers that they had by pulling data from our database. And over time, I sort of started to prefer doing that part of the job, because I found that I was closer to the value generating aspect of the business. I was closer to real business problems, where software development in a lot of cases is a little bit devolved somehow from the business.

Like if you’re a software developer, you don’t necessarily have to understand how the business runs, right? You don’t necessarily have to understand how does the business make profit. Who are the customers? How do the customers make profit? You know, what is the operating model? Those things are not that relevant to software developers a lot of the time. Whereas if you’re a data person, I mean, you can’t provide any value unless you know how the business works. And that’s something I learned in that role. And so I thought, oh, well, maybe I should make a career of working with data instead. So that was my, I guess that was my second pivot from games into software and then from software into data.

And then I did a Master’s in data science, because it turns out it’s called data science. That’s what I found at the time. So that’s something that I should do. I was promised it was the sexiest job of the 21st century and all that kind of stuff. And so I thought, okay, I’ll study that. And so I sort of did a Master’s degree and then transitioned into being a data scientist in industry. And I did that for a few years. Learned some very important, very interesting things about the difference between data science education and data science in practice, which I’m sure we can talk about.

And then again a sort of undercurrent this whole time in my career changes was that I always wanted to teach. Like education is one of the things I’m really passionate about. And I was trying to find various opportunities over the years, but it wasn’t until I landed in the data world that I found my niche of data science and education. And when the pandemic hit, it was the part of it that was fortuitous for me. It was a lot of teaching. All teaching became online. And so I suddenly had all these teaching opportunities that meant I didn’t have to leave the house. It just made things logistically a lot easier. And so in late 2020, I quit my job and started teaching full time as a consultant.

And so that’s sort of what I do these days. I call myself a data generalist, because I’ve done all these different things and I haven’t really pigeonholed myself into any particular role. Mostly these days I do educational work. So designing and delivering workshops, anything from half a day lectures to 10 week accelerators in data science and Python and things like that. And I’m really enjoying it because there’s a variety of clients to work with, a variety of problems that people are trying to solve that I can help with. That’s where I’ve sort of landed. I mean, I don’t know if anybody can learn from that. I mean, what I’ve learned is that just follow your interests. Like whatever has interested me, I just put everything else down and just went towards that. And that’s how I’ve sort of ended up in my like fourth job.

Thank you for sharing your story. I think it is very interesting, right? I think many people started their computer science study because of the interest in gaming, right? So maybe people love playing games. I also did my computer graphics course back then. I think it was, yeah, difficult if you didn’t get the math, I guess.

[00:06:33] Half Stack Data Science Podcast

Henry Suryawirawan: So I think the career that you took probably is also quite common for some people, right? So they started by being a generalist software developer, but found into a specific area, dived deep into that and become a specialist. And you also host a podcast, Half Stack Data Science. So maybe tell us a little bit more about that. What can we learn from that podcast?

David Asboth: Yeah, so that podcast grew from my first real data science job after finishing my degree. And my co-host, Shaun was actually the guy who hired me. He was the hiring manager who hired me into that role at the time. And you know, very early on in that job, I realized, and Shaun was the same, he sort of came from academia into this kind of job. And we both quickly realized that what we thought the job was going to be is not at all what the job is like in reality. You know, a lot of data science education is focused on tools, techniques, algorithms, and so you get this picture that, okay, well, I’m going to be deriving formulas and doing all this complicated machine learning at work. And then you go in and it turns out, you know, a lot of the job is navigating the complexities of an enterprise environment working with sort of office politics. And things like, oh, the data is not actually available, and no one actually has any solid research questions. So we have to find those as well.

And so we were very, very quickly hit on this difference between education and reality. And so we started having these conversations internally about, okay, what are the things we’ve learned about how industry is different? What are the skills that people should actually be trained on or at least be warned about up front? So people have a better picture of what the job looks like. And one day we just said, well, we’ve had these conversations quite a lot. And every time we went to a meetup, we would talk to like-minded people and have these same conversations. And we thought, well, we might as well just put them on the internet for other people to learn from.

And so initially the podcast started off with the two of us booking a meeting room at work, and taking our work Samsung phone, putting it on the table and just having a chat. And then eventually we became a little more professional and got some proper tooling and proper microphones and things, but that’s how it started. So currently we’re running a season where we’re talking to educators in the data space. Because I think at this point in time, getting the education of future analysts and future data scientists right is really important. So we’ve spoken to people who are Python trainers, but also people who are spreading the idea of data literacy. You know, we’ve talked to a variety of people and it’s been very interesting to see what their perspective is. And there is a lot of commonality with our philosophy, which is trying to morph education into something that is much more applied and much more ready for the real world.

Henry Suryawirawan: I’m quite interested in the name itself, like halfstack. Why are you calling it halfstack? I mean, is that the opposite of full-stack, right? Why data science is half-stack? So probably you can explain a little bit.

David Asboth: Yeah, no, that’s a good question. It’s a response to the full-stack idea. I mean, one thing we never liked was this idea that a single data scientist has to be this unicorn who does everything in a company. I mean, that’s just, that’s not the reality. That would never function in a business, especially a solid decades old enterprise. You can’t just drop a single data science unicorn in there and hope that they’ll make the company lots of money with a machine learning model immediately. I just remember there was one presentation we gave somewhere and Shaun we talked about things like the difference between academic data science and business data science and just somewhere on those slides, he coined the phrase half-stack data science. He just put it on the slide. And it sort of, it just sort of stuck. And the idea is that because, data science is so different in the real world as opposed to in education, then you need people with a sort of this hybrid skill set. A more generalist skill set. And we don’t really think that having someone be full-stack is realistic. Or at least, you know, being an expert in everything from data cleaning to statistics to software development to business strategy development.

Henry Suryawirawan: I think these days it’s pretty hard to be full-stack. And even the stack gets deeper and deeper, right? So I think there’s so many technologies that these days people need to learn. So I think, yeah, hence probably half-stack, a quarter-stack probably makes…

David Asboth: Yeah, exactly! Fractional stack, yeah.

[00:10:46] Real-World Data Analysis Gaps Henry Suryawirawan: So let’s go into the topic of today’s conversation, which is about the data analysis. So I think in the beginning you mentioned, you realized there is a big gap between what, normally, data analysts or data scientists learn throughout their education or maybe bootcamp courses, whatever that is, compared with the real life problems, right? So what are the typical gaps that you see, challenges for people from academia or maybe learning from their study and thrown into the deep end, into real world problems. So what are typical gaps or challenges that we have to think about?

David Asboth: Yeah, so I’ve taught accelerators and boot camps and I was faced with this problem as well of trying to teach the right skills, but also within the framework of what I was expected to teach. And so, you know, there’s a list of technical topics that you absolutely have to teach in order to make someone a data analyst, right? You need to be able to read data, combine data, clean it, identify missing values, outliers, all this kind of technical stuff. You know, we usually teach like SQL, some kind of business intelligence tool like Power BI or Tableau. Maybe Python, ideally have some kind of programming language in there or maybe R. So these are sort of technical skills that you just have to teach in the foundational training, because otherwise you can’t do the job right. Excel counts as well, so all the different things you can do in Excel. And then sometimes depending on the course you’d also teach machine learning. You know, how to build a machine learning model, how to do some of the sort of the practitioner things like cross validation, other things like that.

But technical skills are not the whole job, right? And I’m sure multiple guests on your podcast have probably said the same thing that software engineers don’t spend most of their time writing code and data analysts don’t necessarily spend most of their time analyzing data. There’s all sorts of other things to do, like identifying problems to solve in the first place. That’s not something we teach in foundational training. How do you have a conversation with another human and read between the lines of what problems they’re actually trying to solve? And you don’t even necessarily realize that that’s a skill you’re going to need. And my problem with that kind of thing is that we often hand wave it away and say, oh, well, of course, people would just learn that on the job. But how? Right, there’s no one to actually teach them in any sort of formal way. It’s just sort of, oh, you just pick up these skills as you go.

So anything about navigating, like, priorities in a company where five different stakeholders ask you for seven different projects. How do you know what to work on? How do you generate a value statement for any analytical work? So how do you think about the actual quantifiable value that this project is going to have? What kind of impact it’s going to have? And then the other things is, so you’ve got building all these machine learning models, but then how are these models going to be used by the company? And so sort of planning after the first model component as well. That’s something that you just sort of have to learn on the job rather than being prepared for it.

And then finally the other thing that I try and teach as much as possible is like the most realistic data sets that you can work on rather than the toy examples that we teach. Now almost everyone who’s taken some kind of data science course has predicted the survivors of the Titanic, right? That’s one of the classic data science, machine learning examples. And I think the real world applicability of that problem is not that high. So the fact that the data sets in education are not that realistic is one of the problems. And the other one is often there, they’re actually quite clean. You know, there’s a meme in data science, which is 80 percent of data science is spent cleaning data, and the other 20 percent is spent complaining about the fact that we have to clean data.

But it’s true that often, you know, we’re the ones who have to either collect the data or find where it is in the first place, combine it, document it for the first time. And these are all things that take time, and these are all things that we don’t think about much in the classroom. And, you know, one reason for that is we just don’t have the time. We have to teach these technical skills. But I think there is room to make education better at the foundational level by incorporating more of these real world elements.

Henry Suryawirawan: Yeah, I didn’t study data science, data analytics, but I did some kind of data projects before, right? Data reporting, real-time analytics, and things like that. I think the level of complexity and ambiguity, I guess, increases as the amount of data sources, right? And also depending on the cleanliness of the data, right? So I think when you study in a, maybe I don’t know, bootcamp or a course, right? Typically you’re given a data set, which is kind of like clean enough and, you know, well-defined and things like that. But I think as soon as you hit the real industry, right? So you realize that it’s actually not as simple as that. Sometimes also like identifying the problems to solve, right? The question becomes much more abstract, probably, right? There’s no like real formula. And maybe there’s no even like 100% solution that you can actually come up with.

[00:15:36] Understanding the Business/Problem

Henry Suryawirawan: And hence, the first thing that you mentioned is actually identifying the problem. So how can people who learn more about technical stuff, right? Because typically it’s quite straightforward in a bootcamp, like, okay, here’s the data set. Here’s what I want you to find. And it’s kind of like straightforward. But actually in the real business world, sometimes identifying the problem is a challenge, not to mention as well that you don’t understand the domain of the business. So maybe from your experience, some tips that you can teach us here.

David Asboth: Yeah. I mean I was lucky in the company I worked for. So we worked in the used car industry. And like on I think my first day was at an actual used car auction that the company was holding. So I actually got to see the company in operation. There’s no mention of data on that day. It was all about like this is how the business operates. And I think that’s a really good model. I think all data scientists and all analysts should spend more time in the business, outside of the data sets, just in the actual business to see how it works. They should be shadowing their colleagues who are in charge of either entering the data or just doing business operations. The sales people, the customer engagement people, everyone who’s contributing some way to the company. I think data science should understand all those functions, because then you have the context, and then you understand the columns that you’re seeing in the data.

Again, in education, we normally say, okay, here’s the data set, and then here’s something called the data dictionary, where each column is labeled, and then we tell you what each of the columns means, which is great, right? For educational purposes, it’s great. That thing doesn’t exist in the real world. Usually we have to write our own data dictionary. But even then, that’s a very narrow view of looking at it just as a table of numbers. What we should look at it as is the wider context in the business, right? So where does this data come from? Who enters it? When do these records get generated? So just understanding that data generating process is really important.

And yeah, you said about understanding the domain. It’s really important for a data scientist to understand the domain they’re working in. And that’s the kind of thing you can learn that on the job, for sure. Like you don’t need to spend a year as a used car salesman before you go into a data science job in the used car industry. But once you’re in there, you know, having that additional curiosity and context about the domain is definitely important and not something that you can delegate to other colleagues.

Henry Suryawirawan: Yeah, sometimes I think also, like, data scientists or data analysts typically, right, they love playing with data or their tools, their SQL or BI tools, whatever. Don’t forget that you should also collaborate, right? You’re good with crunching the data, cleaning the data, but if you collaborate with the domain expert, maybe sitting side by side, show the data and ask questions, why this matters or what kind of data that you are dealing with, right? So maybe if you collaborate more, you’ll get to learn much better, because otherwise you’ll probably won’t be stuck into identifying the problem and also giving the solutions.

[00:18:28] Result-Driven Data Analysis

Henry Suryawirawan: Which brings me to the next topic of discussion, right, in your book. You mentioned this result-driven approach being pragmatic when you come up with data analysis. Because sometimes I can see we crunch data, we use sophisticated tools and techniques, right? But doesn’t necessarily bring results. You know, after few weeks of time you come up with the result, maybe the business or the stakeholders don’t really get it, right? Or don’t feel satisfied with the answers. So tell us more a little bit about your result-driven approach, so that we can be more pragmatic in our data analysis.

David Asboth: Yeah, I’m glad you mentioned collaboration, because that’s partly what it comes down to, is as data people we need to remember that our primary goal in any company is to provide value of some sort, whether that’s clearly monetary or saving time through automation, whatever it is, that is our primary goal. And something I say to students is if you can solve a stakeholder’s problem, answer their question with a single bar chart, then fine. It’s the problem solving that matters. It’s not the level of sophistication in your tools.

And that’s what I’m trying to get across in the book is that the key thing is to have an end goal to start with. Like it’s one of the first things you need to do is define an end goal that you want to reach. And it might be like the simplest version of the problem. It might be the smallest possible answer. I call it the minimum viable answer in the book, which is, you know, what is the absolute minimum amount of work that you can do to get something, a result that will take you to the next step. And then, you know, it’s usually then a conversation, as you said, a collaboration with stakeholders to say, look, this is what I did based on our conversation. What do you think? What direction should we take this in? And just having that at the forefront of your mind throughout the analysis, I think is really helpful because as data people, we can go down the rabbit holes.

If you’re working on a data set, usually you get more questions generated than answers. Every time you look at a new column, you’re like, oh, there’s missing values here. Oh, there’s a relationship between these columns. And so you can keep exploring it forever. And as you say, never get to a result. So if you already know upfront what your results should be the whole exploration has a goal that you’re directing towards, which, you know, makes it quicker to get to an answer. And it will also make the answer more useful and contextual.

Henry Suryawirawan: So very interesting that you mentioned we should start with an end goal in mind, right? So I think typically, maybe, from my experience, I didn’t see many data analysis work that way. You know, giving an end goal first. And typically it’s like stages, right? So you start with the first crunching of the data, the milestone, and then you give the preliminary result, right? But I think understanding the end goal, how should probably the result look like, confirming with the stakeholder, that’s what you advise in your book, actually, right? So you do the first iteration, do as minimum as possible, which is the minimum viable answer that you have. And ask for feedback and then shape from there, right, rather than going the rabbit hole.

I think many people, especially for me when I work with data, it’s always fun to crunch data, you know. Firing different SQL statements and wait for the results and storing it to somewhere, right? It’s always fun and maybe generating reports or charts which are fancy. But I think sometimes it doesn’t solve the problem.

[00:21:44] Feedback Iteration

Henry Suryawirawan: So I think it’s very, very important: have a goal in mind, the end goal in mind. Clarify that with stakeholders and iterate, right? The iteration here, how would you suggest people to do, right? How short should the iteration be? Or what is too long for you so that you have to be wary about? So maybe a little bit of tips on the iteration.

David Asboth: Yeah, that’s a great question and something that we often wrangle with. On our podcast, we had a whole episode dedicated to estimating time, because in the software world, what I was used to is that we got pretty good at estimating how long a task would take, right? It’s like, oh, we need to add two buttons to a web page. And you’re like, okay, I need to write some functions in the background. Maybe I need to like create a new database column. I need to write this code. It’s probably going to take me a day. And we were pretty close most of the time for these little chunks of work.

In the world of data analysis, there’s so much uncertainty, partly because the problem is ill-defined, partly because we don’t know the data very well, partly because we could find anything of interest and go down all these rabbit holes, that it becomes very difficult to estimate how long something will take. But you can’t just say that to a stakeholder, right? You can’t just say, I have absolutely no idea when I’m going to get back to you on this. It’s unfortunately not viable in the real world. So what we usually said was, again, we didn’t call it a minimum viable answer at the time. We just said, we’ll do some work to get towards this particular answer, which we’ve all agreed on looks like a plausible first step. So in a week’s time, we’ll report back or in a few days we’ll report back. And then rather than saying when the work will be finished, we would just check in at intervals.

So that’s one way to do it is to just say we’re going to tackle this problem, we’re going to work on it for a while. This is the end goal we have in mind, and we’ll check in in a few days. And the check-in might be, here it is. Here’s your minimum viable answer. Or the check-in is this question that you had actually is very difficult to answer with the data we have. Here’s the other kind of data that we need to collect before we can give you an answer. Again it’s about collaborating and keeping your stakeholders in the loop. So, yeah, shorter iteration cycles basically are what you’re after.

[00:23:48] Data Dictionary

Henry Suryawirawan: So I think in the typical software development project or maybe product development, right? Many teams actually don’t start with a good data design, so to speak, right? So they come from the operational point of view, you know, they just do transactions, store the data, and that’s it, right? So there will be tables, mostly relational tables, probably. And then these tables will be given to the data analyst to derive some insights, right? So I think in your book, in all the problems that you have in each chapter, right, you always come up with the data dictionary. I know probably it’s a bit luxury in the real world to actually see a good data dictionary. But tell us really the importance of this data dictionary. And is it just defining, you know, table, column types, and column description, or is there something else beyond just that?

David Asboth: Yeah, in the real world, it is rare to have a document like that. I mean, you might have pieces of it scattered around but not collected together in something as coherent as a data dictionary. I mean the purpose of the data dictionary is at a surface level to record all the different columns in the data and what they mean, what data type they should be and then what they represent. Sometimes, it’s because the column names are abbreviated, so it just tells you sort of what the column actually means or what the abbreviation stands for. But deeper than that, what you want a data dictionary to tell you as well is the process that generates those columns.

So, for example, just to give you an example, we had sale data from the used car industry. Every time there was a sale at an auction that was recorded and we had a couple of different columns. One of them was called the sold date and the other one was called the date sold, which just sounds like we have just this redundancy of two columns that measure the same thing. But it turned out they don’t measure the same thing. What it turned out was one of them was the date that the sale happened. So when the auction happened, but there was another date that could be in the future, because sometimes there’s a dispute around a used car. You know, like they buy it and they look at it and there’s a scratch and they didn’t see it before and so they dispute with the vendor. Maybe they’ll take the negotiation offline and agree on a different price in a couple of days and then that date gets stamped separately.

And so if you have a very high level surface level data dictionary that just says this is the date sold and then for the second one it also says something like this is the sold date. That’s not useful. It needs to give that deeper context of why we have these columns in the first place and what are the possibilities of the different ways that the values could be filled in. So ideally, the data dictionary also talks about the data generating process and how the sort of the business operations translated to this particular data set.

Henry Suryawirawan: And I think, in particular, it’s becoming more important if let’s say you have a multi stages kind of process that derives the data, right? Hence, probably these days, people refer to it as data lineage, right? Where you start from a typical business process that generates the data but then it goes through different transformations, maybe different systems, different processes, until it gets to the final sink or data, the last place where the data gets stored, right.

So I think the data dictionary will be much more important, because you don’t just see, you know, the column names and the values, right? Which is sometimes misleading, just like what you mentioned, right? I think date sold, sold date is not just a common thing, right? But it’s also like, probably you can see it all over different data sources because different people, maybe, creating the column, different teams creating the column. And also maybe different department using the same term, but actually means the different things, right? Hence, probably the domain-driven design kind of a practice makes more sense.

[00:27:18] Data Modeling

Henry Suryawirawan: And I think not just data dictionary in your book, you also mentioned this thing called data modeling. So you mentioned this is probably the most important step as well before you start data analysis. Tell us a bit more about data modeling. What is this step and what should people do in the data modeling exercise?

David Asboth: So one of the ways I think about this is that there’s a good definition of data science, it’s been around for a while, which is the process of turning data into information and information into insights. And it’s this first step of turning data into information. And again, the terminology is very sort of blurred, but when I think of data turning into information, it means the data is whatever you have lying around. As you said earlier, you alluded to operational transactions that happen have to be stored in a database. They weren’t collected with analytics in mind necessarily, they’re just database records that power some kind of customer facing application. And as analysts, we want to come in and analyze that data, but in its raw form, it’s almost never usable. We need to do something to it. And when we do things like what we might call data cleaning as part of an analysis, what we would like to do is to do that data cleaning once and have a cleaned version of that data stored somewhere.

Part of data modeling is doing that data cleaning once so that the logic of the cleaning is encoded already in the data that we use. I mean lots of companies will have this problem where, for example, you have a bunch of Tableau dashboards, and in every dashboard, there’s a formula that calculates some relevant metric, but that calculation is duplicated across every dashboard. So if you ever make a change to it, you need to remember which dashboard it’s also in, and it becomes a sort of mess that you don’t necessarily have a handle on. If you have a clean data model where that metric is already pre-calculated, and the dashboards just read from that clean data model, then you’ve got that problem in one place. So if you need to change the metric, all the dashboards will update. I mean, that’s the very sort of simplistic way to look at it.

And the other reason to do data modeling is to sort of capture business entities in the right way. So one of the problems we had to work with was, what is a customer in our business? And that sounds like a very simple question. Like obviously a business knows what their customer is, but we had different business areas that worked with individuals. So individual used car dealers. So they were people. But then we also had customers who were entities, and they might be, again, they might be a single dealership or they might be like a parent group. And so, of those different entities, which one is a customer? Well that depends on who you ask, and it depends on the purpose that you want to use the data for. And so what you need to do then is have some sort of customer data model where everybody agrees on the definition of a customer. And then if somebody needs to know how many customers do we have? Ideally, all they have to do is just count star from that table. That’s the sort of dream scenario where you’ve done all the business logic and all the work upfront to have a clean data model that can then be analyzed much more easily.

And this, again, is something that we don’t really talk about in foundational training. We say, yes, you need to clean your data. But we don’t say once you’ve cleaned your data, you should probably have a clean version of it somewhere and stop cleaning your data multiple times. Stop repeating yourself. And so that’s why I dedicated a whole project in the book to data modeling, to sort of practice this idea of taking raw data and turning it into a specific structure, which is tailored, again specifically tailored, to the questions you’re going to ask in the business.

[00:30:43] Data Cleaning

Henry Suryawirawan: Specifically about data cleaning, actually this is probably like what you said, the meme, right. 80 percent of your effort probably is spent, first, understanding how dirty your data is, and then like, because there are many variations, right, sometimes it could be the user input that is probably not clean. Second thing is there’s no validation in the software that captures it. And the third thing, for whatever reason, right, people put different formats, you know, like from different systems probably. They don’t have a uniform format, so they just use whatever format that make sense for them.

So data cleaning probably is something that is really, really hard. And first of all, right, if you have millions of records, for example. You probably won’t understand how clean, because you might look at the first, I don’t know, 100 rows and you just deduce, okay, this is typically the data. But actually there are many other columns or many other data that you don’t see, right? So maybe a little bit of tips. How can we actually do this data cleaning much, much efficiently so that we don’t fall into the gotcha, where actually you clean maybe 50 percent of the data, but the other 50 percent is something else, you know, like a different rubbish altogether. So maybe from your practical world example.

David Asboth: Yeah, I think what’s funny is we used to have this term big data, right? To describe data that cannot be processed on your laptop. And you don’t see that term around very much. And that’s because processing power, even on individual laptops, has grown so much. And even getting access to a remote cluster that has a lot more resources in your laptop is pretty easy these days. So I don’t think we often have that problem where you can only clean half of your data and the other half you can only revisit when you run some code later or something. Now, I mean some companies obviously have data that’s so huge that they need special methods, but I think most cases it’s not the case anymore.

But when it comes to data cleaning, I mean, one thing I would tell students is don’t try to clean the whole thing at once before you do anything with it. Because you’re going to find some issues down the line, anyway. So again, it’s just this pragmatism of figure out what part of the data you need right now and have a look at it.

And there are some checks you can do, right? There are some surface level checks you can check, like what are the unique values in this column? Are there missing values? Are there outliers? You can do those things. But some of the more complex problems or patterns in the data that will either invalidate your analysis or require you to redo some of your work. You won’t notice them until you start working. And so, again, don’t be wedded to this idea that you have to make a perfectly clean data set before you start working. Do again the minimum that you need and just start doing the analysis with the knowledge that you’re probably going to have to go back to step one again and again. So if you would ever watch me doing an analysis, it’s never like I have the analysis in my head and I just have to type it out. It’s an active process where you do some stuff, and you go, oh no, this doesn’t make sense at all because I found something in the column. I have to go right the way back to the top and start again.

And so you might have to rewrite some of your code, you might just have to rewrite a bit at the top and keep going. And there are cases in the book where, you know I have example solutions for each project. And I go down a specific particular rabbit hole that I follow to get my particular answer. And you’ll see things in there where I say, oh, it turns out this is the case. There’s an e-commerce data set in there, and then we have some products that are miscategorized. And in the way that I wrote the example solution, it’s trying to be as realistic as possible.

So I haven’t done the analysis and then written it up cleanly. I sort of write it up as the real process. And so halfway through, you’re like, oh, these labels don’t make sense. We need to go back and fix this quality issue in the data and before we can carry on. That’s the realistic way to think about it is you will find data issues throughout the process. So don’t worry about getting it perfect the first time.

Henry Suryawirawan: Yeah, I think it’s worth to emphasize, right? Probably 100% accuracy, sometimes it’s not possible, right? Especially if you’re dealing with really large data, right? So maybe some kind of percentage where you would accept, okay, maybe these are the anomaly kind of a data. Maybe state that assumption or maybe state that signal that you can see from the data. And I think there are plenty of useful tools these days that can actually give you a sense of like a distribution. For example, in a column, how different or how is the variance of the data inside, right? It can give you a statistical distribution, or it can give you some kind of patterns that you can probably deduce what kind of data is inside. So use that kind of tools.

[00:35:05] Data Lake

Henry Suryawirawan: And I think speaking about big data, I think these days people want to build like a data lake in a company where you put everything together into a data lake. Maybe in your practical experience, is there some kind of different challenges that people have to deal with dealing with a data lake? Or maybe big data in general as well?

David Asboth: Yeah, I think it’s very tempting to take absolutely any data that you have lying around and dumping it somewhere and saying, oh, we’ll come back to it when we need it. And so the technology is there to allow people to do that quite easily. And the problem with that is that there’s no thought given again to the end product of like what are we going to use this data for? And dumping stuff into a data lake because at some point in the future we might need it, is not necessarily the best approach. This can create a lot of problems down the line. And if you think about the phrase data science, half of it is the word science. So that’s not how science works, right? Science, when you need to collect data, you have a hypothesis, you set up an experiment, you actually have a sort of theoretical framework to build around before you even think about the data part. And I think some of that could be applied to sort of the business world, where we don’t dump stuff into a data lake for the sake of it. We think a bit more about, you know, what is the problem we’re actually trying to solve, therefore, what is the data that we need, therefore where should we store what information.

Henry Suryawirawan: Thanks for the tips. I think, yeah, because of these cloud technologies and potentially storage cost is cheap, right? So they will just dump everything and maybe think about it later, how we can use the data. But I think sometimes it’s not wise, simply because, yeah, the amount of data is just large, right? And how to deal with it, what kind of insights probably is just difficult if you start with that big amount of data.

[00:36:50] Common Data Analysis Tasks

Henry Suryawirawan: And I think these kind of challenges are quite typical in a day-to-day world. But maybe from your experience, what are the typically business problems that a data analyst should know about, should equip themselves with. Maybe in your book you mentioned things like categorization, dealing with time series. Or maybe what are some of the favorite typical problems that people should be aware of?

David Asboth: Yeah, I picked the projects in the book specifically to address topics that I thought were missing from foundational data training, but that actually come up a lot in the business world. You mentioned time series forecasting. That’s one of the things I talk about a lot with students is, you know, we usually have like maybe one session on time series forecasting. And we’ll teach them a little bit about how to reshape time data, how to think about time data differently from tabular data, and how some of the methods are different. We don’t spend a lot of time talking about like econometrics or anything which is where there’s a lot of time series forecasting problems. But I think the opportunity to forecast things in the real world is actually, there’s a lot of those opportunities and it’s actually much bigger than we let on in basic training. So that’s why I have a project dedicated to time series data.

And then there’s also this other idea of working with categorical data. Now that’s something that we mention as an aside in foundational training. We’ll say, yes, sometimes your data is categorical and here are a couple of methods that you can use to transform that data into something else. But if you work with operational data, like people filling in forms and entering things in, records into a system, anytime there’s a drop down, you’ve got categorical data. And so it’s actually a lot more prevalent than we let on. We spend a lot of time talking about correlation. We spend a lot of time looking at distributions and things for continuous data. But we don’t talk about methods for categorical data enough. And one problem with that is then you’re not equipped to deal with all these columns that you’ll actually see in the real world. But the other problem is that people accidentally shoehorn continuous methods into categorical data.

So I even break down an example in that chapter in the book. There’s a famous heart disease data set where you’re trying to predict whether someone has heart disease based on various different measurements. And like the entire data set is numeric. So it looks like, oh great, we have all this continuous data, we can just throw a correlation at it. We can throw all these continuous methods at it. But if you actually read the data dictionary, again, going back to what we said before, you actually see that most of those values are categories. And they’re like, one of them is something like the slope of, I guess, like a table or a treadmill or something that was during the test. And it’s not a measurement. It’s not an angle of the slope. It’s just one of some values.

So they’re not on a continuous scale, so if you start applying methods that are meant for continuous data on that column, you’re going to make incorrect inferences from it. And the really difficult thing about this is that you don’t get an error message if you do that. The data analysis tools are not going to tell you, hmmm, are you sure your methodology is correct? No, because you’ve just said, I want an average of this column, but it doesn’t make sense. It doesn’t make sense in that context to average that column. And so the difficulty here is to remember that you need to think through your methodology harder because the computer is not going to tell you otherwise.

Henry Suryawirawan: Yeah, hence, I think the data dictionary, again, that you mentioned, it’s very, very important, right? Understand where the data gets generated, right? Which business process, which system, what kind of inputs that can be possible, not just looking at the data and create your own assumption.

[00:40:23] Prediction Analysis

Henry Suryawirawan: So I think that’s pretty dangerous. And I think when you mentioned about prediction, there are a lot of problems that data analyst has to come up with which is to actually derive predictions or maybe models to actually predict a result, right? And this is typically unknown problem where you don’t actually know the accuracy of what you come up with.

So how do you deal with that kind of ambiguity, first of all. And how second thing is that you can come up with a much better prediction. So maybe something in the typical real world, do you do much more rapid iteration and test it in the production before you actually come back and derive a second derivation of what you did. So maybe some tips here as well.

David Asboth: Yeah, no, that’s a great question, because prediction is obviously something everybody says they want. The question is, you know, what is the output of that work? That’s something we found out the hard way on a project is, you know, I built a predictive model for something that was, from a technical point of view, it was accurate enough to use. And when we tried to put it into production, we found various organizational barriers to it. Like the data that the predictive model requires doesn’t arrive in time. So we can only make the prediction when it’s too late. And then the clients that we would use this with didn’t actually have the levers in their business to change anything based on our predictions. So it was a sort of two-fold failure from an organizational point of view.

And from then on, we were much more strict about, again, starting with the end, of like why do you want us to make these predictions? And I think as a data person, that’s a question you should ask immediately when somebody says, I want you to build a predictive model or we should be able, or we should be predicting this thing is, okay, but what are you going to do with the predictions? What is going to change in the business? How are you going to respond to these predictions? And so it’s nice to have that conversation up front, because then you know your stakeholders are forced to think about, okay, if we had this predictive model, what would we actually do with it?

So again, it’s not a technical challenge. Because I think the technical challenge of prediction is pretty well-catered for. There’s lots of libraries to do, machine learning. There’s lots of tips and tricks out there. But the organizational side of it is really where these projects are won or lost. So my biggest advice would be again to have that human conversation of what are you actually going to do with your predictions, first and foremost.

Henry Suryawirawan: Yeah, you mentioned something very interesting, right? So organizational challenge, so not necessarily all the time it’s a technical problem or the data problem, but actually organizational challenge. And I think what you mentioned also very, very insightful in my opinion, right? Don’t just build any predictive model as if like you just want to learn different algorithms and tools, right? And use whatever fancy techniques. But actually thinking about how is the model going to be used in the real life scenario. What kind of value can be derived from there? Is it even possible to be used by the business?

[00:43:15] The Impact of AI on Data Analysis

Henry Suryawirawan: And speaking about predictive model machine learning, I mean, the topic of AI these days. There are so many discussions about using AI to do some kind of mundane analysis. Can AI be used also for data analysis? What have you seen in the industry? Typically how AI is gonna change the landscape of data analysis?

David Asboth: That’s a very interesting question. I’ve played around with the data analysis capabilities of these various tools. Some of them are more sophisticated at this point in time. Anyone listening in a few months time is going to change anyway, so there’s no point naming tools specifically. But you know, some tools are more advanced in data analysis than others. And on the one hand, it’s great to democratize the ability to say, here’s a somewhat messy spreadsheet. Give me some information about it. Give me some insights. Give me the biggest drivers to success or to a sale or something to, you know, what drives property prices based on this spreadsheet of retail transactions, that kind of thing.

On the one hand, that’s great because you don’t have to put people through technical training to get there. But I think what it does create is the necessity that everybody understands how data analysis is done from a more sort of theoretical point of view. To understand what is possible, what are the limitations, what are the biases to look out for, what are the biases, societal biases that will be baked into the data. And this is true for any output these AI tools generate, but also for any analysis that comes out and also any analysis you do, regardless of AI or not, as you know, there’s going to be these biases in there.

One very, very basic example that I showed on a course was uploading survey results, right? So imagine you’ve done some kind of online survey and you’ve got a CSV or a spreadsheet of some sort of the responses that you can download from this tool. And so yes, you can upload it to one of these AI tools and say, you know, give me some information. And one of the things I demoed was telling the AI to give me the average response time. So what is the average time that people took to fill in this survey? And so what the AI tool did is it identified that there is a start time and end time column. It understood that those are supposed to be dates, it understood that they should be differenced, so you can tell what the difference is, and then that difference should be averaged.

And so the answer we got from this, it was a realistic, it was a real survey data set. And the answer, it was like, oh, the average response time was something like 49 minutes from this survey data. And, you know, I said to the participants, you should never believe what comes out from the data analysis, because you should think about does that answer make sense in context? We were all very skeptical. It shouldn’t be 49 minutes, it was a very short survey. And you go into the data, and sure enough, there’s one outlier. Somebody who left their computer on all day and so their particular response time was eight hours. Everybody else’s was like five to 10 minutes. But when you said to the computer, you know, I want the average, it just took the mean of that column. And so that was heavily skewed by that one outlier. And obviously as an analyst, you should have maybe taken the median, something that’s more robust to outliers and got a more realistic value.

On the one hand, it’s good that there is transparency in these tools. It actually gives you the code it ran to get to that analytical answer. So you can double check and you can check its homework. So in that sense, it’s less of a black box. But if you don’t have the required data literacy or statistical training, even that little bit of statistical training to understand what does an outlier mean, what is the difference between a mean and a median, you wouldn’t necessarily see what was wrong. And you might be trapped into just believing the answer that comes out. So just like with any response from an AI tool, people just need the healthy skepticism of just reviewing the answer and checking whether it makes sense.

[00:47:05] Importance of Critical Thinking

Henry Suryawirawan: Very interesting example that you gave, right? So I typically use AI these days to generate code, you know, like a coding assistant. I think it’s much more well-defined problem, right? You can actually test it straight away. Given an input you can see the output. But dealing with data, I think it’s a different kind of a problem. Because maybe you don’t even know the answer, right. So if you just believe what AI is giving you, I think that’s a danger. That’s the first thing.

Second thing, I think all these LLM tools is not definitive. So maybe you ask the first time, it gives you this answer. Maybe second time is a different thing. How can you actually work with that kind of request-response model, which is probably different every time you ask.

And the third thing is the assumption that is baked in into how AI actually processes the data, right? Which is why the critical thinking aspect is very, very important. Do you think that data analysts should feel that their job is safer because of this? Or how should they equip themselves with the AI so that they can have an AI assistant that can power their data analysis process much better?

David Asboth: I like the way you phrased it, which is, is the job safer? Most, most of the time, people will ask whether the jobs are going to be replaced by AI. I think it’s just the same as with programming is, for very basic tasks, I can see automation happening and I can see some of these tools taking some of the work away, potentially. Generating boilerplate code. Generating like, I want to create a chart to do this. I’m not familiar with this particular visualization library. Getting up to speed with that and getting the right chart out of the other end. You can definitely see AI accelerating that process. But just like with writing code, if you don’t understand fundamentally how that task is done, you can’t check the outputs and you can’t debug any problems.

And the problems with data analysis, as we said earlier, are even trickier than with programming, because you don’t get an error message that you’ll get some kind of answer. You can average a numeric column and still get an answer, whether or not that makes sense from a methodological point of view. So I think the future might be that we have AI built in to help us accelerate those little bits of code and little bits of manual tasks that we might want to automate away. But I don’t see an analyst’s job changing, fundamentally, because we’re still supposed to be trusted business advisors, we’re still supposed to be generating value for the business, and AI is just going to be another tool in our toolbox to do that.

Henry Suryawirawan: Yeah, I think in the, I don’t know, like in the non-tech world, some people predict that they can replace all people by AI. Maybe like a question and answer model, right, where you can just give a data and then you start questioning them and they just give you an answer. But I think this is quite dangerous if you actually don’t really understand the analysis. And like, for example, it’s a very simple example, you know, survey average response time, right? While you have outlier, the result can be really, really different. So I think here, the critical thinking is very, very important, right? Don’t just assume that everything AI generates is actually valid.

[00:50:07] Common Tasks Solved by AI

Henry Suryawirawan: So maybe from your experience, I don’t know how much you have applied AI. So any kind of problems, like your favorite problems that can be solved by AI more effectively? Maybe you can share some of your powered AI user experience, I guess.

David Asboth: Yeah, I do have some AI examples in the book as well. I try to identify places where AI will actually accelerate the process. One of the examples, there’s a chapter where the data set is actually a bunch of PDF files. And so the task is to extract data from these PDFs and then do the analysis. And that’s not something that we usually train people on. It’s quite a niche thing, although PDFs are everywhere. It is quite a niche thing to have to analyze data from PDFs. Coming into that problem with an AI assistant means that you can accelerate the process of finding, for example, the right Python library. That’s one of the examples I have in the book is, I want to extract data from PDFs. What are my options in Python? Like, I could go away and Google it, but it would take me a lot more time.

So I do like using it for that. If it’s a domain I’m not familiar with, if there’s a problem I’m trying to solve where I think there must be a Python library for this, somebody else must have solved this problem in a way that I can use it. AI acts as like a superpowered search. Because it’s not just a search query, you can actually give it some context about what you’re trying to do. That’s definitely one aspect I see.

And then just, you know, accelerating smaller tasks. I mentioned creating charts. If you want to figure out how to do a specific kind of data visualization, and again, you’re unfamiliar with the library, you need a bit of help, you know, AI can give you the starter code for it.

Although one problem, and this is I think true for Copilot and other kinds of tools that generate code, is that it’s only trained on what it’s found on the internet. And there’s a lot of sort of not even incorrect code, but maybe inefficient code or maybe not quite the right way to do it. Again, a very specific example is there’s a Python plotting library called Matplotlib. and there’s sort of two different ways to use it. One of them is the old MATLAB style. If anybody listening has ever used MATLAB, there’s a specific way to create plots in MATLAB which is how Matplotlib was originally written and that’s how you create plots in it.

But now there is a much more modern object-oriented interface for building with Matplotlib where you create your chart object and you assign the properties and stuff like more how you would sort of write software in general. But unfortunately, a lot of the code on the internet uses the old style. And so the AI might perpetuate the use of the outdated, less modern style of code. So even for tasks like using a library, you gotta be careful about what the training data is out there.

Henry Suryawirawan: That’s practical tips for people, maybe some creative idea how you apply AI in your day-to-day job, right? So I think extracting text from PDF. Also like coming up with different charts. Oh, I could also imagine, like for example, downloading data from a typical API, right? Because if you’re not familiar with the API, maybe AI can help accelerate that. Or maybe just like transforming data into like a different format. That is probably also another small task that you can use to solve the problem using AI.

[00:53:10] 3 Tech Lead Wisdom

Henry Suryawirawan: So David, it’s been quite a great insight, you know, conversation about data analysis. As we wrap up the conversation, I have one last question that I would like to ask you, which is something that I call three technical leadership wisdom. So if you can think of it just like an advice that you want to give to the listeners here. Maybe if you can share your version of three technical leadership wisdom.

David Asboth: Yeah, sure. So one of the things that I think differentiates like good analysts from the best analysts is curiosity. Just wanting to find out the answer to something. I don’t know if that’s good advice because I don’t know if you can teach that to someone. I don’t know if you can learn to be more curious. But even if just mechanically you try to find the answer to things and persevere beyond the first answer or beyond the obvious answer, that is really such an important skill in life, but it’s particularly in data analysis. Like really wanting to dig in to find the answer and not resting until you’re satisfied with the answer is a particularly good skill.

And just from that, I think practicing your skills. Just whatever your tech skills are, just constantly practicing them, is so important. I mean, I’m just writing, you know, this book is a project based book full of practice opportunities for people. Because I think that’s really, once you’ve got the foundations, the best way to learn is to apply your knowledge and practice. And so just keep doing that. In the data realm, that just might mean solving problems for yourself, even if it’s optimizing your fantasy football team or whatever. It doesn’t have to be a money-making business opportunity every time, just something that solves a problem with data is great. So practicing your skills to stay relevant and to keep them fresh and to learn new things is vital.

And just on the sort of more organizational side of things, I think it’s very important to have a good reason to do data work in the first place. I think there is a maybe mistaken belief in business that doing data analysis is inherently useful. It’s just inherently a good thing that we should be doing. But if you don’t have a purpose, if you don’t have an end goal, if you don’t have a good reason to do it, it’s not going to be successful. And that’s true for analysts as much as entire organizational strategies. Just have a good reason to do data work in the first place.

Henry Suryawirawan: Very interesting last wisdom there. So I think I can actually relate with that kind of advice that you just gave, right? Because many people think, okay, we have the data, we have the data. So let’s just come up with the insights. What insights? Maybe they don’t know what kind of insights they want to derive from it. So I think, yeah, know the reason why you want to tackle data problem. I think it’s really, really important.

So for people who love this conversation, they want to learn from you further or maybe they just want to find more about yourself and your book. Maybe is there a place where they can find you online?

David Asboth: Yeah, I think probably following me on LinkedIn is probably the best place to look. So my name’s pretty uncommon so you can put it in the search bar and find it pretty easily. Yeah, I don’t post much on other social media outlets anymore. So I think LinkedIn is probably the way to see what I’m doing more day-to-day. I also have a website where you can check out the book and there’s a link to the podcast and other things that I’ve written. And yeah, the book is available on the Manning website, and when it comes out, it’ll be available on Amazon.

Henry Suryawirawan: Thank you. I wish you good luck in the process of publishing that. And I hope people today who listen have much more equipped into solving any data analysis problem, just like the title of your book.

David Asboth: Thanks, Henry.

– End –