#16 - Responsible AI and Building Trust in AI - Liu Feng-Yuan
“Having the conversation within the business, the data science teams, and the technology teams about what problems are you trying to solve? What can AI do with the data that you have? Sometimes business comes with a lot of problems that are like science-fiction."
Feng-Yuan is the co-founder and CEO of BasisAI, a Singapore-headquartered augmented intelligence software company that helps data-driven enterprises deploy AI responsibly. He has a vast experience in the tech sector, working with the Land Transport Authority Singapore to make public transport more efficient; and with GovTech pushing data initiatives for Singapore’s Smart Nation projects.
In this episode, I talked to Feng-Yuan about responsible AI and how to build trust in artificial intelligence, including the possibilities, challenges and dangers that AI and ML offer to businesses. We began by talking about his company, BasisAI, which offers bespoke AI solutions that are built responsibly. Feng-Yuan explained why it’s important to differentiate between what is interesting and what is useful when it comes to AI trends. We also spoke at length about deepfake, the dangers that come with it, and how to prevent such instances. At the end, Feng-Yuan also shares some wisdom about effective communication in the age of AI and ML.
Listen out for:
- BasisAI - [00:04:48]
- Feng-Yuan’s career journey - [00:06:23]
- Feng-Yuan’s interesting projects at GovTech - [00:11:57]
- The fear of AI/ML - [00:17:29]
- AI/ML current trends & challenges - [00:20:15]
- The danger of AI/ML - [00:23:07]
- Responsible AI - [00:25:00]
- Explainable AI - [00:28:24]
- Challenges for implementing AI - [00:30:14]
- Managing expectations for AI projects - [00:33:12]
- Productionizing AI - [00:35:07]
- Role of ML engineers in product team - [00:38:31]
- Data Scientist and ML Engineer - [00:41:03]
- Hyper-personalized AI - [00:43:16]
- 3 Tech Lead Wisdom - [00:45:50]
Liu Feng-Yuan’s Bio
Feng-Yuan Liu is the co-founder and CEO of BasisAI, a Singapore-headquartered augmented intelligence software company that helps data-driven enterprises deploy AI responsibly. In his previous capacity, he was responsible for leading and driving Smart Nation data initiatives for the Singapore government, including setting up and growing the data science and AI capabilities within GovTech.
- Email – email@example.com
- LinkedIn – https://www.linkedin.com/in/feng-yuan-liu-9b09aa42/
- Twitter – https://twitter.com/fengyuanliu
- Website – https://basis-ai.com
- LinkedIn – https://www.linkedin.com/company/basis-ai/
- Twitter – https://twitter.com/basis_ai
- Instagram – https://www.instagram.com/basisai/
Mentions & Links:
- “AI Governance: The path to responsible adoption of artificial intelligence“ whitepaper – https://resources.basis-ai.com/learn/ai-governance-responsible-adoption-of-artificial-intelligence
- Temasek – https://www.temasek.com.sg
- Sequoia – https://www.sequoiacap.com
- Land Transport Authority Singapore – https://www.lta.gov.sg
- EZ-Link cards – https://www.ezlink.com.sg
- Public Transport Quality – https://www.ptc.gov.sg/about/overview
- Excess wait time – https://www.lta.gov.sg/content/ltagov/en/newsroom/2017/9/2/bus-services-continue-to-improve-since-transition-to-bcm.html
- GovTech – https://www.tech.gov.sg
- Smart Nation – https://www.smartnation.sg
- IDA – https://www.imda.gov.sg
- Smart Nation Sensor Platform – https://www.tech.gov.sg/products-and-services/smart-nation-sensor-platform/
- Smart Nation Video Analytic Platform – https://www.tech.gov.sg/files/media/speeches/2017/05/Factsheet%20Smart%20Nation%20Sensor%20Platform.pdf
- Spark – https://spark.apache.org
- Hadoop – https://hadoop.apache.org
- Elasticsearch – https://www.elastic.co/elasticsearch/
- Internet of Things (IoT) – https://en.wikipedia.org/wiki/Internet_of_things
- Naveen Jain – https://en.wikipedia.org/wiki/Naveen_Jain
- Dota – https://en.wikipedia.org/wiki/Dota
- Atari – https://en.wikipedia.org/wiki/Atari
- Slack – https://slack.com
- Terminator – https://en.wikipedia.org/wiki/The_Terminator
- The Matrix – https://en.wikipedia.org/wiki/The_Matrix
- Minority Reports – https://en.wikipedia.org/wiki/Minority_Report_(film)
- Jarvis – https://en.wikipedia.org/wiki/J.A.R.V.I.S.
- Avengers – https://en.wikipedia.org/wiki/The_Avengers_(2012_film)
- GPT-3 – https://en.wikipedia.org/wiki/GPT-3
- Turing test – https://en.wikipedia.org/wiki/Turing_test
- Natural Language Processing (NLP) – https://en.wikipedia.org/wiki/Natural_language_processing
- Deepfakes – https://en.wikipedia.org/wiki/Deepfake
- Chatbots – https://en.wikipedia.org/wiki/Chatbot
- Responsible AI – https://www.accenture.com/_acnmedia/PDF-92/Accenture-AFS-Responsible-AI.pdf
- Deep learning – https://en.wikipedia.org/wiki/Deep_learning
- Neural networks – https://en.wikipedia.org/wiki/Artificial_neural_network
- SHAPLEY values – https://en.wikipedia.org/wiki/Shapley_value
- LIME – https://towardsdatascience.com/understanding-model-predictions-with-lime-a582fdff3a3b
- Backtesting – https://en.wikipedia.org/wiki/Backtesting
- A/B testing – https://en.wikipedia.org/wiki/A/B_testing
- Precision and recall – https://en.wikipedia.org/wiki/Precision_and_recall
- TensorFlow – https://www.tensorflow.org
- TensorFlow Playground – https://playground.tensorflow.org/
- scikit-learn – https://scikit-learn.org/stable/
- PyTorch – https://pytorch.org
- Kubernetes – https://kubernetes.io/
- MLOps – https://en.wikipedia.org/wiki/MLOps
- Airflow – https://airflow.apache.org/
- Jupyter notebooks – https://jupyter.org
- D3.js – https://d3js.org/
- WebGL – https://en.wikipedia.org/wiki/WebGL
- Econometrics – https://en.wikipedia.org/wiki/Econometrics
- Bioinformatics – https://en.wikipedia.org/wiki/Bioinformatics
- Geospatial – https://www.omnisci.com/learn/geospatial
- Typography – https://en.wikipedia.org/wiki/Typography
- Data visualization – https://en.wikipedia.org/wiki/Data_visualization
- HR analytics – https://www.hrtechnologist.com/articles/hr-analytics/what-is-hr-analytics/
- Hyper-personalized AI – https://www.convinceandconvert.com/research/hyper-personalization/
Are you a startup in software development which is less than 5 years old?
If yes, our sponsor at JetBrains has a 50% startup discount offer which allows Startups to purchase multiple products and subscriptions for up to 10 unique licenses over a period of months.
To find out more, go to https://www.jetbrains.com/store/startups.
There’s a lot of potential to harness data, to use ML, and to try and scale it. We see a lot of data science teams who are really struggling to productionize and make their models effective. And I think that was one big challenge that we saw. The second thing is that there’s a lot of concern over how AI and ML systems can be like black boxes. And we wanted to help solve that problem. How do you make your ML systems a lot more robust, but a lot more trustworthy?
We want to make sure we build a great data science and engineering organization and solve important problems for the future. A lot of the time, it’s how you use and accelerate the use of AI, but how to do it in a responsible way. And that’s our mission.
When you’re able to calculate the right metrics and when you’ve got the ability to use big data technology, distributed computing to calculate this metrics, then suddenly there’s a lot of impact on the public good, on how you can incentivize and make operations more optimal.
When you outsource everything, then you lose a lot of capability. It’s very hard to specify what exactly you want.
Interesting Projects at GovTech
It was the creative use of visualization and different ways of data-driven thinking that allowed for a data-driven approach to help isolate a problem that a technical team for six months couldn’t figure out. Rather than trying to understand the mechanical and (electrical) engineering aspects of the system, using a more data-driven variable, we find some hints to what the solution was.
It doesn’t mean that in a way expert knowledge is not helpful and that data scientists can find things that experts can’t. It’s the combination of both. It’s the combination that helps.
Any expert by definition knows the domain very well, but any expert also has certain assumptions or mental models of the way the world works. Typically, that mental model works because from the experience, they’ve seen it in 90% or 95% of the time.
For the unobvious cases, sometimes you need a different lens, a different way to tackle the problem. And that’s why starting from a more empirical approach on data, can sometimes help, because engineers always go down the rabbit hole, and then you can’t solve the problem because you’ve gone down that path already.
When you have a multidisciplinary team, you have a team that respects everybody’s perspective. I respect that you are a domain expert. You respect that we are data experts. We’ve worked from many different data problems and then we see how to work together.
It’s also about timing, though. Because all these people who’d think in different ways usually at around the right time they’re disrupting. There’s also some shift in the way technology has evolved, or the way the economics of a particular industry has evolved. When these two coincide, that’s when this disruptor is no longer this annoying person, but this person really did something, which serves real value.
The Fear of AI/ML
Artificial Intelligence and Machine Learning is pretty good at relatively narrow tasks where you have a lot of data and information to go through.
It has a lot of data points to learn from. So for these sorts of problems, where all the data’s online and (easily) captured, then systems can do very well.
AI is not so good when you’d need general reasoning, or that the data you require is not yet instrumented or captured.
Machine learning systems rely on data. It’s not magic. In some ways, it’s all one brute force. There’s actually something very inelegant about machine learning. You’re just group forcing the answer.
You think of AI as some sort of organisms, some sort of robot with consciousness. I just see a bunch of algorithms and cubes of numbers, rather than any sort of consciousness and all that.
AI/ML Current Trends & Challenges
Every now and then, AI has such a mythology that people are always looking for these cool new use cases people can demonstrate. The trick is that you must have something very visual.
You need to separate what is interesting for what it is useful.
Be really careful about these Turing test type examples. Always ask whether it’s cherry-picked.
The biggest challenge is how to make machine learning to be more data driven, more ubiquitous across organizations.
Just because you have a lot of tools and technologies with a lot of hammers, doesn’t mean you can build the house.
The Danger of AI/ML
The models learn from data, and they can only learn from the boundaries of the data that are given to them. So the key (to test these models) is to throw it some left-field question, or something that’s out of the ordinary that you wouldn’t expect it to have been trained on. Whereas human beings have a much more broad common sense lived-the-experience. They see more of the world. If you throw them some left-field question, they will be able to figure it out.
The general principle is just throw it something that might be quite rare. And then if it really messes up, then you know that it’s a robot.
At some level, this is one of these adversarial problem. It will always just keep getting better. And there are better ways of detecting it. And it’s like cyber security. And this is how the technology evolves over time.
There’re a couple of angles to Responsible AI. The first thing is: as you’re building your systems, make sure you’re able to know what goes into training the model. Are you tracking what data has been used in a model? What parameters and models are being used? What assumptions were being made?
When you’re feeding data, sometimes you have to drop rows or columns, sometimes you have to make some tradeoffs. Be quite explicit and careful about these things. Because for these aspects may affect the eventual model that’s being trained. Make sure you have a good engineering process.
There’re now a set of methods to help uncover the black box of AI. There are ways to visualize what’s going on within the neural network. It helps data scientists debug the models. But some of these can also be used to summarize the results of very complex models, and to present them, to explain what are the underlying drivers.
AI algorithms will take any feature you feed to them and use it to try to predict more accurately.
With any machine, any model, you have a certain accuracy or error rate that you would generate. No model is completely 100% correct.
Just like any form of monitoring that you have in any software system, you need to be able to monitor the algorithmic performance of your models.
Problem with learning systems and AI systems is that the algorithms can learn the wrong thing, and they can continue to perform fine.
You have to think about different ways and SLAs when thinking about machine learning systems from traditional software.
Challenges for Implementing AI
The first challenge is getting their data in place. Have they been collecting data? Have they been storing and persisting it? Have they captured enough labels of what good or bad examples are?
The second bottleneck is having the conversation within the business, the data science teams, and the technology teams about what problems you are trying to solve. What can AI do with the data that you have? This is a very iterative process.
Sometimes business comes with a lot of problems that are like science-fiction. Others, they throw a lot of resources solving solvable problems, but they are research level problems.
You have a process. Maybe 80% of the process, AI can handle. The last 20%, let the human beings solve that last 20%. But you save the human being 80% of the time, rather than trying to solve the entire process.
This translation and ideation institution is the part that I see as the greatest bottleneck. Business people who understand some intuition about how AI works. It’s not magic. That you’re learning from the data. You need the right data scientists who appreciate what the business needs, but understand deeply what the state of the art is. And then also, people who understand what data is available. Because a lot of certain things are theoretically possible, but if you haven’t been collecting the right data, then that’s not possible.
Managing Expectations for AI Projects
You need to think about the validation of the algorithm first, before you try to productionize it. A lot of people try and build the end software first or build the analytics platform, without thinking about the use case or validating the business impact.
You have to start very close to the business and validate the accuracy of the model, and how the accuracy translates to a business impact.
Implementing AI/ML is a multidisciplinary problem.
Data scientists are very good at building algorithms, statistical knowledge, understanding the data scientist by training. But they’re not engineers. They don’t have to keep the lights on. They don’t understand containerization, the Kubernetes. They don’t understand networking, cloud infrastructure, site reliability.
The engineering team or infrastructure team haven’t been trained to understand ML systems, how they work, how they need to be retrained, how you need to monitor them in different ways. It’s not just throughput and latency. It’s also algorithmic accuracy. It’s also fairness.
It is a lot about self-service, but also about giving the right flexibility when it’s needed, and the right positive constraints when it’s needed.
All they want to know is what resources they need. So abstract the way the things they don’t need to worry about and then allow them full flexibility on the stuff that really allows them to get their work done.
Role of ML Engineers in Product Team
A data scientist role is often the bridge between the real world and the data.
Whereas the data scientist has to work with the business to understand what the data and the data stored in database means, and how that can be mapped with business needs. And that’s very contextual. And that’s going to differ every single time. And so the mindset of a data scientist is always about being a bit more creative, a bit more understanding the context, and then building the right models and optimize for what the business wants.
Data Scientist and ML Engineer
The data scientist needs to understand the domain. What does the business want? How does the real world work? How am I gathering this data? How do I interpret this data? And how do I build a model that solves what business needs? They’re not thinking about scaling. They’re not thinking about building it once and then generalize and scale it.
A data scientist, who can code, can write and build models, but also can talk to the business, are super important, a premium in any team that wants to be data driven.
The whole idea with machine learning is to allow greater personalization, which means tailoring recommendation.
With more data understanding about people’s preferences and choices, you can narrow down and give people what they want more quickly.
Human beings are more exploratory, more open to new ideas, new inspirations and new things. If you’re locking people into showing them more of what they want, but not giving them the opportunity to learn about new things, there’s a danger with this.
At the high level, you need to think about whether people know exactly what they want, whether you’re optimizing to show them stuff and get them addicted and hyper-engaged. Is that a good thing?
3 Tech Lead Wisdom
When you’re working in data and software, communication is super important. Because it’s so abstract. Because it’s all in your head. It’s important to collaborate to be effective in larger teams. Being able to share what’s in your head with your team members is super important.
- Think hard about how you make yourself effective in communication, and there are many different ways to communicate.
Make sure you’re always learning. That’s the reality of being in technology and software.
- You’re always having to keep up. You’re always having to learn something new. I always suggest to the people I work with to look for adjacent areas.
Have good relationships.
A lot of engineers are very happy talking to machines. Sometimes, talking to machines is very different from talking to other human beings. Because machines tend to be a lot more explicit. There’s less hidden information. Or at least you can probe it to find out. With human beings, sometimes if you probe, there’s always a filter. And there’s always this issue of emotions. So in my experience working in all sorts of teams, whether teams are functional or dysfunctional, really depends on emotional reasons.
If you’re going on a technical career, make sure you think about communicating not with the machines, but invest in communicating with your teammates. And a lot of it is forming that emotional connection, that trust, working through some of these emotional responses that people have, and that you have, and being aware of that.
If you believe that individuals can’t scale themselves and you need to work well in teams, then you need to think of all different solutions to make teams effective. Some of that is communication, my first point. Some of that is making sure you have multi-disciplinary skills, my second point. But a third thing is having good relationships.
Episode Introduction [00:00:45]
Henry Suryawirawan: Hello, all of you my listeners out there. It’s really great to be back here with another new episode for the week. My name is Henry Suryawirawan, and welcome to the Tech Lead Journal show. Before we start our episode today, I am so happy to share that this show has reached a new milestone recently. A few days ago, our Tech Lead Journal community at LinkedIn has reached 1,000 followers. And I would like to say thank you from the bottom of my heart for liking the show and being part of this community. It is my biggest mission to keep delivering great content and conversations with great technical leaders, whom we all can learn from and to continue growing this community in order to reach more people and spread the amazing knowledge from my wonderful guests so that all of us can benefit from them. You can also play a part in my mission by sharing your favorite episodes and learnings on the show’s social media channels, be it on LinkedIn, Twitter, or Instagram. I’m always looking forward to hearing your feedback, and interacting with your posts and comments there. And for those of you long time listeners and have become avid fans of the show, consider becoming a patron by pledging your support at techleadjournal.dev/patron, and support me towards achieving my goal, which you can find further details on that page.
Our guest for today’s episode is Liu Feng-Yuan. Feng-Yuan is the co-founder and CEO of BasisAI, a Singapore headquartered augmented intelligence software company that helps data-driven enterprises deploy AI responsibly. He has a vast experience in the tech sector, working with the Land Transport Authority Singapore to make public transport more efficient. And with GovTech, pushing data initiatives for Singapore’s Smart Nation projects.
In this episode, I talk to Feng-Yuan all things AI and ML, about having responsible AI, building trust in AI, also including AI possibilities, challenges, and potential dangers. We began by talking about his company, BasisAI, and how they are helping enterprises to become more data-driven and to deploy AI solutions responsibly. Feng-Yuan also explained why it is important to differentiate between what is interesting and what is useful when it comes to AI trends. We also spoke at length about deepfake, the dangers that come with it, and how to detect and prevent such potential dangers when AI implementation is built with harm intention. At the end, Feng-Yuan also shares some wisdom about the importance of having effective communication and connecting with each other in the age of AI and ML.
I hope that you will enjoy this great episode. Please consider helping the show in the smallest possible way, by leaving me a rating and review on Apple Podcasts and other podcast apps that allow you to do so. Those ratings and reviews are one of the best ways to get this podcast to reach more listeners, and hopefully the show gets featured on the podcast platform. Let’s get the episode started right after our sponsor message.
Henry Suryawirawan: Hi Feng-Yuan, welcome to the Tech Lead Journal show. It’s very exciting to have you here because you are one of the top popular people in ML and AI in Singapore. So I’m really excited to talk about AI and ML today with you.
Liu Feng-Yuan: [00:04:42] Thanks for having me, Henry. I’m really glad to be on your program. Hope we have an interesting conversation about all things AI/ML.
Basis AI [00:04:48]
Henry Suryawirawan: [00:04:48] So you’re currently working in your own startup called Basis AI. Maybe as a start, can you introduce about the company, what is Basis AI?
Liu Feng-Yuan: [00:04:56] So Basis AI, we are a startup about two years old now. I’m one of the co-founders of the company and the CEO. My two co-founders also have an extensive Data Science and ML background. They used to work out in the Valley, companies like Uber and Twitter. And in my past life, I used to lead the Data Science and ML team at GovTech. When Linus and Silvanus decided to move back to the region, we decided to set up this company because we think there’s a lot of potential to harness data, to use ML, I think 1) to try and scale it. So we see a lot of data science teams really struggling to productionize and make their models effective. And I think that was one big challenge that we saw. The second thing is obviously there’s a lot of concern over how AI and ML systems can be like black boxes. And we really wanted to help solve that problem. How do you make your ML systems a lot more robust, but a lot more trustworthy. I’m sure we’ll talk a lot more about this later on. And so when we founded the company, we saw a lot of potential talent in the region, we feel, Engineering talent, Data Science talent, even Marketing and Business Development. We saw a lot of opportunity. And that’s why the company is headquartered in Singapore. We were fortunate to have institutional investors like Temasek and Sequoia backed us up. We’re about 25 strong now. And couple of interesting projects in the pipeline that we’re excited to try and grow. And you know there’re not that many enterprise SaaS companies in the region. And so, you know, it’s lonely, but also unique. We just want to make sure we build a great data science and engineering org and solve important problems for the future. A lot of the theme is how you really use and accelerate the use of AI, but do it in a responsible way. And I think that’s our mission.
Career Journey [00:06:23]
Henry Suryawirawan: [00:06:23] So I’m sure it’s an exciting journey, especially in AI/ML. It’s very trending. But before we go into deep AI/ML stuff, maybe for the audience here, can you share with them what’s your career journey? Maybe you have some highlights or maybe you have major turning points that you want to share with us for all of us to learn.
Liu Feng-Yuan: [00:06:39] I have a very slightly unconventional background for tech. So before I joined the startup, most of my past experience had been with the Singapore government, doing a whole variety of different things. I would say that in the early part of my career, I was doing a lot more policy related work. But I had a background in econometrics. Part of the challenge is you never had enough data to work with back in the day, right? So I’m talking 2004. And back then, you have a hundred data points to try and draw a straight line through it, and then do a linear regression. So it’s not very interesting to try and use all the techniques you learnt from school. And that’s why I like to say I’m kind of one of the lapsed economist. But the part where data really started getting exciting was when I joined the Land Transport Authority Singapore in 2011. And in 2011, that was when the public transport situations in both (buses and trains) were terrible. The trains were overcrowded. The bus was terrible. So what happened was I joined a team called Public Transport Quality. So I was in charge of a team of 20 people, and we were supposed to enforce and regulate the bus companies and the train companies, based on their metrics and how well they were performing. They have certain SLAs around how crowded the bus is. And then you realized that transportation is fascinating from a data point of view. Because there’s so many different ways to measure public transport performance. And I will go into some of it.
So for example, when I went to the LTA, we realized we had all the tap-in and tap-out data from the EZ-Link cards in Singapore. So actually you can calculate the number of people on a bus in time and space, at any one point in time. But you know, when I used to have to go and explain to the minister, he says, “Just tell me, Feng-Yuan, which is the most crowded bus?” I wanted to tell him, “Sir, it’s a four or five dimensional problem.” Because there are 200 bus lines. One direction going to the city, one direction going away from the city. There’s weekday, weekend. There’s time of day. There is which bus stop it’s crowded at. So it’s not a straightforward problem to figure out where is the most crowded point in this space. So measuring crowding on buses is actually fairly complex. Then how would you measure, for example, frequency of buses. So, you know you and I get very frustrated when the bus is supposed to come four times an hour, and then all four buses come at the same time and then for 50 minutes, there’s no bus. Now, how do you measure this? So it turns out that there is a metric known as excess wait time, where you assume that passengers arrive at a constant speed at the bus stop, and then you estimate the expected waiting time that they would have to wait for to get the bus to arrive. And you look at what the schedule plan expected wait time is, and you look at what the actual performance of the wait time is. And you look at the difference. And the difference is the excess wait time. Now it turns out that in order to calculate this metric, you need a lot of granular data, right? I know I need to know the planned schedule. I need to know the exact arrival time with the bus system. One of my projects in the LTA was with all this data we had on GPS and tap-in tap-out, we found a way to calculate this excess wait time metric across the entire bus system. And then we implemented an incentive scheme for the bus companies. So now I had a metric. I used big data to calculate the metric, but I can calculate the excess wait time for all 200 buses and all 4,000 bus stops. And then I can use this to say to the bus companies, you know what, if your excess wait time score is good, you get more incentives. And if your excess wait time score is bad, you got a penalty. And the outcome is, you don’t realize this, but all your bus companies that you take the bus, now have this at the back of their mind, because they know there’s a metric that they’re assessed against.
This is a bit of a long story, but the really cool thing about transportation and data is, 2011 was a time when data science was called the sexiest profession in the world. It started to become quite hot. Big data. Back then in LTA, there was no Spark. Hadoop was just emerging. But those were almost 10 years ago. But there were a lot of exciting possibilities. And when you’re able to calculate the right metrics, when you’ve got the ability to use big data technology, distributed computing to calculate this metrics, then suddenly there’s a lot of impact on the public good, on how you can incentivize and make operations more optimal. That was a pretty important part of my career, because that’s when I really got excited about data. And I realized that in order to make it work, I needed to bring together multidisciplinary teams. I needed the data scientists. I needed the engineers. I needed operations people. And we really needed to work together to try and see how we could harness data in a better way.
It was after that stint that I went to GovTech. This was around the time when there was a big Smart Nation push. In the past, GovTech used to be called IDA, and we never used to hire engineers or data scientists. We used to outsource everything. And when you outsource everything, then you lose a lot of capability, and so it’s very hard to specify what exactly you want. I started the data science team with five people. After four years, it grew to about 50. And that was a pretty proud career milestone for me, because we managed to get this team to work on data science for the public good. Our customers are all internal government agencies. But, we also obviously started not just doing analysis, but building software, building platforms real time. This is for… maybe the government agencies as well.
So that was also pretty exciting period for me in the journey. And it was during that time that I started reaching out to diaspora from Southeast Asia, from Singapore, from the region who were working in the US at the Silicon Valley companies and saying, “Hey, what are you guys doing? How are you guys doing it? Are we in the Singapore government, do you think doing things right way?” And that’s how I met my current co-founders. I tried to hire them to join the government. They were like “No way in hell that’s going to happen, but we really kind of like what they’re doing, and why don’t you join us?” So that’s my story. I’m just the sales guy, the complement of the team. But I think it requires many different types of skills to make technology effective. That’s what really motivates me, and a lot of the data problems are really cool for me.
Henry Suryawirawan: [00:11:50] Yeah, it’s interesting that you tried to poach them, but you were being poached out.
Liu Feng-Yuan: [00:11:54] Yeah. That’s how it works.
Interesting Projects at GovTech [00:11:57]
Henry Suryawirawan: [00:11:57] So maybe can you share with us some of your biggest project in GovTech? Is it like Data.gov or any other platforms that you built that are interesting?
Liu Feng-Yuan: [00:12:06] Yeah. There were a whole range of different things. A couple of the things that we were working with was the Smart Nation Sensor Platform and the Smart Nation Video Analytic Platform. So these are central platforms that are putting together lots of IOT and sensor data and helping government agencies plan better. Using geospatial data to plan better, to run operations in a better way. There was other thing that we were working on. There was one, which I think is public information now, where we’re helping the police force to basically do better case investigations and searches. So this is data that they already have in general systems. The combination of using Elasticsearch getter, together with some graph algorithms to link information together to. Whereas in the past, they had a couple of siloed systems, that couldn’t talk to each other, and there was manual effort to pull everything together. We got very excited by using simple well-known technology, but used in applications where just saves the officers a lot of time. So a lot of the officers were working overtime in order to try and crack these cases, and we were just able to improve the user experience. Taking the time that they would take to solve a case from, or to investigate a case from like 40 minutes, trawling through all the information, to basically like 15 minutes to piece all the information together. I think there were a lot of internal systems. Data.gov was very exciting because we were really on a mission to help with better information, transparency and open data, which I feel is really important.
There was some ad hoc cases. There was one case I think in 2013, 2014, when the circle line broke down for a while. And for about six months, they couldn’t figure out what the cause was. This is more my team rather than me getting personally involved, but the team of data scientists were called in. They analyzed the data. And in about half a day, they were able to pinpoint the cause of the fault to a single trip. This was actually not a big data analysis. It was pretty small data, but it was creative use of visualization and different ways of data driven thinking that basically allow for a data-driven approach to help isolate a problem that a technical team for six months couldn’t figure out. Rather than trying to understand the mechanical engineering and (electrical) engineering aspects of the system, using a more data-driven variable, we find some hints to what the solution was. The team was able to isolate the cause of these breakdowns to a single train. It turns out that this train was going through the system as a task, other trains, it was causing interference and almost like killing these trains that went by. We didn’t know that the underlying root cause then, but what we could see was this pattern where the faults were propagating through the system at the speed of the train. And then, looking at the data, we can finally made a hunch that, maybe it’s a train on the other side of the track that’s propagating through the system, and then as it passes other trains, somehow it’s killing them. Ultimately, eventually, it was the train engineers who figured out the underlying reason for the interference. But it’s pretty cool in that case that using data, even small amounts of data, could help the investigation and point the investigation to the right path. So that was pretty cool project.
Henry Suryawirawan: [00:14:41] So I still remember this news during that time. The circle breakdown and yeah, the investigation took a long time until suddenly, a bunch of, I think people from GovTech, with you and your team, suddenly figured it out. So I think that was really interesting, especially with the whole Smart Nation movement of Singapore, that kind of highlighted, “Oh, actually, the technology could help in doing so many other stuffs”. Even like for transportation, figuring out why the root cause. Even the experts, they couldn’t figure out. But I think this also highlights the thing that you mentioned initially that we need to have a multidisciplinary team in order to solve a problem. Because not necessarily just the experts, the one who are doing it for a long time, but also like maybe, adding data visualizations or whatever that is, and work together to solve a common problem, which I think is pretty good journey and explanation on how tech can actually improve our lives in the future.
Liu Feng-Yuan: [00:15:28] Yeah, I think you’re spot on there, but I do want to say that it doesn’t mean that in a way expert knowledge is not helpful. And that data scientists can find things that experts can’t. But I rather like to think that it’s the combination of both. That observation and a combination is what helps. Because any expert by definition knows the domain very well, but any expert also has certain assumptions or mental models of the way the world works. Typically that mental model works because from the experience, they’ve seen it in 90% of the time, 95% of the time. But for this hard problems, that these new problems that you searched very hard, you exhausted the obvious things that would happen 90-95% of the cases. So for these unobvious cases, sometimes you need a different lens, different way to tackle the problem. And that’s why starting from a more empirical approach on data, I think sometimes helps, because that you know sometimes, I think engineers always go down the rabbit hole, and then you can’t solve the problem because you’ve gone down that path already. So you swept all the way back up, and then tackle the problem from different approach. And that is what we are able to do when you have a multidisciplinary team. And you have a team that respects everybody’s perspective. I respect that you are a domain expert. You respect that we are data experts. And we’ve worked from many different data problems. And then we see how to work together.
Henry Suryawirawan: [00:16:32] I learned about this elsewhere. There’s this guy called Naveen Jain. He was explaining about how to think big, bold ideas, becoming a disruptor. So he also has this opinion that most of the disruptions that happen in the world, did not come from the experts. Simply because what you’re saying, the assumption, the mental model is there. “We have tried that before and we think it won’t work.” While there’s a new guy coming out of the domain experts, and he could come with a …, maybe call it disruptive idea, and then be able to challenge and prove that actually, “Hey, it works.” And that’s where the disruptor comes actually in the world. I think like Uber or Airbnb, and things like that , some of the examples of this big disruption.
Liu Feng-Yuan: [00:17:08] And it’s also about timing though. Because all these people who’d think in different ways usually at around the right time they’re disrupting. There’s also some shift in the way technology has evolved, or the way the economics of a particular industry has evolved. And when these two kind of coincide, that’s when this disruptor is no longer this annoying person, but this person really did something, which actually serves real value, right. So it’s also a combination of different factors.
The Fear of AI/ML [00:17:29]
Henry Suryawirawan: [00:17:29] So this also brings us into a good segue to the AI world. A lot of people, I’m sure many people in the world are worried. I’m an expert in my area. But now it seems, “Hey, this AI coming, AI and ML, and I’m scared. Will I be replaced one day with all these AI/ML thing? And there’s a less emphasis on the expert knowledge.” What do you think about that?
Liu Feng-Yuan: [00:17:51] I think where Artificial Intelligence and Machine Learning is that, it’s pretty good at relatively narrow tasks where you have a lot of data and information to go through. Right, so there’s a lot of interest now in language models, because you’ve got a lot of language on the internet that you can mine, and if you mine the entire internet, and machines don’t get tired, software doesn’t get tired. Unlike human beings, you can just work through kind of millions or billions of hours, and learn from these systems. All the examples of reinforcement learning, where you can see how we can play Dota and play chess, are essentially that the AIs playing against each other themselves. And it can play millions of games, because if you die in a game, then you can just restart. You can try again. So it does a lot of data points to learn from. So for these sorts of problems, where all the data’s online and (easily) captured, then, yes, systems can do very well. So computer vision, recognizing faces, it’s all just pixels, right. It’s just images and you can process this and that’s really good.
Now where AI is not so good, it’s where you’d need some sort of general reasoning, or that the data you require is not yet instrumented or captured. So think about all of the interactions that you and I have, that isn’t already yet captured in Gmail or systems. I give you an example. So people always talk about HR analytics. I like to predict when a person’s going to quit. And the truth is, look, when you quit a job, you always know it’s either your boss or your colleagues who don’t get along, and things like that. Now, even if I mine all the email systems, you can’t just some of it, but a lot of this doesn’t happen in email. Maybe if you mine the Slack channels nowadays, because stack is so easy, you’ll be able to figure out if someone’s unhappy. But if you don’t have a lot of latent information about the emotional connections between workers and colleagues, and there was underlying the reason why someone will quit, then you’re not going to be able to build a decent HR model for those who are going to quit. I know there’s a lot of coverage on it. I haven’t seen any convincing examples of that.
So I’d like to think that, until we reach a time when people are installing microchips into our brains, and having that ability to download whatever we’re thinking, whatever we’re feeling, then we should be really worried. Maybe that time is not so far away. But yeah, machine learning systems rely on data. It’s not magic. In some ways, it’s all one brute force. There’s actually something very inelegant about machine learning. You’re just group forcing the answer. When you think of AI, what do you think ? You think of some sort of organisms, some sort of robot with consciousness. I just see a bunch of algorithms and cubes of numbers, rather than any sort of consciousness and all that. So people need to figure out. You know, people think about AI, you think of conscious beings. You know that’s not the case. It’s algorithms at the end of the day. What you put it to do is another question.
Henry Suryawirawan: [00:20:09] So yeah. I mean the day where the Terminator comes, I think still pretty far, I hope.
Liu Feng-Yuan: [00:20:14] Yeah.
AI/ML Current Trends & Challenges [00:20:15]
Henry Suryawirawan: [00:20:15] You have worked with big data and recently with this Basis AI, working on AI and ML. So what are some of the current trends that you see are growing in terms of AI and ML? Any kind of like space being hot areas for AI and ML to implement? Or is there any other new technologies, new innovations that come out of the AI and ML?
Liu Feng-Yuan: [00:20:35] Yeah. Every now and then, you know, AI has such a mythology that people are always looking for these cool new use cases people can demonstrate. And the trick is you must have something very visual. When reinforcement learning could play Atari games, that’s pretty cool! When you can play DOTA. These are all very visual examples. But I think you need to separate what is interesting for what it is useful. I talked about the example where the reinforcement learning students can play computer games, and that’s great, but in those use cases, when you make a mistake in a computer game, you died, you can regenerate yourself. But if you are using reinforcement learning, for example, teach a robot how to put items on the shelf, can you afford a million mistakes, where the robot goes crazy and then basically knocks down the shelf or kills a human being? You can’t generate that kind of learning data. I think we need to be careful about thinking about what’s exciting that we see in the news.
So again, there was a recent, a lot of excitement over this new language model called GPT-3, which is a language based AI model that apparently took tens of millions of computing resources to train a model. Your colleagues at GCP will be quite happy with that, right. Because they have to retrain a model and lots of compute. You would read the article about GPT-3. And then at the end, they tell you this article is actually written by GPT-3 itself. So you know, we can fool human beings into thinking it’s actually another human being that quote the article. So I think this kind of Turing test examples are really red herrings. Because a lot of the times, these cases are always cherry-picked. And it’s also not always really clear to me how these systems can be used, because they’re so brittle and you need to cherry pick it, and it might not be that productive or useful. So I always like to say, be really careful about these Turing test type examples. Always ask whether it’s cherry-picked. While people would say, “Hey, this shows a lot of cool new developments in the language models, and the like.” I like to think that the biggest challenge now is how do you make machine learning actually being more data-driven, more ubiquitous across organizations. And yes, you might need one point facial recognition algorithm, or one point NLP algorithm there. But these are point solutions, right? How do I use this to make the way we operate really work more data-driven and more fact-based. And I think that’s the thing that I think a lot of organizations are struggling with. Just because you have a lot of tools and technologies with a lot of hammers, doesn’t mean you can build the house. So that’s where I think the really interesting problems lie.
Henry Suryawirawan: [00:22:42] Yeah, I can see this as well. A lot of, in the industry, people probably want to jump the curve saying that, “Oh, there are so many cool AI/ML stuffs! Let’s do AI/ML!” But whenever we want to build something, because an ML requires a lot of data right, like what you mentioned, but they just don’t have it. I think what you mentioned is very spot on, like being more data driven, having all these data and metrics being collected, I think is still like one of the crucial steps before you can actually successfully adopt AI and ML.
The Danger of AI/ML [00:23:07]
Henry Suryawirawan: I also have one thing, like you mentioned about this GPT-3, it’s very interesting. Quite recently as well, I heard about someone who can produce blog posts just by having AI/ML doing it. And we all know about this deep fakes, where you have the videos of popular people talking about stuffs, which are generated by robots or AI/ML. So for us layman people, right? How do you actually spot this kind of, I don’t know whether you call it fake or you want to call it arts, it’s like borderline. So how can we spot this? And not fall into the trap that whatever that they are producing actually can really cause a catastrophic impact. Because if it’s misused, then what about it?
Liu Feng-Yuan: [00:23:41] There are two questions here. So how do you spot these deep fakes or these generated AI models that are generating? And then two is, how do you prevent this? How do you stop it? First of all, how do you detect it? Cause being able to detect may not mean that we can stop it. But one way to test these models is: the models learn from data, and they can only learn from the boundaries of the data that are given to them. So the key is throw it some left-field question, or something that’s out of the ordinary that you wouldn’t expect it to have been trained on. Whereas human beings have a much more broad common sense lived-the-experience. They see more of the world. So you throw them some sort of left-field question, they will be able to figure it out. But if it is an AI, with these chatbots, throw it some question that it may not have seen before. But yeah, the general principle is just throw it something that might be quite rare. And then if it really messes up, then you know that it’s a robot. Which is why people always ask the rude questions of chatbots, because you’re trying the instinct and the intuition (whether it) is the same. Throw it something it probably hasn’t seen before, and you can test for those issues.
Are there more systematic ways to detect deep fakes and things like that? Yeah, I think there are. There’s a lot of research into the technical details around this. At some level, this is one of these adversarial problem. It will always just keep getting better. And there are better ways of detecting it. And it’s like cyber security. And I think this is how the technology just evolved over time. I think the larger problem of how to distinguish between what’s real and what’s fake, is a really important one. And I think there are no easy answers to it.
Responsible AI [00:25:00]
Henry Suryawirawan: [00:25:00] So how can we prevent it? Which is the second question. And coming back again, I think I saw you a lot talking about Responsible AI. I think this is also touching upon that, right? How do we prevent that? And how can we ensure that whenever we implement AI, is as responsibly as possible, is as safe and without any harm intention behind it?
Liu Feng-Yuan: [00:25:17] I think if you are developing your own machine learning models and your own systems, and you are responsible actor, I think there’re ways to do it. I mean, obviously if there’s a kind of malicious actor on the other side, then they may not want to do so. There’re a couple of angles to Responsible AI. The first thing is: as you’re building your systems, just make sure you’re able to know what goes into training the model. Are you tracking what data has been used in a model? What parameters? What models are being used? What assumptions were being made? Even things like, when you’re feeding data, sometimes you have to drop rows or columns, sometimes you have to make some tradeoffs. So just being quite explicit and careful about these things. Because for these aspects may affect the eventual model that’s being trained. I think is a part of it, and I’m sure we’ll come to this later, is just making sure you have a good engineering process. You can version control, tracking parameters. These are all just hygiene factors that software engineers have been using in other domains for a long time.
The second thing is there’re now a set of methods to help uncover the black box of AI. So if we talk about deep learning networks and neural networks, there is ways to visualize what’s going on within the neural network. It helps data scientists debug the models. But some of these can also be used to summarize the results of very complex models, and to present them, to explain what are the underlying drivers. Now, explanations are a tricky topic. They always say, you’re trying to give a convincing explanation to a social scientist, to a physicist, and to a mathematician, they are not satisfied with the same level of explanation. Different demands of what’s a good explanation. In the same way, it’s the same with AI. What a good explanation is to a data scientist will be different from the product manager to a consumer, to a regulator. So this is again, ongoing research. But this idea that you should be able to at least explain where something goes wrong, so that you can try and fix it, I think is an important aspect. And increasingly there is this idea of fairness. AI algorithms will take any feature you feed to them and use it to try and predict more accurately. So if you feed to it gender or race, it will try and use these features to try and make a personalized recommendation. But again, there is a whole realm of ethics, and morally, human beings feel that some of these features shouldn’t be used to distinguish and to make decisions. And so this is a human ethical constraint that we need to put on the way we develop these models.
Now there are ways to quantify the impact. So with any machine, any model, you have a certain accuracy or error rate that you would generate. No models completely 100% correct. Now, the question: is the model making errors on women versus men differently? And if it is, you need to have visibility of this. Just like any form of monitoring that you have in any software system, you need to be able to monitor the algorithmic performance of your models, and whether they are treating men and women the same way. So our approach is really how do we get better visibility into these systems? How do we get visibility into the explainability metrics, into the fairness metrics? Also part of the engineering process to build these models. And there, I think gives you the foundations for building Responsible AI systems and responsible software, because you want to be able to know when things go wrong. One observation that I think might be interesting for people to be aware of is, typically you design software to fail spectacularly. When things go wrong, you want it first, maybe say, it doesn’t do more harm. Problem with learning systems and AI systems is that the algorithms can learn the wrong thing, and they can continue to perform fine. So the microservice will still be up. They’ll be 99.9% available. But it will give you a really bad prediction, 99.9% of the time. So you have to think about different ways and SLAs when thinking about machine learning systems from traditional software.
Explainable AI [00:28:24]
Henry Suryawirawan: [00:28:24] I’m quite curious about this. How you do the explainability, the fairness thing, because all I thought is that AI/ML is a black box. You train it over a long period of time. You throw it a bunch of data. And it just comes with an algorithm out of the box. But how can you actually come up with all these explanation, maybe metrics to quantify “Oh, these are the actual features that the algorithm puts a lot of emphasis on”? How do you do that actually? Maybe going a bit technical here.
Liu Feng-Yuan: [00:28:49] What I suggest that, maybe you can put a link to this, you have a look at TensorFlow Playground. TensorFlow Playground allows you to train a neural network in your browser. What’s really cool is that as you’re training the algorithm, it gives you an insight into what is going on in the individual neurons within the neural network. So neural network is some ways organized based on how your brain is supposed to work, and how neurons in the brain works. What the visualization here is showing is actually how each of the neurons is thinking. And you see that in the early layers of the neural network, the thinking is quite simplistic. So in this neuron, any dot that’s above a certain straight line boundary, it considers to be orange. And it tries to classify between orange and blue dots. Anything above a certain threshold is orange, and the other one is blue. And that’s a more linear way of thinking. But as you go from layer to layer in the neural network, what is essentially the thinking almost, is a lot more sophisticated. It’s able to draw boundaries that are less linear and more nuanced based on the data. So this, to some extent, it gives you some intuition, and there are techniques and methods to allow data scientists to expect what’s going on behind these very complex algorithms. And so that gives you some intuition into how these things can work. Those who are curious, you can look into research methods, like SHAPLEY values or LIME. So, LIME will be able to say, “Hey, I’ve predicted that this image is a doll, for example, and but you can point to the pixels in the image that led the algorithm make that prediction. So it does give you some sort of a link or explanation of why that particular prediction is supposed to be made.
Challenges for Implementing AI [00:30:14]
Henry Suryawirawan: [00:30:14] So coming to the actual implementation in the industry. Obviously everyone wants to do this AI/ML now. I think we touched a bit, the first problem is about the data. But what are the challenges that you see in terms of, in the industry, how people want to adopt AI, but there are so many hurdles that they need to go through? Maybe you can share some of the big challenges there.
Liu Feng-Yuan: [00:30:33] Yeah, I think the first thing is: there are many ways that people use AI systems. So sometimes they just consume an API. So if cloud provider has a ready translation service, facial recognition service, and it trains its model off all the data on the internet, then you just call the API. And then you send the API an image, it tells you what’s in the image. And in these sorts of situations, I think it’s quite straight forward. But then the use cases tend to be very generic. So if I have a very generic use case object detection, “Is it a cat? Is it a dog?”, then these things would probably work well. But if I’m in a manufacturing context, and I want to distinguish between various manufacturing defects, it’s unlikely that all that information is on the internet, and that Google has mine all these data, it’s on Google photos. That object detection algorithm in the manufacturing context would probably not be good enough. So in these situations, if you are a manufacturing company, then you need your own data to train the models on. And then the question is: have they been collecting this data? Have they been storing it, persisting it? Have they captured enough labels of what good or bad examples are, in order to train the AIML systems? So I think the first thing is just getting their data in place.
The second bottleneck, I actually see is: having the conversation within the business, the data science teams, and the technology teams about what problems are you trying to solve? What can AI do with the data that you have? And this is a very iterative process. Sometimes business comes with a lot of problems that are like science-fiction. So some financial companies come to me and say, “Hey, can you help me predict how the stocks in the stock market is going to move?” Usually, I giggled. Then I said, “If I can do that, I won’t be standing here talking to you. I’ll be, I’ll be trading, be a trader already, right!” It’s such a good problem. So again, some science fiction. Others, they throw a lot of resources solving solvable problems, but they are research level problems. And I’m like, “Hey, do you really need to be solving this problem? Can we simplify the problem?” I think there’s a lot of that back and forth where the business side would want a perfect solution, where I’m saying, “You know what? You have a process. Maybe 80% of the process, AI can handle. The last 20%, let the human beings solve that last 20%. But you save the human being 80% of the time, rather than trying to solve the entire process. So a lot of that, how do you find the right solution that meets what business needs, that can leverage with that you have versus something that can be leveraged off, either transfer learning or open source, where you don’t need to kind of reinvent everything from scratch.
This discussion, this translation and ideation institution is the part that I see the greatest bottleneck. Business people who understand some intuition about how AI works. It’s not magic. That you’re learning from the data. You need the right data scientists who appreciate what the business needs, but understand deeply what the state of the art is. And then also, people who understand what data is available. Because a lot certain things are theoretically possible, but if you haven’t been collecting the right data, then that’s not possible.
Henry Suryawirawan: [00:32:58] I can totally understand, like scoping the problem itself. So obviously we are all living in a science fiction, right? We’ve seen The Matrix, the Minority Reports and things like that. Yeah. Sometimes people tend to be influenced by those. But scoping the problem, I think is still a challenge.
Managing Expectations for AI Projects [00:33:12]
Henry Suryawirawan: The other thing I probably see is that, whenever you explain, try to explain this to the stakeholders – they are not necessarily the most technical people – sometimes there are a lot of unknowns, uncertainties. Compared to building software. You kind of know what will be the first beginning and the end output. Once you achieve all the output. Yes. Okay. Tick. 100%. This is your payment, because you deliver the solution. But what about AI/ML? I can see that sometimes, because it needs a lot of iterative process, a lot of probably analysis, sometimes the results just totally bomb, right? It’s very disappointing. So how do you actually manage expectations? Maybe from your experience with Basis AI as well, how do you manage this?
Liu Feng-Yuan: [00:33:47] So you need to think about the validation of the algorithm first, before you try to productionize it. So a lot of people, try and build the end software first, or build the kind of analytics platform, without thinking about the use case or validating the business impact. So what we would do is say, “Okay, you’ve got a problem you’re trying to solve. Maybe want to increase your conversion rate or click-through rate on your e-commerce platform. We’ll build you a personalization engine to do that, let’s try it out.” So let’s do a small, a couple of sprints, pick a month, maybe three months, build an MVP of the model. I validate the accuracy of the model. For each case that come in, what was your old conversion rate? Can we backtest it? Can we A/B test it? And then convince the user on a small scale, “Hey, this works.” And then very quickly from there, if they’re convinced, you then take that and say, “Okay, I’ll guarantee that these accuracies, the baseline and the benchmark, we continue to assure you of that, but let’s help you productionize that. And we help to maintain and productionize that later on.” So you have to start almost very close to the business and validating the accuracy of the model. And again, how the accuracy translates to a business impact. So you can talk about precision recall curve, all the various metrics, but that doesn’t mean anything to business. It’s about, okay, taking this. Let’s run a business simulation. Let’s make similar assumptions about the business. And based on 80%, this is the kind of cost savings we can generate for you. And you break it down for them. Then you have the conversation. So again, you’re absolutely right. Traditional software systems are more functional. Is the user experience pleasant? You can test the functionality. And whether that works with very different composition.
Productionizing AI/ML [00:35:07]
Henry Suryawirawan: [00:35:07] So you keep mentioning about this productionizing AI/ML. Could it be one of the other challenge when you want to adopt AI/ML? Maybe you can tell the audience here who are not familiar with this problem.
Liu Feng-Yuan: [00:35:19] I think ultimately it’s a multidisciplinary problem. Because data scientists are very good at building algorithms, statistical knowledge, understanding the data scientist by training. But they’re not engineers. They don’t have to keep the lights on. They don’t understand containerization, the Kubernetes. They don’t understand networking, cloud infrastructure, site reliability. These things are very foreign to them. And the truth is, a lot of data scientists are very frustrated, because they build this model, and they want to put it into the wild, they want to test it, and then it almost has to be thrown over the wall. The engineering team or infrastructure team, and these guys haven’t been trained to understand ML systems, how they work, how they need to be retrained, how you need to monitor them in different ways. It’s not just throughput and latency. It’s also algorithmic accuracy. It’s also fairness. And so a lot of things get lost in translation. And as I mentioned, I think part of it is also organizational.
So back in 10 years ago, when data science was started, they put data science into IT or technology. And that was a mistake, because data scientists should sit very close to business. So they moved data scientists now closer to the business. So they understand their use case, and what the data means is very important. And for that, you need to sit close to the business. But now they sit close to the business, once they put the model, and they want to productionize, they need to go back to the IT teams, the technology teams. And then there’s another barrier to go back to those teams.
So this is what we’ve seen a bottleneck. I’ve had experiences where I’ve got a great model. My data scientists trained a model. Feel very happy with it. Takes another six to nine months for an enterprise to work with a vendor, handover the model, get them to productionize it, maintain it. Very frustrating process. And then when something breaks, debug what’s going on is a real pain. So I think this scalability productionization, I think is a challenge. Now you can stand up a whole machine learning engineering team just to do this for you. But we’ve tried to develop some technology to help the data scientists to be more productive. So help the data scientists abstract away Kubernetes concepts. I think it’s containerized to them. They don’t really need to worry about those. But they can easily get to a microservice and the API that they can run. Now, obviously there are trade-offs. When you do this, you obviously can’t optimize at the very low level. But for most of the use cases, maybe that’s sufficient.
Henry Suryawirawan: [00:37:13] Is it like, probably one way to solve this problem is to make it more like self-service kind of thing? For the data scientists.
Liu Feng-Yuan: [00:37:19] Yeah, you’re right. It is a lot about self-service, but also about giving the right flexibility when it’s needed, and the right positive constraints when it’s needed. So for example, data scientists use many different open-source frameworks: TensorFlow, scikit-learn, PyTorch, and we want to be quite agnostic about that. So we want to leave it to them to figure out what tools they want to use. At the same time, data scientists, they don’t want to have to wrangle with, like I mentioned, Kubernetes and Spark. All they want to know is what kind of resources they need. So abstract the way the things that they don’t need to worry about, and then allow them full flexibility on the stuff that really allows them to get their work done. The other thing is obviously data scientists like working in notebooks. So it’s very interactive, but that’s not great for productionizing something, because very hard to version control notebook. And so sometimes you have to insist, positive consensus says you need to check everything into a Git repository before you productionize. So sometimes it’s putting in some discomfort, but for good reasons, because the context on experimentation is very different from when you want to productionize. So a lot of our conversations internally are really around what the data scientists mental model is and what the engineers mental model is, and we realize that they’re very different. But sometimes, you need to nudge certain groups to think in the way that engineers think, because I think software engineering has a much longer history of good engineering process, whereas data science is a much newer domain and discipline.
Role of ML Engineers in Product Team [00:38:31]
Henry Suryawirawan: [00:38:31] So I also think that like what you mentioned, data science is still in the early days, maybe if I can put it that way. And also I think, if let’s say you want to build a very sophisticated AI and ML system, where probably you can use the real time data coming in, training by itself and update the model on the fly as well, and provides the API for people to use that in real time. So I think there are a lot of things that needs to be done, like how to get the data in real time, how to store it, maybe clean it, wrangle it, and also feed it into the training model, use the elastic infrastructure. I can see this can go way complicated. Would it be logical then in the future that ML engineer is just one part of the team? We call it feature team or whatever. These days people adopt Agile, so they like to build a feature team kind of thing. Would that be a day where ML engineer would become just one member in the team?
Liu Feng-Yuan: [00:39:19] I can see a future where that’s the case. Again, in a world where we just have a much even greater ability to collect information and data, when IOT becomes even more ubiquitous. Yeah, so I can see the ability to handle data, to process it, make sense of it. I think will always be there. The interesting thing about, maybe not so much the ML engineer, but a data scientist role is that a data scientist is often the bridge between the real world and the data. A database admin doesn’t really need to understand what the data means. Obviously it has to be valid, but the meaning of data isn’t important. Whereas the data scientist has to work with the business to understand what the data and the data stored in database means, and how that can be mapped with business needs. And that’s very contextual. And that’s going to differ every single time. And so the mindset of a data scientist is always about being a bit more creative, a bit more understanding the context, and then building the right models and optimize for what the business wants. So that part is something that you can’t automate or industrialize that easily. Now the engineering parts, the MLOps components, the right monitoring, the right kind of pipelines, the right building for scale, the right kind of ingestion for real time streaming, those things, you can engineer. Those things are much more within the realm of engineering. To your point, when you’re thinking about what roles are going to be embedded in every team, it’s also important to think about what roles are really building your core engine and platforms that can be reproduced and replicable and scalable, and that’s going be a central team that maintains the infrastructure almost. And what stuff tends to require some level of contextualization. And then every product team is going to need that role. So I think every product team is likely to need a data scientist, probably an ML engineer, whereby depending on how technology evolves, maybe that analogy or something that’s going to (be) part of the core infrastructure team, rather than something that every product team has. We’ll see how technology in the industry evolve.
Data Scientist and ML Engineer [00:41:03]
Henry Suryawirawan: [00:41:03] I have a related question to that. These days, especially because AI/ML is so hot, many people would love to be a data scientist, or do anything related to ML. So my first question will be: is it possible for someone who don’t have any background in mathematics or algorithms or even engineering to be, how should I say, a good person in AI/ML? And the second thing is that: is it also wise to just go with data science? I’ve heard mixed opinions about this. We don’t probably need a lot of data scientists. We need more people who can productionize it. So what’s your thought behind all this?
Liu Feng-Yuan: [00:41:36] You know, unfortunately in any new disciplines, the titles are often confusing. Or they mean different things. So they’re overloaded, either that or not specific enough. My experience has been that a lot of data scientists have a stronger, scientific background. They understand statistics, the algorithms. A lot of them unnecessarily computer scientists. Some of them are. But they come from other disciplines, like bioinformatics, a lot of good data scientists from bioinformatics. Some from econometrics. Some from other engineering disciplines. That I think is one role that I think of data scientist. And in that realm, the data scientist needs to understand the domain. So the skillsets and the aptitudes, it’s more the inclination that I feel is: What does the business want? How does the real world work? How am I gathering this data? How do I interpret this data? And how do I build a model that solves what business needs? They’re not thinking about scaling. They’re not thinking about, how do I build this once and then generalize and scale it. The aptitude and the inclination of the data scientist is, I think, one skillset. That’s very important, because that often is a translation role that I feel is missing. So a data scientist that can code, can write models, can build models, but can talk to the business. Super important, a premium in any team that wants to be data driven.
The second thing is obviously you need the ML engineering team. When I need to productionize things. And this is things like: How do I set up Airflow, Spark clusters? How do I wrangle with this Kubernetes? And again, that requires getting your infrastructure teams to get on board as well. These guys have completely different training and background and inclination. The instincts are completely different. I feel like tech is one of those areas where you have to keep learning. So it’s about finding what suits your own inclination and attitude more than the knowledge, because if you’re motivated in your spark and you’ll pick it up. But again, I think ML engineers are less inclined to think about the context or the business use case. They are trying to build systems that can scale, that is resilient and a fault-tolerant. And so it’s a very different kind of mindset almost.
Hyper-personalized AI [00:43:16]
Henry Suryawirawan: [00:43:16] So I saw one of your recent talk, it’s about having hyper-personalized AI experience. Maybe can you share with us, what do you mean by that? Is it like one day I could have someone like Jarvis? For people who watch the Avengers movie, like Jarvis, helping me, assisting me in anything about me and myself? Rather than, having more generic kind of AI/ML model for business and for whatever problems in the world. But could we one day have something like Jarvis for us?
Liu Feng-Yuan: [00:43:42] There are many different connotations when you take stuff from the movies, so you’ve got to be careful and quite precise about what we mean there. But yeah, the whole idea with machine learning is it allows greater personalization, which means tailoring whatever recommendation. You have so much choice. How do I filter down the choice? So it’s a search problem to some extent. How do I filter down the choice? And show you that something is most relevant. And because of what you’ve seen on e-commerce, on Netflix, I think with more data understanding about people’s preferences and choices, then you can narrow down that field and give people what they want more quickly. There was an interview at talk that I gave that said there’s a danger with this, obviously. Because if your choices primarily getting of what you used to click on or what you used to want, that I’m just feeding you more the same. And if you think about human beings as more exploratory, as being open to new ideas and new inspirations and new things, then this isn’t a path dependency. You’re locking people into showing them more of what they want, but not giving them opportunity to learn about new things. Right now, somebody who builds most sophisticated recommendation engines will say there are ways to surprise users. And of course that’s true. But I think at the high level, you need to think about: are you in a situation where people know exactly what they want and you’re optimizing to show them stuff and get them addicted and hyper-engaged? Is that a good thing? In some cases, it is. In other cases, maybe you don’t want to do that. Because engagement sounds like a good thing. But addiction sounds like less than a good thing. And it really depends on the context. I mean, a lot of education, if you think about when you were going to school and all that, there were lots of things you didn’t like to learn, or you didn’t like to experience, but because you were forced to, you’re like, “Hey, actually after a bit of that, you realise, hey, this is really cool and really expands my knowledge.” With all this concern over people’s being in their own kind of bubbles and echo chambers, can hyper personalisation exacerbate that? Because you only get seen stuff that you want to see, and then suddenly you think, “Oh, everybody’s you”. And then when someone else says something different, you get outraged, because how can you even think that? Everybody knows that this is the case. And that’s fake news. The other side is seeing all the things that we think is true, and anything that they see is new is fake news. And that’s a very kind of dangerous world to be in.
Henry Suryawirawan: [00:45:33] Yeah. So I can see this, we all know this quadrant known-unknown thing, right? Could it be possible that with all this hyper-personalized experience, you will be having more unknowns? Because you are kept feeding all these. The first choice that you choose will be the path that you’re going into, without you actually knowing there are so many other interesting stuff out there.
3 Tech Lead Wisdom [00:45:50]
Henry Suryawirawan: Okay. I think it’s a very exciting talk. I think due to the time constraint, I have to ask my last question. As usual for all my guests. Would you be able to share your 3 technical leadership wisdom for all of us?
Liu Feng-Yuan: [00:46:00] I’m not sure I have any kind of wisdom, but I’ll share three ideas, and hopefully your audience can see if this is useful to them. I think one thing that I’ve found working with a lot of data scientists and engineers is: when you’re working in data and software, communication is super important. Because it’s so abstract. Because it’s all in your head. And because it’s important to collaborate to be effective in larger teams. Being able to share what’s in your head with your team members is super important, right? If there’s a text, every time there’s some form of I/O, then it’s going to be a very unproductive team. So again, I always ask, especially the engineers and data scientists, think hard about how you make yourself effective in communication and there are many different ways. Some people talk better. Other people like to write docs, asynchronous docs where everyone chips in. Other people have very good ways of commenting code. Some data scientists write very nice Jupyter notebooks. It’s very nicely organized and visualize. Other guys do really good technical talks. Sharing with their team. And others do very nice dashboards, where you can interact with the data and understand what’s going on. The point is: there are many different ways to communicate. Some people are “Oh, I don’t like to talk, so I’m not a good communicator.” That’s not true. Everyone has a different flavor of how you communicate, but investing in how you communicate is particularly important. I think in the world of data and software, because everything’s so abstract. But if you’re building a bridge, you just point and say, “There!” But it’s a bit harder. Not everyone can see a kind of physical elephant or the physical kind of structure. So communication is one thing that I emphasize quite a bit.
The second thing I would really think about is making sure you’re always learning. I think that’s the reality of being in technology and software. And you need this innate hunger to learn. If learning is tiring for you, then go into a different job where technology only evolves only once every 10 years. Because it’s tough, right? You’re always having to keep up. You’re always having to learn something new. I always suggest to the people I work with: look for adjacent areas. So you’re focusing on one area. Look for something adjacent to that. It’s kind of related to what you do, maybe a different angle, and take that out. And then it’s kind of a search through the different technology space. Over time you find yourself in a very different world. But there’s some loose connection between all of them. For me, I used to be very interested in university in typography. Because I used to read books. And I noticed the type face is kind of very different. And then, randomly I thought, data visualization is like typography, but for data. Then I got very interested in data visualization. And then I got very interested in, “Oh, what are the different technologies you can use to visualize data.” So, with my team, we were looking at D3.js and then obviously various tools like WebGL. You learn different ideas from one starting point. Typography to visualization to frontend technology, and then you go from there. And sometimes you end up in very strange places. But that I think is what makes technology interesting.
The third point I would make is: I think a lot of engineers are very happy talking to machines. Sometimes, talking to machines is very different talking to other human beings. Because machines tend to be a lot more explicit. There’s less hidden information. Or at least you can probe it to find out. With human beings, sometimes if you probe, there’s always a filter. And there’s always this issue of emotions. So in my experience working in all sorts of teams, whether teams are functional or dysfunctional, really depends on emotional reasons. It’s emotional relationships. It’s anger. It’s frustration. It’s ego. It’s jealousy. This is why if we were all robots, life would be perfect in terms of human teams. So I would say to even engineers, and if you’re going on a technical career, make sure you think about communicating not with the machines, but invest in communicating with your teammates. And a lot of it is forming that emotional connection, that trust, working through some of these emotional responses that people have, and that you have, and being aware of that. It sounds very touchy-feely and very non-technical. But if you believe that individuals can’t scale themselves and you need to work well in teams, then you need to think of all different sorts of solutions to make teams effective. Some of that is communication, my first point. Some of that is making sure you have multi-disciplinary skills, my second point. But a third thing is having good relationships, and that’s the more emotional angle to it, that you shouldn’t shy away from. This is my advice.
Henry Suryawirawan: [00:49:38] Right. Wow. I think it’s a very good thing, especially for all engineers out there. Don’t talk to the machines. Talk to the human as well.
So Feng-Yuan, thank you so much for your time. I really enjoyed this conversation, especially for me, I’m not a very super sophisticated AI/ML expert. So I learned a lot from your experience and your sharing today. And I hope all the listeners out there can also learn about AI and ML, and whether some of the things that you would want to pursue for your career as well. Thanks again Feng-Yuan. Is there any place for people to find you online?
Liu Feng-Yuan: [00:50:08] I try and keep a kind of low profile online. But you can look for Basis AI. I think on Instagram and on LinkedIn, and on Twitter as well. So do follow us and keep track of what Basis AI is doing. And, yeah, I’d be happy to follow up. If people want to reach out, my email address is Feng-Yuan, firstname.lastname@example.org. So if you want to reach out and have a chat, happy to do that.
Henry Suryawirawan: [00:50:28] Thanks for your time, Feng-Yuan. I hope to see you again in the future episode.
Liu Feng-Yuan: [00:50:31] Thanks. Thanks Henry.
– End –