#216 - Practical Data Privacy: Enhancing Privacy and Security in Your Application - Katharine Jarmul

12-May-2025 1 hour 7 mins Katharine Jarmul

included in Architecture & Design Security

“Essentially, privacy is a lot about having the autonomy and the control to decide who I want to show up as and what I want to share with whom, under what circumstances or contexts.”

Brought to you by Swimm.io

Start modernizing your mainframe faster with Swimm.
Understand the what, why, and how of your mainframe code.
Use AI to uncover critical code insights for seamless migration, refactoring, or system replacement.

Feeling uneasy about how your personal data is used, and wondering if companies are doing enough to protect it?

In this episode, Katharine Jarmul, author of “Practical Data Privacy,” dives deep into one of the most critical and rapidly evolving topics today. Discover how data privacy impacts you as a user and what organizations should be doing to protect your information responsibly. Learn why simply blaming users isn’t the answer and how we can build a more trustworthy technological future.

Key topics discussed:

Understanding Data Privacy: The meaning of data privacy and how it links to autonomy, trust, and choice
More Than Just PII: The full scope of sensitive data needing protection
The “Spying” Phone Feeling: How too much data collection can be used to infer sensitive details
Organizational Responsibility: Shifting data protection burden from users to companies building and deploying technology
Privacy by Design: Embedding privacy into tech right from the start
Essential Data Governance: Why knowing your data is key to privacy
Practical Privacy Techniques: Pseudonymization, anonymization, data masking, and more
Privacy Enhancing Technologies: Exploring tools like differential privacy, federated learning, and encrypted computation
AI & Privacy Challenges: Using AI responsibly with sensitive information
Navigating Privacy Laws: Understanding GDPR, data sovereignty, and global regulations
Building a Privacy Culture: Fostering a culture of learning, psychological safety, and risk awareness around privacy

Tune in to learn how we can build a safer, more responsible, and trustworthy digital future for everyone.

Timestamps:

(01:20) Career Turning Points
(02:14) Data Privacy Landscape
(07:45) PII (Personally Identifiable Information)
(11:33) Data Privacy Risk in Current Technologies
(14:13) Data Utility vs Privacy
(19:01) Privacy by Design
(24:19) Data Governance
(29:06) Retention Schedule
(31:10) Data Privacy Practices & Techniques
(34:09) Privacy Enhancing Technologies
(38:52) Fostering Data Privacy Practice & Culture
(47:05) The Legal Aspects of Data Privacy
(51:10) AI and Data Privacy
(56:08) 3 Tech Lead Wisdom

_____

Katharine Jarmul’s Bio
Katharine Jarmul is a Principal Data Scientist at Thoughtworks Germany and author of the recent O’Reilly book Practical Data Privacy . Previously, she has held numerous roles at large companies and startups in the US and Germany, implementing data processing and machine learning systems with a focus on reliability, testability, privacy and security.

She is a passionate and internationally recognized data scientist, programmer, and lecturer. Katharine is also a frequent keynote speaker at international software and AI conferences.

Follow Katharine:

LinkedIn – linkedin.com/in/katharinejarmul
Newsletter – https://probablyprivate.com
YouTube – @ProbablyPrivate
📚 Practical Data Privacy: Enhancing Privacy and Security in Data – https://www.oreilly.com/library/view/practical-data-privacy/9781098129453/

Mentions & Links:

📚 Designing Data Intensive Application – https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/
Data privacy – https://www.cloudflare.com/learning/privacy/what-is-data-privacy/
Privacy by design – https://en.wikipedia.org/wiki/Privacy_by_design
Data breaches – https://www.ibm.com/think/topics/data-breach
Data leaks – https://www.microsoft.com/en-us/security/business/security-101/what-is-a-data-leak
Personally identifiable information (PII) – https://en.wikipedia.org/wiki/Personal_data
Privacy Not Included – https://foundation.mozilla.org/en/privacynotincluded/
Latent variables – https://en.wikipedia.org/wiki/Latent_variable_model
Data sovereignty laws – https://www.oracle.com/id/cloud/sovereign-cloud/data-sovereignty/
Machine learning – https://en.wikipedia.org/wiki/Machine_learning
Large scale natural language processing – https://en.wikipedia.org/wiki/Large_language_model
Parallel computation – https://en.wikipedia.org/wiki/Parallel_computing
Natural language processing (NLP) – https://en.wikipedia.org/wiki/Natural_language_processing
RAG system - https://en.wikipedia.org/wiki/Retrieval-augmented_generation
Cambridge Analytica – https://en.wikipedia.org/wiki/Cambridge_Analytica
Ann Cavoukian – https://en.wikipedia.org/wiki/Ann_Cavoukian
Martin Kleppmann – https://martin.kleppmann.com/
Hadoop – https://en.wikipedia.org/wiki/Apache_Hadoop
Copilot – https://github.com/features/copilot
Mozilla – https://en.wikipedia.org/wiki/Mozilla_Foundation
Apple Intelligence – https://en.wikipedia.org/wiki/Apple_Intelligence
Microsoft Presidio – https://microsoft.github.io/presidio/

Our Sponsor - JetBrains

Enjoy an exceptional developer experience with JetBrains. Whatever programming language and technology you use, JetBrains IDEs provide the tools you need to go beyond simple code editing and excel as a developer.

Check out FREE coding software options and special offers on jetbrains.com/store/#discounts.
Make it happen. With code.

Our Sponsor - Manning

Manning Publications is a premier publisher of technical books on computer and software development topics for both experienced developers and new learners alike. Manning prides itself on being independently owned and operated, and for paving the way for innovative initiatives, such as early access book content and protection-free PDF formats that are now industry standard.

Get a 45% discount for Tech Lead Journal listeners by using the code techlead24 for all products in all formats.

Our Sponsor - Tech Lead Journal Shop

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Data Privacy Landscape

I’m not trying to diminish the importance of law and regulation in the field of privacy. But I think we can all relate to privacy, because each of us has some personal understanding of our own privacy. And often that’s informed by cultural influences we grew up around, where privacy and trust played a role in societal bonds we grew up with or where we live now.
Sometimes people hear data privacy, and they just think, oh, that’s for lawyers.
Essentially, privacy is about having the autonomy and control to decide who I want to show up as and what I want to share with whom, under what circumstances or contexts.
We all know there have been times in our lives when information we didn’t want to share with a particular person or groups got out, sometimes through technology, sometimes through a person, maybe both. And we all know that icky feeling when your privacy is violated. And that’s how it can teach us how much it is about trust, context, and having a choice.
We have regulatory or judicial legal understanding of privacy, and we have a technical understanding of privacy. The best part is when all of those work together. When the technology is reflecting not only what we legally need to do, but also what we socially and culturally understand.

PII (Personally Identifiable Information)

PII is a very small view of data that we might think of as sensitive data. Within the field of sensitive data, we have PII. That’s things like email addresses, birthdates, names, things that might be unique to an individual, and certainly in combination are unique to an individual.
And then we have what I call person-related data. Things like what I click on, what I buy can be person related. What I enter into search terms can be person related. When we think of all these things in combination with each other, those can point to uniquely identifiable characteristics of a person.
And you mentioned Cambridge Analytica. Can we use things like Facebook likes or short Facebook surveys or those combinations to decide how this person might vote, whether they’re easily influenced in voting and these characteristics, which we would think of as sensitive?
And then we also have other sensitive data that isn’t person related that we call proprietary or confidential data. Often, we don’t think of that when we think of privacy, because it doesn’t really have to do with privacy. Yet the types of protections that we use for privacy can also be applied to things like corporate secrets, other proprietary confidential information that we don’t want to share outside of a particular context.

Data Privacy Risk in Current Technologies

Obviously everybody is different. There’s different levels of trust that people have with different services, and you might decide the usefulness of this is enough for whatever trade off you have.
We run into this problem, and this happens a lot in information security or cybersecurity, where we decide that it’s a user problem, and it’s your job to figure out what you think is the most secure email service and what is the best thing.
This culture, I call it blaming the victim type of culture where, why is it that if I want to talk to a person, and they only use one particular messaging application, why is it my job to inform people what is the most secure or not the most secure? Why can’t we hold a higher bar for organizations to create secure and private by design services? So that people can choose whatever they like.
If we’re looking at privacy more holistically, it’s less about you must use this software than we need to hold software companies and product companies to a higher standard so that people can choose whatever they like, which I’m very pro choice on.
There’s lots of good advice out there, how to inform yourself, doing things like reading through privacy policies. There’s plenty of good advice to inform yourself on how to make those choices. But it focuses the blame on the individual who has to use things. At the end of the day, I want people to be able to connect and use technology in cool ways and not have this privileged burden of reading through every privacy notice to figure out which best aligns. By the way, then it gets updated six months later, then you have to read through and compare all of them over again.
Mozilla has a great resource for people looking for what criteria to use called Privacy Not Included, and that reviews things like automobiles, connected devices, home assistants, and other things. You can look at what criteria they use and maybe start to build your own criteria to inform your choices should you have the time and energy to do so. And there’s no shame if you don’t, because at the end of the day, we need to hold the orgs accountable.

Data Utility vs Privacy

There’s a popular shift with consumers thinking more about privacy. Namely, because of the big data era of collecting at will and doing things that make people feel creeped out. Make people feel like their phone is spying on them.
In some cases, that might be true. In other cases, we have so much data that we can infer sensitive things about people without their knowledge and without our intention. When we think about recommendation algorithm development, search response algorithm development.
There can be latent variables. What I mean is because you follow and like these five things, we have learned your gender or age or family status or other things. We didn’t explicitly try to learn them, but we learned them by amassing so much data and putting an algorithm on top without controlling for privacy. And then you get super creepy ads, and you’re convinced your phone is spying on you, and you must hide it.
I don’t want people to be afraid of tech. I don’t think any of us want people to be afraid of technology. I also don’t want to be afraid to have a mobile phone. It’s a nice thing to have.
One thing you can start talking about in an organization that may not have a culture around thinking about privacy is how do we reason about the trust that we’re creating.
This is great for tech leads and product people. When you’re a technical lead or principal or senior person making architecture decisions, product decisions, data decisions, and you’re a top product person, you can have a real conversation about do we want to talk with our customers about this problem of privacy?
There are businesses where privacy may not play a huge role by default, but you can at least start that conversation. What we often talk about in technical privacy is balancing between the most privacy we can offer, which would be collecting no data, and utility - what we think about as privacy utility balance or tradeoffs.
I would argue, as a data person if you don’t know what you’re going to use the data for, you’re probably just amassing large cloud computing fees for no reason. It’s better to have smaller experiments with less data and then grow it over time.

Privacy by Design

Privacy by design was developed in the early 2000s by a researcher in Canada, Cavoukian. When you read the original Privacy by Design, there are seven design principles. You see a lot of software thinking. It talks about the same themes we’re talking about now: allowing user choice, transparency, but also by default, building things that respect the user, that respect that privacy is informed by context.
If I put something in a form, it doesn’t necessarily mean I want that to be used to train an algorithm. We need to understand how the user imagines what they’re doing and build technology that mirrors that or makes it obvious what is happening.
And then security principles, not only application security, but also infrastructure security principles, so we don’t end up exposing something we’ve collected in trust that we’re using responsibly, but then open the system up for potential attacks or ways of exfiltrating that data for another purpose.
We can take privacy by design, use that to architect and build our systems, and then evolve that thinking for new things like building large-scale data systems, algorithmic systems.
And then we have a huge problem now around third-party data systems which aren’t private by design. When we think of third-party, it can be services we’re using, and we are the intermediary between our users who we have a direct relationship and trust with.
As long as we’re reviewing the third party vendors we use for privacy. Or the other way around where we’re a third party vendor. We have zero access to the users. And this is a lot of us that work in the B2B space.
How many layers until we get to a human who can choose whether they want their data used that way? And do we have a good review process to make sure that chain of trust is verified?

Data Governance

Data governance is critical, even outside of privacy. If you don’t know what data you have, where it lives, what pathways it goes through, where it’s being processed, how it’s being processed, by whom, along the way, then you end up in a place where privacy is not possible to some degree.
Hopefully, you have strong practice of vendor review and vendor assessment, because thinking through risk means, can we create a reasonable way that we as an organization want to think about what third-parties we work with, what intermediaries we work with, what SaaS we work with.
There’s no magical right choice, but there needs to be a commitment to some choice and thinking through why you choose one vendor over another that relates to principles like privacy and data security.
Beyond that, if we don’t understand the data flows through our systems, if we don’t have data properly tagged - some people use tagging, others use categorization. If we don’t have a reasonable way of organizing the data catalogs and data stores that we have, then we don’t have any reasonable way to think through retention schedules.

Retention Schedule

What is a retention schedule? It means that we collect data for a certain purpose for a given timeline. Maybe we collect purchase data for every customer. And we say we retain purchase data for as long as you’re a customer or for up to two years after you close your account.
How do we enforce that if we haven’t bothered to connect? Then somebody has to write some nasty SQL query. It may not work if somebody did bad data entry. This all relates to better data governance, which relates to things like data cataloging.
But also things like thinking through data quality metrics and how we ensure we’re getting value out of the data we collect.
If you improve trust relationships with your customers, you will get better data quality. That’s proven time and time again. If you’re just spamming, collecting whatever you can, especially through some intermediary, you’re probably getting pretty low quality data.
If the signal you’re trying to measure is something like is this customer interested in ABC, and you have a good relationship with a customer that you could just directly ask them. Then you have both high data quality and high trust. You’ve created a good privacy and security relationship with your customer. And you know that the data you’re storing is accurate.
It is not a trivial undertaking to do data governance at scale. Most organizations are not doing it very well. They’re just starting right now.

Data Privacy Practices & Techniques

These are not decisions to be made in a vacuum about what controls get applied, where or how we decide what we want to do with what data.
How do we decide what sensitivity a data has? Hopefully your organization has the ability to create a data governance board or practice or data privacy practice where there could be multiple disciplines of people in the room. You need people that are business informed that know the business goals of why we’re trying to do this thing, data informed, software and tech stack informed, and regulatory informed, and privacy informed. You can also hire privacy professionals who are not legal experts, but instead focus on privacy by design.
Then we get into the fun bits, which we call privacy enhancing technologies - different technologies we can use to help meet these goals of privacy in our organization. Most of us have seen pseudonymization that can take many forms. It can be simple masking. Identify things like PII and mask them.
Masking could take several forms. You probably already used some hashing mechanisms. Maybe a one way hash or a reversible hash. These can be ways to pseudonymize information. You probably have already used redaction. Simply removing certain types of sensitive data from a reporting tool or BI dashboard to say, this is ready for organization wide consumption, or this is ready for our marketing report. Any of these things are small privacy mitigations that you do. And those can lead up to more serious privacy mitigations like anonymization or thinking through things like local-first data processing, distributed data, or encrypted computation.
First, you have to shape the problem every developer knows. If you don’t know what problem you’re trying to solve, then just putting a technology in it is not going to solve the problem.

Privacy Enhancing Technologies

There have been cool developments over the past 10 years, and the ones I’m most excited about are the ones that show up in the book - technologies like differential privacy, which is a way we can reason technically about privacy.
I call it measurable privacy or rigorous privacy, because we’re using a scientific way of thinking about and shaping the problem. This works well for things like releasing data publicly or thinking through anonymization. It can work in many different use cases, but it can never work if you’re trying to say customer A wants B. It has to work in some form of aggregation of an idea or a person.
When we think of anonymization, we cannot say that we release something anonymized that can be tracked back to one individual. Because then it’s not anonymized by definition.
There’s an increasing number of open source libraries that allow you to think through differential privacy from a development context.
The next most famous one in recent years has been federated learning or federated analytics. Sometimes I call it distributed learning. It relates to what I call the field of local first software.
Martin Kleppmann and a group of folks have popularized this movement of local first data. Now we have devices, edge devices, edge compute like mobile devices, and within whatever AWS cluster you run, you have plenty of compute. Can we push processing as far down to the edge as we need? And therefore push the data as far down to the edge as we need? This might mean cross-silo learning or cross-silo analytics. Say you’re a multinational company or in a B2B situation, and you allow every company to house their data within their own realm premise.
Within their own cloud system boundaries, they don’t exchange data, they exchange analysis or output. This could be machine learning processing, simple analytics, or building a mobile application where I keep all data local, only send certain artifacts, and push that back up to centralized compute or redundancy compute. Depending on how that works, you could offer things like end-to-end encryption.
You’d do this because A, you’re greatly reducing your exposure risk. You don’t have centralized data, just centralized insights. How attractive is it to hack your system? Much less so with less data. And B, if you can get the same insights, you’re saving a huge amount on cloud compute.
How differently we could engineer and architect our systems if we thought through what data you actually need. Can we do that in a privacy respecting manner so the data stays away from centralized setup and data sharing where companies are exchanging sensitive data at scale?
Then there’s the third type of privacy enhancing technology I’m excited about - encrypted computation. People might have heard about it as homomorphic encryption or secure multi-party computation.
This enables use cases based on cryptography, which allows you to compute insights on data without decrypting it. Eventually somebody will decrypt it. The insight gets decrypted and used at some point. But the actual processing can be done in an encrypted state. That provides a quality we call secrecy, which is different than privacy.
In our cloud compute, we only computed in encrypted form, then moved back to your device where you personally decrypted it. By design, we never saw the data in an unencrypted state except once it was on device or in a data sharing landscape. We only process data with another company by keeping it encrypted so neither company could learn anything other than the final results.
Any intermediary results or anything that might leak individual level privacy, we cover up with a secrecy blanket. When we decrypt and remove that secrecy blanket, we’ve ensured the end result has enough privacy.

Fostering Data Privacy Practice & Culture

If you’re starting with basic data governance, you’re not going to deploy homomorphic encryption tomorrow. You have to start with the building blocks.
From lean product thinking, think through what types of advanced technologies fit the problems we have. Maybe it’s exploring local first development. Then pull in small use cases, small product features or new product launches and say, we’ll cordon off 20% of the planned development time to experiment. Could we use this new technology to help expedite future products so they are launched in a private by design or secure by default mentality?
What you’re doing is A, giving people interested in the topic space to start exploring. As you know from any new development technology, if you give people space, and they have interest or motivation, they can go far. You’re developing this culture of learning and experimentation, which we all need. And B, you’re boxing it enough that it doesn’t become a blocker with backup technology you’ve always used there. You’re growing the experience and knowledge.
When you start focusing on the process rather than the outcomes, you’ll see the rewards happen over time.
How do we explain to business stakeholders? Maybe you don’t have to explain. It will become evident by allowing time in the process. You can say simply that we need to stay up to date. We need to modernize how we think about privacy. This will pay off in the long run by allowing us to build platforms that do privacy at scale, which is the end goal.
The advocates that come from allowing your developers or data people or technologists to learn more deeply about this will pay off in itself. Those people will become advocates that educate others in the org about how this works. They will find clever ways of explaining it if you give them time and space to learn.

The Legal Aspects of Data Privacy

It’s already complicated. It’s probably going to get more rather than less.
When I talk about this with high level stakeholders, we have to think about it as future proofing the data strategy. If we just develop data or AI strategy within what we know today, we’re not going to be prepared for the future. Even today we have a very fragmented legal jurisdiction around almost everything data, certainly data privacy. This also applies to data security and data governance protections that different organizations are looking at.
One big trend that I don’t see going away soon is data sovereignty laws, which are about jurisdictional control of data. For example, data cannot leave a certain nation state or group of nation states. Or data cannot travel across untrusted nation states. If you work in government or government adjacent, you already know this pain. Now the pain is coming to you, whether you’re government adjacent or deal with highly secure systems or critical infrastructure. It’s ballooning out to other areas.
At the end of the day, this is a policy problem internally at our organization to decide our stance towards regulation. Where do we exist, what’s important to us? The legal experts, not technology experts, get to decide our risk posture and policies related to what we know about the data, how it’s stored, where it’s stored, what it’s used for. Then we step in as technologists at that policy and principles level and say, this is possible, this is not possible. This is a good idea, this isn’t because… Then we can talk about architectures, clouds we use, infrastructure questions alongside privacy technologies.
There’s a non-trivial number of legal and privacy professionals who want to learn more about technology or already know a lot about it. If you build those bridges, you can have useful conversations about what this looks like. That’s probably the healthiest approach, rather than avoiding the topic because it’s scary. You don’t want to talk with lawyers because they might say no. Then five years down the line, they ask why we architected it this way, and there’s no answer other than we didn’t have the conversation early enough.

AI and Data Privacy

There are new developments in thinking through A, how do we build machine learning or AI systems that can utilize private data without problems. And B, if we don’t build the AI models ourselves, which most organizations still don’t, how do we build a protective blanket around these interfaces to third party AI systems or AI APIs?
This is a growing trend and problem. It will come for everyone, including people who are not machine learning experts.
This is a growing field of practice and you don’t have to learn it to the greatest depth. Just start to inform yourself. What might it look like if we built this RAG system with privacy or security? How could that look different? Do we have the capabilities now? If not, what capabilities would we need to grow?

3 Tech Lead Wisdom

Even if you’re not inspired to learn, I ask that you empower people on your team or in your organization to learn.
- The field is quite deep and intensive. It’s not necessarily a field that many people specialize in because they studied it, so you want to open doors for people to grow and learn and foster that ability to specialize in that culture of learning. As tech leads, if it’s coming from you, it’s much more powerful.
- Foster that learning. Support your people in learning, create those spaces for learning, try to give back pressure to delivery deadlines to allow for that learning cushion.
I don’t think all risk conversations have to be scary.
- It’s natural I empathize with it being scary, because it is a big job and a serious job to talk about the risk of people’s data and that trust relationship.
- Yet if we can normalize conversations around risk, if we can allow space to have scary topics be part of our road mapping, part of our planning, where we regularly do privacy risk reviews, security risk reviews, auditing reviews. If we can normalize that like we do with testing for bugs or figuring out if something is deployed properly, then we make it more tangible. It becomes less scary by default, and we make it less uncertain. Most of the scariness comes from this uncertainty problem.
Relates to both of these, creating a team culture and organization culture of psychological safety.
- We’re not going to get everything 100% right all the time. We can foster learning, foster risk discussions, and build more private and secure systems by having real conversations, by escalating problems when they exist, when we see them. And by empowering people to have reporting up and down about what we think about problems like privacy and security.
- By fostering this safety conversation around these problems, we end up creating better software systems products in terms of privacy and security.
- Make friends with privacy people at your org. Be nice to them. Just reach out. If they have a legal background, they’re never going to say yes or no, which is fine. That’s what they’re trained to do. But they’ll give you advice, steering ideas, guiding questions. It’s going to help you. So be friendly with them.

Transcript

[00:01:20] Introduction

Henry Suryawirawan: Hey, guys. Welcome back to another new episode of the Tech Lead Journal podcast. Today we are going to cover a topic that is becoming trendy in most parts of the world, data privacy. I have Katharine Jarmul here. She is the author of a book titled Practical Data Privacy: Enhancing Privacy and Security in Data. So Katharine has a lot of background expertise in machine learning and data scientists. But today, we’ll try to cover the topics in more like generalist term. And maybe we can discuss the importance of data privacy and maybe some potential risks that you need to be aware of whenever you build your projects or products. So Katharine, welcome to the show.

Katharine Jarmul: Thanks so much, Henry. I’m excited to be here.

[00:02:14] Career Turning Points

Henry Suryawirawan: Right. Katharine, I’d love to ask my guest to maybe share some turning points in your career that you think we all can learn from that.

Katharine Jarmul: I think one of the turning points in my career, the early ones, is I started my job in technology, as a data journalist, or what we would today call data journalist. Kind of like how do we use data and interactives to tell stories and to support investigative journalism. I was at the Washington Post doing that. There was certainly a large turning point where I left kind of the field of media and journalism and I went into startups. And at the time, I went into startups that were focused on media companies as a customer base. And that led me into what I would today call large scale natural language processing. Or very early models compared to what we now call AI models, but still a lot of exposure to how do we do language processing at scale. And a lot of exposure to thinking through things like parallel computation. Those were the days of Hadoop and all of these other types of problems that came with how do we process large scale documents stores and use them for some sort of other service.

And then, around 2014 is when I moved from Los Angeles, California, where I was based, to Berlin, Germany, where I now live since then. And this was also the time when NLP, so the space of language processing, was moving from more simple model designs or what we would today maybe call more simple model designs, into deep learning which basically fuels a lot of the architectures that we talk about today when we talk about AI. And I basically took a year off of working and just studied deep learning at that time to kind of update. I already knew a lot of the math behind it, because thankfully I always loved math, but to really upscale myself from how do we think of the problem from a statistical point of view, statistical learning. At the end of the day, deep learning is also statistical learning, but how do we move from that space to maybe a more kind of notion of applying concepts of calculus and linear algebra to a learning model. And that was a good idea because yeah, kind of deep learning then ate most of the world, um, of machine learning.

Then the final career change was maybe about three or four years after that of doing deep learning, I started thinking or became enmeshed in the problem of kind of how do we think about what we would today call trustworthy and responsible AI development. Then most of the time we use the concept of ethical machine learning. And in looking into that problem, I became very interested in data privacy and data security of machine learning systems and models themselves. And that kind of fueled what eventually led to writing the book and obviously my current work, which is as a specialist in privacy and security and machine learning models.

Henry Suryawirawan: Well, thank you for sharing your story. I find it quite interesting the journey that you had, right? And I think it’s very unique that you took a career break simply to study deep learning and all that, right? So most people maybe took a break to do something else, you know, beyond work. Some did go study, but I think it’s still pretty rare to go deep into a certain area and just to study and do maybe more things to get expert on that.

[00:07:45] Data Privacy Landscape

Henry Suryawirawan: So today’s topic, we are going to cover about data privacy. So I think in a lot of parts of the world, this topic or this kind of thing has become like the mainstream topics, right? We heard a lot in news, data breaches, you know, data leaks, people’s privacy, you know, the Cambridge Analytica kind of thing also is part of the privacy thing. So maybe if you can give us an overview first, what is the current landscape of the data privacy thing and what should people know about it?

Katharine Jarmul: Yeah, I mean, a lot of times I try to orient people first with the idea of kind of personal, or we can think of like sociological privacy. Because sometimes people hear data privacy, and they just think, oh, that’s for lawyers. Like that’s those people are responsible. They started privacy law. Those people are really, really important in the field of privacy.

I’m not trying to diminish the importance of law and regulation in the field of privacy. But I think we can all relate to privacy, because each one of us probably has some sort of personal or what we might call individual understanding of our own privacy. And often that’s informed by whatever cultural influences we grew up around, whatever society like where did privacy play or where did trust play in whatever societal bonds we kind of grew up with and we’re socialized with or maybe even where we live now.

So like for example, now I live in Germany, has a very different relationship with privacy than where I come from, California. And so we can also be influenced by kind of shifts or changes along our life that allow us to see the problem differently. But essentially, privacy is a lot about having the autonomy and the control to decide who I want to show up as and what I want to share with whom, under what circumstances or contexts. And I think we all know that there’s been times probably in our lives when information that we didn’t want to share with a particular person or with groups of persons got out, some which way, sometimes through technology, sometimes through a person, maybe both. And we all know that icky feeling when your privacy is violated. And that’s how it can also teach us how much it is about trust, how much it is about context, how much it is about us having a choice. And I think that you can see kind of the social constructs in a lot of the regulation that you then read. Because then, of course, we have regulatory or what I would call like judicial legal understanding of privacy, and then we have a technical understanding of privacy. And kind of the best part is when all of those work together. When the technology is reflecting not only what we legally need to do, but also what we socially and culturally understand and knowing that that can shift depending on all sorts of different things, down to the individual level.

Henry Suryawirawan: I like the three things that you mentioned, right? First, it maybe relates to the context and society, right? So privacy in one particular area might be different than maybe the norm or the culture in some other parts of the world. I also like the trust thing because like most of the time when we use software, right? The concept of trust is a little bit abstract, it’s vague, right? I know that I submit my details, but there’s not necessarily a trust being created or established when I use the software. And the last one is about choice, right? I do have a choice to actually maybe take out my data, delete my data, or whatever that is, right? So I think those are the key terms that I just picked from what you explained just now.

[00:11:33] PII (Personally Identifiable Information)

Henry Suryawirawan: And a lot of people actually associate privacy with PII, right? Personally Identifiable Information. Do you think that privacy relates just to PII or is it more than that?

Katharine Jarmul: Yeah, I mean, PII is a great place to start. I don’t want to PII bash here. But I think PII is a very small view of data that we might think of as what I like to call sensitive data. Within the field of sensitive data, of course, we have PII. That’s things, just for people who don’t know that term, that’s things like email addresses, birthdates, names, things that we would say might be unique to an individual, and certainly in combination are unique to an individual. And then we might have what I tend to call person related data. Not everybody calls it that. Some people want to call it personal data, some people call it, yeah, data related to persons. Anyways, there’s all sorts of terms for it, especially when you get into the legal realm.

But I think of person related data as like things like what I click on, what I buy can be person related. What I enter into search terms can be person related. Because when we think of all these things in combination with each other, probably when we combine them, those can also point to uniquely identifiable characteristics of a person. And you already mentioned Cambridge Analytica, some of the original research was focused on, can we use things like Facebook likes or very short Facebook surveys or those combinations to decide how this person might vote, whether they’re easily influenced in voting and these types of characteristics, which we probably would think of as sensitive.

And then we also have other sensitive data that isn’t really person related that we might call proprietary or confidential data. And a lot of times, we don’t think of that when I think we think of privacy, because obviously it doesn’t really have to do with privacy. And yet the types of protections that we use for privacy can also be applied to things like corporate secrets, other proprietary confidential information that we also don’t want to share outside of a particular context. And so these all kind of can fall under some idea of data that we might need to protect in a different way than other data.

Henry Suryawirawan: Nice. Thanks for all the classifications of different person related, right? It could be the sensitive, highly sensitive ones. It could be just the person related. It could be like confidential data. So I think that totally makes sense, right? So not necessarily just like person related. It could be entity related.

[00:14:13] Data Privacy Risk in Current Technologies

Henry Suryawirawan: And I think a lot of times when we talk about privacy these days, there are a lot of technologies that could expose us to this privacy risk, right? I mean just to mention a few things like social media, definitely. And then you have the web technologies like cookies, you know, when you browse a certain things, right? It tends to follow you from one page, one website to the other. You have the connected device, you know, we talk about, you know, this Alexa or, you know, the Google Home thing, right? And lately also AI, right? So people just, you know, ask AI about certain stuff. Sometimes, it could take your data away. Maybe looking at those technologies, what will be your advice for people when they use those technologies, right? Because it’s such a ubiquitous thing now. What should people care about in terms of their privacy?

Katharine Jarmul: Yeah, that’s, um, that’s a hard question, because obviously everybody might be different. And again, like we said, like there’s also different levels of trust that people might have with different services, right? So you might really, really like Copilot or whatever it is that you use every day and you might decide, you know what, the usefulness of this is enough for whatever trade off that I have.

At the end of the day, a lot of my work is focused around informing organizations what they should be doing to better protect the people’s privacy that use their services and products. And I think we run into this problem, and I think this happens a lot also in information security or cyber security, where we kind of decide that it’s a user problem and that like it’s your job to figure out like what you think is the most secure email service and what is like the best thing. And like, oh, you’re bad if you use WhatsApp because WhatsApp’s not secure, which is not necessarily true, right? But the, this kind of culture, I call it kind of like blaming the victim type of culture where, why is it that I, if I want to talk to a person and they only use one particular messaging application, why is it my job to make sure to try to inform people what is the most secure or not most secure and this. Why can’t we hold a higher bar for organizations to create secure and private by design services? So that people can choose whatever one they like. If they like the colors, I don’t care. They like the interface. They like whatever product. They like… they want to buy a package deal from a certain cloud provider. If we’re looking at privacy more holistically, it’s less about like you must use this software than we need to hold software companies and product companies to a higher standard so that people can choose whatever it is they like, which I’m like very pro choice on that one of, uh.

Obviously, there’s lots of good advice out there, how to inform yourself, doing things like reading through the boring privacy policies and figuring out how to make it a fun game for yourself. There’s plenty of good advice to inform yourself for how do you make those choices. But I think it almost focuses like the blame on the individual who has to use things. And at the end of the day, I want people to, yeah, be able to connect and use technology in cool ways and not have this very privileged burden of reading through every single privacy notice to figure out which best aligns. And oh, by the way, then it gets updated six months later, then you got to read through and compare all of them over again.

Mozilla has a great resource, by the way, for people who might be looking for what criteria might I use called Privacy Not Included, and that reviews things like, I think they did one on autos, like automobiles, I think they did one on like connected devices, home assistants, like you mentioned, and a few other things. And then at least you can look, like what criteria did they use and maybe start to build your own criteria to inform your choices should you have the time and energy to do so. And there’s no shame if you don’t, because at the end of the day, we got to hold the orgs accountable.

Henry Suryawirawan: Right. So very interesting the way you explain that, right? So it used to be, you know, like it’s a user’s problem, right? So the one who is responsible to take care about the data that you share. And I think we can read all this privacy policy and all that. Sometimes in most of, and I don’t know, most software, but many software, we don’t have a choice. We simply just to have to check, agree. And you know, like, you know, whatever language that they provide, we just agree, right? Otherwise, we can’t use anything from the software at all.

[00:19:01] Data Utility vs Privacy

Henry Suryawirawan: So I think that’s a very, yeah, that’s a very good segue now to actually move into towards the organizations or the teams who build the software and the products, right?

So the first thing is I came from some, you know, belief probably, right, last time. Especially in the era of big data, we have to collect as many data as possible, you know, from the users so that we can derive insights, we can analyze things. And now this seems to be clashing, you know, with data privacy. So in this era, these days, what would you advise people in terms of balance, right, about collecting as many data as possible versus, you know, treating privacy more seriously?

Katharine Jarmul: I think, first off, I want to say, like I’m not trying to get anybody to lose their job. So, if any of the advice I gave you think would make you lose your job, don’t do it, right? So, just a mini disclaimer.

Um, but I think, I think as you mentioned, I think there’s, there’s starting to be, I think, a more popular shift into, even just consumers, so to speak, the average person thinking more about privacy. Namely, because I think this trend, as you mentioned, of the big data era of just collecting, you know, at will and doing all sorts of things that, at the end of the day, make people feel creeped out, you know. Make people feel like their phone is spying on them and all of these things are happening, right? And in some cases that might be true. And in other cases, it might be a fact that we have so much data that we can actually infer sensitive things about people without their knowledge and also without our intention, right? When we think about recommendation algorithm development, search response algorithm development. So search algorithms and content algorithms.

When we think about a lot of this, there can be latent, what we call latent variables. And what I mean by that is, you know, because you follow and you like these five things, we kind of have learned your gender or your age or your family status or these other things. We didn’t explicitly try to learn them but we kind of learned them by amassing, as you say, so much data and then putting an algorithm on top and not controlling for anything like privacy. And then you get super creepy ads and you’re like convinced your phone is spying on you and then you must hide it or… you know what I mean? Like, I don’t want people to be afraid of tech. I don’t think any of us want people to be afraid of technology. I also don’t personally want to be afraid to have a mobile phone. It’s a nice thing to have.

And so when I think about, like this shift that’s happening, maybe one of the things that you can start talking about in an organization that may or may not already have a culture around thinking about privacy is how do we reason about the trust that we’re creating. And I think this is great for tech leads, as you aptly point out, and product people. Because at the end of the day, when you’re like a technical lead or a principal or a senior person in the type of architecture decisions, product decisions, data decisions that get made, and you’re a top product person, those people can have a real conversation about do we want to talk with our customers about this problem of privacy?

Do we want to talk with our customers about what’s the mental model that they think about how our service works? What parts of this is up for debate? What parts of this is not? Because there are like indeed businesses that by default probably privacy is not gonna play a huge role, right? But you can at least start that conversation. And what we often talk about in the field of technical privacy is balancing between the most privacy we can offer, which would be collecting no data, and utility, right, is what we think about privacy utility balance or trade offs. And utility there we might think of as this old idea of we’re going to collect every single mouse movement or single character you type in, in the hopes of getting some sort of insight, which I would also argue, as a data person if you don’t know what you’re going to use the data for, you’re probably just amassing large cloud computing fees for no reason. And so it’s better to have smaller experiments with less data and then to grow it over time.

Again, if you will lose your job for doing that, I’m sorry. And maybe you can find a new job one day.

Henry Suryawirawan: Right. So I think you mentioned something very important, right? And many, many privacy experts also say the same thing, right? If you don’t know what you collect that data for, then don’t do it, right? Don’t store it, don’t do it, don’t ask for it, right? So I think that’s also very, very, you know, important, right? For everyone who builds software these days, right? Especially the technologies, right? And also when you build AI model, right? I mean, last time we also simply say, for machine learning it needs as many variables as much as possible, right? So that it can train the model better. But again, like, if you don’t know how you want to use the data, maybe it’s better not to collect them.

[00:24:19] Privacy by Design

Henry Suryawirawan: And this comes back to this thing called privacy by design. I think, um, it’s mentioned a lot of times, right? In security, there’s a secure by design. In privacy, now there’s a privacy by design. So maybe it can help explain what is privacy by design and how we should use it?

Katharine Jarmul: Yeah, privacy by design is actually like a quite old concept, which is pretty cool. I think it was developed first in the early 2000s by a researcher in Canada, Cavoukian, I don’t know if I pronounced the name right, anyways. And, uh, she developed these principles as part of research, I think she was working on as somebody that both understood some of the privacy regulatory aspects and then also the software aspects. And when you read the original Privacy by Design, I think there’s seven design principles, you see a lot of software thinking. This is like really pre data pre algorithms as like a normal part of software design. And a lot of it talks about the same types of themes we’re talking about now, allowing user choice, allowing transparency, but also by default building things that respect the user, that respects the fact that privacy does have all these, you know, is informed by context.

And so if I put something in a form, it doesn’t necessarily mean I want that to be used to train an algorithm, right? That we understand like how the user is imagining what it is they’re doing and that we’re building technology that hopefully mirrors that or if not makes it more obvious what is happening. And then security principles, so basic, not only application security, but also infrastructure security principles so that we don’t end up exposing something that we’ve collected in trust that we are even using responsibly but then open the system up for potential attacks or other ways of exfiltrating that data and using it for another purpose.

And this is where I think we can take privacy by design, we can use that to architect and build our systems, and then we can also evolve that thinking for new things like building large scale data systems, algorithmic systems.

And then we have a huge problem now around third party data systems of, you know, which I don’t think by any means private by design. And so when we think of third party, it can either be services we’re using, and we are the intermediary then between our users who maybe we have a direct relationship with and direct trust with. And then we want to use a third party system, which is fine as long as we’re really clear. And as long as we’re actually reviewing the third party vendors that we use for things like privacy. Or the other way around where we’re actually a third party vendor. We have zero access to the users. And this is a lot of us that work in the B2B space. We don’t actually, I mean, usually the companies that we’re interacting with, they might have customers or they might even have customers who have customers, right? And so how many layers until we get to a human who can make a choice about whether they want their data used that way or not? And do we have a good review process to make sure that that kind of chain of trust is verified? And that it’s mentally sound, both the way that we’re dealing with these trust relationships and also how we’re eventually communicating all the way down to a person who can say, you know what, I’d actually rather not. If I could opt out, I’d rather use this service, but without the interface to ChatGPT, for example. And I think we kind of see this in some of the new products that come out that there’s a little bit more transparency of, hey, we built it this way. Here’s the services we’re using. Do you want to use this particular feature or not?

At least, I see that a lot in the way that Apple is choosing to design some pieces of Apple Intelligence and so forth is trying to have that conversation in the open. Does it always go well? Maybe not. But at least trying to start that conversation with users in the open about how they can better understand and control the way that the data flows through the systems. And I would call that like today’s updated version of thinking through privacy by design.

Henry Suryawirawan: Wow. So after you mentioned about, you know, third parties and how data, their data intermediaries, or even people these days, we use a lot of SaaS, right? I mean, software product teams, right? We use a lot of SaaS, you know, maybe managed database or serverless database and things like that. So that kind of like implies the data can be stored in most of, uh, other places, you know, beyond our premise, right?

[00:29:06] Data Governance

Henry Suryawirawan: So I think this also speaks very, very true, the fact that sometimes if your organization grow really large, it’s very difficult to actually know what data have you collected, where they reside, the retention, you know, who people get access to, right? So tell us about the importance of this, you know, data governance because in many parts of conversation I was in about data privacy, I think the first step is always to know the inventory of your data, right, and govern that particular inventory. So maybe tell us the importance of data governance.

Katharine Jarmul: Yeah, and I think data governance is such a critical, even outside of privacy. But you said it perfectly. I mean, if you don’t know what data you have, if you don’t know where it lives, if you don’t know what pathways it goes through, like when we talk about data engineering, if you don’t know where it’s being processed, how it’s being processed, by whom, along the way, then you kind of end up in a place where privacy is not possible to some degree, right?

Hopefully, you’ve kind of, again, I’ll re-mention having strong practice of vendor review and vendor assessment, because I think, like, thinking through risk always means, can we create a reasonable way that we as an organization want to think about what third parties we work with, what intermediaries we work with, what SaaS, as you point out, we work with. And there’s no going to be no magical right choice, but there needs to be a commitment to some sort of choice and perhaps some thinking through of why it is that you choose one vendor over another that relates to principles like privacy and data security. But then beyond that, if we don’t understand the data flows that are going through our systems, if we don’t have data properly tagged, like some people might use tagging, other people use categorization, other people use all sorts of things. But if we don’t have a reasonable way of organizing the quote unquote data catalogs that we have and data stores that we have, then we also don’t have any reasonable way to think through retention schedules.

[00:31:10] Retention Schedule

Katharine Jarmul: What is a retention schedule? It means that we say that we collect data for a certain purpose for a given timeline. Maybe we collect purchase data for every customer, right? They go to our site, they buy something. Or maybe we hold other people’s purchase data. It doesn’t matter. The purchase data exists, right? And maybe we say we retain purchase data for as long as you’re a customer or for up to two years after you close your account or something like this. Well, how do we even enforce that if we haven’t bothered to like connect? Then somebody has to write like some really nasty SQL query. It may or may not work if somebody did bad data entry. And this all relates to better data governance, which relates to things like data cataloging.

But also things like thinking through data quality metrics and how do we ensure that we’re actually getting value out of the data we collect. And what I will say is if you improve trust relationships with your customers, you will get better data quality. Like that’s by default proven time and time again. And so if you’re just spamming, collecting whatever you can, especially through some sort of intermediary, you’re probably getting pretty low quality data. If the signal that you’re trying to measure is something like is this customer interested in ABC and you have a good relationship with a customer that you could just directly ask them, right? And then you have both high data quality of high trust. You’ve created a good privacy and security relationship with your customer. And you know that the data that you’re storing is actually accurate.

So I think in the end of these all end up supporting each other. And I just want to say it is not a trivial undertaking to do data governance at scale. And most organizations are not doing it very well. They’re just kind of starting right now, I think, for a lot of organizations. So it’s okay to decide, okay, we want our next few years to have a goal of having better data governance in which we can also have better privacy and data quality.

Henry Suryawirawan: Yeah, so I could imagine companies that have been around for years, right, having production data, serving lots of customers. Definitely, it will be a challenge, right, especially, you know, these days also people work remotely, they work, you know, using internet. You know, something got downloaded from the system into the device, send over to WhatsApp or something like that. Or you have data pipelines, you know, data warehouse, where data moves from one side to the other. I think it’s really, really challenging. I would just empathize. I also have this kind of problem as well to catalog data, knowing the lineage, right? Govern, you know, whether it’s sensitive, not sensitive, parts of it whether it can be shared with many other people or not. So I think it’s definitely a challenge, but I think it will take some time if we build awareness. I’m sure maybe we can improve the data governance that we have.

[00:34:09] Data Privacy Practices & Techniques

Henry Suryawirawan: So maybe let’s go into the techniques. How can we actually improve our privacy practice, right? I think the first that is always mentioned is about, you know, pseudonymization or anonymization or data masking, right? So tell us maybe for people who are not familiar with these terms, what are those techniques, right? And how can we apply that practically?

Katharine Jarmul: Yeah, I mean, I want to add a mention here that is maybe kind of also helps bridge data governance to like actually implementing what we might call as like mitigations or controls. And that is, these are not decisions to be made in a vacuum of like what controls get applied where or how do we decide what we want to do with what data?

How do we even decide what sensitivity a data has? Hopefully your organization has the ability to create a data governance board or practice or even a data privacy practice where there could be multiple disciplines of people in the room. Because I think you have to have people that are business informed that know what are the business goals of why we’re trying to do this thing, data informed, software and tech stack informed, and then, of course, regulatory informed, and privacy informed. So you can also hire privacy professionals who are not legal experts, but instead focus on the topic of like privacy by design.

So if you have these people together, maybe also somebody from InfoSec, you can start to say, you know what, we think the biggest challenges for our data governance or our data privacy or our data security or all three is these top three, um, new systems that we want to develop. So we’re going to put that on a kind of our collective roadmap for the year and we’re going to prioritize those. And I think then you start thinking once you can understand what’s the risk space that you have. So what are the biggest problems that you have with the data privacy? Then you can start thinking about, oh, do we need to pseudomize? Do we need to anonymize? Do we need to yeah, run some other type of setup?

And then we get into the fun, fun bits, which I would all fall under kind of these types of privacy controls or start to go into what we would call privacy enhancing technologies which are different technologies that we can use to help meet these goals of privacy in our organization. And most of us have seen pseudonymization at some point in time that can take many forms. It can be simple masking, like maybe you’ve used one of the popular libraries is Microsoft Presidio to try to identify things like PII and mask them. Masking could take several forms. You probably already used some hashing mechanisms or something like this. Maybe a one way hash or maybe a two way like a reversible hash. These can be ways to pseudomize information. You probably already have already used redaction. So just simply like removing certain types of sensitive data from, let’s say, a reporting tool or a BI dashboard to say, okay, now this is ready for organization wide consumption, or this is ready for our marketing report or whatever it is. So like the data is scheduled for release publicly or at least semi publicly within an org.

And any of these things are small privacy mitigations that you do. And then those can lead up to more serious privacy mitigations like anonymization or even thinking through things like local first data processing, distributed data, or encrypted computation. So it gets very space age, the further you want to go, and those only fit certain types of problems, right, so first, you have to shape the problem so you can know, as every developer knows, I don’t have to say it. If you don’t know what problem you’re trying to do, then just putting a technology in it is unfortunately not going to solve the problem.

Henry Suryawirawan: Yeah, that’s a very very good advice, right? So I like the in the beginning you mentioned that it’s not something one, like one particular team or, you know, person to decide, right? So it’s like a multi departmental kind of a effort, right? And there are many departments that should be involved, right? You mentioned a few that are very important, right? It could be the privacy, you know, department, the technology, definitely the product, right? Maybe the legal side. Infosec, definitely. Data and security is kind of like interrelated with each other and there could be many other parties, right? And especially in the organization, I think it’s very important to have this so called decision making process to actually identify whether something is sensitive or not sensitive.

[00:38:52] Privacy Enhancing Technologies

Henry Suryawirawan: So I think during the explanation just now, you mentioned this term that is quite trendy these days, privacy enhancing technologies. I think, uh, still tech, we all love, you know, to know like what are the technologies out there so that we can kind of like apply and try it. So what are some of the privacy enhancing technologies that are, you know, like quite modern these days that people are using to solve this privacy thing?

Katharine Jarmul: I think there’s been some really cool developments, I would say, over the past 10 years, and I think the ones I’m most excited about are also the ones that show up in the book that I know you’re now a bit familiar with, which is technologies like differential privacy, which is a way that we can reason technically about privacy. I like to call it for people the idea of measurable privacy or rigorous privacy, because we’re using a scientific way of thinking about or trying to shape the problem, and then trying to decide that. And that works well for things like if you have to release data publicly. If you want to think through something like anonymization. And it can work in many different types of use cases, but it can never work if you’re trying to say customer A wants B. It has to work in some form of aggregation of an idea or a person. But of course, when we think of anonymization, we cannot ever say that we release something anonymized that can be tracked back to one individual. Because then it’s not anonymized, right, by definition. So I think differential privacy, really cool technology, lots of different approaches, lots of different algorithms to think through. And also an increasing number of really cool open source libraries that allow you to think through differential privacy from a development context.

Then probably the next most famous one of recent years has been federated learning or federated analytics. Sometimes I like to call it more distributed learning. And it also relates to what I would call the field of local first software. I don’t know if you’ve already had local first software experts on your show, but, um, Martin Kleppmann and who’s famous for the book, um, Martin Kleppmann’s book is very famous…

Henry Suryawirawan: Data Intensive Application or something like that.

Katharine Jarmul: Yes, yes, thank you, um. And he and a group of folks have kind of started this movement, I think it probably started before them, but have popularized this movement of local first data which is more thinking through kind of exactly what we were talking about before. Which is if we can, now we have devices, we have edge devices, we also have edge compute like mobile devices, and certainly within whatever AWS cluster you run, you have plenty of compute, so can we push processing as far down to the edge that we need? And therefore can we push the data as far down to the edge as we need? So this might mean what we call cross silo learning or cross silo analytics, which says you’re a multinational company or you’re in some sort of B2B situation and you allow every company to at least house their data within their own, you know, realm premise, so to speak. So within their own boundaries of their cloud system or whatever it is that they’re using, and they don’t actually exchange data, they exchange some sort of analysis or output. So this could be machine learning processing. This could just be simple analytics, but we kind of try to push it there. Or it could go all the way down to I’m building a mobile application, I’m going to keep all of the data local, and I’m only going to send certain artifacts, and I’m going to push that back up to some sort of centralized compute or some sort of redundancy compute, right? And depending on how that works, you could even offer things like end to end encryption and other types of things.

Why would you go to this bother? Well, you’d go to this bother, because A, you’re greatly reducing your exposure risk, right? You actually don’t have centralized data. You just have some centralized insights. So how attractive is it to try to hack your system? Much less so, the less data that you have. And B, if you can get the same insights, you’re also saving a huge amount on cloud compute. Cause again, if you’re like, if every ping, every two seconds I open an app, you’re just checking whether I’m still at my house, like cool. I mean, not so cool. I don’t think so. But, you know, you’re wasting a whole bunch of network, compute, and storage to just find out that guess what? Between the hours of, you know, 9 p.m and whatever time in the morning. I’m at my house, right? When in aggregate analytics, you probably could have already figured out that is, you know, 80% of your user base or something like this. I don’t really know why you need to know that, but I just give you an example a rough example of how differently we could engineer, architect our systems if we thought through again this question of what data you actually need. And can I do that in some sort of privacy respecting manner so that most of the data stays far away from some sort of centralized set up and certainly far away from some sort of data sharing set up where companies are just exchanging sensitive data at scale?

And then that leads to the third type of privacy enhancing technology. I’m personally really excited about, which is encrypted computation. Or people might have heard about that under terms like homomorphic encryption or multi…, secure multi party computation. And this essentially enables a whole bunch of use cases based on cryptography, which allows you to compute insights on data without decrypting it. Of course, then eventually somebody will decrypt it. So the insight gets decrypted and used at some point in time. But the actual processing of the data can be done in an encrypted state. And that provides a quality that we like to call secrecy, which is different than privacy, but secrecy, we can use to say, you know what?

In our cloud compute, we only ever computed in encrypted, and then we move back to your device where you personally unencrypted it. And so by the design that we have offered, we actually never saw the data in an unencrypted state except once it was on device or in a data sharing landscape. We actually only ever process data with another company by keeping that data encrypted so that neither company could learn anything other than the final released results.

So any intermediary results, anything that we think might leak in individual level privacy, we can kind of cover up with a secrecy blanket. And then at the end, when we decrypt and we pull off that secrecy blanket, we’ve already ensured that the end result has enough privacy that we can remove the secrecy. I hope that was understandable. You tell me if so.

Henry Suryawirawan: Wow, I think definitely sounds really cool. All these advanced technologies. I may not be exposed to many of those technologies, but it sounds really promising, right? Especially I think the local first concept thing will be very important, right? Especially if you don’t want to get exposed to many sensitive data at your, you know, organization. Encryption, I think still kind of like the go to kind of like strategy for securing the data, right? It could be encryption at rest. It could be in transit, right? It could be just like what you mentioned maybe just now, homomorphic encryption and things like that. These days cloud provider also come up with, I don’t know, like secure chip, you know, where you can actually do more secure computing on the virtual machine. So definitely a space that maybe for those people who are interested, right, you can follow the technology’s trends.

[00:47:05] Fostering Data Privacy Practice & Culture

Henry Suryawirawan: So you mentioned about all this, right, definitely one aspect for engineers or product teams, it makes it really, really more complicated to actually, you know, come up with the solution, right? How can we actually, you know, talk to the stakeholders that we need to, you know, improve our solution by implementing all these complex things? That might increase the effort and also the amount of resources that we need, you know, with all these technologies, with the amount of development effort and things like that.

Katharine Jarmul: Ah, yeah. Yeah, yeah, yeah. One thing that I think that your podcast has done well so far is like talk about lean principles, talk about creating buy-in over time, developing like a practice and eventually developing platforms, right? Because I think that’s kind of the evolution that you have to work with. If you’re starting with even just basic data governance, you’re not going to tomorrow deploy homomorphic encryption. Like that’s not going to happen for you. So what I think you have to start with is like, you have to start with some of the building blocks. But try, I think, from a kind of a lean product thinking, try to think through what types of these more advanced technologies actually fit the problems that we have? You know, maybe it is exploring local first development or something. And then pulling in very small use cases, very small product features or maybe even new product launches and saying, you know what, we’re gonna cordon off an extra 20% of the planned development time to think through an experiment. Could we use this new technology to help expedite future products so that they are launched in a private by design or secure by default type of mentality?

And what you’re doing there is, A, you’re giving people who are interested in the topic enough space so that they can actually start exploring, right? And as you know from any new development technology, new type of thinking around development, if you give people a little bit of space and they already have some interest or motivation, they can actually go pretty far. And you’re also developing this culture of learning and experimentation, which we all need. And then B, you’re also like boxing it enough that it doesn’t become a blocker so that there’s some sort of backup technology that you’ve always used that’s there. And that you’re kind of giving, you know, you’re growing the experience and knowledge.

And I think as soon as you start focusing on the process rather than the outcomes, then you’re going to see the rewards happen over time. And you ask like, how do we explain to the business stakeholders? Well, maybe you don’t even yourself have to explain. It will become evident by allowing time in the process for that. You can say as simple as we need to stay up to date. We need to modernize the way that we think about privacy. And I think this is going to pay off in the long run by allowing us to build platforms that allow us to do privacy at scale, which is the end goal, right?

But along that way, we’re going to add some extra time and cushions so that we can build in these new exciting technologies that I promise you, you will be able to talk about in some press release one day, right? And show that we’re forward thinking and show that we’re kind of advancing this space but to start out and allow that. And I think the advocates that will come out of allowing some of your developers or your data people or other technologists at your org to learn more deeply about this stuff is going to pay off in and of itself. Those people will end up being the advocates that educate the other people in the org about how this stuff works. And they will find, I promise you, they will find clever ways of explaining it if you give them the time and space to be able to learn.

Henry Suryawirawan: Yeah. So I think that’s a very, you know, important advice, right? So start small, right? Focus on the process, not necessarily just the outcome. And I think everyone can build awareness, I think, within the team, within the organization, right, about the privacy, and treat privacy more seriously.

[00:51:10] The Legal Aspects of Data Privacy

Henry Suryawirawan: Obviously, one easy way to explain to stakeholders is about regulations, right? I’m sure in Europe, it’s easier to actually mention about privacy, because there’s a GDPR that is very strict, right? In some parts of the world, they are starting also to adopt these kind of stringent rules about, you know, personal data law, personal data protection, and things like that. So let’s go to that legal aspect. So what do you see the trends these days, especially what I know is that like many countries now starting to come up with this kind of law, this kind of GDPR-like regulations. So is this something that we are going to see going forward? Like all countries will have their own kind of like regulations. And I think it will be, it will become very complicated if, let’s say, you build a product that works in multiple jurisdictions, right? So maybe tell us a little bit your insights on this.

Katharine Jarmul: Yeah, I mean, I think it’s already pretty complicated, um, it’s probably gonna get more rather than less. I mean, that’s why I think a lot of times when I talk about this, especially with high level stakeholders, like when we’re talking with C-level of some sort of multinational or company that wants to become multinational, we have to think about it as like future proofing the data strategy, right? Because if we just kind of develop data or AI strategy within what we know today, um, we’re not going to be prepared for the future. And even with what we know today is we have a very fragmented legal jurisdiction set up around almost everything data, but certainly data privacy. But that also applies to data security. It also applies to data governance protections that different organizations are looking at.

And one of the big trends that unfortunately I don’t see going away anytime soon is the data sovereignty laws as well. And these data sovereignty laws are kind of around the jurisdictional control of the data. So, for example, that data cannot leave a certain nation state or a group of nation states. Or that data cannot travel across these untrusted nation states. And if you already work in the government or you work in government adjacent, you probably already know this pain. And you’re like, you’re not saying anything new, Katharine. Well, now the pain is coming to you, whether or not you’re government adjacent, whether or not you deal with kind of highly secure systems or critical infrastructure. I think it’s kind of ballooning out to other areas.

And again, at the end of the day, a lot of this is a policy problem internally at our organization to decide, okay, what’s our stance towards regulation? Where do we exist, like what’s important to us? And then obviously the legal experts, not the technology experts, get to decide, okay, here’s our risk posture, here’s like the policies that we want to see related to what we know about the data, how the data’s stored, where it’s stored, what it’s used for. And then we kind of get to step in as technologists, act kind of like that policy and principles level, I would say, and say, this is possible, this is not possible. This is like a good idea, this is not such a good idea because… And then we can talk about things like architectures, clouds that we use, infrastructure questions alongside cool stuff like privacy technologies.

And I would say there’s a non trivial number of legal professionals and privacy professionals that want to also learn more about technology or might even already know a lot about technology. And you, if you build those bridges, you can have a really useful conversation around what does this actually look like? And yeah, I think that can be probably the most healthy approach, rather than just avoiding the topic because it’s scary. And you don’t want to talk with the lawyers because they might say no. And then five years down the line, they’re like, why did we architect it this way? And, um, there’s like no answer other than we didn’t have the conversation early enough.

Henry Suryawirawan: Yeah, I think, uh, you mentioned something really, really, um, you know, very important, right, to have now probably legal team, legal aspect when you build the products, right, when you architect the solution, when you architect your data storage and things like that. I think we are used to discuss about, you know, like whenever we choose a cloud provider, right, it’s the data center in the particular country. Now, I think that’s not enough, right? Because the data transfer, you have to also look at, like, if let’s say you use a SaaS provider that resides in another country, another jurisdiction, maybe the data sovereignty law doesn’t allow you to do that, right? And maybe you can have a trouble, right, when transferring data over. So I think, uh, a lot of countries also implement this differently. I think it’s very hard to keep up. And that’s why probably the legal team is also still the best person to kind of like help advise the team to actually how to implement the better solution about this privacy.

[00:56:08] AI and Data Privacy

Henry Suryawirawan: So, Katharine, we have spoken a lot about privacy, technologies and all that. Is there something else that you think listeners here should know about data privacy?

Katharine Jarmul: I think lately I get a lot of questions around AI and data privacy. So I just wanted to note that there’s a lot of new developments and thinking through A, how do we build machine learning or AI systems that can utilize private data without problems. But B, how do we, if we don’t build the AI models ourselves, of which most organizations still don’t at this point in time, how do we actually build kind of a protective blanket around these interfaces to, again, these like third party AI systems or AI APIs?

And I just want to suggest out there, obviously, I think my book is a great resource. Um, I also have a newsletter on the topic, but to just kind of start to inform yourself. Because what I’m seeing in the circles that I’m in is that this is a growing trend and a growing problem. And I think that it will come, it will come for everyone. It will come for people that are not machine learning experts soon enough. If it’s not already a problem that you’re facing where you’re sitting there with some other team and you want to use a third party API. And all of a sudden this questionnaire comes in from the legal department or how is it going to look or how do you actually build that with privacy or with security, right, if you’re using proprietary documents. Let’s say you want to build like a RAG system and you want to have access to a bunch of sensitive documents that the company has. Like how do we make sure that we build that responsibly?

So I just want to point to the fact that this is like a growing field of practice and you don’t have to learn it like to the greatest depth. But to just kind of start to, when it pleases you, start to inform yourself lightly. Like, hmm, like what might that look like if we built this RAG system with privacy or security? How could that look different? Do we have the capabilities to do that now? If not, what capabilities would we need to grow? And to just, you know, your tech leads, right, to just start to allow that topic to percolate in your head a bit, because I think it’s a growing, uh, importance, I think, in how we think about integrating machine learning into normal workflows.

Henry Suryawirawan: Yeah, I could imagine many teams now are kind of like scrambling to implement some sort of AI into their systems, right? And especially when you do that, sometimes we don’t think further ahead, right? Especially if you’re dealing with, I don’t know, like customer support, you know, customer confidential data, you think just by implementing AI, it’s going to make life easier, but you are not sure that they will expose certain things. Also, when you use some third party systems these days, you know, when whenever these companies adopt AI, sometimes the choice by default is to actually enable AI training, right? Uh, using your data sometimes. So be aware of that. Also like, I think those things need to be more transparent and explicit so that we can build trust with the third party, the systems that we use.

[00:59:28] 3 Tech Lead Wisdom

Henry Suryawirawan: So Katharine, I think it’s been a great conversation. I learned a lot about data privacy, although it kind of scares me a little bit on how to build a more what private by design system. Uh, unfortunately, we reached the end of our conversation. So one thing I would like to ask you to end the conversation is what I call the three technical leadership wisdom. So you can think of them just like an advice. So maybe if you can share your version to the listeners here.

Katharine Jarmul: Yeah, I mean, I hope maybe I’ve sparked some people who might be interested in learning more about data privacy, data security topics. But the number one piece of advice is like, even if you’re not inspired to learn, I ask that you empower people on your team or in your organization to learn. And that’s because, as you can probably tell from my advice today, the field is actually quite deep and very intensive. And it’s not necessarily a field that a lot of people specialize in because they studied it, right? So you’re gonna want to open doors for people to grow and to learn and to foster kind of that ability to specialize in that culture of learning. And for you all as tech leads, like if it’s coming from you, it’s going to be much more powerful than like a junior who just joined the team and who’s like, oh, wow, cool. Like I learned about differential privacy in college because sometimes they do that now. Like I want to try out this thing. Like if you can give that person space, if you can kind of shield them and give them some ability to learn, they’re going to level up the entire org eventually, right? And so, certainly, level up your team but eventually this kind of spreads and creates these cultural changes that you need. So foster that learning, you know, support your people in learning, create those spaces for learning, try to give some back pressure to delivery deadlines to allow for some of that learning cushion.

Another thing that I want to share is I don’t think all risk conversations have to be scary. Um, I think it’s totally natural and like I empathize so much with it being scary, because it is a big job and it is like a serious job to talk about the risk of people’s data and that trust relationship. And yet at the same time, if we can normalize conversations around risk, if we can kind of allow for some of that space to have like scary topics be part of our road mapping, be part of our planning, where we actually kind of just regularly do privacy risk reviews, security risk reviews, auditing reviews. If we can kind of normalize that as a normal thing, like we do with like testing for bugs or figuring out if something is deployed properly, then we also make it like more tangible. And I think then it becomes less scary by default, and we make it less uncertain. And I think most of the scariness comes from this uncertainty problem. So to try to focus on that, right?

And then finally, like that relates to both of these, relate to like creating a team culture and hopefully an organization culture of psychological safety. And I think by knowing that we’re not going to get everything 100% right all the time, we can foster both learning, we can foster risks discussions, and we can certainly build much, much more private and secure systems by allowing us to have real conversations, by allowing us to like escalate problems when they exist, when we see them. And by empowering people to kind of have reporting up and down about what do we think about problems like privacy and security. So up the hierarchical chain of the org and down. And by fostering this kind of safety conversation around these types of problems, I think we end up creating, by default, better software systems products in terms of privacy and security.

Henry Suryawirawan: Wow, very lovely. You know, it all interrelated with each other, right? I especially love the second one, you know, not all risk needs to be scary, right? Sometimes it just because of the uncertainty, like we don’t know about that particular risk. But I think if you dive deep, if you learn, right, and you allow people to experiment and try to solve the problem, psychological safety here, so I think maybe we can all improve together. So thanks for that beautiful wisdom.

So Katharine, for people who love to talk more about privacy, is there a place where they can find you online or maybe resources where they can learn more about privacy?

Katharine Jarmul: Yeah, I mean, uh, I have a newsletter. It’s called Probably Private. So if you’re more of like an email newsletter person, you can find me there. Um, I’m starting kind of to produce also some YouTube videos. They’re mainly around machine learning systems and privacy and machine learning systems. So you can find me there. My book is mainly written for data people, but I hope, or I heard some feedback from also software and architects that it can be useful parts of my book.

And then there’s such an amazing wealth of resources online around privacy. I would even like, even just start with your cloud provider, like your main cloud provider. Or if you have your main service providers, if you have major services. And just start looking around of like, what settings do they have around privacy? What do they make available to you? What knobs can you turn? Even just starting there or even having like somebody on your team, like the art, the cloud architect or cloud engineer on your team, focus on that, could end up paying dividends later of just starting to know, hey, it’d be really easy for us to turn on, you know, this or turn off this and that would provide better privacy.

So I think, those are some easy ways to get started. And make friends with privacy people at your org. Be nice to them. Just reach out, set up a one to one 15 minute coffee, because they could be huge resources if you develop that. Maybe you already have the relationship, but if you can develop that relationship, that can be a quick person that you can just double check your thinking with or think out loud with private. If they have a legal background, they’re never going to say yes or no, which is fine. That’s what they’re trained to do. Good job for them. But they’re going to give you advice. They’re going to give you steering ideas. They’re going to give you guiding questions. It’s going to help you. So be friendly with them.

Henry Suryawirawan: Wow, thanks for the plug. I’m sure if there are legal people listening to this, they will feel happy, . So I think, yeah, make friends with the legal compliance team, security team, right? So those people not there to make the job harder for you, but actually they help you and the organization to improve. So, Katharine, I love this conversation. Thank you so much for your time. I hope the listeners here learn a lot about the data privacy today. So thanks again.

Katharine Jarmul: Thanks so much, Henry.

– End –