#73 - Continuous Architecture (Part 3) - Security and Resilience - Eoin Woods

24-Jan-2022 45 mins Eoin Woods

included in Architecture & Design Security Architecture DevOps Culture & Practices Software Craftsmanship

“Because we ship stuff now almost immediately into public facing clients, almost as soon as we’re writing a line of code, we need to be thinking about how we make sure that it’s a secure line of code and it will be deployed and operated securely as well.”

Eoin Woods is the co-author of “Continuous Architecture in Practice” and the CTO at Endava. In this last of a three-part series of “Continuous Architecture” episodes, Eoin shared the remaining two important quality attributes covered in the book, i.e. security and resilience. Eoin explained why we should treat security as a critical quality attribute, the changes in the security landscape that make security becomes more challenging, the threat modeling concept, how to do continuous threat modeling, and his 10 secure by design principles. Eoin then shared about resilience as a quality attribute, how we should differentiate resilience from high availability, some common resilience techniques that we can implement in our system, and the importance of embracing failure mindset.

Listen out for:

Career Journey - [00:05:42]
Software Architecture - [00:09:43]
Quality Attributes: Security - [00:12:19]
Security Landscape Changes - [00:14:08]
Availability as Security Objective - [00:18:59]
Threat Modeling - [00:20:51]
Continuous Threat Modeling - [00:23:59]
Secure by Design - [00:26:56]
Quality Attribute: Resilience - [00:31:14]
Resilience and High Availability - [00:33:38]
Resilience Techniques - [00:35:36]
Allowing for Failures - [00:40:18]
3 Tech Lead Wisdom - [00:41:23]

_____

Eoin Woods’s Bio
Eoin is CTO at Endava, based in London. In previous professional lives, he has developed databases, created security software and designed way too many systems to move money around. Outside his day job, he is a regular conference speaker. He is interested in software architecture, software security and DevOps, and has co-authored a couple of books on software architecture.

Follow Eoin:

Website – https://eoinwoods.info/
LinkedIn – https://www.linkedin.com/in/eoinwoods/
Twitter – @eoinwoodz
Continuous Architecture – https://continuousarchitecture.com/
Endava – https://www.endava.com/

Mentions & Links:

📚 Continuous Architecture in Practice: Software Architecture in the Age of Agility and DevOps – https://amzn.to/3sock1l
📚 Continuous Architecture: Sustainable Architecture in Agile and Cloud-Centric World – https://amzn.to/3tdKynx
📚 Threat Modeling: Designing for Security – https://amzn.to/36FzhER
SecureDevelopment.org – securedevelopment.org
DevSecOps – https://www.devsecops.org/
Shifting security left – https://www.onpage.com/shifting-security-left-tools-and-best-practices/
Mean time to repair (MTTR) – https://en.wikipedia.org/wiki/Mean_time_to_repair
Mean time between failures (MBTF) – https://en.wikipedia.org/wiki/Mean_time_between_failures
Stack smashing – https://www.techopedia.com/definition/16157/stack-smashing
Botnet – https://www.trendmicro.com/vinfo/us/security/definition/botnet
DDoS Attack – https://www.cloudflare.com/learning/ddos/what-is-a-ddos-attack/
Bulkhead – https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead
Adam Shostack – https://shostack.org/about/adam
Charles Weir – https://www.research.lancs.ac.uk/portal/en/people/charles-weir(1b9f0687-b857-48cb-b075-24dadd1069d4).html
John Allspaw – https://www.linkedin.com/in/jallspaw/
GOTO – https://gotopia.tech/
Oracle Tuxedo – https://www.oracle.com/middleware/technologies/tuxedo.html
Sybase – https://en.wikipedia.org/wiki/Sybase
Intertrust Technologies – https://www.intertrust.com/
Digital Rights Management – https://digitalguardian.com/blog/what-digital-rights-management

Our Sponsors

Are you looking for a new cool swag?

Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.

Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.

Like this episode?

Follow @techleadjournal on LinkedIn, Twitter, Instagram.

Buy me a coffee or become a patron.

Buy me a coffee

Quotes

Career Journey

Some themes that you see a lot at the moment are where people are either starting out with nothing, and they need to build very scalable, resilient, reliable, secure platforms from nothing, but really very quickly.
Then at the other end, there’s an awful lot of large organizations who today need to produce, for example, new SaaS digital platforms for themselves or their clients. So they need to enter new markets, or whatever the driver is. They also have to work in this way. But they not only have the problem with building something new. They’ve got a lot of stuff already, which is often slightly unkindly called legacy.
I always define legacy as the software that pays the bills. Whereas the new shiny software is probably massively in debt. It’s probably not paid for anything yet.
On a micro-level, how do you design good software for the cloud first era? Dealing with very fast moving cloud platforms, where you build on a service today, and 12 months later, the vendor announces that they’re not going to invest in it anymore. They want you to move to another service and it’s totally incompatible.
People expect systems today to be almost magical. They require them to have really good quality attributes, but also to be very sophisticated. But I’m seeing lots of people finding just the complexity of the software they need to build today really, really challenging.

Software Architecture

For me, the key concerns for architecture are two.
- One is, there are many people who care about our system. Let’s call them stakeholders, cause that’s what the buzzword is. So there are people who care. Some of them care very positively and they want it to be a success. And people often forget about some people don’t really want our system to be a success. Or they’re neutral. For example, internal compliance groups. To them, our new system has just increased risk.
  - So there are positive and negative stakeholders. They have a lot of different concerns, and many of those concerns conflict.
  - The corporate security group obviously wanted to be totally secure all the time with absolutely no chance of anything going wrong, and users want it to be absolutely easy to use, totally intuitive, with no barriers to use it.
  - Those concerns need to be balanced. And it’s very hard to balance those at a feature level, service level, team level.
- The other dimension which those people don’t think a lot about, but actually is very important to them, is what we call the quality attributes, also known as the non-functionals: security, resilience, availability, performance–in some cases, regulatory compliance, the ability to evolve, and so on. All of those are inherently cross-cutting concerns.
  - So getting those quality attributes achieved, and again trading off between them. Using that to meet the needs of those stakeholders. You’ve got all these different desires and opinions and needs that all conflict. Those things are, to me, the essence of software architect. And the reason they are the essence of software architectures, they are cross-cutting. One part of the development team, however empowered, cannot achieve them by themselves.

Quality Attributes: Security

Security is one of those, in my humble opinion, because being a security guy I would say this, but it’s one of the most difficult and one of the most critical. It’s one of those that it’s very hard to know if you are secure. One very small mistake can suddenly make you dramatically insecure.
Whereas performance, I’d argue. It’s easier to test that you’ve got acceptable performance, and performance tends to degrade. You tend not to go off a cliff. Now, I know there are exceptions. But I think it’s more common in my experience that performance degrades, and so you can address it while it’s going on.
I think security is something that really has to be handled at system level. It’s got to be thought of as a system quality, rather than a quality that any team in the system development organization can just do either for themselves or everyone else.

Security Landscape Changes

There’re couple of factors, both internal and external. The key external one is that the threat landscape, in the security jargon, has never been worse or more dangerous.
- Firstly, as you point out, most systems today are inherently embracing the internet.
  - We went through three phases. Systems in the data centers. All safe and sound. Very little connectivity. Then we had systems in the data center safe and sound with very clear connections to the internet. And then we’ve moved to the “Internet is the System” era where the internet just pervades the system. So different bits of the same system are connected across the internet. Public visibility is just now given.
  - The problem with that, of course, is well, it’s very convenient. It means that you can access your system from anywhere else. So you can be attacked from anywhere. It’s almost trivial.
- Second thing is in the threat landscape.
  - The thing that I always get extremely concerned about is how quite sophisticated attacks, quite sophisticated threats are being weaponized, as they call it, meaning packaged up.
  - What’s happened more recently is that it’s become very much the norm that common and sophisticated attacks get packaged up into point and click tools and sold on the Dark Web for Bitcoin. That means that people with relatively modest amounts of money and some time can actually buy the expertise to do the attacks, and just do them themselves.
  - And then on our side, as we said several times, systems are now hung on the internet. We also tend to expose self-describing interfaces. Now, (when) we use the question mark JSON on the end of the URL, it tells you all about itself and says “Here’s how to attack me.”
- And then the third thing is people are often today using security services. For example, authentication, authorization, distributed automatic protection of the data at rest and all of that. Actually, I think that’s a really good thing because it’s much, much better than building your own.
  - The complexity for team though is that by the time they’ve got something from a cloud manufacturer, something from SaaS service, something from the database provider and something from the application server, there’s quite a complex web of pieces that they probably don’t understand very deeply because they’re not security engineers. They’ve got to put that together safely. Again, there were only a certain number of people could actually put the security stuff together and they were real experts.
- That in itself was actually quite a big problem because it slowed us down a lot. It made security engineers probably too powerful, and often limited what we could do. Or people just went ahead with almost no security cause they didn’t have security engineers.
- Today, you probably don’t need a dedicated security engineer. But on the other hand, if you actually don’t understand all the pieces, you could create a very insecure system out of very secure services.
People just need to be much more aware today than they were, of the security of their systems from the first time they write a line of code.
Historically, there’s always been a tendency for security engineering to look at the requirements, give some broad recommendations, a few sketches of choices about how things might work and go away, and then come back quite a long time later when quite a lot of have been developed, then tell everyone it was wrong, and here’s what they had to change to put the security in.
Because we ship stuff now almost immediately into public facing clients, we just can’t do that anymore. So, almost as soon as we’re writing a line of code, we need to be thinking about how do we make sure that it’s a secure line of code and it will be deployed and operated securely as well.

Availability as Security Objective

There’s a number of classical models for security. And the so-called CIA Triad, the Confidentiality Integrity Availability is one of those classic models. I didn’t invent it. It’s as old as security, really.
The reason it’s there is because one of the common attacks on systems classically has been to simply take them out of operation.
We normally think availability more about guarding against self-inflicted or accidental internal failures and keeping it running. But actually, there’s a security dimension too, and that’s to make sure that an adversary cannot use the system’s availability against us in some way. In other words, they attack the system to make it unavailable.
What they’re doing is they’re attacking the availability of the system for them to get a benefit. It’s a different way of thinking about availability. It’s about threats from external factors to our availability, rather than threats from internal failure or internal mistakes to our availability.

Threat Modeling

Threat modeling is a very straightforward process. It’s trying to work through with the combination of references to common catalogs of threats and also just thinking creatively ourselves, what threats could our system undertake?
[Charles Weir]’s just trying to raise awareness that you don’t need a complex process. You don’t need a load of tools. You don’t need a load of security experts. You know your system. Just think through what it is that could affect its confidentiality, its integrity, or its availability.
Because actually, when you sit most teams down and you say, how could information get out of our system maliciously? They’ve got all kinds of ideas how it might happen. They know their system because they wrote it.
No, it’s not that security guy got a lot to offer, but actually, just get started. You probably have a very good idea what threats your system faces from a security perspective.

Continuous Threat Modeling

It’s not, unfortunately, something we can just put in the pipeline, however sophisticated our tooling. It’s a design activity or analysis activity. The reason I mentioned that is that threat modeling, historically, has tended to be something that either you created a complete formal design for your system before you’d written much code. And again, you’ve got the security high priests to arrive and priestesses, of course. And they then looked over what you have made, did a threat model and they handed you back the threat model.
The point of the threat model, whether you do it or somebody else does, is that you know what risks you need to mitigate. Because the output of the threat model is effectively the risks that you need to rank, and thinking about how you mitigate them. And then you mitigate them in your design and then you build your system.
That wasn’t all that useful because all kinds of things changed your delivery, and nobody went back to talk to the security group. The other problem was where people hadn’t done much security work at all. So really late in the day, they sort of do a threat model, and they realize that actually, they’ve built a little thing that’s quite hard to secure.
To avoid these two extremes, neither of which is helpful. The idea of continuous threat modeling is really rather than threat modeling being this big, complicated thing that has to happen once a year, and it’s a bit like undergoing an audit. It’s all painful and unpleasant. Actually, just make thinking about threat modeling part of your standard delivery process.
The first time you start talking about a story, start thinking, does this have new threats or could this introduce new threats? As you’re thinking about changing a piece of infrastructure, as you’re thinking about changing the layering of the system, as you’re thinking about, perhaps we’ve got a set of brand new services, and they’re going to have new sorts of database this time, and we’re going to interact with this by messaging - we stop and just go, “Okay, so how could people extract information? How could people make this unavailable? How could people execute operations that they’re not meant to?”
So really, it’s part of the development cycle of each piece, rather than this big thing that happens very occasionally.
The term I like to use is democratizing it. Make it continuous and make it everyone’s activity. Just like testing. Everyone should be testing. Part of the testing should be security, and part of working out what threats you’ve got should be threat modeling.

Secure by Design

That’s a set of 10 principles that I came up with. I’ve always liked principles as a way of driving behavior, because the great thing about principle is they make it clear what’s important and what’s less important, but they don’t tell you what to do.
I don’t really like people making design decisions for me, but I do recognize that I don’t know everything. In fact, I’m very long way from knowing everything. So guidance is good. So I came across the idea of design principles.
So what I did was I read them all and thought about them. And there was everything from sets of six or eight principles, literally through to sets of hundreds of principles. But I noticed that certain themes emerged quite quickly. They were all talking about least privilege. They were all talking about the importance of trusting carefully. They were all talking about failing securely. They were all talking about the importance of making sure that default behavior was secure. When I boiled them down, I get to about 10.
That’s why I use principles in general. I think they’re really good aide memoires. They guide, they don’t command, and that’s good because you empower people. You don’t tell them what to do, because we can’t tell them what to do. They know more than you do.
They help people to remember what’s really important about security. People have limited brain space. So it gives them such accessible things which are pretty jargon free to remember when they’re building their systems.

Quality Attribute: Resilience

When I started doing software architecture and design, I think we mainly talked about availability. We thought about everything at 3 9s, 4 9s, 5 9s, those things. And the thinking really was we don’t get many failures. When we do get failures, let’s make a recovery time acceptable. Let’s minimize the data loss. And these systems are just being used by folks in our organization. It was not a big deal.
What’s happened is that now our systems are cloud first, they’re internet connected, they’re used by people all over the world. They’re SaaS-based product. We can’t just say, “Oh, we’ve had a bit of an outage. We’ll be back at lunchtime”. You just can’t do that anymore.
Because the failures will still happen, the thinking has completely flipped on its head. It’s now about embracing the failure. It’s about building systems that deal with the failure and stay in operation rather than try to mask the failure. But if the failure is really serious, accept that they’ll just go down for a while, while they recover from it.
I think the things that make resilience so important today are, we’re building systems with lots of independent pieces. The simple mathematics tells you things will go wrong. Because, you know, mean time between failures can only be so small. If you’ve got enough pieces, something’s going to go wrong at some point.
And then because things are now so critical and internet facing, we can’t take long downtime periods to do recoveries and fail back. So this new approach of we have to live with a failure in the system and have the different parts of the system deal with failures and the things they’re relying on. Hence, the overall system becomes resilient. That has to be the way that we go for certain classes of system today. That’s not to say that every system has to work this way, but many systems do.

Resilience and High Availability

High availability is about saying failures are infrequent. When they happen, we will use technology that tries to mask the failure, normally by moving to a replica of what’s failed. If there’s a bit of downtime, that’s probably okay. Because they’re infrequent events, if you like, we don’t optimize around them.
Resilience is more saying we’ve now got so many moving parts. There’s always going to be something not working, so everything else has to continue working.
Whereas in the high availability scenario, when something has failed, there is an acceptance that the system is unavailable. Even if it’s for a short period. The system is unavailable for a bit, while it sorts itself out. Whereas resilience is about spreading responsibility across all the pieces.
I think quite a few years ago, [John Allspaw] said we keep optimizing for meantime between failures, failures happen, anyway. We should focus just as much on how long it takes us to recover from a failure. That is, I think, the resilience thinking.
Rather than saying “Why don’t we make the meantime between failures as long as possible? When something happens, we’ll just accept we’ll be out for a while.”, to saying, “Yes, of course we’ll maximize our time between failures. Why wouldn’t we?” It’s just as important to say when the failure occurs, it should have minimal impact. So the time to recover should be very short.

Resilience Techniques

I think the fundamental ones are about knowing what’s going on, which is the field of monitoring, observability brings different threads of system awareness together. Health checks and watchdogs and alerts, those kinds of tactics are not very difficult.
Actually, they’re the foundation upon which you build the capability of observability. They really build resilience. Until you know what’s going on and can spot it quickly, it’s very hard to have a resilient system.
If you don’t know there’s been a failure, thinking about all the sophisticated, clever stuff you could do to minimize the impact of failure just seems to be the wrong way round.
Similarly with security. It’s really common for people start by saying, here’s some security technology we’re going to use. And my first question is, “That’s great. What threat is that mitigating? They’ll look blankly and go, “Well, it’s security.” Yes, but do you have a threat that this mitigates? Cause often you find they put quite a lot of complexity. None of this stuff comes for free. And actually, they’re very unclear whether they really have that threat and if it’s important.
Same with resilience, really, is that if you can’t get good visibility of the system, and you don’t know what kind of things go wrong, knowing what to prioritize in terms of mitigations is really quite difficult. That’s why I start with the basics of making sure you can get as much information about your system as possible. Make sure that it can health check itself. Make sure that you can build in simple watchdogs that can tell you when things go wrong and make sure you’ve got a good alert in the system.
I think it’s [lack of health check on the application] perfectly understandable. People do miss it. Because no one’s asking for it, are they? If you’ve got Ops integrated into your teams, they’re often not actually thinking to us with that. Because often, their mistake is that they’re worrying about things like alerts and watchdogs or the infrastructure components. But actually it was at the application level there wasn’t enough knowledge about what was going on inside the app.
It’s all a risk mitigation exercise. But you could argue architecture is fundamentally about mitigating technical risk. For nearly all of these things, these tactics have been created because people spotted there was a risk and sometimes, that risk occurred.

Allowing for Failures

This is the sort of resilience approach as opposed to the availability approach. It’s changing the mindset. So when we used to build a system with HA technology, we effectively were pushing the allowance for failure, HA application software into the system. The HA system will spot there’s a problem. It will shut things down and it will start things up over here. The application itself is unaware of what was going on.
If you’ve got lots and lots of pieces, something’s always going wrong. You can’t do the sort of cold fail-over type pattern. The individual parts and system have to take a lot more responsibility for continuing to run when things go wrong.
That’s what I mean by embracing failure, if you like, is that you have to allow for the fact failures can occur when you’re designing the application software. As opposed to say, we’ll keep the application software very simple. We’ll put all this complexity into the High Availability manager.

3 Tech Lead Wisdom

Focus on fundamentals, right through your career.
- Things change all the time. The individual technologies move at a very fast rate.
- I especially worry, sometimes people who come into our industry through the sort of fast conversion courses, the so-called bootcamps, is that they might know quite a bit about today’s technologies, but there just hasn’t been time to give them the fundamentals and the historical context.
- Go back and learn what the fundamentals are. Why was it created? How does it work? What are its strengths and weaknesses?
Holistic thinking.
- Some people say step back, some people say see the big picture.
- When you’re working on a micro problem or a micro design, or a micro piece of user interaction, whatever it is, it’s quite easy to lose context and purpose.
- That’s one of the things often that I find when we have architecture problems, teams have done great work at the micro-level, and no one has been stepping back and trying to work out what the macro-level looks like.
Remember that you’re not the smartest person in the room, even if you think you are.
- There’s almost certainly somebody who knows more about nearly everything than you. So, be as good at listening as you are at doing the talking.

Transcript

[00:01:37] Episode Introduction

[00:01:37] Henry Suryawirawan: Hello again, to all of you, my friends and listeners. Welcome back to another new episode of the Tech Lead Journal podcast with me your host, Henry Suryawirawan. Really excited to be back here again, to share my conversation with another great technical leader in the industry. If you’re new to Tech Lead Journal, subscribe and follow the show on your podcast app and join our social media communities on LinkedIn, Twitter, and Instagram. And if you have been regularly listening and enjoying this podcast and its contents, help support the show by subscribing as a patron at techleadjournal.dev/patron and support me to continue producing great Tech Lead Journal episodes every week.

My guest for today’s episode is Eoin Woods. He is the co-author of“Continuous Architecture in Practice” and the CTO at Endava. This episode is the last of our three part series on continuous architecture. How have you been enjoying them so far? If you still remember, previously we have episode 67, the first in this series with Murat Erder, where he shared his insights on continuous architecture’s principles and essential activities. Then in episode 70, we have Pierre Pureur, where he shared the first two important quality attributes covered in the book, which are scalability and performance.

And in this episode, as the last part of the series, Eoin shared the remaining two important quality attributes, which are security and resilience. Eoin started by explaining why we should treat security as a critical quality attribute, the changes in the overall security landscape these days that makes security becomes more challenging, such as the prevalence of the internet and availability of the ready-made security attacking tools. Eoin also shared the threat modeling concept, how we should do continuous threat modeling, and also using his 10 secure by design principles that we can use in our software development life cycle.

In the next half of the conversation, Eoin then shared about resilience as a quality attribute, how we should differentiate between resilience from high availability, some common resilience techniques that we can implement in our system, and the importance of embracing failure mindset so that we can incorporate failures in our software architecture and design.

I enjoyed my conversation with Eoin, wrapping up my knowledge on the remaining quality attributes of the continuous architecture. If you also enjoy and find this conversation useful, I encourage you to share it to someone you know, who would also benefit from listening to this episode. You can also leave a rating and review on your podcast app or share some comments on the social media, on what you enjoy from this. So let’s get our episode started right after our sponsor message.

[00:04:54] Introduction

[00:04:54] Henry Suryawirawan: Hello, everyone. Welcome back to a new episode of the Tech Lead Journal podcast. Today I have with me a guest named Eoin Woods. He is the CTO of Endava based in London. Eoin is the regular conference speaker. He has spoken a lot at previous conferences before the pandemic. So I’ve seen him talk at GOTO. I’ve seen him at other conferences as well. He is actually, in particular, interested in software architecture, software security, and DevOps, and also co-authored a couple of books on software architecture. So today actually is a continuation from the previous episode that we covered, which is about “Continuous Architecture in Practice”. Today we’ll be talking more on security and resilience aspects of the architecture. So Eoin, really looking forward to have this conversation. Welcome to the show.

[00:05:40] Eoin Woods: Thank you. It’s really good to be here. Thank you for the invitation.

[00:05:42] Career Journey

[00:05:42] Henry Suryawirawan: So Eoin, maybe for the people who haven’t known you yet, you can introduce yourself, probably telling us about highlights or turning points in your career.

[00:05:51] Eoin Woods: Certainly. Thank you. So I had a very conventional start. I did a Computer Science degree. I then spent about 10 years in software product development. I worked on products that sort of are still around today. Things like Tuxedo, which was a pioneering TP monitor for transaction processing, the precursor for today’s app servers. I’ve worked on the Sybase database, which is now part of SAP’s software stable. And I also worked in Digital Rights Management. That’s the slightly contentious today, control of audio and video, primarily. That technology lives on today in a number of consumer devices. It’s very well hidden. And that’s where I got my interest in software security.

After that, I moved into the financial domain. I spent about another 10 years in capital markets, working primarily for a large investment bank. But also some time for an asset manager, which was also very good. And then in January 2015, I joined Endava. And that’s completely different because we serve other people by writing software for them, and that was something I’ve never done before. So, it’s great to work at Endava. Another completely different experience.

[00:06:46] Henry Suryawirawan: Thanks for sharing your journey. So it’s interesting that you went from the normal route of software engineer and now being a CTO in Endava, helping people to build software. So what will be some of the challenges that you see lately, in terms of helping people to build software?

[00:07:00] Eoin Woods: That’s a great question. We’ve got a lot of clients. It’s very hard to generalize, actually. But some themes that you see a lot at the moment are where people are either starting out with nothing, and they need to build very scalable, resilient, reliable, secure platforms from nothing but really very quickly. They’re often under pressure, of course, because they have investors or there’s a market opportunity they have to meet. Years ago, when you were a startup, you were often building a software product. You had multiple years of funding to go away quietly without (interruptions to) develop your software and ship it, to understanding beta customers who would tell you about all the problems (in it). And now, it’s very frequently SaaS products. It’s just out there on day one. That’s pretty challenging. Those are the people building from nothing.

Then at the other end, there are an awful lot of large organizations who today need to produce, for example, new SaaS digital platforms for themselves or their clients. So they need to enter new markets, or whatever the driver is. They also have to work in this way. But they not only have the problem with building something new. They’ve got a lot of stuff already, which is often slightly unkindly called legacy. I always define legacy as the software that pays the bills. Whereas the new shiny software is probably massively in debt. It’s probably not paid for anything yet. But they’ve got all the software and hardware, and it varies a lot. It may not inherently be bad software, but it’s probably been around a long time. It’s probably much less flexible than the brand new software, because maybe they had no idea what they needed to do with it back then. Maybe the technologies weren’t there. So they’ve got to square the circle, as we say, and they’ve got to somehow maximize the benefit that you can get from lots of existing hardware, software, much of it hard to change, and also going and playing in the digital world.

So at a macro-level, those are the kinds of trends I see. On a micro-level, they wouldn’t be surprising. I think that’s the same ones everywhere. How do you design good software for the cloud first era? They know how to do it until they put it into production –live, and then they discover it was harder than they thought. Dealing with very fast moving cloud platforms, where you build on a service today, and 12 months later, the vendor announces that they’re not going to invest in it anymore. They want you to move to another service and it’s totally incompatible. It works in a totally different way, and they haven’t defined a migration path. Those kinds of problems. And just the ongoing kind of complexity, is that people expect systems today to, by the standards and systems I wrote when I was a graduate, almost be magical. They require them to have really good quality attributes, but also be very sophisticated. For example, very adaptive and intuitive user interfaces. People aren’t really prepared to spend any time and effort learning a user interface anymore, which is great as a user. I love that. But I’m seeing lots of people finding just the complexity of the software they need to build today really, really challenging.

[00:09:27] Henry Suryawirawan: I really love the description that you mentioned about legacy. So I’ve heard a couple of different definitions last time. Legacy software is software that has a technical debt or something like that. But the way you describe it as a software that pays the bill. So I think it’s a really great description of it, which I haven’t heard before.

[00:09:43] Software Architecture

[00:09:43] Henry Suryawirawan: So speaking about all these challenges, definitely, these days, the pace of technology is really rapid and also complexity, as you mentioned, is really becoming more enormous. Especially with the cloud era, the digitization, all these recent new technologies being introduced as well. So, we are going to the topic of software architecture. What do you think your definition of software architecture in this era? And why is it still important to look at the aspects of the architecture of the software?

[00:10:09] Eoin Woods: Software architecture has many definitions from very kind of complex academic ones through to very pragmatic ones. For me, the key concerns for architecture are two. One is, there are many people who care about our system. Let’s call them stakeholders, cause that’s what the buzzword is. So there are people who care. Some of them care very positively, and they want it to be a success. And people often forget about some people don’t really want our system to be a success. Or they’re neutral. I mean, for example, internal compliance groups. To them, our new system has just increased risk. They know it has to happen, but actually, if you say to them, would it be great if we did this tomorrow? They go, ah, well, if you must. As opposed to, of course, the people who want to use it and go, we need this all tomorrow.

So there are positive and negative stakeholders. They have a lot of different concerns, and many of those concerns conflict. The corporate security group obviously want it to be totally secure all the time with absolutely no chance of anything going wrong, and users want it to be absolutely easy to use, totally intuitive, with no barriers to use it. Clearly, those two things are in total conflict. Those concerns need to be balanced. And it’s very hard to balance those at a feature level, service level, team level, those kinds of levels of things. That, to me, is a big part of software architecture.

And then, in achieving that, the other dimension which those people don’t think a lot about, but actually is very important to them, is what we call the quality attributes. Also known as the non-functionals. Security, resilience, availability, performance; in some cases, regulatory compliance, the ability to evolve, and so on. All of those are inherently cross-cutting concerns. And again, it’s very hard to point to one empowered service team and go, “You do the security then.” They’ll just look at you and go, “What do you mean ‘do the security’? I mean, we’ll make sure our service follows whatever pattern, styles and approaches we’ve all agreed for security.” But it will not mean that the system is secure or whether that service (will be secure when it) is composed as part of something more complex. Well that, something that team doesn’t really know. They just know that they’ve fulfilled one small part of the contract.

So getting those quality attributes achieved, and again trading off between them. Using that to meet the needs of those stakeholders. You’ve got all these different desires and opinions and needs that all conflict. Those things are, to me, the essence of software architect. And the reason they are the essence of software architecture, (is that) they are cross-cutting. One part of the development team, however empowered, cannot achieve them by themselves.

[00:12:19] Quality Attribute: Security

[00:12:19] Henry Suryawirawan: Hearing your description about this architecture, we can totally sense that it’s really quite complicated. It’s a cross-cutting thing. You need to find the right trade-offs based on the business requirements and also a few other quality attributes that sometimes people don’t think about. But when it happened that the system doesn’t scale, it doesn’t perform well, actually, they will look back and see, okay, why didn’t we think about it? Which brings us to these two quality attributes that I think is kind of like put as the major quality attributes. Two of them will be covered today, which is security and resilience. So, the book mentioned actually that security is an architectural concern. I mean, we all know the things about security these days, especially a lot of vulnerabilities, a lot of major cases happen because of security. Why do you think it should be part of the architectural concern from the beginning?

[00:13:05] Eoin Woods: Yes. Good question. I mean, we sort of touched on it before. It’s easy to view security as something that every part of your overall team - say, you’re structured around services - every service team will look after its architecture. The problem is that security is one of those, in my humble opinion, because being a security guy I would say this, but it’s one of the most difficult and one of the most critical. It’s one of those that it’s very hard to know if you are secure. One very small mistake can suddenly make you dramatically insecure.

Whereas performance, I’d argue, it’s easier to test that you’ve got acceptable performance, and performance tends to degrade. You tend not to go off a cliff. Now, I know there are exceptions. You can absolutely take the performance of a system off a cliff. But I think it’s more common in my experience that performance degrades, and so you can address it while it’s going on. Security, almost all the time. You’ve got a disaster. So, I think security is something that really has to be handled at system level. I’m not suggesting one person should do it. Don’t get me wrong, that would be terrible idea. But it’s got to be thought of as a system quality, rather than a quality that any one team in the system development organization can just do either for themselves or everyone else.

[00:14:08] Security Landscape Changes

[00:14:08] Henry Suryawirawan: And speaking about all these recent changes in the technology like cloud, for example. The prevalence of cloud technology is now putting people’s workload into a new territory. Let’s talk it this way. Because you might need to expose it over the internet. So previously, you have only data centers, for example. So what other thing actually has changed recently that makes the security landscape becoming more and more risky?

[00:14:31] Eoin Woods: There are a couple of factors, both internal and external. The key external one is that the threat landscape, in the security jargon, has never been worse or more dangerous. There are a couple of reasons for that. Firstly, as you point out, most systems today are inherently embracing the internet. We went through three phases. Systems in the data centers, all safe and sound, very little connectivity. Then we had systems in the data center safe and sound with very clear connections to the internet. And then we’ve moved, to coin a phrase, “The Internet is the System” era where actually, the internet just pervades the system. So different bits of the same system are connected across the internet. Public visibility is just now a given. And the problem with that, of course, is well, it’s very convenient. It means that you can access your system from anywhere else. So you can be attacked from anywhere. Whereas before, actually, it was quite hard for someone in, shall we say, North Korea, to attack someone in Germany. Now, it’s almost trivial.

Second thing is in the threat landscape. The thing that I always get extremely concerned about is how quite sophisticated attacks, quite sophisticated threats are being weaponized, as they call it, meaning packaged up. Again, 10-15 years ago when I was first working in security, it’s not that there weren’t people out there who could exploit very subtle vulnerabilities. I mean, back then, with compiled languages, stack smashing, for example, was something that there were lots and lots of people out there could do it. But actually, when I say lots and lots, it was a small community of very skilled people who probably all went to the same security conferences could actually do it. What’s happened more recently is that it’s become very much the norm that common and sophisticated attacks get packaged up into point and click tools and sold on the Dark Web for Bitcoin. That means that people with relatively modest amounts of money and some time can actually buy the expertise to do the attacks, and just do them themselves. For example, Botnets. Again, to build a botnet, it’s a real non-trivial task. It requires real skill. But if you can rent somebody else’s, that requires almost no skill. So the threat landscapes has become really serious.

And then on our side, as we said several times, systems are now hung on the internet. We also tend to expose self-describing interfaces. I mean, I’m old enough to have worked not exactly on mainframes, but alongside mainframes. Back in those days, if you didn’t have the protocol description, it was extremely difficult to actually interact with anything. And then there were huge, big thick books (describing the protocols), and you had to just go and learn them. Whereas now, we use the question mark “?json” on the end of the URL. It tells you all about itself and says here’s how to attack me.

And then the third thing is people are often today using security services. For example, authentication, authorization, distributed automatic protection of the data at rest and all of that. Actually, I think that’s a really good thing because it’s much, much better than building your own. The complexity for team though is that by the time they’ve got something from a cloud manufacturer, something from SaaS service, something from the database provider and something from the application server, there’s quite a complex web of pieces that they probably don’t understand very deeply because they’re not security engineers. This isn’t what they do. They’ve got to put that together safely. Again, there were only a certain number of people could actually put the security stuff together and they were real experts. They were security engineers. And that in itself was actually quite a big problem because it slowed us down a lot. It made security engineers probably too powerful, and often limited what we could do. Or people just went ahead with almost no security cause they didn’t have security engineers. Today, all the bits you need are there. You probably don’t need a dedicated security engineer. But on the other hand, if you actually don’t understand all the pieces, you could create a very insecure system out of very secure services.

So that’s, I think, why people just need to be much more aware today than they were, of the security of their systems from the first time they write a line of code. And again, historically, there’s always been a tendency for security engineering to look at the requirements, give some broad recommendations, a few sketches of choices about how things might work and go away, and then come back quite a long time later when quite a lot of has been developed, then tell everyone it was wrong, and here’s what they had to change to put the security in. So the security was sort of all bolted on later on. Well, because we ship stuff now almost immediately into public facing clients, we just can’t do that anymore. So, almost as soon as we’re writing a line of code, we need to be thinking about how do we make sure that it’s a secure line of code and it will be deployed and operated securely as well.

[00:18:29] Henry Suryawirawan: So thanks for sharing all of these different aspects. Definitely, I learned a lot of things as well. For example, exposing yourself unnecessarily through self-describing interface. I didn’t really realize that, but yeah, actually that’s true. And the other thing is complexities of different multiple technologies that we are integrating now, like you mentioned, authorization or authentication as a service, even managed services from the cloud or Platform as a Service. So all those things, we need to know different permutations of how they actually interact and basically find the loopholes where the security danger is. Thanks for sharing that.

[00:18:59] Availability as Security Objective

[00:18:59] Henry Suryawirawan: One thing that I read from the chapter in the book about security is you mentioned about three objectives of security, which are the common things, confidentiality, integrity, and availability. Availability here is quite different, in my opinion. So can you explain why the availability is actually the objective of security?

[00:19:17] Eoin Woods: Sure. Yeah. So there’s a number of classical models for security. And the so-called CIA Triad, the Confidentiality Integrity Availability, is one of those classic models. I didn’t invent it. It’s as old as security, really. But the reason it’s there is because one of the common attacks, if you like, on systems classically has been to simply take them out of operation. Obviously, we normally think availability more about guarding against self-inflicted or accidental internal failures and keeping it running. But actually, there’s a security dimension too, and that’s to make sure that an adversary cannot use the system’s availability against us in some way. In other words, they attack the system to make it unavailable. And so that’s why for a very long time, it’s been part of that CIA triad. For example, suppose you are in some time sensitive business. You’re Stock trading or you’re in online gambling, or even perhaps you’re in auction bidding. If you can not even prevent the system being available, but simply slow it down significantly, then an adversary can have a significant benefit. That is a security concern. What they’re doing is they’re attacking the availability of the system for them to get a benefit. And so that’s why it’s included in security. You’re quite right, is that it’s a different way of thinking about availability. It’s about threats from external factors to our availability, rather than threats from internal failure or internal mistakes to our availability.

[00:20:36] Henry Suryawirawan: So thanks for clarifying that. Yeah. I also now think about the DDoS attack. For example, if the massive surge of traffic going into your server, performance becomes slow or even worse, it could be becoming unavailable, and people will not be able to use that. And hence, the availability part of the security.

[00:20:51] Threat Modeling

[00:20:51] Henry Suryawirawan: So you mentioned a couple of times about threat modeling. For people who may not be familiar with security terms and jargons here, can you help to explain what is actually threat modeling?

[00:21:00] Eoin Woods: Of course. Yeah. So threat modeling is actually very simple. As you say, you’re quite right to call it out as jargon. Every subfield has its own terminology and jargon, don’t they? So threat modeling is a very straightforward process. It’s trying to work through, really with a combination of references to common catalogs of threats and also just thinking creatively ourselves, what threats could our system undertake? So, for example, what threats could there be to our information integrity? How could somebody corrupt information? How can somebody perhaps damage information in some way? How could somebody undermine the operation of the system by intercepting messages, for example? Or could they do something at the application level, perhaps, to put dangerous data or invalidated data into the system to stop something working.

Confidentiality, of course, is about keeping things private. That’s the classic security concern, actually. It’s keeping things private. And then availability, what could people do to affect the availability of our system? So, for example, as you say, DDoS attack, that will be a very classical example and you might take it from a catalog. For our particular system, we might recognize that actually, we’re very sensitive to the size of one particular message type because of what we do with it. We realized, to our horror, that if somebody sent us something modest but big, like a 100 MB message and they sent us five of them in quick succession, they actually might prevent us processing messages for 20 minutes, which might be quite long enough in some scenarios to actually cause us a problem. So that’s all threat modeling is.

There are a number of ways of going about it. There’re even a number of philosophies and schools as to how you should go about it. Probably the best known reference would be Adam Shostack’s book, simply called “Threat Modeling”. Adam Shostack worked for Microsoft for a number of years. I mean, he’d been a security guy well before that, but he is most famous, I think, for his work at Microsoft, improving their security practice and threat modeling was a big part of that.

And also people like Charles Weir, of Lancaster University with his securedevelopment.org site, where he’s just trying to raise awareness that you don’t need a complex process. You don’t need a load of tools. You don’t need a load of security experts. You know your system. Just think through what it is that could affect its confidentiality, its integrity, or its availability. For the team, unwrap those terms to (work out) how could we leak information? How could somebody steal information? How could somebody destroy information? And so on. Because actually, when you sit most teams down and you say, how could information get out of our system maliciously? They’ve got all kinds of ideas how it might happen. They know their system because they wrote it. So actually, it’s often made to look like something the security “high priests” have to come and do to you. And actually, people like Charles Weir have been doing a great job of pointing out, no, it’s not that security guy got a lot to offer, but actually, just get started (yourself). You probably have a very good idea what threats your system faces from a security perspective.

[00:23:31] Henry Suryawirawan: That’s actually a very good point because often in my career, I’ve seen a lot of different development teams, for example, product teams. They just put their hands in the air and say, I don’t know anything about security. But actually, if you think about it, if you also put a lens as maybe typical users or typical owners of the product, you probably would be able to find out what are the potential things of where the attacks could come from, or where are the things that may not be as secure as what you think of. So I think, yeah, that’s really right that everyone should know where the potential attacks could be.

[00:23:59] Continuous Threat Modeling

[00:23:59] Henry Suryawirawan: Speaking about the movement of DevSecOps and about shifting left of security, there’s this thing in your book also, you mentioned about continuous threat modeling and mitigation. So there’s this continuous part of that threat modeling. So what do you mean by that? Is it something that we have to codified in our secure software delivery pipeline, or is it something like an activity that you always have to schedule in your sprints? What do you mean by this continuous threat modeling?

[00:24:23] Eoin Woods: Yeah. Great question, actually. So it’s not, unfortunately, something we can just put in the pipeline, however sophisticated our tooling. It’s a design activity or analysis activity, really. The reason I mentioned that is that threat modeling, historically, has tended to be something that either you created a complete formal design for your system before you’d written much code. And again, you’ve got the security high priests to arrive and priestesses, of course. And they then looked over what you have made, did a threat model and they handed you back the threat model. The point of the threat model, by the way, whether you do it, or somebody else does, is that you know what risks you need to mitigate. Because the output of the threat model is effectively, the risks that you need to rank, and thinking about how you mitigate them. And then you mitigate them in your design and then you build your system. Job done. Well, of course, that wasn’t all that useful because all kinds of things changed your delivery, and nobody went back to talk to the security group. The other problem was where people hadn’t done much security work at all. They then quite late in the day, somebody maybe says, “So just take me through the security model.” And they go, “gasp security”. And then somebody says, “Well, you should have a threat model”. So really late in the day, they sort of do a threat model, and they realize that actually, they’ve built a little thing that’s quite hard to secure.

So, to avoid these two extremes, neither of which is helpful. The idea of continuous threat modeling is really rather than threat modeling being this big, complicated thing that has to happen once a year, and it’s a bit like undergoing an audit, it’s all painful and unpleasant, actually, just make thinking about threat modeling part of your standard delivery process. The first time you start talking about a story, start thinking, does this have new threats or could this introduce new threats? As you’re thinking about changing a piece of infrastructure. As you’re thinking about changing the layering of the system. As you’re thinking about, perhaps we’ve got a set of brand new services, and they’re going to have new sorts of database this time, and we’re going to interact with this by messaging. We stop and just go, okay. So how could people extract information? How could people make this unavailable? How could people execute operations that they’re not meant to? So really, it’s part of the development cycle of each piece, rather than this big thing that happens very occasionally. Honestly, maybe isn’t that effective because it’s done at a very high level. The term I like to use is democratizing it. Make it continuous and make it everyone’s activity. Just like testing. Everyone should be testing. Part of the testing should be security, and part of working out what threats you’ve got should be threat modeling.

[00:26:33] Henry Suryawirawan: So, yeah. Speaking about continuous testing and all that, I think mostly people get it these days. You know, like developers, they tend to want to test their software now. Also, product managers, they want to test. So hopefully one day we’ll see the situation also where the developers and the product managers also think about security every time. It’s not like special activities, or maybe it’s just a few specialists who are able to advise on security.

[00:26:55] Secure by Design

[00:26:55] Henry Suryawirawan: I saw one of your talks in the previous conferences about secure by design principles. I think this is quite interesting. And you mentioned there are 10 principles. Why do you think there are these principles that we have to think about whenever we design our software? Can you explain a little bit about this secure by design concept?

[00:27:13] Eoin Woods: Of course. Yes. So the first thing to say is that’s a set of 10 principles that I came up with. The reason I came up with them is, so this is quite some years ago now, I guess probably when I was in banking, I’ve always liked principles as a way of driving behavior, because the great thing about principle is they make it clear what’s important and what’s less important, but they don’t tell you what to do. So I don’t really like people making design decisions for me, but I do recognize that I don’t know everything. In fact, I’m a very long way from knowing everything. So guidance is good. So I came across the idea of design principles. There have been software design principles for a very long time, and they’re used in other fields too. I come across them in a couple of places.

I’d done some work for a secure software company in business actually called Intertrust Technologies who are primarily a DRM and trust vendor, they’re kind of a research lab. And then, I got into banking, (where I assumed that) security was job one, and then everyone would be an expert on security. (But in reality), no one knew about security. The perimeter was brilliant, but actually within the bank it was a very mixed knowledge. What really helped the security engineering teams actually, I must say the culture there was brilliant, it was a very cooperative to culture, and very service-oriented security engineering teams. But in the teams, generally, there was still this very unhelpful mindset of, oh, we’ve got these helpful security engineers, they will just work magic after we’ve developed the software. “Now this is going to end so badly.” Luckily, we didn’t actually have lots of security problems. But if you ran scans and stuff on systems, you often found quite elementary problems.

So I started working out that I don’t want to just tell people what to do, but anyway, I can’t, there are too many systems. By this point, I was quite a senior architect. There were lots and lots of systems. And I thought, well, thinking back, I always used to find design principles really useful when I was learning how to design software. And I went off, and I looked for some security principles. And I found that there are loads and loads of them. I mean, far too many. The National Security Center in the UK has got a set. Systems engineering guys have got a set and some of the vendors have got a set. Oh no. This is no good. I can’t give people hundreds of principles. They just won’t read them.

So what I did was I read them all and thought about them. And there was everything from sets of six or eight principles, literally through to sets of hundreds of principles. But I noticed that certain themes emerged quite quickly. They were all talking about least privilege. They were all talking about the importance of trusting carefully. They were all talking about failing securely. They were all talking about the importance of making sure that default behavior was secure. I felt well, actually, these principles are saying these things in many different ways, but when I boiled them down, I get to about 10. Now obviously, I ended up with 10, because 10 is a nice number for a conference talk! Many people have pointed out that one of them is “fail securely and have secure defaults”. Many people pointed out those are really two principles. And they’re absolutely right. You’ve called me bang to rights. I put 2 together to get to 10. Otherwise, I had 11.

But the reason I came up with them, I thought, these are the essence of all of these other sets of learning principles written by people far cleverer than me. I’ll try to get that. But if I can boil them down to 10, and make them reasonably short and reasonably entertaining while I explain them, there’s some chance people might remember them. And what do you know? It seems to have worked. I was very surprised. I did this talk. I don’t remember where I did it first. It might have been ACCU in the UK, which is a very technical conference. Very much at the implementation level, and they really liked them. And then I think I submitted them to one other place and then I just started getting invitations to present it. And actually, they really seem to have landed. And then I had a couple of other talks around principles since then. So that’s why I use principles in general. I think they’re really good aide memoires. They guide, they don’t command, and that’s good because you empower people. You don’t tell them what to do, because we can’t tell them what to do. They know more than you do. But they help people to remember what’s really important about security. People have limited brain space. So it gives them such an accessible thing which is pretty jargon free to remember when they’re building their systems. And so that’s how I came up with them. And that’s why I’m still using them today.

[00:30:40] Henry Suryawirawan: I also personally find those really useful. So I like the way that you think about principles. The things that drive behavior, not necessarily telling people what exactly they should do, which are sometimes I found in the security industries that, yeah, people just point here’s what you need to do. Don’t put your bucket unsecurely for public access, for example, or something like that. So I think principle is something that we can all agree with, and people might find different ways on how to implement that. So thanks for sharing this. I’ll make sure to put it in the show notes as well. So for people who want to know about this 10 or 11 secure by design principles.

[00:31:13] Quality Attribute: Resilience

[00:31:13] Henry Suryawirawan: So moving on to the next section, which is about the other quality attribute, resilience. Again, same type of question. Why do you think resilience should be part of an architectural concern?

[00:31:23] Eoin Woods: So, yeah, that’s a great question. When I started doing software architecture and design, I think we mainly talked about availability. We thought about everything at 3x9s, 4x9s, 5x9s, those things. And the thinking really was we don’t get many failures. When we do get failures, let’s make the recovery time acceptable. Let’s minimize the data loss. And these systems are just being used by folks in our organization. If they don’t have them for a couple of hours, well, then they’ll just use post-it notes for a bit and they’ll catch up afterwards. It was not a big deal. And most organizations that was the trade off. You could buy it. If you were in the stock exchange, you bought a Tandem, because Tandem could run literally nonstop. That’s what they call the machines (“Tandem Non-Stop”). They will run non-stop, but they were very expensive.

I had a friend who worked for Tandem at one time and he told a funny story that they used to go to customers and they’d say, “How much downtime can you take?” And the customer said, “Oh, none, ever.” They went, “Okay, well, it’ll be $20 million (per year) then.” And the customer went, “Oh. Hang on a minute. That’s a lot of money”. They went, “Yeah. So could you take 10 minutes a year?” They went, “Oh yeah.” “Oh, that’s too many dollars.” Everyone thinks back then, it was sort of, if you need it always on, it was Tandem. And everybody else just went, “Well, we’ll have some downtime and it’ll be fine.” And then, of course, what’s happened is that now our systems are, as we’ve been talking about they’re cloud first, they’re internet connected, they’re used by people all over the world, they are SaaS-based product. We can’t just say, “Oh, we’ve had a bit of an outage. We’ll be back at lunchtime”. I remember literally doing that inside organizations. You just can’t do that anymore. Because the failures will still happen, and the thinking has completely flipped on its head. It’s now about, to coin a phrase, embracing the failure. It’s about building systems that deal with the failure and stay in operation rather than trying to mask the failure, but if the failure is really serious, accept that they’ll just go down for a while, while they recover from it.

I think the things that make resilience so important today are, we’re building systems with lots of independent pieces. The simple mathematics tells you things will go wrong, because, you know, mean time between failures can only be so small. If you’ve got enough pieces, something is going to go wrong at some point. I would imagine in the Amazon estate, at absolutely any time something is not working, simply because there are so many moving parts. And then because things are now so critical and internet facing, we can’t take long downtime periods to do recoveries and fail back. So this new approach of we have to live with a failure in the system and have the different parts of the system deal with failures and the things they’re relying on. Hence, the overall system becomes resilient. That has to be the way that we go for certain classes of system today. That’s not to say that every system has to work this way, but many systems do.

[00:33:38] Resilience and High Availability

[00:33:38] Henry Suryawirawan: You mentioned in the beginning about the availability, reliability, or some also say high availability. What’s the difference between all these with resilience?

[00:33:46] Eoin Woods: My view of it is that the difference is the philosophy. High availability is about saying failures are infrequent. When they happen, we will use technology that tries to mask the failure, normally by moving to a replica of what has failed. If there’s a bit of downtime, that’s probably okay. Because they’re infrequent events, if you like, we don’t optimize around them. Resilience is more saying we’ve now got so many moving parts. There’s always going to be something not working, so everything else has to continue working.

Whereas in the high availability scenario, when something has failed, there is an acceptance that the system is unavailable, even if it’s for a short period, the system is unavailable for a bit, while it sorts itself out. Hopefully, semi-automatically, hopefully using high-availability technology. That sort of technology works best with a relatively small number of relatively large parts and a few big app servers and a few big databases, rather than myriad microservices, all running independently, which is very hard to do centralized HA on that. Whereas resilience is about spreading responsibility across all the pieces.

[00:34:42] Henry Suryawirawan: Is it also fair to say that the resilience part of the system architecture is actually trying to optimize the MTTR part? In the system admin operations world, we have this concept about MTTR, MBTF. Is it correct to say that the resilience is actually to ensure that the MTTR actually is as little as possible?

[00:35:02] Eoin Woods: Yeah, exactly. I think it was John Allspaw quite a few years ago. I think he’s since felt that he would say it a different way today, but I think quite a few years ago, he said we keep optimizing for meantime between failures, but failures happen, anyway. We should focus just as much on how long it takes us to recover from a failure. That is, I think, the resilience thinking. Rather than saying why don’t we make the mean time between failures as long as possible, when something happens, we’ll just accept we’ll be out for a while, to saying, yes, of course we’ll maximize our time between failures. Why wouldn’t we? But actually, it’s just as important to say when the failure occurs, it should have minimal impact. So the time to recover should be very short.

[00:35:36] Resilience Techniques

[00:35:36] Henry Suryawirawan: So let’s go on a few tactics about these resilience techniques. I think you mentioned a few in the books. Do you have any favorites that you want to cover?

[00:35:44] Eoin Woods: Oh, that’s a great question. Do I have any favorites? I mean, I think the fundamental ones are about knowing what’s going on, actually, which of course is the field of monitoring, observability bringing different threads of system awareness, I suppose, together. Health checks and watchdogs and alerts, those kind of tactics, I think, are, one, they are not very difficult, I mean, some resilience approaches are really quite sophisticated, they could be quite difficult to do, but building in watchdogs for things that can go wrong, building in alerts, building in good health checks on your system, are not difficult.

Actually, they’re the foundation upon which you build the capability of observability. They really build resilience. Until you know what’s going on and can spot it quickly, it’s very hard to have a resilient system. And yet, it’s quite rare, actually, I think that you go into project teams and you say, “How are your health checks? or “Are all of these components automated at runtime?”. There is often a bit of a pause and they go, “Yeah. Yeah. We really should have figured that out. We really do need to do that. Yeah. Thank you for asking that.” “What you mean is you don’t know?” And they go, “Yeah, we don’t know. We know we have to do it, but it’s just one of those things that fell off the bottom of the list.” And yet, you’ll often find actually, they’ve been agonizing about building in bulkheads or something, when you separate parts of the system, so that failure can’t cascade. Very sensible, I mean, I’m not suggesting you shouldn’t do that. But if you don’t know there’s been a failure, thinking about all the sophisticated, clever stuff you could do to minimize the impact of failure just seems to be the wrong way round.

Similarly with security. It’s really common (that) people start by saying, “Here’s some security technology we’re going to use.” And my first question is, “That’s great, what threat is that mitigating?” They’ll look blankly and go, “Well, it’s security”. Yes, but do you have a threat that this mitigates? And some stuff like “we’re going to authenticate users”. Yep. I accept, you need to authenticate users. You don’t need to have a big brain thought about why you need to do that, because you need to know security identity to have any other security. But there are lots of things, people start applying cryptography, for example, and I quite often would say, “Which threat do you think this is mitigating?” “Oh, it’s to keep it secure”. Yeah. I think that’s absolutely the right thinking, but what is the threat that this will mitigate? Because often you find they put in quite a lot of complexity. None of this stuff comes for free. And actually, they’re very unclear whether they really have that threat and if it’s important.

Same with resilience, really, is that if you can’t get good visibility of the system, and you don’t know what kind of things go wrong, knowing what to prioritize in terms of mitigations, is really quite difficult. So that’s why I start with the basics of making sure you can get as much information about your system as possible. Make sure that it can health check itself. Make sure that you can build in simple watchdogs that can tell you when things go wrong and make sure you’ve got a good alert in the system.

[00:38:07] Henry Suryawirawan: And actually it’s after listening to what you said, I kind of believe so as well. Because even though the platform these days also have the capability of enabling health check, like Kubernetes has it, some cloud providers have it. But often, I see the application actually doesn’t have a way to saying that, okay, the application is healthy. Which means that it is actually an afterthought, right? It’s not something that they prepare firsthand, and know actually what needs to be done in order to provide the capability to say that this application is healthy. That is still true in my experience as well in the recent days, even. So thanks for pointing that out. Even though it’s simple, health checks and alerts, but many people may still miss it in their software development.

[00:38:45] Eoin Woods: And I think it’s perfectly understandable (that) people do miss it because no one’s asking for it, are they? I mean, often the Ops group, even. If you’ve got Ops integrated into your teams, they’re often not actually thinking to ask for that. Because often, their mistake is that they’re worrying about things like alerts and watchdogs or the infrastructure components. Very important. But it’s quite often in my career, actually, I’ve been involved in quite complex incidents. When I was in banking, people tell me I have a calm voice, so I somehow used to get dragged into handling incidents. I wasn’t calm. I can assure you. I just sounded calm. Quite a number of occasions, I can think of, I mean, real life cases, actually, there was lots and lots of monitoring, and it was all at the infrastructure level, or stuff at the application level was doing something that was so trivial that there was actually something quite seriously locked up in the midst of an application, and nobody could tell until they literally started going into it looking at log lines and debuggers. So afterwards, crikey, if you just had an end-to-end transaction, you’d have known that you had an impending problem several hours before you had an incident. Because everything was green, everyone thought everything was fine. But actually it was at the application level there wasn’t enough knowledge about what was going on inside the app.

[00:39:46] Henry Suryawirawan: Also another thing that you mentioned just now, when speaking about explaining why do you do this? Why you implement encryption, for example? Why you implement a bulkhead? So the key question here I think is what threat model or what resilience aspects that you want to solve before actually introducing that complexity unnecessarily? I think sometimes, we tend to neglect that as well, because we think, yeah, it’s secure. So we need to do it.

[00:40:06] Eoin Woods: It’s all a risk mitigation exercise. But you could argue architecture is fundamentally about mitigating technical risk. For nearly all of these things, these tactics have been created because people spotted there was a risk and sometimes, that risk occurred.

[00:40:18] Allowing for Failures

[00:40:18] Henry Suryawirawan: Another aspect that I see is quite important in the book, right? When you architect for resilience, is something that is called the allowing for failure. So what do you mean by that? Allowing for failures?

[00:40:29] Eoin Woods: This is the sort of resilience approach as opposed to the availability approach. It’s changing the mindset. So when we used to build system with HA technology, we effectively were pushing the allowance for failure out of the application software into the (HA) system. The HA system will spot there’s a problem, it will shut things down and it will start things up over here. The application itself is unaware what was going on. And of course, as we said, if you’ve got lots and lots of pieces, something’s always going wrong. You can’t do the sort of cold fail over type pattern. The individual parts and system have to take a lot more responsibility for continuing to run when things go wrong. That’s what I mean by embracing failure, if you like, is that you have to allow for the fact failures can occur when you’re designing the application software. As opposed to saying, we’ll keep the application software very simple, we’ll put all this complexity into the High Availability manager.

[00:41:14] Henry Suryawirawan: I think totally makes sense. These days we have to expect failures will happen, and what will happen afterwards, which is like to reduce the MTTR aspects of our operations.

[00:41:23] 3 Tech Lead Wisdom

[00:41:23] Henry Suryawirawan: So, it’s been a very insightful conversation. Talking about security. Talking about resilience. Unfortunately, we have to cut it short this time. But I always ask this one last question for all my guests, which is about three technical leadership wisdom. So this is something that you want to share for people to learn from you. From your journey, maybe from your wisdom, or maybe from your past mistakes as well. So would you be able to share any three technical leadership wisdom?

[00:41:46] Eoin Woods: I’ll try to. I don’t know how wise they are, but I’ll certainly try to. Things that I find to be important are focus on fundamentals, right through your career. As we said earlier, things change all the time. The individual technologies move at a very fast rate. I especially worry, sometimes people who come into our industry through the sort of fast conversion courses, you know, the so-called bootcamps, is that they might know quite a bit about today’s technologies, but there just hasn’t been time to give them the fundamentals and the historical context. So go back and learn what the fundamentals are. The technology today I’m using is almost certainly an instance of a type of technology. Why was it created? How does it work? What are its strengths and weaknesses?

I think the second thing I’ve found throughout my career is holistic thinking. I would hesitate to call it systems thinking. That’s actually a technical field, but some people say step back, some people say, see the big picture. When you’re working on a micro problem or a micro design, or a micro piece of user interaction, whatever it is, it’s quite easy to lose context and purpose. And that’s one of the things often that I find when we have architecture problems, teams have done great work at the micro-level, and no one has been stepping back and trying to work out what the macro-level looks like.

And the third thing is remember that you’re not the smartest person in the room, even if you think you are. There’s almost certainly somebody who knows more about nearly everything than you. So, be as good at listening as you are at doing the talking.

[00:42:59] Henry Suryawirawan: Lovely wisdom, indeed. I love all of them. Especially the second one, holistic thinking. So I think these days, we tend to go micro, but sometimes we should step back and look at the macro side as well. So Eoin, for people who want to continue this discussion online with you, or maybe find the books, where can they find you online?

[00:43:15] Eoin Woods: They can find me on my website, which is EoinWoods.info, E O I N W O O D S.info, which I know is not obvious, but that is how you spell it. There’s ContinuousArchitecture.com, which is the site for the book, which all three authors are currently putting content into. So you can contact us through there, or you can find me via my website. There’s links on my website to the normal places.

[00:43:34] Henry Suryawirawan: I’ll make sure to put all those in the show notes. So Eoin, thanks so much for coming to this conversation. It’s been a pleasure.

[00:43:39] Eoin Woods: Thank you, Henry. It was really fun to talk to you. Thank you again for the invitation.

– End –