#5 - Infrastructure as Code - Kief Morris

 

“With Infrastructure as Code, you’re not trying to kind of reverse engineer or understand what ended up somehow onto each system, you’re actually saying, this is how the system is built and because it’s built from that code. So there is no difference."

In this episode, I had an in-depth discussion with Kief Morris, the author of the O’Reilly “Infrastructure as Code” book. We started from what is Infrastructure as Code and why we should implement this important concept for managing our infrastructure, especially in the cloud era. We also discussed Infrastructure as Code principles, patterns, anti-patterns, pipeline, testing, and also recent new tools in this space. Kief also mentioned about his upcoming 2nd edition of the Infrastructure as Code book and what new changes that he is introducing. Do not miss our Pet vs Cattle discussion!  

Listen out for:

  • Kief’s career journey and how he started doing infrastructure as code - [00:03:22]
  • How Kief got into writing Infrastructure as Code Book - [00:07:24]
  • What is Infrastructure as Code and why - [00:10:35]
  • Pet vs Cattle - [00:18:56]
  • Infrastructure as Code principles & patterns - [00:20:15]
  • Automation fear - [00:27:06]
  • Refactoring infrastructure code - [00:30:49]
  • Infrastructure as Code pipeline & testing - [00:36:17]
  • Pulumi and CDK - [00:48:29]
  • Infrastructure as Code anti-pattern example - [00:50:07]
  • 2nd edition of Infrastructure as Code book - [00:51:40]
  • Infrastructure as Code reverse engineering - [00:53:40]
  • Kief’s 3 Tech Lead Wisdom - [00:55:02]

_____

Kief Morris’s Bio
Kief is a Global Director of Cloud Engineering at ThoughtWorks. He enjoys helping organisations adopt cloud age technologies and practices. This usually involves buzzwords like cloud, digital platforms, infrastructure automation, DevOps, and Continuous Delivery. Originally from Tennessee, Kief has been building teams to deliver software as a service in London since the dotcom days. He is the author of “Infrastructure as Code”, published by O’Reilly.

Follow Kief:

Mentions & Links:

 

Like this episode?
Subscribe and leave us a rating & review on your favorite podcast app or feedback page.
Follow @techleadjournal on LinkedIn, Twitter, and Instagram.
Pledge your support by becoming a patron.

 

Quotes

On What Is Infrastructure as Code And Why

  • You manage and define your infrastructure in files that you can manage and treat like source code and apply software engineering practices, like Test Driven Development, Continuous Integration, Continuous Delivery.

  • It’s not so much that one type of language or the other is the right language to use for infrastructure code. I think there are different concerns, different things going on in an infrastructure code base. And some of them are more appropriate to use a declarative language and some are more appropriate to use a more kind of general purpose, procedural or object oriented language.

  • The value of Infrastructure as Code is in the reproducibility and consistency.

  • Without Infrastructure as Code, what you often find is that, especially for a system of any significant size, you end up with lots of different things and lots of different versions of things running.

  • With Infrastructure as Code, it makes it very simple to roll out patches and keep everything at a consistent level.

  • With Infrastructure as Code, you’re not trying to kind of reverse engineer or understand what ended up somehow onto each system, you’re actually saying, this is how the system is built and because it’s built from that code. So there is no difference.

On Infrastructure as Code Anti-Patterns

  • The Snowflake system is the one that everybody’s afraid to touch or you can’t rebuild it.

  • A Pet is like systems that you really love. You’ve built them by hand very carefully.

  • Cattle are like, you don’t have an emotional attachment to them because you have too many of them and you’re going to just eat them anyways.

  • You should treat all of your systems as disposable.

  • The idea of unreliability, of treating everything as reproducible and assume the system is unreliable.

  • The idea of reducing variability, you try not to have too many different versions of things running. You try to keep everything kind of minimized.

  • One anti-pattern is using these configuration management tools as Infrastructure as Code tools, kind of ad-hoc, to make a particular change.

  • I think the reason that people end up going and making configuration changes by hand is because it’s lack of confidence that the tools are really going to work and that things might break.

  • Automation fear spiral, where the more you’re afraid that your tool is going to break something, the more likelihood you are to do something by hand, which means that it’s more likely the next time that the automation will break it.

  • An important component of GitOps is Continuous Synchronization.

  • Continuous synchronization, the idea that your tool just runs in a loop and continuously looks at the version of the code that should be applied to your infrastructure and will reapply it again and again, even if nothing has changed.

  • The blast radius of a change, how much could you break with the particular change that you’re making right now?

  • Reduce the blast radius by shrinking the kind of scope of the pieces of your system.

On Infrastructure Pipeline

  • Make a pipeline for each piece of your infrastructure.

  • By having a pipeline, it forces you to keep that loosely coupled thing going with your system parts.

  • Having a pipeline forces you to think about what we did in the software world.

On Manually Editing Infrastructure State

  • Infrastructure surgery, it’s like you’re doing brain surgery on something and you’re at risk of killing the patient.

On Infrastructure as Code Repository Structure

  • I tend to kind of start from thinking about what are the applications and the system structure, the infrastructure architecture that you want and your team structures, you should kind of align those up.

  • Monorepo isn’t about just having all of your code in one repository. What it’s really about is how you build your code.

  • I think that the monorepo is really a build strategy rather than a code organization strategy necessarily.

  • What you tend to want to do is you’re trying to break your system apart into small pieces and to be able to test each of those pieces individually.

On Infrastructure as Code Testing

  • People talk about test driven development as not so much about the tests necessarily as about design, forcing you to have a better design. And it’s the same for infrastructure as for software.

  • With tests, you have to be very pragmatic and think about what’s the value of this test? What’s the risk I’m trying to address here?

  • I think at the declarative code level, there’s not so much value in test, particularly that unit test level.

On Infrastructure as Code Reverse Engineering

  • What I tend to recommend is essentially taking piece by piece of existing infrastructure and rebuilding it by writing the code, clean and well-structured. Because the reverse engineered code, I don’t think it’s necessarily going to be well-structured, well-factored, won’t include tests and all those kinds of things.

Kief’s 3 Tech Lead Wisdom

  1. Harnessing and trusting the power of your team and the people around you. Don’t over manage the people and think that you need to think for them.
  2. Provide the right level of guidance and support to people.
  3. Be aware of the different situations and needs might call for different approaches.
Transcript

Episode Introduction [00:01:32]

Henry Suryawirawan: Hello everyone. Welcome to another episode of the Tech Lead Journal with me your host Henry Suryawirawan. Tech Lead Journal is also now available as a YouTube channel. So for those of you who love listening your podcasts on YouTube, please know that this podcast is also available for you to like and subscribe. And if you are enjoying and have been benefiting from this podcast so much, consider becoming a patron of the Tech Lead Journal by visiting the techleadjournal.dev/patron. I will sincerely appreciate your support so much. And your valuable support will help me a lot to make the production of the upcoming episodes more sustainable and frequent. As a patron, you also get exclusive access to patron-only contents, including direct personal access to me. So please, please, please pledge your support at techleadjournal.dev/patron.

So, I’m very excited to have Kief Morris for today’s episode. Kief is the Principal Cloud Technologist at ThoughtWorks and also the author of the O’Reilly Infrastructure as Code book, a book that I highly recommend if you’re into DevOps, cloud and infrastructure automation. In this episode, I had an enjoyable in depth discussion with Kief regarding infrastructure as code. And I personally learned a lot from his massive experience and expertise in this area. We discussed things like infrastructure as code principles, patterns, anti-patterns, pipeline, Pet versus Cattle, and the latest infrastructure as code new tools. We also discussed about his upcoming second edition of the Infrastructure as Code book.

So without further ado, let’s jump right into the episode.

Introduction [00:03:16]

Henry Suryawirawan: Hi Kief. Thanks so much for joining me in this episode, in the Tech Lead Journal podcast.

Kief Morris: [00:03:21] Hi Henry. Thanks for having me on.

Career Journey [00:03:22]

Henry Suryawirawan: [00:03:22] Yeah. So, it’s a pleasure to meet you because I read your book a few years back “Infrastructure as Code”, I must say that is one of my favorite technical books. So I read it end to end, and I got the opportunity to have the book signed by you as well few years back. So before we go into the infrastructure as code, I want to ask you about your career journey. How did you start your career? What are the major turning points in your career and what led you to dealing with Infrastructure as Code and ended up writing the book?

Kief Morris: [00:03:52] Okay. So it depends on how far back we want to go. I mean, I started out not in IT, but I was doing it as a hobby. I was running a bulletin board and doing that kind of thing and then decided I was enjoying that more than what I was doing. I was managing like a movie cinemas. I was in a movie cinema company. And so I decided to go back to university and got my Master’s degree in Computer Science. That was around the kind of mid-nineties. The internet was a thing, but it was only just becoming a big thing in the U.S. At least, before it kind of started growing bigger. So it was still quite new. It was still kind of a niche, but it was a lot of fun. And so I learned about Unix and I learned a bit of programming and all this kind of stuff. And so I worked, I got a job actually in my Computer Science department in the systems administration team. And so the same time that I was taking classes and learning how to program and understanding computer science, I was also getting to actually work with our computer systems. And because we had the vendors like Sun and HP and all of the kind of, particularly Unix vendors would give us equipment. And so, you know, we got to kind of do quite a lot of interesting stuff and play around with lots of stuff. That was kind of how I got started in infrastructure in a way. And coding was a big part of it. It was scripting, like the way I learned in that team was very old school. We compiled the applications to make them available to users. So we had to learn how to do that, which meant we had to learn how to debug, sometimes code. It was a very kind of, I think good introduction to start wrote like Perl scripts and things like that to do systems administration tasks.

And after that around 1998, I finished and I moved to London and I got a job in an industry and this was the .com days were just starting. Especially in London, it was a bit slower to pick up than in the U.S. But it then did pick up and became a very big thing. And so I worked in different companies over the probably 10, 12 years. From there, I worked for several different companies that were what I would call like post startup. So there were companies that were maybe like 25 to 50 people when I joined, and then they grew to maybe a few hundred people. So it’s very much about taking something that was already kind of established, but maybe not very mature and then building it out. So even when I worked across roles in software development, or systems administration, I might work in either role because this was before DevOps. So before you could do both of those things. And I always kind of found myself drawn to the other side of things. So if I was a systems administrator, I’d be looking at, okay, how do we build and deploy applications? And then if I was a software developer, I was looking at well, how do we kind of configure our environment so that things stop breaking in one environment when they work in a different environment, you know, how do we make our environments more consistent? So I was always kind of attracted to that kind of thing.

And then, it was probably around 2008 or so, it was I think when DevOps became a thing and Infrastructure as Code became a thing. And when the cloud was very, very early on becoming a thing, AWS started up. So all of these things kind of converged and they’re all like right in my interest area. I was like, Oh, this is perfect. And then, I came across the Continuous Delivery book. Actually, it was in the early release, as we do and as we’ve done and seen with my books, where you can go on to what was called Safari and is now called O’Reilly Online Learning Platform or something like that. You can get a pre-released draft of the book basically. So I found that Continuous Delivery book and I was like, wow, this is exactly, such exciting stuff for me. It was written by a couple of guys, Jez Humble and Dave Farley who worked at ThoughtWorks. And so I said, Hey, I want to go work with that company. I want to go work with the people who are doing this stuff. And so I joined ThoughtWorks. That was about 10 years ago now. I’ve been at ThoughtWorks ever since.

How Kief Got Into Writing Infrastructure as Code Book [00:07:24]

Henry Suryawirawan: [00:07:24] Right. It’s very interesting history. I didn’t know that you were actually interested in writing the book because of the Continuous Delivery book. Both books are really awesome, actually. So, yeah, it’s one of those must read books, I must say.

Kief Morris: [00:07:36] Yeah. It definitely inspired me. I don’t think at the time I had in mind that I would write a book, but it’s one of those inspiring things. And certainly, you know, I did decide to write a book that was one of the ones I looked at, as well as obviously Martin Fowler’s books for inspiration. So in terms of, you know, how the book kind of came about, it’s funny, because as I said, I’ve had all that interest in this area and writing “Infrastructure as Code” and using cloud and virtualization and everything. And working at ThoughtWorks, I would go and work with clients where they’re asking us to help them to do Continuous Delivery. And so we have to work with their infrastructure teams, their operation teams. Just say, okay, we need to have environments that are set up in a very consistent way and hey, here’s a cool thing you can do to make that easier for you. You could do Infrastructure as Code. And so I would be trying to kind of persuade people to do that.

And I always thought, Oh, I was waiting for one of the experts, right? Cause there are all the people who are the well known experts in Infrastructure as Code. There’s people like Andrew Clay Shafer and Patrick Debois and all these kind of people, these DevOps people. And I was expecting one of them would eventually write this book, that I could then give to these people that I’m working with to explain now here’s how to manage your environments in a nice, consistent way so that you could do Continuous Delivery. And eventually at some point, I think some website done some blog posts and an editor approached me and asked me if I would want to write a book on DevOps. And I said, I’ll tell you what, I know a book I would like to try to write. And so that’s what happened. Yeah, I did that.

Henry Suryawirawan: [00:08:50] Right, right. The rest is history right, they said. So during that time what kind of tools that actually exist, that led into the whole, the creation of more than just one tool. So during that time when you said about doing scripting, Perl scripting, are there any tools that stand out from the past that led into the creation of multiple modern tools that we see these days?

Kief Morris: [00:09:12] I think the one, the first one that I really got a hold of. So I’ll tell you that my kind of source of information for this stuff before Infrastructure as Code was a thing, was a website called , I can’t remember infrastructures.org or infrastructure.org with or without an S. They are different sites, but it hasn’t been changed since I think 2007. So it’s interesting just as all this was taking off, that site, whoever the maintainers were didn’t really kind of get carried on. But it talked about like automatically provisioning servers and automatically configuring. The two tools they talked about, that I started using when I read that were one was called FAI Install, which was for Debian Linux. I was using physical servers for this. You would take a server that was in a rack. You would boot it. It would hit like the, depending on which brand of hardware it was, it might be the F12 button or the F2 button when it boots. And you would have configured a DHCP server so that when you do that and it boots, it would like download a little installer and then it would install the OS. And then the other tool that you would use then was CFEngine which is written by a guy called Mark Burgess, who’s probably, you know, like the godfather, the grandfather of Infrastructure as Code, again before it was called that. And that was a server configuration tool very much like so. Following on from that came kind of Puppet and Chef and Ansible all kind of followed in the footsteps of CFEngine.

So that was kind of how I started doing this stuff. And yeah CFEngine was certainly one of the precursors to all this. And then, I discovered Puppet when that came out and I thought, Oh, this was really cool. And then later on, in a team that I was working with, one of my colleagues found Chef and said, Hey, let’s try using this and AWS, and it just kind of went from there.

Infrastructure as Code [00:10:35]

Henry Suryawirawan: [00:10:35] Right. So let’s start with what is Infrastructure as Code then?

Kief Morris: [00:10:39] Okay. Infrastructure as Code. So you can define it different ways, people define it different ways. I think it’s basically about the idea that the way you manage your infrastructure, you define it, is kind of in files that you can manage and treat like source code. And so this is the idea that these things are external to the tool itself, you can put them in a source control system, you can use any text editor, you can use any kind of tools that manage and interact with text files, to the cool and interesting stuff. And then that enables you to do things like take software engineering practices. So things like Test Driven Development, Continuous Integration, Continuous Delivery, you can start using these techniques for your infrastructure. So that’s kind of my basic definition of it.

And now I think these days there’s a big kind of, I’m not sure if controversy is the right word, but there’s a big kind of movement around, well, should it be code or is it configuration? So I’m working on the second edition of the Infrastructure as Code book right now and one of the topics I’m trying to kind of articulate in there is so we know what are the different types of languages you can use and when are they may be appropriate. Because the tools like starting with CFEngine, Puppet, Chef and then the more kind of cloud oriented tools like Terraform and CloudFormation, they’re all declarative, right? Before that we wrote scripts where you mixed in how to create the infrastructure with what you want. So you would have like the kind of loops and that create the thing, provision a server and then you would loop on like, is it created yet? Wait for it to be created, it wasn’t an error and all those kinds of things you had to do all that kind of logic around it. But these tools kind of separated that out so that the tool will handle the logic. You just say, I want my server to look like this. I have this much RAM, this many CPUs and so on, just declare that. And now it’s interesting that we’re seeing an emergence of tools like Pulumi and AWS CDK, which are bringing in kind of general purpose real programming languages that you can use TypeScript, JavaScript, Python, whatever, within these tools. And that’s interesting because in a way it looks like it’s going back to the old scripting of infrastructure, but I don’t think it really is. And the reason these tools are popular, I think, is because you get this, if you’ve worked with much of these infrastructure tools, if you work with like, Terraform or Ansible, you’ll see people trying to kind of turn them into procedural languages. You see in the languages themselves that the makers of the languages start adding in like, Oh, we need to have conditionals and loops and, and things like that. And then the code gets just really terrible, really terrible because it’s kind of declarative and kind of not, it’s a big mess.

But my thinking on this, is that it’s not so much that like one type of language or the other is the right language to use for infrastructure code. I think it’s that there are different concerns, different things going on in an infrastructure code base. And some of them are more appropriate to use a declarative language and some are more appropriate to use a more kind of general purpose, procedural or object oriented language. And I think as a kind of field of Infrastructure as Code, this is all still very new. And I don’t think we’ve found the kind of right forms to say like, Oh, okay, here’s how to organize your code into different concerns and to use different tools and different languages for each. I think we’re still discovering that, which is quite exciting really to be kind of doing this at this time and kind of exploring and discovering as an industry.

Henry Suryawirawan: [00:13:35] Yeah. So I find myself as well, like, every few months, should I say probably there are new things coming up as well, in terms of Infrastructure as Code. So, previously I used a lot of Ansible scripts and over the time move over to Terraform and then with cloud native, Deployment Manager. And also coming up with like Kubernetes Config Connector and Kubernetes itself, it’s also in a way a format of Infrastructure as Code, right? The YAMLs?

Kief Morris: [00:13:59] Yeah.

Henry Suryawirawan: [00:13:59] So in the book you mentioned Infrastructure as Code is like managing your infrastructure, just like, you’re treating a software and that’s why you apply software development best practices. So what kind of software development best practices that normally are applicable for Infrastructure as Code? Is it all of them or just some of them?

Kief Morris: [00:14:17] Well, I think it kind of depends. And it goes back a little bit to that point of what are the different concerns in your code base. So, I mean, typically the things I talk about as practices to bring into it, it very much draws from the kind of Agile software engineering practices of Test Driven Development, Continuous Integration, Continuous Delivery, Pair Programming, all those kind of things. As well as I think some of the design principles of how to design good software, how to keep software kind of loosely coupled, and so on that we want that for infrastructure, because I think what I’ve see a fair bit of these days is people who’ve been doing this for a little while to have very large code bases.

For example, I worked with a client in London who had a Terraform project that when you run “terraform apply”, it could take two hours for it to finish. So it’s just kind of this massive project that grew. And it was exactly like we talk about software monolith. It was a monolithic project. And so what we helped them to do was to break that apart into smaller. So taking that kind of almost microservices mentality of how can we make it into smaller pieces that are loosely coupled and independently deployable or appliable, as it were.

Why Infrastructure as Code [00:15:18]

Henry Suryawirawan: [00:15:18] So for people who are probably new to this Infrastructure as Code, can you probably explain why should they even learn and do this Infrastructure as Code?

Kief Morris: [00:15:26] Sure. I think the value of Infrastructure as Code is it’s kind of in the reproducibility and in the consistency. So most of the environments, certainly that I’ve worked in have been ones where we’re doing software delivery. We have test environments, maybe staging and pre-production, all these kinds of environments. And so one of the kind of first things that infrastructure code can help you to do is to make sure that those are consistent, they’re built the same way at each point.

And then I think other things Infrastructure as Code can help with is around, just making sure that you can manage the amount of stuff you have. So one of the things that you see as you move from a data center, where all of your servers are hardware and maybe virtual machines. But they have kind of a physical limit. There’s only so many virtual machines you can cram onto the hardware that you have in your data center. But then when you go onto the cloud, you can create so many more and it get out of control. And so how do you manage all of those things? And I think Infrastructure as Code gives you a very good way to manage that.

So rather than using maybe an automation tool, that’s like a GUI and then it has the configuration information inside in its internal data files. Having it as code in the source control that says, this is what my infrastructure should look like means that anybody can look at it. So new members of the team can join in and they can look at the infrastructure code and say, Oh, this is how we configure an application server, this is what our databases look like. Very easy to see that.

And then also the processes for, so, okay, how do we provision a system? How do we deploy software? Being able to look at the code for that is quite helpful. And that helps as well with more controlled environments and we have to worry about compliance like in financial services and so on. Because you know, that stuff can be auditable. You can show an auditor, look, here’s the code that defines what our system looks like. And here’s where we define our security rules. Here’s how changes are made, here’s what changes have been made because it’s in our source control history. And if you’re using something like a Continuous Delivery Pipeline, you can say, Oh, here’s the logs of like every change that went through and was made to our system and who made it and what tests did we run? So that becomes a very strong story for kind of compliance and auditing as well.

Henry Suryawirawan: [00:17:21] Yeah. So what kind of problems if let’s say people do not adopt this Infrastructure as Code? I mean, traditionally going back years we have seen without Infrastructure as Code people can still build and manage their infrastructure. But what kind of problems that normally exists if they stay on and do without Infrastructure as Code?

Kief Morris: [00:17:40] I think without Infrastructure as Code, what you often find is that, especially for a system of any significant size, you end up with lots of different things and lots of different versions of things running. So if you’re looking like patch levels and so on, you might have all kinds of different versions of say Java running across your different systems. And then when there’s a security vulnerability is published, how do you roll out the patches? With Infrastructure as Code, it makes it very simple to roll out patches and keep everything at a consistent level. Without that you tend to have a little bit more of a chaotic environment and then it can be difficult where you’re like deploying an application and it works in one environment and not another environment because there are differences. And it’s hard to kind of understand what those differences are because you have to maybe run some tools to try to analyze what’s on what versions of different things are on there. But with Infrastructure as Code, you’re not trying to kind of reverse engineer or understand what ended up somehow onto each system, you’re actually saying, this is how the system is built and because it built from that code. So there is no difference.

Henry Suryawirawan: [00:18:35] Right. And also I think the term Snowflake servers, it’s one of the challenge.

Kief Morris: [00:18:39] Yeah, exactly. The Snowflake server. The Snowflake system is like the one that everybody’s afraid to touch or you can’t rebuild it. You’re like, Oh, nobody really knows how to rebuild it. We’ve tried to kind of move off of that and onto a new version of the operating system, but we still have to run the old operating system version that isn’t supported anymore because we can’t figure out how to rebuild it.

Pet vs Cattle [00:18:56]

Henry Suryawirawan: [00:18:56] So there’s this one term, very commonly mentioned when we talk about Infrastructure as Code, which is Pet versus Cattle. Can you explain to us, like, what is Pet versus Cattle?

Kief Morris: [00:19:06] Yeah, so like a Pet is like systems that you really love. You’ve built them by hand very carefully. And I used to love this right back before, even the early days where I was using infrastructure code, like I mentioned with hardware. Every server, like we gave them special names. We would have like a theme. And so the one place I worked, all of our Linux servers were named for planet. So we had, and we didn’t have very many right, but Jupiter and Saturn and Pluto and so on. We have some other servers that would be maybe the names of the moons, Ganymede and Callisto and so on. And that was fun, right. You know, we love doing that. You would work on Saturn and make sure that it was configured nicely and everything.

And then it was somebody from, I mentioned in the book and I can’t remember the name of the person, somebody from, I think from CERN in Switzerland, who did a talk. At least this is where I learned it. It was a talk they gave around treating your servers, like cattle rather than pet. So cattle are like, you don’t have an emotional attachment to them because you have too many of them and you’re going to just kill them anyways and eat them or whatever. So, merge the idea that your servers and other parts of your infrastructure, you should be able to destroy it very easily and recreate it and reproduce it. And we talk about Utility Computing. We used to talk about Utility Computing and it’s kind of like that. It’s not especially hand-built thing. It’s more industrial era for infrastructure.

Infrastructure as Code Principles [00:20:15]

Henry Suryawirawan: [00:20:15] Right. So, when we adopt this IAC, do we have few principles in mind? For example, a few things that I can think of is it seems like Infrastructure as Code is supporting disposability where you also mentioned couple of times, reproducible. So are there actually principles behind Infrastructure as Code?

Kief Morris: [00:20:33] Yeah. So some of the principles, as you mentioned, there’s the idea that you should treat all of your systems as disposable. And this helps with being able to kind of know that you can rebuild something. So it helps kind of with the disaster recovery and continuity to know if any part of my system is destroyed or compromised or whatever, I know I can just destroy it and build it again. That’s an easy thing to do and a common thing to do. So that’s one of the principles.

And that kind of underline the idea of kind of unreliability. So originally, the idea of the cloud was that the hardware might not be that reliable and it shouldn’t matter. You should be able to have run reliable systems on top of anyways. And you see if you have that mentality, when like there’s a failure of a cloud, like when one of the regions in AWS has a failure, you see some organizations are down and other organizations are not. Like Netflix is the classic example. They’re really good at making sure that they carry on running even if different parts of AWS go down. And that’s because of that mentality of they treat everything as reproducible and assume the system is unreliable. That’s being able to reproduce everything, kind of on demands. There’s I think consistency. And kind of trying to reduce variability. I don’t think it’s something I put necessarily in the first edition of the book. But it’s the idea that you try not to have too many different versions of things running. You try to keep everything kind of minimized. And this helps with things we saw this with, I think it was at the Heartbleed security vulnerability that got a lot of publicity. With the Open SSL, the library that’s installed on systems and people found that across their various servers, they had like maybe six, seven different versions of the Open SSL libraries installed. And that makes it difficult to kind of keep everything a) consistent and also just to make sure that everything is patched in the latest. So I think that’s another important principle.

Infrastructure as Code Patterns [00:22:04]

Henry Suryawirawan: [00:22:04] Right. When we say treating Infrastructure as Code just like any software development practices, are there any sort of patterns, like in the software world, we have design patterns or some kind of architectural patterns. Are there any patterns in IAC?

Kief Morris: [00:22:19] Yeah, there’s definitely. So in both additions with the book, I use patterns for various kind of parts of the various concerns. I’ve got a few patterns around how to provision servers, for instance. So there’s like the Push versus the Pull idea.

So the Push idea is that you have a tool and an Ansible tends to be designed to work this way. Although you can use the different tools in different ways. The idea is that you spin up a new server in the cloud or a virtual machine or what have you. And then you have a process sitting outside that connects in, maybe by SSH logs into the machine and then configures it. So that’s kind of a Push to kind of provision and configure a machine.

And then the Pull pattern is basically where you say, when the machine boots up, it has something already pre-installed on it that will trigger and download configuration files, latest configuration files and apply it to itself. And so it basically pulls this configuration down onto itself. And as with any kind of design patterns in software engineering, there are advantages and disadvantages of each, maybe where they’re more appropriate.

A lot of people like the Push pattern because you don’t have to pre-install anything on your virtual machines. So Ansible can just connect in, as long as SSH is configured on there and you can connect into it. You can then go and configure that machine regardless of how it was before.

Whereas the Pull pattern tends to work well with servers that are automatically provisioned say like auto scaling. Because the platform decides to spin up a new virtual machine automatically. And so it’s useful that the machine is able to configure itself rather than needing to have something else know, Oh, it’s time to go and find this new machine and configure it.

Henry Suryawirawan: [00:23:46] Right. From my personal experience doing this Infrastructure as Code, there are oftentimes where people just learn the tools. Like for example, the most popular one obviously is Terraform. And then afterwards, they learn it, they did it, they applied it and infrastructure is set up nicely. But then over time they go back to maybe the GUI console based kind of changes being made or maybe they even do like an SSH and going to the servers. So, what are some of the tips to avoid these kinds of practices?

Kief Morris: [00:24:13] Yeah, that’s quite common and it’s certainly the way I tended to start out was using the automation to build things, but not so much to change them afterwards. And what would tend to happen is you’ll create a few different application servers and they’ve all got Java, maybe they’re all running Tomcat, but then it’s like, Oh, one of them is getting higher traffic and its configuration needs to be tweaked to optimize, to handle the traffic level. So maybe you write your Ansible code to go and make that change to that server. The configuration change to that application server, but you don’t run it on the other application servers. And then later on you say, Oh, I need to do an upgrade to Tomcat, and so I make my Ansible playbook and I try to use it to go and upgrade the Tomcat and it breaks on some of them because you’ve made different changes to them in the meantime. And so I think the kind of anti-pattern here is using these configuration management tools as Infrastructure as Code tools, kind of ad-hoc, kind of like to make a particular change.

And one of the stronger patterns that we see to make it possible to manage changes to your infrastructure more reliably and make the infrastructure code the way, the easiest way to do that, is the Continuous Synchronization pattern. And this is the idea that your tool just runs in a loop and continuously looks at the version of the code that should be applied to your infrastructure and will reapply it again and again, even if nothing has changed. And so tools like Puppet and Chef were designed to do this, so they have an agent that runs on the server and polls maybe every hour or so, and make sure that it has the latest version of the code. If the code has changed, it gets the latest version. It hasn’t changed, it just reapplies the existing version. And that kind of makes sure that everything is kind of pulled in for the same version all of the time. When you do go and make a change and you say, I want to change this one server to optimize it for the high levels of traffic, it forces you to figure out how to do that within the code to say, okay, maybe I’m going to make two different classes of application server now, one is for high traffic and one is for low traffic. Or maybe I make the configuration change even on the other servers, I’m just going to, all of the servers will now be tuned for high performance, even if they don’t necessarily need it. And so that way it’s all kind of built in and all consistent.

Because I think the reason that people end up going and making configuration changes by hand is because it’s lack of confidence that the tools are really going to work and that things might break. It’s what I call the kind of automation fear spiral, where the more you’re afraid that your tool is going to break something, the more likelihood you are to do something by hand, which means that it’s more likely the next time that the automation will break it. And so it just kind of goes downhill from there.

And this is also a principle actually of GitOps as well, which is one of the kind of schools of thought around Infrastructure as Code, tends to be applied more in the Kubernetes world. But this is a key part of that people often talk about less, actually people talk about GitOps, oh I’m using git branches to manage the different versions of my infrastructure code for my different environments. And I think, okay, I’m doing GitOps, but actually an important component of GitOps is that Continuous Synchronization. The fact that there is a process somewhere and whether it’s a tool like Weaveworks tool, which supports this way of working. There’s something there that is just kind of continuously checking the code against the reality and making sure that they match up. So that’s an important thing I think to do.

Automation Fear [00:27:06]

Henry Suryawirawan: [00:27:06] Yeah. You mentioned something around the automation fear. I think obviously for example, your case where a customer, where the “terraform apply” runs for like two hours. And as your Infrastructure as Code gets bigger and bigger, obviously there’s this tendency of fear coming in, for example, one line of code change. Sometimes you don’t know, actually it might affect other things, it might destroy the existing infrastructure and recreate that somehow.

So obviously this is one thing that is not so easy and not so straightforward, especially for people new, learning about the tools. I myself have seen it several times, when I made some changes and Oh, it’s going to destroy this and then recreate this. So obviously there’s a lack of confidence from my part as well when I do that. So how do you overcome that kind of challenge?

Kief Morris: [00:27:45] So there’s a number of things you can do for that. One is, so as we talked about in that case of the two hour Terraform project, it’s breaking it down into smaller pieces. And so that when you’re running your “terraform apply”, it’s only running against a smaller subset of the whole. And so that gives you a little bit of a confidence. You’re still kind of scared because you still, you know, you can break that thing and maybe it can have ripple on effects. But it’s a little bit more, it’s that blast radius. People talk about the blast radius of a change. Well, how much could you break with the particular change that you’re making right now? So that’s one thing is reducing the blast radius by shrinking the kind of scope of the pieces of your system.

Another thing is and the reasons I talk a lot about automated tests and the pipeline for Infrastructure as Code, which is not really a mainstream things, not many teams are really doing this in anger for a couple of reasons we can talk about, but the idea of why to do it. That’s just one of these conversations I have with these teams who are trying to kind of break apart their infrastructure is make a pipeline for each piece of your infrastructure. And that does a couple of things. One is, obviously you can run some automated tests to your Terraform so that you have some confidence, okay, we spun up this part of it, subset of the system that the Terraform thing manages and we ran some automated tests against it and they passed. So we have more confidence, that we can now apply this to production with less chance of it failing.

The other thing that it does by having this pipeline is it kind of forces you to keep that loosely coupled thing going with your system parts. So what will often happen if you don’t have the pipeline, but you break your system into multiple parts is, it’s like, okay, we’re going to make a change to our application server Terraform code. Well, in order to test that, we might have to create our networking stack and our database stack and all these other kinds of parts. So again, that two hour Terraform project I mentioned in order to test one part of it, we might still have to run two hours worth of Terraform, even though it was all a whole bunch of different little projects, rather than one big project, we still have to do all of that to have all of the infrastructure. That’s the prerequisite for that one piece that we’re actually working on changing.

And so having a pipeline kind of forces you to think about what we did in the software world of saying, how do you do a unit test? We designed the components of our software so that we can create an instance of it by itself. To do a unit test, you shouldn’t need to spin up a database and connect to a database. That was one of the kinds of things that people struggled with the unit tests of software was, Oh, we have too much other stuff that we have to create in order to run this one unit test on one component.

And so the answer was well, to force you to kind of do better design and say, okay, how can we redesign? How can we refactor this component so that it does the same thing, but it doesn’t need all those dependencies. We can maybe do some dependency injection or other techniques. We can use mocks and fakes and things like that to test just the piece that we’re working on in isolation.

And so this kind of stuff comes together of like, to break things into small pieces, you need to design them to be loosely coupled. And then the pipeline kind of forces you to prove that that design works. I make a change to my infrastructure code for my application server, I push that into the pipeline. And if I’ve made a dependency, a hardcoded dependency on some other component, it will fail in the pipeline because the pipeline is what I’m testing it by itself. And so then you have to go back, okay, how can I redesign my change so that it keeps it loosely coupled.

So people talk about test driven development as not so much about the tests necessarily as about design, forcing you to have a better design. And it’s the same as true for infrastructure as for software.

Refactoring Infrastructure Code [00:30:49]

Henry Suryawirawan: [00:30:49] Right. So before we go deep into the pipeline design and things like that, there’s one thing that I want to ask as well, because you just mentioned it, refactoring. How do you actually refactor Infrastructure as Code? Because sometimes it involves state of what has been created and then how do you actually refactor that?

Kief Morris: [00:31:05] Yeah, that’s a really good question. And it’s one of the things that I’m adding into the second edition of the book as a topic, because it is one of these big things as we get.

As we mature and do more of the Infrastructure Code, we’re finding a lot more of these what are the hard parts? This is one of the hard parts. So there’s a couple of things, and I think, again, it comes from looking at the software world. There’s a few techniques of things like doing blue-green deployments and so on, and feature flags to say that like, I’m going to make a change with my infrastructure and I can push it through into production. But either, it isn’t necessarily live, so maybe I’m doing kind of a dark launch or canary release or I’m using a feature flag or whatever. And then also, maybe if I’m going to change the infrastructure for my application server, maybe I’m not going to just apply Terraform to that, maybe I’m going to create a new instance of that infrastructure and bring up a new application server and then test that before I flip it over. So that’s like some of the options you have.

And then with state, you mentioned state, cause this is like a Terraform thing in particular and especially when you’re refactoring, you’re saying I’m going to pull apart. So the example I use is where you’ve got like a Terraform networking project that on AWS, it creates like a VPC and some subnets and so on. And then you’ve got other projects which create application servers, which are deployed into those subnets. And then you say, Oh, actually the way we created our subnets in the first place is not quite right. We need to change the size of them or whatever. But that can really break things, right.

So one of the things I’ve seen people do is, they basically use the Expand and Contract pattern from database schema changes. So this is something if you’re not familiar with it, it’s worth looking up, but it’s like, where do you use particularly a tool like, say Flyway or DB Deploy where it uses scripts to make changes to a database table structure, and how do you do that in a way that is safer? And so we can do the same thing for infrastructure. So the example for infrastructure with that networking example.

So let me first say like what I think is probably not the right way to do it, but it’s kind of common, is to say I’m going to edit the state files. I’m going to kind of go and you can use the Terraform command or maybe other tools, or maybe you could even use a text editor, because it’s like the JSON file, you can go and edit. But that scares me. I’ve been calling that infrastructure surgery. It’s like you’re doing brain surgery on something and you’re at risk of killing the patient. So it’s very scary and especially in an emergency situation.

So the Expand Contract patterns I would say it actually started as a series of commits. It’s not a single change. What we’ll do is we’ll say, well, first we’ll make a commit that adds new subnets that are in this structure that maybe we needed a bigger address range in there. Maybe we didn’t add new subnets or whatever. Do we add the new subnets, but we keep the existing subnets that are still there and the servers are still in the existing subnets. Then we can push through changes to the things like those servers, to change them into the other subnet. And so that’s a more attractable change. In some cases you might be able to do it depending on what you’re actually changing without interrupting any kind of service or anything. In some cases it might mean, okay, that server’s going to get rebuilt. But you can manage it kind of one by one, rather than having to do everything at once. You can say, first we’ll change this one server, which is less scary, maybe it’s not public facing. We’ll change that to make sure, okay, yeah that worked. Now we’ll go onto the kind of scarier ones, one by one. And you do each of those changes by pushing again, pushing the infrastructure code change through your pipeline with tests so that you’re more comfortable. Yes, this is going to be okay. And then after you’ve made all those changes and everything has moved over into the new subnets, then you can push the change to delete the old subnets, to clean them out. That’s the contract, you’ve expanded it by creating new subnets, moved everything over and then contracted by removing the old subnets. So that’s a pattern I’ve seen people using with Infrastructure as Code that can work quite nicely.

Henry Suryawirawan: [00:34:20] Right. So another common case of refactoring, which I find sometimes is difficult, again, this is based on Terraform, when you create something and then you realize, Oh, we can create a module that can be reusable for other use case. And that’s when you actually create like this abstraction, oh, we need to add the module name, plus dot something. And that ends up in, changing the state using some commands. Sometimes it can be very scary. Yeah, so that kind of situation, I still could not find a way of how to solve that. Do you have some experience around that?

Kief Morris: [00:34:48] There’s a few things here. And so I don’t know if this relates exactly to what you’re talking to, but are you talking about like where you have Terraform projects that are getting their dependencies from other Terraform state files and other projects?

Henry Suryawirawan: [00:34:59] So let’s say we have a Terraform, like pure, just without module, and then we can see a pattern emerges like, Oh, we can actually create a module out of this few resources. And then that’s when we extract that out as a module, but then when you have already applied that to an environment, the state actually doesn’t recognize the idea of module.

And then after that, when you make that change, when you introduced that module, you kind of need to apply that to the state, introducing this naming of the module as part of the resource.

Kief Morris: [00:35:26] I think that might be an example of where you could use, depending on exactly the specific situation, you might be able to use to Expand and Contract for that, where you introduce the module into your project and create the thing, but you don’t necessarily right away move the old thing into the module. And then you then make a series of code changes that you apply one by one that kind of move things over to using the module.

So possibly that will help. I mean it may also end up that you need to do those state file changes. It’s one of those things that, like I said, it’s scary and we’d like to avoid if we at all can, but sometimes you end up having to do that. I think you can potentially script those things. So if you think about it again, like database deltas, where you say, okay, I’m going to have a script that makes that state file edit, but it’s going to be a script and I’m going to first do it in a test environment. And then I’m going to come on a little test that says, did it work and then push the change on to the next environment. So it’s a hands off thing, at least you’re not doing the surgery by hand.

Infrastructure as Code Pipeline [00:36:17]

Henry Suryawirawan: [00:36:17] Right, right. So, yeah, let’s go into the design of this pipeline. Obviously we have application code, and now we have infrastructure code. How do we structure our repository? That’s the first thing. Do you put them all together in one repository or do you separate them? How do you actually design the CI/CD pipeline? Are they separate pipeline that will converge at some point in time? Because there are different versions of Infra as Code and there’s a version of application code as well.

And then lastly, about the testing. So how do you apply the testing in between, like you have a set of Infrastructure as Code, you have a set of application code and they are merged into one, and then how do you apply the testing on that?

Kief Morris: [00:36:53] Okay. Good. It’s a good question. So starting with the code base design, I think this is one of those things where there are patterns, not necessarily a right or wrong answer or best practice, or what have you. So I tend to think, in terms of, there’s the Conway’s Law thing of your system design is going to reflect your team design or they should align. And so I think, I tend to kind of start from thinking about what are the applications and the system structure, the infrastructure architecture that you want and your team structures, you should kind of align those up. And I think code base design will tend to come out of that. So you don’t really want to have a code base, where multiple different teams are all making changes to it, unless you’ve got good kind of tooling around that to manage that. So you kind of have to think about how you manage changes to that.

So a typical kind of thing to do is to have separate code repositories maybe based on teams. And then the question of, should my infrastructure code and my application code be in the same repository? It’s like, well, is the same team managing both of those, which they might be. You might have. Okay, an application team owns its own infrastructure or parts of the infrastructure that are relevant to the application. I think that’s often a good way to look at it. And then maybe other teams manage infrastructure that’s more shared, and so you can organize your repositories that way.

So that’s a pretty common pattern. There’s the monorepo pattern, which is where like everything is in one big repository and famously, some of the companies like Google and Facebook use this. And I was writing about this actually in the second edition, and trying to define the monorepo. And one of the things that occurred to me that I realized as I was talking about this and talking to some people about it, was that monorepo isn’t really about just having all of your code in one repository. What it’s really about is how you build your code. So with Facebook and Google, they use tools, which basically you just run your build and it runs the build across the entire repository, all the projects that are in it. And it’s obviously they’re very intelligent tools that work out. Okay, you only changed a very small part of the code and the repository, not everything. So it doesn’t rebuild everything from scratch every time. And similarly, like it runs tests then based on what’s actually changed. And that’s the kind of key thing.

I’ve seen people put all of their code into a single repository, but then each of the projects was still separate, would still like compile and build and as separate builds and separate pipelines. And so I don’t think that’s a true monorepo, because I think that the monorepo thing is really it’s a build strategy rather than a code organization strategy necessarily.

So then I think you’re asking about pipelines and our kind of pipelines map on from that. And I think, it’s really about the different pieces that you have and then how they come together. So I see a pipeline as, there’s a couple of things, when you go from left to right, the start and then go into the right of later stages in the pipeline. There’s a couple of different things that are happening. So the classic thing is that like the early stages of the things that run fast, fast tests, like unit tests and so on, and I think that’s true. And then the slower tests and the broader scope tests, like journey tests that go across the user interface of a fully integrated application, tend to run later in the pipeline.

And then I think another dimension of that is it’s the components. So I think what you tend to want to do is you’re trying to break your system apart into small pieces and to be able to test each of those pieces individually. And so, early pipeline stages will tend to be first build a component on its own and test it and make sure it’s good, and then promoted along. And then later pipeline stages might aggregate some of those components together. And the components can be various things. It can be like application code libraries, Java libraries, for instance. I’m going to test my Java library first with unit tests and so on. And then I’m going to test an application that builds a WAR file by pulling different library files, JAR files together.

Now, similarly with infrastructure, you might have, I’ve got some code which is going to install Java and Tomcat onto a server, and it’s an Ansible playbook. And so you probably want to have like a pipeline stage which just tests that playbook on its own. And it can do that very quickly because it doesn’t need to necessarily create a server in the cloud, it can run like a Docker instance, or something like that, where it just says simple Linux image, run the playbook and then tested it, install Java and Tomcat correctly to do what I want there. Before you then progress that on and say, okay, now I’ve got something which is like an Ansible, a role of an application server. And that takes in not just the Java and the Tomcat, but it also puts on the monitoring agent and sets up user accounts and whatever else we want for an application server. And so it says, okay, now I’m going to test those, all those playbooks together and see, does that create a server the way I would want and just test at that level. And then you might have the code, whether it’s Terraform or Ansible or whatever, which says now I’m going to create the infrastructure for my application server. It’s going to create, obviously gonna provision a virtual machine or an auto scaling group, the load balancer, some networking routes and a security group that open ports to it and all that kind of things around the application server. So I’ll provision that on the cloud and then I’ll bring the other things and then test those together. So you can see that each of these is kind of like aggregating piece by piece, putting the pieces together step by step and then testing that they work together and integration testing in a kind of incremental way.

So that’s the kind of classic thing to do that kind of fan in and say, we’re going to bring all these kinds of pieces together, and then once we’ve tested them all together, we might progress them to a later stage. Maybe after doing the automated tests, we make it available to human QA, to do exploratory testing or what have you, or UAT so users are going to look at it and then we approve it to go on into production.

That doesn’t necessarily scale very well at very large scale. So if you’ve got like, dozens of teams, much less hundreds of teams, all building stuff, having all their work trying to come together and run a massive integration test across everything, becomes unwieldy and intractable.

So what you can do there instead, and this is kind of draws from the microservices way of thinking, is to say actually, we can push different parts of our system straight into production. We might do some contract testing or we might test against instances of other parts of the system to make sure it’s okay, but we’re not going to try to couple the releases.

So as an example, let’s say you have a team working on shared networking infrastructure, those VPCs and subnets and so on, and that’s a Terraform project. And you’ve got another team which is building application servers and the infrastructure for the applications. You might say that the networking team can go ahead and push changes to the Terraform code for the networking through the pipeline into production on their own. And they don’t have to make sure that they can run a comprehensive test across everything in the organization. And then some of the application server team can push their applications through with any changes to their infrastructure, but it’s decoupled. And that just means you need to do a little bit more thinking about the contracts between those different pieces and how to test it there. But you’re not breaking anything, I’m not making a change to my networking that’s going to break other people. But you can do that, and I think, again, this comes down to the good design thing, it forces you to kind of think about, okay, so what are people depending on in my networking infrastructure. Rather than people just kind of hard coding dependencies to wherever they want. It’s like, no, no. I’m going to say I’m going to publish the subnets maybe to a registry or somewhere. These are the subnets that I create and you make sure that you use those and then I can change anything else, as long as I keep those subnet IDs consistent.

Henry Suryawirawan: [00:43:21] Right. When you say breaking out these component or different parts of the whole giant system. You mentioned something in the book called stack, is this referring to that kind of pattern?

Kief Morris: [00:43:31] Yeah. So the stack, this is a term that I’ve used because the tool vendors don’t tend to have a consistent terms of the CloudFormation AWS, they talk about a stack and Pulumi also uses the same term. Terraform has a project, and I’m not sure the Ansible necessarily has a term to refer to it, but it’s basically, it’s like the unit of stuff on, say a cloud or virtualization platform, like a unit of change. So it’s like, you’ve got a blob of code that you can change, when you applied it, it’s kind of the scope of what might change within that piece of code. So with Terraform as a project and there’s usually the state file associated with it. Like I said, with CloudFormation, it’s a bit more concrete cause they actually use that term, stack as well.

So the a big part of it, especially in the second edition, I think, I’ve expanded a lot more on this and how to do things like integrate between stacks nicely and patterns for designing stacks, so that they’re kind of loosely coupled and everything and where to break stacks apart. Because as that becomes a bigger thing, when I wrote the first edition of the book, most people were still focused on servers. And now this level of stuff has become what we tend to work with a lot, if you’re not working with kind of Kubernetes and the layer above that even.

Infrastructure as Code Testing [00:44:30]

Henry Suryawirawan: [00:44:30] Right. We covered a lot of things about testing. So what are some of the tools normally people use to do this infrastructure test. I don’t even know whether you can do unit tests. Like for example how do you unit test Terraform?

Kief Morris: [00:44:42] It’s a hard one, right? And this is I think something we’re still in a lot of churn in the industry about what type of tests are appropriate and useful and valuable, and therefore what tools to use for them. So like are unit tests useful?

Well, one of the things I’m realizing with looking at the newer tools like Pulumi and CDK, which are using regular languages, is that there’s a kind of a couple of different types of projects really, that you can think about.

So the typical, again, the stack, where you’re defining infrastructure and you’re using a language, if you’re using something like Terraform, you’re tend to be defining the elements of the infrastructure at a pretty granular level. You’re saying I want a virtual machine, here’s my subnets, here’s my routes, gateways, all the individual pieces are all right there. And you have the code that directly reflects those. And again, it goes back to that thing of its declarative code. It tends to declare I want these pieces to be in place.

And so declarative code tends to have a little bit less value for testing in that, if you’re saying, if your code says, make a subnet and give it this IP address range, what is your test going to say? It’s going to say, is there a subnet and does it have that IP address range? And it’s like, well, with tests, you have to be very pragmatic and think about like what’s the value of this test? What’s the risk I’m trying to address here. So when does that test of that subnet gonna fail? Well, you could have a syntax error, which you can catch beforehand. You know, a lot of the errors that you would catch, are actually going to fail when you apply the tool. When you run “terraform apply”, it’s going to tell you, Oh, you can’t create that subnet because the address range is not a valid address range. And so you are never gonna run the tests.

So I think at the declarative code level, there’s not so much value in test, particularly that unit test level. It’s more when you aggregate and you say, okay, I’ve declared my subnet and I’ve declared gateways and routes and other things. Now it’s like, well, can I actually make a connection through all that stuff and get to where I need to get to you. That’s where you want to test, because that’s tends to be what goes wrong. When you apply your Terraform code and you point your web browser at it, and it’s like, connection refused, well, where was it refused? Was it my subnet? Was it my security group? So that’s where you want to have the tests. I think is when you kind of aggregate those things together. So that’s for declarative code.

And then I think the interesting thing, when you look at using a tool like Pulumi or the CDK, you’re probably not using it for the same thing. And this was kind of a realization I had where my first assumption when I saw these tools coming out was like, Oh, you’re just going to make like the same kind of project you make for Terraform, but it’s just gonna be a different language to declare all those subnets and things and so on.

But I think actually those tools are probably more appropriate for the reuse. And you mentioned the example of the Terraform projects, where you make modules, because you identify there’s more reasonable code. I think often that’s a situation where you find what the code in those modules is the code that you want to say, I want to actually pass in some parameters and maybe do different things, maybe I want to make a different number of subnets depending on the AWS region, how many subnets are available there. You want to do more dynamic things. Or I want to assign security groups and this and that based on what you’ve created.

So that’s where I think logic comes in. And I think that’s where you see the kind of really terrible Terraform code of putting in loops and conditionals and stuff into your HCL, which is essentially a declarative language and trying to make it procedural, it’s because what you’re trying to actually do a different type of thing. And so this is where I want to use a tool like Pulumi to make a library.

That’s where I think testing becomes more powerful, more useful because it’s like, okay, I’m making a library and it’s going to create subnets for me. And it’s going to create subnets depending on what parameters I pass in. Again, the number of availability zones that are in our region, or maybe the use cases, is it a public facing thing, or is it an internally facing thing. It’s various kind of things where you might want to have some variations. So you have this code, which can do different things. So that’s where you want some unit tests and say, okay, make an instance of my little Pulumi library for subnets. And pass in different parameters and then see does it do the things. And that’s also the kind of thing that you can probably do more with kind of mocks and the kind of offline kind of things to where it really is proper unit tests that will run quickly.

So I think that’s kind of where my thinking is coming to of where these different things come into play and where the testing becomes more important.

Pulumi and CDK [00:48:29]

Henry Suryawirawan: [00:48:29] Right. You’ve mentioned a lot of times about Pulumi and CDK, like many times now. Are they like the up and coming tools that we have to explore?

Kief Morris: [00:48:38] I think so. Yeah.

Henry Suryawirawan: [00:48:39] What are some of the good things from those tools?

Kief Morris: [00:48:42] Again to kind of like a recap a little bit about what they do, is basically they let you use a regular programming language, an off the shelf programming language. And they tended to support a JavaScript and TypeScript is quite popular, Python for some. I think maybe Java for CDK, I’m not sure.

There’s different languages that they kind of support. And so I think the value is in when you’re writing a reusable code basically. It’s where you’re not saying I’m writing the code for this one environment and I know what that environment should look like, but it’s like I’m writing code that different people might use or that I might reuse across different environments, which are different, and so it needs to have a bit more logic going on. So I think that’s a valuable thing about it.

The other valuable thing. So we talked about, you asked me like what tools there are for tests. I don’t think I specifically like listed any tools, other than to say that there’s not as many out there. There are tools out there, but it’s not as mature, which is one of the reasons why people don’t do a lot of testing of infrastructure, I think. But the thing when you’re using one of these other tools, like a Pulumi or CDK and it’s like, Oh, what testing tools are available to write unit tests? Well all the tools that you have for your language, if you’re running JavaScript, there’s loads of tools, right.

And then similarly for packaging up your code and versioning it and promoting it in like an artifact repository, it’s like suddenly you have access to that full ecosystem. And the things you can do in an IDE with the JavaScript code to kind of navigate and refactor your code, and all of that, is just way beyond what you can do with more infrastructure focused language like HCL, which although there’s support out there for HCL and similar languages, it’s not nearly as wide as it is with the general purpose languages. So I think that’s a powerful thing, that’s a useful thing.

Infrastructure as Code Anti-Pattern Example [00:50:07]

Henry Suryawirawan: [00:50:07] Right. Obviously in your career, you have dealt a lot with Infrastructure as Code. Can you probably share with us, what are some of the most horrendous anti-patterns or code smells in Infrastructure as Code?

Kief Morris: [00:50:19] Yeah, so I think the big projects is one thing. This is massive project which has kind of grown and grown and grown over time. So I think one of the other things that we see with Terraform projects and similar projects is with multiple environments. So it’s like, how do you define a dev environment, test environment, staging environment, production environment with Terraform.

And I think one thing that you’ll see people do at the beginning, a common thing to do when you first starting out is I kept one project that has all the environments in that one project. And then you break the project and it breaks all your environments. Or you’re making a change, your test environment, you break production, and then you realize, well, that’s not a good thing to do, let’s separate them out.

And then the next thing that people do, which I still think is not quite right, and I described it as an anti-pattern, is where you have a separate project. It’s like a separate Terraform project for each environment, even though it’s meant to be the same environment, they are the same system that you’re kind of replicating.

So it’s like, Oh, I’m going to make an infrastructure, change my test environment and make it there and apply it. And then I copy the code change over to my staging environment then over to my production environment. And that’s just like copy and pasting is never really a good thing to do with code. It’s the whole load idea of code is that as usable and consistent, so on and so on.

So I much prefer having a single piece of infrastructure code that defined your environment, that you apply in one, test it and then push the same code to the next environment. Which requires you to do some things you have to think about how do you configure parameters because in different environments you might need like different numbers of servers in your load balancer pool, for instance. But you know, there are ways to do that, and I think it’s important to do that.

2nd Edition of Infrastructure as Code Book [00:51:40]

Henry Suryawirawan: [00:51:40] Right. Throughout this conversation, you have mentioned a lot about second edition of your book. I personally didn’t know about that until I researched about you before this podcast. So can you tell us, what makes you write the second edition and what are the major changes that we should be looking forward to?

Kief Morris: [00:51:57] Yeah. So, things have moved on a bit. It’s been about four or five years now, I think. And it’s interesting. So the focus has changed. The subtitle of the first edition of the book was something like managing servers in the cloud. So it’s like servers, it was all about servers. And it’s like, now, we’re talking about stacks. And I mentioned stacks or I think equivalent concept in the first edition of the book. But it’s become much more of a thing we’re spending a lot more time now on that level of the Terraform or the block of stuff in the cloud that you’re managing, blocks different pieces and now all that.

And so I think that’s become important. I think a lot of the topics talked about today are, how to manage that infrastructure and keep it loosely coupled in smaller pieces and more manageable so it’s less scary to change.

I’ve learned a lot since I wrote the book right. I think as an industry, we’ve moved on in terms of what we know and what challenges we’re facing. And so I felt like that really needs to come in there.

So I think that’s probably like the biggest change is a lot more focus on that level of stuff and there’s a fair bit on those kind of almost design and architecture type patterns for infrastructure stack type projects. How do you integrate different stacks?

I talked a little bit about code base organization, the monorepo and those kinds of things. I talk a little bit about the refactoring stuff that we talked about. Cause again, that’s one of those things that we’ve run into a lot more or something I’ve seen a lot more that needed covering.

So it’s a pretty substantial rewrite, I would say. One of the things I’d like to do, if I can, is find one of those tools that like compares to see like how much commonality there is between two texts. I want to see how much of the book has changed. I would say it’s probably quite a lot.

Henry Suryawirawan: [00:53:21] Right. When can we expect to see the book being published?

Kief Morris: [00:53:24] It should be out around the end of the year or maybe the beginning of next year. And you can get access to a draft of most of the book, I think on the O’Reilly’s Learning Platform, which is formerly known as Safari. So if you’re a member of that platform, if you have access to that platform, you can find it on there.

Reverse Engineer Infrastructure as Code [00:53:40]

Henry Suryawirawan: [00:53:40] So I’ll just ask one question, which I forgot to ask. Throughout my experience as well, there are many people interested with Infrastructure as Code, but they have already had infrastructure set up since the beginning. And some of them actually dream about having a reverse engineering tools, in which that from an infrastructure being set up, can you actually come up with automatically generated code? So, is there such tool that you know of?

Kief Morris: [00:54:03] I think there are a couple of tools that you can generate Terraform code. I don’t know that at the top of my head, what the tools are, but I know there are tools to point it at the cloud and it’ll spit out code that you can then edit and so on.

I tend not to recommend that approach. Maybe do it as a, kind of a learning exercise to kind of learn a little bit about your existing infrastructure. But what I tend to recommend is essentially taking piece by piece of existing infrastructure and rebuilding it by writing the code, clean and well-structured. Because the reverse engineered code, I don’t think it’s necessarily going to be well-structured, well-factored, won’t include tests and all those kinds of things. So I recommend starting by take one piece, write the code for it, write tests for it, write pipeline for it, put it out there and then using one of the kinds of things that we’ve talked about around, expander and contract or other kind of refactoring patterns to move things over to the newly built thing. And then just do that piece by piece and replace your infrastructure with kind of new clean stuff. The same way you do with application code, when you have a monolith and you’re trying to maybe turn it into a more kind of microservice or just better factored system.

3 Tech Lead Wisdom [00:55:02]

Henry Suryawirawan: [00:55:02] Right. Before we end our conversation, Kief. As always, I ask my guests to share with us the three technical leadership wisdoms. Can you share with us, what are your technical leadership wisdoms?

Kief Morris: [00:55:14] Yeah, I think my biggest one is kind of like harnessing and trusting the power of your team and the people around you.

One of the things that we didn’t really talk about in my career journey is that , particularly before ThoughtWorks, I did manage team that tended to manage teams of systems, administrators, or developers, or sometimes both. And one of the kind of most powerful things I found was not thinking that I had to kind of manage everything and decide everything and know everything that’s going on, but just kind of like trust the people and be there to support them.

That’s, that’s like the number one thing. It’s one of those lessons that I kinda like to relearn , It’s just like, I know that and I learned it and then I’ll be in a new situation, where I’m kind of not necessarily doing it. And then I realize, Oh, wait, if I just tell the people and let them kind of decide, then it works out much better. It’s much more scalable and less stressful as well.

The second thing is to support that really, you need to kind of work out the right level of guidance and support to provide to people and that can vary in different situations. And maybe I’ll cheat a little bit and call out my third one as well, is to be aware of the different situations you’re in might call for different approaches, depending on things like the, maybe the skill level of the team. So if you’re coming into a team and it’s not really going well at the start, like they’re struggling, you might need to take a bit more of a hands-on approach and a bit more close involvement, and provide more direct support to people and understand what that is and sit down with them and work through things, or what have you. Whereas with a team that’s more experienced and everything is flowing well, the level of guidance and support you may need to provide is maybe more around the kind of prioritization and maybe even around higher level stuff. Like, look, these are the challenges that as an organization we’re facing, so if I’m the one going into management team meetings, and maybe we’re focused on cost management, or maybe we’re not focused on cost management, we need more opportunities for growth or whatever it is. You can take that back to your team and say, hey, these are the things that we’re struggling with and that we’re focused on as an organization. And so with decisions that you’re making, with the work that you did do day by day, I don’t necessarily need to come in and help you make those decisions. But like, this is what you need to know that might help you with that.

So I think being there to support. I guess the one, two, three is, one, don’t over manage the people and think that you need to think for them, but do provide the right level of support that they need to know what they should be working on. And when they need help to be able to bring that into them. And then the third is to tailor that to the situation and their needs.

Henry Suryawirawan: [00:57:23] Right. Where can people find you online?

Kief Morris: [00:57:26] I’m mostly on Twitter. It’s @kief, K I E F. You can see me on there a lot. I do some stuff on LinkedIn. But Twitter is probably the best. I’ve got a couple of websites that I almost never post to, infrastructure-as-code.com with hyphens between the words is where I tend to post things about the book. I’ll probably start posting more as I finished up the book and get more into the going out there and talking about it to the world mode.

Henry Suryawirawan: [00:57:47] Right. So, yeah, I’m looking forward to the new book and thank you so much for your time, Kief. Nice talking to you, learning more about Infrastructure as Code.

Kief Morris: [00:57:54] Thanks a lot for having me, Henry. It’s been good fun.

– End –