10 Dec 2024

Why Infrastructure as Code sucks

Let’s admit it: infrastructure as code sucks

I have spent most of my career wrestling with it, so admitting it isn’t exactly easy or fun. But the truth is, even if it’s the best we have, it feels like infrastructure as code is stuck in time and no improvements are being made. Which would be fair, if it were a mature industry, but somehow, it doesn't feel like it. So I would like to start a conversation with you and imagine a world, where it sucks less. I have this strange belief that we could do better. And that we deserve better. In this article, I will try to break down the reasons why I think it is broken and explore how can we finally fix it.

Over the past 40 years, software development — and infrastructure alongside it — has gone through massive changes in business. It shifted from being a cost center to a business engine, from fixing office paper jams to building distributed systems that deliver your favorite TV show. Yet the way we think about infrastructure hasn’t changed much. And let me repeat it again: it sucks.

But to understand why, first we need to know…

A Little Bit of History

In the early days all we had were physical servers, managed by different sets of scripts, and our setups were fairly static. We were used to build infrastructure to last, and we designed it to be as immovable as possible. Systems were expensive reliability wonders and sysadmins spent their time dancing around the servers on the outside, trying to make improvements. And only then, some time was spent writing incantations on the servers themselves, making sure hardware capabilities matched what’s on the server. Even as the pet vs cattle approach started taking ground, the “infrastructure as code” part mainly touched server internal configuration, with some limited provisioning capabilities (like BootP and PXEBoot later). Tools like CFEngine, Puppet, and Chef emerged to bring consistency, reconciling configurations either manually or through central management nodes. Configuration files, installed packages — these were the resources we were managing.

We started virtualising things. Servers became abstract. Then came the Cloud, and it took over the world – these days, major part of internet traffic is handled by cloud providers. Of course, the stuff we used before didn’t disappear, just like radio didn’t vanish when you started streaming curated playlists. Cloud was an evolution, rather than a revolution. And that primed our minds to think about virtual infrastructure in the same way we think about physical infrastructure.

Cloud was a huge shift in the infrastructure world. It promised us flexibility — elastic computing whenever you needed it. That’s what ‘E’ stands for in EC2. I remember missing a project deadline just because servers have arrived to the datacenter during a change freeze (!) — and suddenly, these problems went away. Resource elasticity wasn’t the only reason, Cloud moved us from static server setups to dynamic ones, where infrastructure could finally align with business services and evolve at a similar pace. Having dedicated racks in the data center to handle your payments system became irrelevant.

The paradox of this dynamism came in bundled with another promise — immutable infrastructure. When we talk about immutable infrastructure, we usually mean immutable server configuration, which was a development that took over about the same time as cloud, and it made configuration management systems of old obsolete. A server with its configuration became an artifact, and, along with your cloud network, databases, queues and other things, instance of it became a resource. And suddenly, we have stopped caring about the server itself, which empowered us to do other, more interesting and more important things for business. As a whole, industry lost some knowledge and gained some understanding. Here’s a question for you, if you have started your career after 2015. Do you work in the Cloud and know the pros and cons of RAID systems? Does it even matter anymore?

To deal with all that confusing world of Cloud infrastructure, another set of management tools was born - Cloud Formation, Terraform, CDK, Pulumi. We have started declaratively defining these resources, grouping them together by business needs and deploying them in groups. Everything was still fairly static, but was able to move around in the matter of hours. We have inflated our titles, ditching old and dirty sysadmins to declare ourselves “DevOps” Engineers or Cloud Engineers — we put on white gloves. I know this sounds a bit harsh, because we did change how we thought about the systems — we have stopped just administering servers, and we did start applying engineering principles to what we did. But fundamentals were still the same — we were managing resources, but just little bit differently and in just a bit fancier environments.

And then Kubernetes happened. It was a winner of yet another higher layer abstraction we have placed on top of the Cloud, in a similar way as Cloud was a winner for virtualisation technologies. FreeBSD jails, LXC, docker, Kubernetes, and we are pretty much at the place where we are. Another cycle of inflated titles, now we are Platform Engineers, which usually means we know some Kubernetes and YAML. I’m harsh again, because of course, there are many other things that came about with this major shift that made our infrastucture much more dynamic. And we codify everything!

So why does it suck?

In the current state, we mostly manage resources using infrastructure as code. We love abstracting things up, jumping the layer, and yet we are still managing resources. We still believe in building infrastructure to last, even though what we really should build is an infrastructure designed to change. As we have layered our platforms, different layers do not move at the same pace. We still treat virtual resources as if they were physical servers and switches in the datacenter. And then, when businesses change, we spend yet another cycle of rewrites and migrations.

Our tools poorly mimic software development (even if most of the stuff we manage is software), and, at best, they can be called declarative scripts. We lack appropriate abstractions and the ways to extend our code. Tools like Terraform require refreshing the state of every resource to perform even simple operations, which means that even in the simplest setups you have to fragment your infrastructure into different stacks. The need for smaller, more manageable infrastructure deployments spawned workarounds like Terragrunt, but these solutions added a new layer of complexity. Now, we have to deal with keeping versions updated, managing inter-stack dependencies, and ensure consistency.

So we need IaC tailored CI/CD platforms like Spacelift, and while these are great at streamlining our workflows, they still fail short to address the core issues. What’s missing isn’t a better stack but a way to rethink the entire structure—one that aligns with how systems actually operate and change. If that’s true and all these layers are still not enough, can we rethink the approach?

What should a better solution look like?

I don’t believe that we can (or even should) stop using everything we have and create a new paradigm of how we should manage infrastructure. But I do think that we can extend our current tools to introduce proper abstractions at the right places, and that could allow us to think and manage infrastructure systems that mimic business system boundaries. And I think Cloud Providers could do much more.

To be fair, the attempts to come up with the principles are there — we have Kubernetes with continuous reconciliation which is adopted by tools like Crossplane, we even have some abstractions, like Deployment Stacks in Azure. But as they are movements in the right direction, I don’t think they go far enough.

Where should we look for inspiration? If we name our setups code, we should probably explore how development principles could help us. So let’s borrow some ideas from object-oriented programming, and see how could we move away from managing individual resources to managing resource collections as cohesive units. That would make our code more expressive, and would allow us to define relationships between different business parts better. Not what it is, but what it does.

For the lack of better terminology, let's call this new abstract unit an Aggregate Resource, as at the most basic level it's just a grouping of resources. In that sense, it is similar to a Terraform module or a Composite Resource in Crossplane. Think of an Aggregate Resource as an Object in OOP, and Aggregate Resource Artifact, or it's definition, as a Class. You should be able to nest Aggregate resources — ideally, all your environment should be able to fit into one top level resource. We should allow teams to build higher-level abstractions from smaller components. Imagine a database abstraction that is part of a backend service abstraction, which in turn belongs to a larger application abstraction.

Admin operations on Aggregate Resources must be atomic, immediately providing state or sync/out-of-sync information via an API, without querying individual resources. At the same time, it should not be a black box and you should be able to view into the resources inside of the Aggregate Resource if needed. So far nothing too drastic, maybe except for a fact, that it needs runtime, and ideally it should be a Cloud Provider's one.

These units should have public and private resources, along with interfaces to override or extend their behavior. This injection is probably the biggest difference between how we manage resources currently and a proposed solution. For example, there is no good reason why you cannot have a Kubernetes cluster defined inside of the module and yet inject a custom implementation if needed. In theory, you could achieve something similar by using conditionals in terraform, but they are not expressive enough to allow for much flexibility. Also, terraform's reliance on one single DAG limits the way you can use them, as all resources need to be computed in advance. It's one of the major reasons why we couple our code tightly and tend to layer our code rather than map to business entities: this is our foundation, this is our network, these are our IAM boundaries, instead of thinking about application scope.

Important to note, that Aggregate Resource's definition should also be treated as an artifact, complete with its own packaging and versioning. Storing this artifact in a repository can track differences between internal resources as well as version compatibility. In the cloud, we would deploy and instantiate this artifact, ensuring a consistent and repeatable infrastructure deployment process.

Aggregate Resource should have ability to be instantiated within your network and have ability to query and reference internal network APIs if needed.

Currently, major cloud providers have no methods of providing real-time, event-driven state notifications, which makes Aggregate Resource implementation rather difficult and expensive. We would depend on polling child resources rather than being able to subscribe to state change events and react continuously. Unfortunately, Cloud tools today are mostly centered around monitoring, logging, and auditing, not real-time updates.

Each Aggregate Resource should have its own identity in the Cloud and enforce hierarchical ownership over its child resources. Ideally, subresources managed by an Aggregate Resource should not be changed outside of this ownership. Hierarchy is needed to allow flexibility of transfering ownership from one Aggregate Resource to another during refactors. However, only Azure provides universal resource locking, other Cloud Providers can only rely on some specific resource deletion prevention and IAM.

How is this improving anything?

This concept would unlock a new level of flexibility between infrastructure and business needs. This is true modularity, composability and extensibility. It would allow us to build truly dynamic infrastructure provisioning systems, use cases like SaaS platforms that need to spin up cloud resources for new tenants on demand would become trivial – we could directly integrate provisioning APIs to the applications themselves. We could build libraries of best practices and extensible modules for common architectures and expand on them. The hierarchical ownership model would simplify complex systems and reduce the risk of errors and would make migrations or rollouts much safer and more intuitive to handle. Infrastructure would stop being just a pile of resources, it would become an actual agile system that can adapt and change as business evolve. So why should we stick with the current tools?

This is where I’ll stop for now. I’ve laid out what’s broken and hinted at what might fix it. But the real challenge is in the details — how to make Aggregate Resources work, what they could mean for your infrastructure, and whether we’re ready to rethink the entire foundation.

These ideas are far from complete, but maybe that’s the point. I’ll dive deeper into the mechanics and challenges in the next article. And if you think these ideas are worth something, please do let me know on X (@geriBatai) or BlueSky (geribatai.bsky.social), I would love to hear your thoughts.