Latest issue 23 Dec 2024

Rethinking Infrastructure as Code: Do artifacts matter?

This article is Part 2 in a series exploring Aggregate Resources and rethinking Infrastructure as Code. If you haven’t read Part 1 yet — Why Infrastructure as Code Sucks — I recommend starting there for more context. It lays the groundwork for the ideas we’ll be diving into here.

In the last article, I introduced the concept of an Aggregate Resource. This time, I’d like to invite you to explore it with me a little bit more — let's take a look at a level below the surface, expand our understanding and see what possibilities our imagination could lead to. Instead of focusing solely on the what, let’s dig into the hows and whys, starting with the concept of an artifact and its role in the Aggregate Resource idea.

Expect this article to raise more questions than it answers — and that’s exactly the point. You might uncover even better answers along the way. And if so — do let me know. But before understanding why and if artifacts would matter, we need to understand the context and how they fit in the bigger picture.

Zoom out: the bigger picture

Imagine we are standing next to the empty whiteboard. We know nothing yet, but we have this dream of an Aggregate Resource and now we need to come up with a plan. You hold the whiteboard marker, and now it's your turn to write the plan down. How would you know if the thing we are building is actually useful? How would you measure if we are capable of building it? What's our technical solution, are we ready to draw boxes and lines and arrows and clouds already? And how would we track if our solution is "better" than anything else?

The scope is broad, so if I were you, I wouldn't start drawing boxes and lines and arrows and clouds straight away. I would use my marker to write down some questions to give our vague idea a little bit more structure. Those questions should serve us as waypoints navigating the complexities of designing an Aggregate Resource:

How do we want to interact with these resources and what tools do we want to use? That is, you will have to decide what user experience do we want to provide. You could say, we could develop our own configuration syntax and expect people to adopt it, or we could allow them to use the tools they are already familiar with and use now. Or you could even decide to support a specific tool or a programming language of choice to ease up adoption. And we can provide API and Aggregate Resource schema to allow extensions to be built for other tools, hopefully.

How would we package our Aggregate Resource to convey it’s meaning? What benefits packaged Aggregate Resource should provide and are these benefits worth the effort? How much mental juggling would you expect our users to do, would you expect engineers to deal with the internals of these artifacts we produce? This looks like additional work, so again, maybe we can rely on existing tools, say, terraform modules, for that?

How do we run our artifacts, what is our runtime? We can run our Infrastructure as Code on demand, we can run it periodically or we can have a continuous resource reconciliation. You could argue that writing our own runtime is a must, or once again, maybe we can adopt or adapt something that already exist? Would you want to support different clouds, or should we focus on a single Cloud Provider? And what effort this additional runtime layer requires? Do we expect our users to provision infrastructure just to provision infrastructure?

Check those questions again. Do you have some mental model forming in your head already? What does it look like?

And now that we have these questions, what do they show? To summarize, they guide us to three critical aspects: syntax, internal representation, and runtime.

It feels a bit familiar, doesn't it? When designing Aggregate Resources, we are, in a sense, creating a small, special-purpose language. Like any programming language, it revolves around three core pillars: how the language is expressed (syntax), how its structures are internally represented and understood (semantics/representation), and how it actually ‘runs’ (runtime execution). This perspective gives us a systematic way to reason about and refine the design.

What should be our starting point? One approach is to start with how the resources will run, identify dependencies for the runtime, and then move up to representation and syntax. Alternatively, we could begin with intuitive syntax and work down toward the runtime. Either way, the order matters less than keeping the layers distinct. After all, it would not be great if we would start leaking concepts from one domain to another, and our syntax depended on Cloud Provider properties. That would be like developing an assembly language for the Cloud.

So, to avoid these hidden threats to the model, somewhat unconventionally, I want to start from the middle, addressing Aggregate Resource internal representation. By figuring out the workings of the most abstract thing, we should have the strong foundation into which we will be able to plug the specifics of a Cloud Provider or a specific configuration engine. And that's why we will try to focus on the artifacts first.

Zoom in: Why do we need Artifacts?

So far we know that the artifact is the bridge between our syntax language and our runtime engine that will run all our provisioning. But why do we need it, and how should it look? If we use terraform, for example, we already have terraform modules. We have YAML for CrossPlane, we have Helm charts and Ansible modules. Can't we just declare terraform modules or Helm charts as our artifacts and move on straight to runtime implementation?

It's a very intuitive first hunch, and most definitely wrong. You see, one common theme in all these module examples is that they operate on the source code. And source code is not much of an artifact. Artifacts aren’t just about packaging code — they’re about catching mistakes early. Think of them as a safety net that spots broken relationships, unmet constraints, and missing dependencies, all before you hit 'apply.'

Meanwhile, storing a source code, like a Terraform module, doesn't really do that — it’s just text with no inherent meaning. For example, if we would store a movie script in plain English, packaged in our module, module would still be valid. What it does and how it works is all nonsense, but it's a module nevertheless. This means we would gain very little by having such artifacts — at the very best, it's a file distribution mechanism. So maybe we don't need artifacts at all?

Now imagine the way most of the infrastructure engineers develop their code. Imagine yourself writing some new module, maybe a new Helm chart. Most likely, the tools you will use to validate your code will be helm template to visually check if your template renders and kubectl apply --dry-run to make sure your generated YAML is valid against the Kubernetes API. But these tools only validate the structure of your templates, missing the opportunity to catch deeper issues. For example, they won’t tell you if your templates reference a secret that doesn’t exist. You will discover the problem, sometimes quite painfully, only after execution.

If you're already convinced, or you feel that your eyes start glazing over technical details as the brain starts asking for too much oxygen, this is a good place to skip to another paragraph. TLDR; we want artifacts!

If you're still here, let's take a look at another example: simple terraform snippet. Say, we want to have an AWS VPC with subnets. The code in the example below is perfectly valid:

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "web" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "192.168.0.0/24"
  availability_zone = "us-west-2d"
}

It has a couple of problems — can you spot them? With just two resources, you might notice the IP addressing is wrong, and the availability zone doesn’t make sense. But
the only tool you have to catch these mistakes is your own eyes, you wouldn’t realize the configuration is broken until you try to apply it. And now consider another, more complex scenario: a Kubernetes Deployment specifies an image stored in an Amazon ECR repository. The Deployment YAML looks good, as does the code defining the ECR repository. However, the repository permissions might not allow access from the Kubernetes cluster. What are the tools to catch this kind of misconfiguration? Nevermind that currently these resources will be managed by separate tools, most likely.

The problem is that current Infrastructure as Code tools function as interpreters on execution. It means that we can only understand module actions during the execution and we can only validate those actions against APIs. And that's why primary infrastructure as code development methodology is TATIW (terraform apply till it works) — we've all done that, and probably more times than we would like to admit. Change the line, wait 20 minutes to apply, repeat it again.

Artifacts should enable us to catch most of these issues at compile time, making configurations safer and deployments less painful. They also offer a way to test infrastructure more effectively — before anything is deployed. Testing is a whole different challenge, but imagine the ability to simulate resource interactions, validate permissions, or catch missing references, all without having to spin up actual environments. This could change how we build and trust infrastructure.

I'm not trying to say these are easy problems to solve, but clearly there is a value of having artifacts, and trying to catch these error by shifting left our validation. So if storing a source code is not enough, what other options do we have?

Let's have a schema

How about coming up with the resource schema and putting up a bunch of serialized resource definitions into our artifact? Think about Azure Bicep — it compiles into ARM templates that can be validated upfront. Artifacts could work similarly, giving us something tangible to check before deployment. It's definitely an improvement over storing source code as an artifact - at minimum, it guarantees that our artifact is syntactically correct. You can also ensure that required fields are all present and that in general, the code is well formed. But I still believe it does not go far enough. It lacks expressiveness, and while you could patch up things like conditionals into the schema, it's rather static. How would we catch that missing IAM permission between ECR and our Kubernetes cluster, again?

What we really want is to codify the meaning, to embed richer representation of objects and their relationships in more domain aware way. Instead of just verifying that the right properties exist, we would want to verify that dependencies, references, and constraints make sense in context. We would want to catch issues like invalid resource names, unsupported cross-resource relationships, or improperly typed references before ever reaching the provider’s API. Shift left.

In other words, we want our artifacts to capture semantics. We want to get rid off all the unnecessary stuff, but distill our infrastructure as code down to it's meaning, drawing a map of a program. And we do have a method to do that — by using Abstract Semantic Graph.

Semantic What Now?

Abstract Semantic Graph—sounds fancy, doesn’t it? But don’t worry, it’s less intimidating than it seems — it’s inspired by techniques used in compilers and static analysis tools. These systems map the relationships between code components, ensuring they interact correctly and validating their dependencies before execution. When looking at it, squint, and you can see how your program needs to "run" and how your runtime will have to "think" about your code.

Let's take the following line in Golang: a := i + rand.Int(2). How would the graph for it could look like? Even if the example is completely made up, as Golang uses different technique? If we would pretty print the graph, it would look something like:

Declare(a, type=int);
Assign(
  Variable(a, type=int), 
  Add(
    Variable(i, type=int),
    Function("rand.Int", 
      type=func(int) int, 
      arguments=Literal(2)
    )
  )
)

Looking at this example we could see, how our runtime could handle a program — where we store variables, functions, how building the graph could verify function arguments and do the type checking, and how our runtime routines would have to implement these pretty printed nodes. And while I think our artifact format will evolve, and eventually will have to be a binary form friendly to the runtime, it looks good enough format as a starting point. It allows us to capture the meaning, without depending too much on external things, and that should be fine for a high level, domain specific "language".

Remember our earlier Kubernetes example — the one where a Deployment references a non-existent secret? That’s exactly the type of issue an Abstract Semantic Graph could catch. Unlike syntax validators, an ASG validates configurations within the broader context of the system. For our Deployment example, it would represent both the Deployment and its referenced Secret, flagging a validation error if the Secret doesn’t exist (or maybe a warning in this specific case, if we want to be able to specify an existing reference deployment time). Similarly, it could ensure Kubernetes has the necessary permissions to pull from an ECR repository. While we still need to define the precise semantics of this language and its type-checking mechanisms, the ASG seems like a good conceptual fit to tackle these questions later.

Is that it? Do we just serialize the graph and move on? Not entirely. It would be wrong not to include a header (or metadata, if you like) from the initial version. At the very least, it should include something like formatVersion=0, ensuring the format can evolve without introducing breaking changes. The header could also store runtime-specific capabilities — for example, whether the artifact requires GCP, AWS, or Kubernetes. It’s a straightforward addition, but one that ensures the artifact remains adaptable as our approach develops.

There’s still so much left to uncover about artifacts and their role in infrastructure. How can they scale with complex systems? How can they connect resources to runtime engines in meaningful ways? And what would their semantics actually look like in practice? This article is just the beginning — a foundation for ideas that can grow with your insights. And if these ideas resonate with you or spark new questions, I’d love to hear your thoughts. Together, we can explore how these concepts might evolve into something practical and worth building.