Infrastructure as Code at scale: from a Terraform monolith to modules

Every IaC journey starts the same way: one directory, one state file, one team, everything in it. It works until it doesn't. The moment your first serious engineer asks why terraform plan takes 18 minutes, or why the team avoids applying on Fridays, you have hit the scaling wall. That wall is not a Terraform failure — it is doing exactly what you configured. It is a boundary failure: you have built an infrastructure monolith and dressed it up as code.

The path out is almost never "switch tools". It is: define boundaries that match how your infrastructure actually changes, enforce those boundaries through split state files and versioned module contracts, and automate the verification and deployment workflow. OpenTofu, Terragrunt, Atlantis, and OPA are multipliers on a good structure. They cannot rescue a bad one.

Why the Terraform monolith turns hostile

The monolith degrades in three ways that reinforce each other.

Plan time grows with resource count. Every terraform plan triggers a state refresh: Terraform queries the cloud API for the current state of every managed resource in the workspace. A small workspace of 50 resources completes in under a minute. At 500 resources, teams routinely report plan times of 15 to 30 minutes — a pattern documented in several public engineering post-mortems including the ThousandEyes Engineering team's account of their Terraform scaling journey. In workspaces managing thousands of resources across services with aggressive rate limits, the same refresh process can exceed two hours. The problem is not Terraform's query logic; it is that the cloud provider's throttle limits serialize what should be parallel reads.

Lock contention serialises all work. Terraform acquires an exclusive state lock before any plan or apply. One workspace means one lock. Two teams working on unrelated concerns — one updating a VPC peering rule, one deploying an ECS service — queue behind each other. In practice, coordination breaks down: someone force-unlocks mid-apply, the next apply starts on a partially-written state, and the on-call engineer spends Saturday reconstructing what the state should look like.

Blast radius is unconstrained. A misconfigured resource in one corner of the state can produce a plan diff in a completely unrelated service. A provider upgrade that breaks one resource type stops everyone from applying until the regression is isolated and fixed. The psychological effect matters as much as the technical one: a codebase where any change might affect anything is a codebase nobody trusts, and engineers who distrust their tooling work around it.

Approximate terraform plan times by workspace resource count — representative figures, not precise benchmarks

Under 100 resources~2 min

~300 resources~10 min

~500 resources15–30 min

800+ resources (documented extreme cases)45–120 min

Source: Engineering reports: ThousandEyes Engineering, InfoWorld, MoldStud, 2024–2025

Decompose the state: boundaries first, tooling second

The most consequential decision in IaC scaling is where to draw the boundary between state files. There are three axes to evaluate simultaneously.

Rate of change is the most important axis. Resources that change daily — application service configurations, security group rules, Lambda environment variables — should never share a state with resources that change quarterly, like VPCs, Transit Gateway attachments, or DNS zones. When they do, a routine deployment locks out the quarterly change, and an emergency change to core networking requires touching the same workspace as a dozen application deploys.

Ownership follows team structure. A platform team managing Kubernetes node groups and a product team deploying application services should operate on independent locks. Where team boundaries exist, state boundaries should follow. The state file is where the Terraform ownership model becomes explicit in code rather than in conversation.

Failure domain defines blast radius at design time. If the payments infrastructure needs an emergency rollback at 3am, that operation should not require touching the same workspace as identity or notifications. Separate failure domains are separate state files.

A practical target layout for a mid-sized product company:

| State | Contents | Change frequency | |---|---|---| | foundation | VPC, subnets, DNS, certificates, Transit Gateway | Quarterly | | platform | EKS/ECS cluster, RDS, ElastiCache, ECR | Monthly | | iam | IAM roles, policies, OIDC providers | Weekly | | app-payments | ECS services, ALB rules, SQS queues | Daily | | app-identity | ECS services, Cognito, Lambda functions | Daily | | observability | CloudWatch log groups, Datadog forwarders | Weekly |

Each state is independently lockable, appliable, and testable. A change to app-payments does not lock app-identity. A broken provider version in platform does not block application deployments.

before

IaC Monolith

Single state file with 600+ resources
plan takes 20+ minutes end-to-end
One lock blocks all concurrent team changes
One bad apply can affect every service
Provider refresh touches every resource on every plan
Cannot test one domain without risking another

after

Decomposed States

Six to eight states, each under 150 resources
Plans complete in under 2 minutes per state
Independent locks per domain or team
Blast radius limited to one domain
Targeted plans are fast and surgical
Each state independently testable and appliable

Monolith vs. decomposed state architectureSource: Common IaC engineering patterns

Migrating without disrupting live infrastructure

Migrating a live monolith without downtime is the part most guides skip. The order of operations matters.

Decomposing a live Terraform monolith

01
Inventory
Run terraform state list and classify each resource by change frequency, owner, and failure domain. Export to a manifest — a spreadsheet is fine at this stage.
02
Map boundaries
Group resources into candidate target states. Validate the grouping with owning teams before writing a line of HCL. Disagreements at this step are architectural, not technical.
03
Extract modules first
Pull repeated resource patterns into versioned modules before touching state boundaries. State migration is cleaner when resources are already encapsulated.
04
Create target backends
Provision the new S3 buckets, DynamoDB tables, or Terraform Cloud workspaces for the target states. Initialize empty backends first — do not move state yet.
05
Move state, not resources
Use terraform state mv with the --state and --state-out flags to move resources into new state files. Never destroy and recreate live infrastructure — you will interrupt running services.
06
Verify zero drift
Run plan in each new state and expect a zero diff. Any diff means configuration was not correctly mirrored during extraction. Fix it before touching the next state.

Source: ClimsTech Engineering

A realistic migration takes one to two weeks for a team doing this for the first time. The highest-risk step is the state move: test on a non-production resource before touching any stateful infrastructure, and keep the original state locked until the new states have a clean plan.

Remote state: wiring states together without coupling them

Once states are split, you need a way for one state to consume outputs from another. The terraform_remote_state data source is the standard approach:

data "terraform_remote_state" "foundation" {
  backend = "s3"
  config = {
    bucket = "my-company-tfstate"
    key    = "foundation/terraform.tfstate"
    region = "us-east-1"
  }
}
 
resource "aws_ecs_service" "payments" {
  name       = "payments"
  cluster    = data.terraform_remote_state.foundation.outputs.ecs_cluster_arn
  subnet_ids = data.terraform_remote_state.foundation.outputs.private_subnet_ids
  # ...
}

This pattern is clean but has a coupling failure mode: if the foundation team renames an output, every consuming state breaks at plan time with no warning. Treat your remote state output schema as a versioned API. Mark breaking changes explicitly in the module changelog, and give consuming teams a migration window before removing old output keys.

For values that are not Terraform-managed resources — a shared secret ARN, an account ID, an image tag from a separate CD pipeline — prefer AWS SSM Parameter Store or Azure Key Vault over terraform_remote_state. SSM decouples the consumer from the producer's backend configuration and does not require the consuming state to have IAM read access to the state bucket.

Modules that survive production

Modules are Terraform's unit of reuse. A well-designed module makes deploying a new service a ten-line configuration. A poorly designed one becomes a maintenance liability that nobody wants to touch.

Create a module when a pattern repeats three or more times, or when a single complex resource group is owned by one team but consumed by many. An ECS service with a standard ALB listener rule, a CloudWatch log group, and an IAM task role is the right candidate. A one-off VPC peering configuration for a legacy integration is not — the abstraction cost exceeds the reuse benefit.

Keep the module contract minimal. Inputs should be the minimum required specification: the service name, the image URI, the replica count. Outputs should expose only what consumers genuinely need. Avoid pass-through modules that re-expose every attribute of every managed resource — they grow without bound and become impossible to version without breaking all consumers at once.

module "service" {
  source = "git::https://github.com/my-org/terraform-modules.git//ecs-service?ref=v2.1.0"
 
  name         = "payments"
  image_uri    = "123456789.dkr.ecr.us-east-1.amazonaws.com/payments:${var.image_tag}"
  min_replicas = 2
  max_replicas = 10
  cpu_units    = 512
  memory_mb    = 1024
  vpc_id       = data.terraform_remote_state.foundation.outputs.vpc_id
  subnet_ids   = data.terraform_remote_state.foundation.outputs.private_subnet_ids
}

Pin versions — always. A source without a ?ref= resolves to the default branch HEAD at the time of terraform init. The next commit to the modules repository changes your production service configuration silently, with no plan diff to review. Pin to a git tag using semantic versioning: v2.1.0. Breaking interface changes increment the major version; new optional inputs increment minor; bug fixes increment patch.

Test modules. Terraform 1.6 shipped a native terraform test command with .tftest.hcl test files that run against real or mocked providers:

# tests/ecs_service.tftest.hcl
run "creates_with_correct_replica_count" {
  command = plan
 
  variables {
    name         = "test-service"
    image_uri    = "nginx:latest"
    min_replicas = 3
  }
 
  assert {
    condition     = aws_ecs_service.this.desired_count == 3
    error_message = "Expected desired_count of 3"
  }
}

For integration tests that provision real resources in a test account and verify live behavior, Terratest (Go) remains the most capable option — it runs a full apply, makes assertions, and destroys on completion within a single Go test function.

CI/CD for IaC: nobody applies manually

The single highest-leverage improvement most teams can make is eliminating manual terraform apply. When engineers apply from laptops, you get unreviewed plans, inconsistent local provider caches, forgotten -var-file overrides, and no audit trail. The pipeline enforces process consistently.

The baseline workflow:

Developer opens a pull request modifying infrastructure code.
CI runs terraform plan automatically and posts the diff as a PR comment.
Static analysis (Checkov, tfsec, or Trivy) runs in parallel; policy violations fail the check.
A human reviewer reads the plan output — not just the code diff, but the actual resource change list.
PR merges; CI runs terraform apply against the real environment.
Failure triggers immediate notification and blocks further applies until resolved.

Atlantis is the most widely deployed open-source orchestrator for this pattern. The atlantis.yaml in your repository controls which directory triggers which plan and supports per-project workflow overrides:

version: 3
projects:
  - name: foundation
    dir: stacks/foundation
    workspace: default
    autoplan:
      when_modified: ["**/*.tf", "**/*.tfvars", "../modules/**/*.tf"]
      enabled: true
 
  - name: app-payments
    dir: stacks/app-payments
    workspace: default
    autoplan:
      when_modified: ["**/*.tf", "../modules/ecs-service/**/*.tf"]
      enabled: true

The when_modified glob is what fans out plans when a module changes. A version bump to ecs-service v2.1.0 triggers plans in every stack that consumes it, giving reviewers a composite view of the impact before merge.

DORA elite-tier benchmarks — what top-performing teams achieve

On-demand

Deploy frequency

elite tier

<1 day

Lead time for changes

elite tier

~5%

Change failure rate

elite tier

<1 hour

MTTR

elite tier

Source: Google DORA State of DevOps 2024 (39,000+ respondents)

In the 2024 DORA report (39,000+ respondents), only 19% of teams qualified as elite performers. The high-performer cluster shrank from 31% to 22% of respondents between 2023 and 2024, while the low-performer cluster grew from 17% to 25%. Teams not actively improving their delivery automation are sliding backward relative to the field. IaC automation is not a differentiator at this point — it is table stakes for staying in the top half.

Drift detection and policy as code

Automated checks catch the class of mistake that code review never will, because code review happens on diffs — not on the cumulative state of what was deployed six months ago.

Drift detection. Drift is the gap between what Terraform's state file believes the world looks like and what it actually looks like. It accumulates from console clicks, CLI overwrites, and auto-healing processes that modify resources outside Terraform's control. A daily scheduled terraform plan run (using the -detailed-exitcode flag — exit code 2 signals drift) is the minimum viable detection. Some teams run it hourly for production stateful infrastructure. Pipe the non-zero-exit output to a Slack channel; when it is clean, ignore it. When it is not, treat it as a medium-priority incident before the next deploy turns it into a high-priority one.

Static analysis in the PR pipeline. Checkov runs against HCL before plan and catches misconfigurations at the source:

checkov -d stacks/app-payments --framework terraform --compact --quiet

Plan-time policy enforcement with OPA. Conftest validates the plan JSON output — catching issues that only appear after variable interpolation, unlike static HCL analysis:

terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policies/

A minimal OPA rule enforcing mandatory cost-centre tags on all resources:

package terraform.tagging
 
required_tags := {"CostCentre", "Environment", "Team"}
 
deny[msg] {
  resource := input.resource_changes[_]
  resource.change.after != null
  missing := required_tags - {tag | resource.change.after.tags[tag]}
  count(missing) > 0
  msg := sprintf(
    "Resource %v is missing required tags: %v",
    [resource.address, missing]
  )
}

On HCP Terraform and its alternatives (Scalr, Spacelift), Sentinel runs as a managed policy layer between plan and apply. For teams not on managed platforms, the Checkov and Conftest combination is equivalent and runs anywhere CI runs.

The non-negotiables — no public S3 buckets, approved instance families, encryption at rest on RDS, approved AWS regions — belong in policy checks, not in documentation. Documentation gets ignored in the middle of an incident. A failing policy check does not.

Common pitfalls and their specific fixes

These appear in nearly every IaC scaling effort, roughly in the order teams encounter them.

Rate-limiting during state refresh. At 500 or more resources, DescribeInstances, ListBuckets, and similar describe calls fire concurrently and hit the cloud provider's throttle limits. The plan serializes instead of parallelizing, and eventually times out. Fix long-term: split the state. Fix short-term: use -refresh=false for fast iteration on infrastructure you know is deployed correctly, and terraform apply -target= for surgical emergency changes. Avoid -target in normal workflows — it produces a plan that diverges from the full state graph and can cause inconsistencies on the next unconstrained apply.

Unpinned module sources. The source line points to the default branch. A teammate pushes a breaking change. The next terraform init silently pulls it, and the next plan includes an unreviewed infrastructure modification. Fix: all module sources must carry a ?ref=v{major}.{minor}.{patch} pin. Enforce this in CI:

# Fail the pipeline if any module source lacks a version pin
grep -rn 'source.*git::' stacks/ | grep -v '?ref=' && echo "ERROR: unpinned module source" && exit 1 || exit 0

Circular remote state dependencies. State A reads platform outputs; platform state reads an app output for a health-check URL. Neither can plan without the other existing first. Fix: draw the dependency graph before you write a data source. The dependency direction must flow strictly from foundation to application. Anything that would create a cycle belongs in SSM Parameter Store, a shared constants state, or a configuration management layer outside Terraform entirely.

State corruption from an interrupted apply. A SIGTERM arrives mid-apply — a CI timeout, a spot instance reclaim. The state is now partially written. Fix: enable S3 bucket versioning on your state backend on day one. After a corrupted write, roll back with aws s3api list-object-versions and get-object --version-id. Terraform Cloud and HCP Terraform snapshot state automatically on every operation. Recovering from a corruption without backend versioning is manual and error-prone.

Rename creates destroy-and-recreate. A resource is renamed in HCL. Terraform reads this as destroy-old, create-new. For a production RDS instance or Kubernetes cluster, that is catastrophic. Fix: use the moved block (available since Terraform 1.1) to declare the rename without touching the resource:

moved {
  from = aws_ecs_service.app
  to   = aws_ecs_service.payments
}

Terraform updates the state entry without issuing a destroy. This is one of the most underused features in the language, and it prevents what would otherwise be a multi-minute service interruption.

Console changes accumulating as invisible debt. An incident is resolved by hand in the AWS console. The fix works. Nobody updates the HCL. Three months later, the next apply overwrites the manual fix and the incident recurs. Fix: enforce IaC-only changes at the cloud level using AWS Service Control Policies or Azure Policy deny effects that block resource mutations from any IAM principal except the CI pipeline's role. This is the only approach that eliminates the pattern entirely. For teams not ready for full enforcement, the daily drift-detection run makes the accumulation visible before it becomes a recurring incident.

The ecosystem: OpenTofu, Terragrunt, and where each belongs

OpenTofu is the open-source fork of Terraform, maintained by the Linux Foundation since HashiCorp relicensed Terraform to the Business Source License in August 2023. OpenTofu 1.x is syntax-compatible with Terraform 1.x for the vast majority of configurations. Teams evaluating it are primarily those with BSL licensing constraints, open-source-only procurement policies, or a preference for community governance over a single vendor's roadmap. Technically, the two tools are equivalent today. OpenTofu's foundation governance makes it the lower-risk long-term choice for teams that care about licensing continuity.

Terragrunt solves the DRY configuration problem. If you have ten application stacks that all use the same S3 backend, the same provider pinning, and the same tagging inputs, you should not copy-paste that boilerplate ten times. Terragrunt's root terragrunt.hcl with generate blocks and inputs maps keeps the layout clean:

# terragrunt.hcl (root — inherited by all stacks)
remote_state {
  backend = "s3"
  config = {
    bucket         = "my-company-tfstate"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

# stacks/app-payments/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}
 
terraform {
  source = "../../modules//ecs-service?ref=v2.1.0"
}
 
inputs = {
  name         = "payments"
  min_replicas = 2
  max_replicas = 10
}

Terragrunt's run_all plan command plans every stack in dependency order in one invocation — useful in CI when a module change needs to fan out across all consumers before merge. The important caveat: Terragrunt solves the organisation problem (DRY config, orchestration order), not the boundary problem (where to split states). Used on a poorly designed monolith, it produces a DRY monolith with the same blast radius, more tooling, and an additional layer to debug.

Managed platforms — HCP Terraform, Scalr, Spacelift, env0 — provide remote state hosting, RBAC, cost estimation, policy enforcement, and audit logging as managed services. For teams without the operational capacity to run Atlantis and maintain S3 backends, they are the pragmatic choice. Evaluate them on whether the plan-on-PR, apply-on-merge workflow is first-class, because that workflow is non-negotiable regardless of which platform runs it.

A slow terraform plan is a user-interface problem. Engineers who wait 20 minutes for feedback stop running plans — and the next incident traces directly back to that gap.

— A pattern observed consistently across infrastructure teams at scale

What a mature repository layout looks like

To make this concrete: a mature IaC repository for a team using Terragrunt and versioned modules looks like this:

infra/
├── terragrunt.hcl               # root: backend, provider pins, shared tags
├── modules/
│   ├── ecs-service/             # v2.1.0 tagged on git
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── tests/
│   ├── rds-cluster/
│   └── lambda-function/
└── stacks/
    ├── foundation/
    │   └── terragrunt.hcl       # VPC, subnets, DNS
    ├── platform/
    │   └── terragrunt.hcl       # EKS/ECS, RDS, ElastiCache
    ├── iam/
    │   └── terragrunt.hcl       # roles, policies, OIDC
    ├── app-payments/
    │   └── terragrunt.hcl
    ├── app-identity/
    │   └── terragrunt.hcl
    └── observability/
        └── terragrunt.hcl

Each stack is a thin terragrunt.hcl that sources a pinned module version and passes environment-specific inputs. The modules directory contains the resource logic. The stacks directory contains the decisions. The modules are the interface; the stacks are the implementation. Keeping that distinction sharp is what makes the repository navigable as it grows past ten stacks and three teams.

What to remember

Split states along three axes — rate of change, team ownership, and blast radius — not by arbitrary resource count.
Design your remote state dependency graph as a strict DAG. Circular remote_state references are an architectural mistake, not a Terraform limitation to work around.
Pin every module to a semantic-versioned git tag. An unpinned source is an unreviewed change waiting to happen on the next terraform init.
Eliminate manual applies. Plan-on-PR plus apply-on-merge is the baseline; add OPA/Checkov policy checks that run before the human reviewer reads the diff.
Run drift detection on a schedule — daily at minimum. Drift found before a deployment is an observation. Drift found during one is an incident.
Use the moved block for resource renames to avoid destroy-and-recreate cycles on live infrastructure. Available since Terraform 1.1 and OpenTofu 1.x.
OpenTofu and Terragrunt are multipliers on a good boundary layout — they cannot rescue a bad one. Model your state boundaries first, then reach for tooling.
Target plan times under two minutes per state. Anything longer is signal for decomposition, regardless of whether it has caused an incident yet.

Infrastructure as Code at scale: from a Terraform monolith to modules

Why the Terraform monolith turns hostile

Decompose the state: boundaries first, tooling second

Migrating without disrupting live infrastructure

Remote state: wiring states together without coupling them

Modules that survive production

CI/CD for IaC: nobody applies manually

Drift detection and policy as code

Common pitfalls and their specific fixes

The ecosystem: OpenTofu, Terragrunt, and where each belongs

What a mature repository layout looks like

Reading the field notes?