Managing Terraform state without fear

Terraform's state file is simultaneously the most important artifact in your infrastructure and the most consistently mishandled one. Every IaC horror story — the corrupted state that took a day to reconstruct, the production database that vanished mid-apply, the two engineers whose concurrent runs left state inconsistent for a week — traces back to the same root: state was not treated as the critical, shared, mutable infrastructure record that it is. The disciplines that prevent these failures are not exotic. They are the gap between engineering teams that trust their infrastructure tooling and teams that are quietly afraid to run terraform apply on a Friday afternoon.

What the state file actually is

Before covering the disciplines, it helps to be precise about what Terraform stores and why it matters. The state file (terraform.tfstate) is not a cache or a convenience artifact. It is Terraform's authoritative mapping between your configuration and real cloud resources. It stores:

Resource identifiers — the cloud-provider IDs (i-0abc123def456, arn:aws:rds:us-east-1:...) that Terraform needs to make API calls against real infrastructure
Resource attributes — every attribute Terraform knows about: IP addresses, endpoint URLs, generated names, computed values that exist only after a resource is created
Dependency graph — the directed edges between resources that Terraform uses to sequence creates, updates, and destroys correctly
Provider metadata — provider version constraints and schema hashes used to detect incompatibilities
Sensitive values — outputs marked sensitive = true are stored in state in plaintext in every Terraform version; sensitive only masks CLI output. Terraform 1.10's ephemeral values keep some secrets out of state entirely, but Terraform itself never encrypts the state file — encryption at rest is the backend's job (OpenTofu 1.7 added built-in state encryption)

That last point makes state categorically different from code. Your .tf files are safe to commit to version control. The state file almost certainly is not. A real infrastructure stack's state file will typically contain database master passwords, generated TLS private keys, IAM access key secrets, Kubernetes cluster client certificates, and other credential material — all readable JSON.

State also carries two fields that most engineers ignore until something goes wrong: serial (an integer incrementing with each write) and lineage (a UUID identifying the state chain). Terraform uses these to detect when two processes have diverged, and to prevent an older state write from silently overwriting newer state. Understanding this is what makes the concurrency problem below make sense.

Remote backends: not optional

The single highest-leverage Terraform discipline is storing state in a remote backend. Local state on a developer's machine means the state is unavailable to every other team member and every CI runner, there is no locking, there is no versioning, and the file is almost certainly unencrypted. All four are fixed by switching to a remote backend.

The main options and their properties:

| Backend | Locking mechanism | Versioning | Encryption at rest | Notes | |---|---|---|---|---| | S3 + DynamoDB (AWS) | DynamoDB conditional writes | S3 native versioning | SSE-S3 or SSE-KMS | Most common; requires two managed resources | | GCS (GCP) | GCS object locks | GCS object versioning | CMEK or Google-managed | Single resource; built-in locking | | AzureRM (Azure Blob) | Azure Blob lease API | Storage versioning | Azure Storage encryption | Built-in; storage account + container | | HCP Terraform | Platform-managed | Platform-managed | Platform-managed | Managed SaaS; adds run policies and audit logs | | Terraform Enterprise | Platform-managed | Platform-managed | Platform-managed | Self-hosted version of HCP Terraform |

The S3 + DynamoDB configuration is worth showing in full because the DynamoDB table is frequently misconfigured:

terraform {
  backend "s3" {
    bucket         = "acme-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123def456"
 
    dynamodb_table = "terraform-state-lock"
  }
}

The DynamoDB table must have a partition key named exactly LockID — case-sensitive, type String. That is the only attribute you define; Terraform manages the rest. A common mistake is adding additional required attributes or using a different key name, which causes lock operations to fail with confusing errors:

resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}

Enable versioning on the S3 bucket. State files are small; versioning costs almost nothing. It is the only recovery path when an apply corrupts state or when you need to roll back after a mistaken state push:

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.terraform_state.id
 
  versioning_configuration {
    status = "Enabled"
  }
}

State locking and what happens without it

Remote backends provide state locking, but locking is only useful if you understand what it does and how to recover from a stuck lock.

When Terraform starts a plan or apply, it acquires a lock on the state file — writing a conditional entry to the DynamoDB table in the S3 case. Any other Terraform process attempting to acquire the lock on the same state key fails immediately, with an error showing the lock ID, the identity that holds it, and when the lock was acquired. The lock releases when the operation completes or when the process exits cleanly.

Without locking, two concurrent applies can each read the current state, each modify a different resource, and each write their version back — last write wins. The first apply's state changes are silently discarded. Infrastructure now has resources Terraform no longer believes it manages. This is one of the primary causes of orphaned resources and the cloud waste that accumulates around them.

Handling stuck locks

A lock gets stuck when a process is killed mid-apply: OOM on a CI runner, an interrupted network connection, or a SIGKILL during a long provisioner step. The stuck lock blocks all subsequent operations:

Error: Error locking state: Error acquiring the state lock: ConditionalCheckFailedException
  Lock Info:
    ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
    Path:      prod/networking/terraform.tfstate
    Operation: OperationTypeApply
    Who:       runner@ci-node-42
    Created:   2024-11-15 14:22:08.987 +0000 UTC

The fix is terraform force-unlock with the lock ID from the error message:

terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

Do not run force-unlock while the original process might still be running. Verify the process is genuinely dead first — otherwise you bypass the protection the lock was providing, allowing exactly the concurrent-write corruption you are trying to avoid.

State segmentation: the blast-radius problem

A single Terraform state file for an entire company's infrastructure is a concentrated failure point. When that file locks, nothing can apply anywhere. When it is damaged, everything is at risk. When a plan runs, it evaluates every resource — which becomes progressively slower as resource count grows, and slower plan times mean longer lock windows.

The right model is to split state by blast radius. Resources that should fail together — because they form a single logical system — share a state file. Everything else is separate. In practice this typically means:

One state per environment (dev, staging, prod) at minimum
One state per major functional boundary within an environment (networking, data tier, application tier)
Separate state for global resources that serve multiple environments (DNS zones, shared ECR registries, cross-account IAM roles)

A concrete illustration of the performance case: a state file with 400 resources typically runs terraform plan in roughly 90–120 seconds when provider API calls are involved. Splitting that into four 100-resource state files reduces each plan to roughly 25–35 seconds — a 3–4x improvement. More importantly, a change to the application tier no longer locks the networking tier, and a failed apply to staging cannot affect production state.

Workspaces vs separate root modules

This is a persistent source of confusion. The two mechanisms solve different problems:

option A

Terraform Workspaces

Same configuration code, multiple state files under a single backend path
Appropriate for simple per-environment parameter variance (instance sizes, replica counts)
Workspace name is a runtime selection — easy to accidentally target the wrong workspace
All workspaces share the same backend bucket and DynamoDB table
Does not enforce separate IAM credentials per environment

option B

Separate Root Modules

Separate directory per environment; state stored under separate paths or accounts
Appropriate when environments differ architecturally, not just in scale
Requires explicit directory navigation — harder to target the wrong environment by accident
Can enforce separate AWS accounts, GCP projects, or state buckets per environment
IAM isolation is enforced at the filesystem and credential level

Workspaces vs separate root modules — when to use eachSource: HashiCorp Terraform documentation

The practical rule: use workspaces for simple per-environment parameter variance where topology is identical. Use separate root modules when environments differ architecturally, when you need IAM isolation between environments, or when you manage production from a different cloud account than development. Most teams at production scale use separate root modules for environment isolation and optionally workspaces within a single environment for ephemeral feature-branch stacks.

Cross-stack references

When state is segmented, stacks still need to reference each other — the application tier needs the VPC ID that the networking tier created. The mechanism is the terraform_remote_state data source:

data "terraform_remote_state" "networking" {
  backend = "s3"
 
  config = {
    bucket = "acme-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}
 
resource "aws_instance" "app" {
  subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
  # ...
}

The networking root module must export the value as an output. This creates a soft coupling: the application stack depends on the networking stack's output contract, not its internal implementation. An alternative that reduces coupling further is using AWS SSM Parameter Store or GCP Secret Manager as a broker — the networking stack writes IDs to parameters, and the application stack reads them — but terraform_remote_state is simpler for tightly related stacks.

Secrets in state: the silent danger

Gartner has projected that through 2027, 99% of cloud security failures will be the customer's fault, primarily through misconfiguration (Gartner, 2023). Terraform state is one of the most common misconfiguration vectors precisely because the risk is invisible: the configuration looks clean, but the state file sitting in an S3 bucket is readable by anyone with the s3:GetObject permission on that key.

Cloud security failure attribution and state hygiene rules

99%

Cloud security failures attributed to customer misconfiguration

Gartner projection through 2027

State files that should live in any git repository

Hard rule — no exceptions

1.10

Terraform version introducing ephemeral values, keeping secrets out of state

HashiCorp, Nov 2024

Source: Gartner, 2023; HashiCorp Terraform documentation

What commonly ends up in state with plaintext secrets:

aws_db_instance with an inline password attribute — the password is stored in state
tls_private_key resources — the private key material is stored in state
aws_iam_access_key — the secret attribute is stored in state
random_password — the result is stored in state
Kubernetes cluster resources — client certificates and bearer tokens are stored in state

Mitigations, in rough order of importance:

1. Use a secrets manager instead of Terraform for secrets. Let AWS Secrets Manager or HashiCorp Vault generate and store the secret; use a data source to retrieve the ARN or path. Terraform writes only the reference, never the plaintext value.

2. Encrypt the backend with a dedicated KMS key. S3 SSE-KMS means the stored state JSON is encrypted at rest. Use a key specifically created for state buckets — not the default AWS-managed key — so you can audit key usage and have a break-glass rotation path.

3. Lock down read access at the bucket policy level. The state bucket policy should deny s3:GetObject to everyone except the Terraform execution role and an explicit break-glass role. No developer should have direct read access to production state.

4. Keep secrets out of state with ephemeral values — or encrypt state client-side on OpenTofu. HashiCorp Terraform has no native state encryption in any version: Terraform 1.10 (November 2024) instead introduced ephemeral values, which let inputs and outputs marked ephemeral pass through a run without ever being persisted to state or plan files — use them for credentials wherever a provider supports it. If you run OpenTofu, version 1.7 added genuine client-side state encryption with AWS KMS, GCP KMS, or PBKDF2 key providers, encrypting state before it leaves the process — defense in depth even if the backend storage is misconfigured:

terraform {
  encryption {
    key_provider "aws_kms" "main" {
      kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123def456"
      region     = "us-east-1"
      key_spec   = "AES_256"
    }
 
    method "aes_gcm" "default" {
      keys = key_provider.aws_kms.main
    }
 
    state {
      method   = method.aes_gcm.default
      enforced = true
    }
  }
}

Note that enforced = true causes OpenTofu to refuse to read unencrypted state, which is the right setting after migration but should be staged carefully: initialize on a copy first, confirm the encrypted read succeeds, then set enforced. On HashiCorp Terraform, the equivalent posture is backend-level encryption (rule 2) plus ephemeral values — or HCP Terraform, which encrypts state server-side as a paid feature.

State is not a byproduct of infrastructure — it is infrastructure, and it deserves the same access controls as the resources it describes.

— ClimsTech Engineering

Drift detection: closing the loop

Infrastructure drift is the gap between what Terraform's state says exists and what the cloud provider actually has running. Drift happens constantly in real operations: an on-call engineer makes an emergency security group change at 2am, a developer resizes an instance through the console, an autoscaling group changes instance counts, a managed certificate auto-renews with a new ARN.

Drift has two failure modes. The visible one: the next terraform apply reverts the emergency change, potentially taking down a service that the change was preserving. The invisible one: unreviewed changes accumulate silently until no one knows what the actual production configuration is — and the next apply is a surprise.

Drift detection workflow for production state

01
Scheduled plan
Run terraform plan -detailed-exitcode on a cron schedule — daily minimum, hourly for high-change production environments — in a read-only CI job with no apply permissions.
02
Exit code check
Exit code 0 means no changes. Exit code 1 means an error in the plan itself. Exit code 2 means changes are present (drift detected). Alert only on codes 1 and 2.
03
Triage
Determine whether drift is intentional (an emergency change that should be codified) or accidental (an undocumented change that should be reverted). Never auto-apply without human review.
04
Remediate
Codify intentional changes: write the config, commit, and apply. Revert accidental changes: update the resource to the desired state via terraform apply targeting the specific resource.
05
Close the loop
After remediation, run plan again and confirm exit code 0. Update runbooks if an emergency procedure was the source of drift, so the next oncall handles it correctly in code.

Source: ClimsTech Engineering practice

A production-ready GitHub Actions job for scheduled drift detection:

name: Drift Detection
 
on:
  schedule:
    - cron: "0 6 * * *"   # 06:00 UTC daily
  workflow_dispatch:
 
jobs:
  drift:
    runs-on: ubuntu-latest
    environment: production-read-only
 
    steps:
      - uses: actions/checkout@v4
 
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.9.x"
 
      - name: Configure AWS credentials (read-only role)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-drift-reader
          aws-region: us-east-1
 
      - name: Terraform init
        run: terraform init -input=false
        working-directory: infra/prod/networking
 
      - name: Terraform plan (drift check)
        id: plan
        run: |
          set +e
          terraform plan \
            -detailed-exitcode \
            -out=plan.tfplan \
            -input=false
          PLAN_EXIT=$?
          echo "exitcode=$PLAN_EXIT" >> $GITHUB_OUTPUT
          exit $PLAN_EXIT
        working-directory: infra/prod/networking
        continue-on-error: true
 
      - name: Alert on drift
        if: steps.plan.outputs.exitcode == '2'
        uses: slackapi/slack-github-action@v2
        with:
          payload: |
            {"text": "Drift detected in prod/networking. Review CI run ${{ github.run_id }} before next apply."}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_PROD_ALERTS }}
 
      - name: Fail on plan error
        if: steps.plan.outputs.exitcode == '1'
        run: exit 1

The set +e before the plan and the explicit exit code capture is necessary because GitHub Actions runs shell scripts with -e by default, meaning exit code 2 would terminate the script before the output is recorded. The critical detail is the read-only IAM role: drift detection should never have terraform:Apply or the underlying resource modification permissions. If it does, a compromised workflow or an accidental trigger can revert legitimate emergency changes without human review.

State surgery: the operations you will need

At some point you will need to directly manipulate what Terraform tracks. These are the operations you should know before you need them in an incident.

terraform import

Use when a resource was created outside Terraform and you want to bring it under management. Since Terraform 1.5, the preferred form is the codified import block:

import {
  to = aws_s3_bucket.existing_logs
  id = "acme-application-logs-prod"
}

Terraform 1.6 added -generate-config-out, which drafts the resource configuration from the provider schema:

terraform plan -generate-config-out=generated_imports.tf

Review the generated configuration carefully — it often includes computed attributes that should not be in config, and default values that differ from what you actually want to manage. After importing, always run terraform plan and confirm it shows zero changes before committing the configuration. A plan that shows changes after import means your configuration does not match the actual resource state, and the first apply may modify or recreate something unintentionally.

The classic import syntax still works and is sometimes the only option for provider resources that do not yet support the block form:

terraform import aws_security_group.app sg-0abc123def456789a

terraform state mv

Use when you rename a resource, move it into a module, or extract it from a module. Without state mv, Terraform interprets a rename as destroy-and-create:

# Rename a resource
terraform state mv aws_instance.web aws_instance.app_server
 
# Move a resource into a module
terraform state mv aws_instance.web module.app.aws_instance.web
 
# Move a resource out of a module
terraform state mv module.app.aws_instance.web aws_instance.web

After every state mv, run terraform plan and confirm it shows zero changes (no creates, no destroys). If the plan shows a destroy/create pair for the resource you moved, the address in state mv did not match the configuration address exactly.

terraform state rm

Use when you want Terraform to stop managing a resource without destroying it — typically when migrating a resource to a different root module, or intentionally orphaning a resource for manual management:

terraform state rm aws_s3_bucket.old_logs

After state rm, the resource still exists in the cloud. The next plan will not show it. If the resource block is still in the configuration, Terraform will attempt to create a duplicate. Remove or comment out the configuration block immediately after the state rm, or import into the destination root module before running any plans there.

terraform state pull and push

For backend migrations and emergency state repairs:

# Download current state for inspection or as a migration artifact
terraform state pull > backup-$(date +%Y%m%d-%H%M%S).tfstate
 
# Restore a previous version (use with extreme care)
terraform state push backup-20241115-142208.tfstate

state push bypasses serial checking with -force, which means it can overwrite newer state with older state. This is the sharpest edge in Terraform state management. Use it only when you have confirmed the backup is the correct version and the current state is the one with the problem.

Backend migration

When moving state between backends — local to S3, one bucket to another, S3 to HCP Terraform — the procedure is:

Update the backend block in configuration to the new target
Run terraform init -migrate-state; Terraform prompts to copy state to the new backend
Confirm yes
Run terraform plan immediately; it should show zero changes
Delete the old backend state only after confirming the new location is correct and versioned

# After updating the backend block
terraform init -migrate-state
 
# Verify no state was lost
terraform plan

The critical mistake to avoid: running terraform init -reconfigure instead of -migrate-state. The -reconfigure flag initializes the new backend without copying state, leaving you with an empty state that Terraform interprets as "all resources need to be created." Always take a terraform state pull backup before any migration.

Production pitfalls and their fixes

These are the failures that appear most frequently in post-mortems.

The wrong-workspace apply

A developer selects the prod workspace to check something, forgets to switch back, and runs terraform apply from a dev context targeting production state. With separate root modules this is harder because it requires navigating to the wrong directory and having the correct credentials — two barriers instead of zero.

Fix: For workspace-based setups, add a guard using a locals check that fails if the workspace name does not match an expected value, and fail fast in CI if the workspace does not match the branch. For production, use separate AWS accounts or GCP projects with separate credential sets. An IAM role that is available only in CI and not on developer laptops for production state is a structural guard that cannot be bypassed by accident.

The state file in git

It happens more often than post-mortems admit. A developer initializes Terraform locally and commits terraform.tfstate to the repository. The state file is now in history and may contain secrets. The fix has two parts: remove it from history with git filter-repo (not filter-branch), rotate any credentials that were exposed, and set up a remote backend before the second commit. Prevention is a .gitignore that includes *.tfstate, *.tfstate.backup, and .terraform/ added at project initialization, before any state file ever exists.

The silent destroy from count or for_each changes

Changing count = 3 to count = 2 destroys the resource at index 2. Changing a for_each map to remove a key destroys that resource. Terraform will show this in the plan, but if engineers have learned to apply -auto-approve from the plan output without reading it carefully, a non-trivial resource vanishes.

Fix: use prevent_destroy = true in the lifecycle block of every stateful production resource:

resource "aws_db_instance" "primary" {
  identifier        = "acme-prod-primary"
  engine            = "postgres"
  engine_version    = "16.3"
  instance_class    = "db.r7g.2xlarge"
  allocated_storage = 500
 
  lifecycle {
    prevent_destroy       = true
    ignore_changes        = [engine_version]
  }
}

prevent_destroy = true causes terraform plan to fail with an explicit error if any planned operation would destroy the resource. It does not prevent all destroy paths — you can remove the lifecycle block and apply — but it prevents accidents where a plan is applied without reading it.

The partial apply failure

An apply fails mid-run: a provider API rate limit, a network timeout, a provider bug. Terraform has created or modified some resources and not others. The state file reflects only what completed. Subsequent plans will show the remaining changes, which is usually correct behavior. But if the partial apply left a resource in an intermediate state — a load balancer with rules but without the target group the rules reference — the state may not fully capture the inconsistency.

Fix: run terraform plan immediately after any failed apply and read the output carefully before proceeding. Do not fix the partial state by making manual console changes; import any changes or let Terraform reconcile them. Use terraform state list to verify the resource inventory against what you expect:

terraform state list | grep aws_lb

Cross-reference the output against the cloud provider console to identify anything Terraform created but did not record, or recorded but did not finish configuring.

The sensitive output that is not actually secret

A Terraform output is marked sensitive = true. It still appears in state in plaintext — the sensitive marking only suppresses it from CLI display and plan output in terminal. Anyone with state read access sees the value.

Fix: treat sensitive = true on outputs as a display guard, not a secrets boundary. For long-lived credentials or key material, use a secrets manager as the authoritative store and put only the ARN or path in Terraform outputs. Terraform outputs are appropriate for non-sensitive infrastructure metadata: VPC IDs, subnet CIDR blocks, endpoint hostnames without embedded credentials.

What to remember

Local state is not viable for team or production use. Pick a remote backend — S3+DynamoDB, GCS, AzureRM, or HCP Terraform — and migrate before the second engineer touches the codebase.
The DynamoDB partition key must be named exactly LockID (type String). Every other configuration fails silently or with opaque errors.
Split state by blast radius: networking, data tier, application tier, and global resources each get their own state file. A change to one tier should not lock or risk another.
State contains secrets in plaintext by default. Encrypt the backend with a dedicated KMS key, restrict read access at the bucket policy level, and never commit the state file to git.
Drift detection is a daily scheduled job in CI, not a periodic manual check. Alert on exit code 2 from terraform plan -detailed-exitcode and require human review before any remediation apply.
Use prevent_destroy = true on every stateful production resource. It will catch at least one accidental destroy per year.
Know terraform state mv, rm, pull, and push before you need them in an incident. Practice on non-production state. After any state mv or import, always run plan and confirm zero changes.
Backend migrations require -migrate-state, not -reconfigure. Always take a terraform state pull backup before migrating.