The fastest way to lose confidence in a Kubernetes cluster is to stop knowing what is actually running in it. Someone applied a hotfix directly at 2 a.m., the manifest in git is now out of date, and the next planned deploy quietly reverts the patch — or worse, carries the misconfiguration forward into the next environment. GitOps fixes this by enforcing one rule: git is the desired state of the cluster, and a controller reconciles the cluster to git continuously. No human hands between commit and live state.
That principle sounds obvious. Most teams agree with it in a slide deck. Almost none have wired it up completely — through secrets management, environment promotion, autoscaler conflicts, and database migration timing. This post is about what that wiring looks like, where it breaks, and how to measure whether you actually bought what GitOps promises.
What GitOps actually is (and isn't)
The term was coined by Alexis Richardson at Weaveworks in 2017, but the OpenGitOps project (a CNCF sandbox project, spec at opengitops.dev) formalized four principles worth holding to precisely:
Declarative. The desired system state is expressed declaratively: what you want, not a sequence of commands to run. Kubernetes YAML, Helm values, Kustomize overlays — these qualify. Shell scripts run by a pipeline do not.
Versioned and immutable. Desired state is stored in a version control system that enforces immutability and retains a complete history. Git satisfies this. A deployment artifact store alone does not.
Pulled automatically. Software agents continuously pull desired state declarations and apply them to the system. The cluster reaches out to git; git does not push into the cluster.
Continuously reconciled. Agents do not reconcile on a schedule you define and then stop. They observe actual state, compare it to desired state, and act to close any gap — on a continuous loop, not a one-shot pipe.
Notice what is absent from that definition: there is nothing about CI pipelines, nothing about Helm or Kustomize, and nothing about Argo CD or Flux by name. Those are implementation choices. GitOps is the contract — the others fulfill it or don't.
What GitOps is not:
It is not "put your manifests in git." Many teams have YAML files in a repository but still run kubectl apply from a CI runner on every merge. This satisfies the declarative and versioned principles but violates the pull principle entirely. The cluster is still trusting an external caller.
It is not a replacement for CI. Building, testing, and publishing images is still CI's job. GitOps governs what version runs in the cluster, not how that version was produced. Conflating the two leads to CI pipelines that do too much and GitOps controllers that know too little.
It is not magic drift prevention on its own. If your organization maintains break-glass kubectl access and uses it regularly, GitOps cannot protect you. Drift prevention requires holding the operational contract with the same rigor you hold the technical one.
The pull model: why the inversion matters
The conventional push-based CI/CD model works like this: a pipeline builds an image, runs tests, then calls kubectl apply or helm upgrade directly against the cluster. For this to work, the CI runner must hold cluster credentials — typically a kubeconfig or a service account token with namespace-wide apply permissions. The cluster trusts an external caller.
The GitOps pull model inverts this. A controller running inside the cluster watches a git repository, computes the diff between declared desired state and live state, and applies changes from inside the cluster boundary. The CI runner only needs write access to the git repository. Cluster credentials never leave the cluster perimeter.
Push-based CI/CD
- CI runner holds kubeconfig or service account token with cluster-apply permissions
- Blast radius of a compromised CI token is the entire cluster or namespace
- Audit trail lives in pipeline logs — may rotate, inaccessible post-incident, not linked to git history
- Rollback requires triggering a new pipeline run and waiting for full CI execution (8–15 min typical)
- Cluster state diverges silently if someone applies outside the pipeline
GitOps pull model
- Controller runs inside the cluster; CI only needs git write access
- Compromise of the CI system cannot directly modify the cluster
- Audit trail is the git commit history — permanent, searchable, linked to reviewed PRs
- Rollback is git revert; controller reconciles in under 2 minutes without pipeline involvement
- Continuous reconciliation detects and alerts on drift — or corrects it automatically
One thing that surprises teams: pull-based GitOps is not meaningfully slower than push. Argo CD polls git on a configurable interval (default: 3 minutes) but can be triggered immediately via a webhook from your CI system after it updates the manifest repository. The path from "image pushed to registry" to "pod replaced in cluster" is typically under 2 minutes when webhook triggering is active — comparable to a push-based pipeline that also has to wait for RBAC token validation and API server network round-trips.
Controller mechanics: Argo CD and Flux in practice
Both Argo CD and Flux implement the reconciliation loop, but they expose meaningfully different mental models.
Argo CD structures everything around an Application resource (or ApplicationSet for fleets) that binds a git path to a cluster namespace. It exposes a web UI with diff visualization and sync status, supports multi-cluster deployments from a single control plane, and has strong multi-tenant RBAC. According to the CNCF 2025 end-user survey, Argo CD is deployed in roughly 60% of Kubernetes clusters used for application delivery — it has effectively become the default choice.
Flux v2 is more composable and controller-per-concern: a GitRepository source controller watches git, a Kustomization controller applies manifests, and an ImageRepository plus ImagePolicy pair handles automated image tag updates. The separation of concerns is cleaner and the API design is more idiomatic Kubernetes. Flux's primary sponsor, Weaveworks, ceased operations in early 2024, though the project continues under CNCF stewardship with an active maintainer community. Teams doing fresh evaluations often ask about continuity risk; the CNCF graduation status and active contributor base are the relevant mitigations.
A minimal Flux v2 setup for a single application:
# clusters/prod/sources/backend-repo.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: backend-config
namespace: flux-system
spec:
interval: 1m
url: https://github.com/example/k8s-manifests
ref:
branch: main
---
# clusters/prod/apps/backend.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: backend
namespace: flux-system
spec:
interval: 5m
path: ./services/backend
prune: true
sourceRef:
kind: GitRepository
name: backend-config
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: backend
namespace: backend
postBuild:
substituteFrom:
- kind: ConfigMap
name: cluster-varsTwo fields deserve attention. prune: true means resources deleted from git are deleted from the cluster — this is the drift correction you want, but it will delete resources you applied manually and forgot to commit. Set it deliberately. The healthChecks block causes Flux to wait for the Deployment rollout to complete before marking the Kustomization healthy — this surfaces a bad image tag before the sync is considered done, triggering an alert rather than silently leaving pods in a crash loop.
The reconciliation sequence for any change looks like this:
- 01
PR merged to manifest repo
A reviewed commit declaring the new desired state lands on the target branch. This is the single authorization gate before the change reaches the cluster — no separate deploy approval needed.
- 02
Controller notified
A git webhook fires to the controller, or the controller detects the change on its next poll interval. Webhook-triggered reconciliation begins within seconds of merge.
- 03
Diff computed
The controller fetches the latest commit, renders manifests through Kustomize or Helm, and computes the diff against live cluster state via the Kubernetes API. Only changed resources are identified.
- 04
Delta applied
Only the changed resources are submitted to the Kubernetes API server, which validates and applies them. Unchanged resources are untouched, reducing unnecessary rollout churn.
- 05
Health check and status
The controller monitors rollout health — pod readiness, custom health expressions — and reports sync status. A failed rollout surfaces here as a degraded Application, not silently in a crash loop.
Source: Argo CD / Flux v2 controller architecture
Argo CD adds one capability worth calling out explicitly: sync waves. Resources annotated with argocd.argoproj.io/sync-wave: "-1" apply before resources in wave 0. This lets you sequence a database migration Job before the application Deployment within a single sync. Most push-based pipelines approximate this with sleep timers or multi-stage pipelines. Sync waves make the ordering explicit, version-controlled, and testable.
Structuring your repositories
The most consequential early decision in a GitOps adoption is repository structure. Get it wrong and every environment promotion becomes a painful manual operation or a fragile script.
| Model | Description | Best fit | Key tradeoffs | |---|---|---|---| | App-of-apps (Argo CD) | Root Application generates child Applications from a directory tree | Single team, 5–30 services | Tight coupling between infra and app config | | Per-team config repos | Each team owns a manifest repo; platform team owns cluster-level config | Multi-team org, 50+ services | Coordination overhead for cross-team dependencies | | Fleet repo (Flux) | Single repo, environment directories, Flux Kustomizations per env | Platform engineering teams | Merge conflicts at scale without strict CODEOWNERS | | Monorepo with Kustomize overlays | base/ and per-environment overlays/, image tags pinned per overlay | Small org wanting simplicity | Environment drift if overlays diverge from base |
For most teams under 30 engineers, a single manifest repo with environment directories and Kustomize overlays is the right default. The full audit trail lives in one repository, environment promotion is a PR that updates the image tag in the production overlay, and you avoid the cross-repo webhook complexity that multiplies with the per-team model.
Environment promotion via image tag update looks like this:
# After staging has validated backend:1.4.2 for 24 hours without incidents
cd clusters/prod/backend
# Update the production overlay to pin the validated image tag
kustomize edit set image backend=registry.example.com/backend:1.4.2
git add kustomization.yaml
git commit -m "promote backend:1.4.2 to prod (staged 2025-12-17, no P1 incidents)"
git push origin main
# GitOps controller reconciles; no pipeline run neededThat commit message is the deployment record. No pipeline system to query, no logs to correlate — git log --oneline -- clusters/prod/backend/ answers "what is deployed and when was it last changed" for any service at any point in time.
For large organizations managing hundreds of services, Argo CD's ApplicationSet with the git-directory generator removes per-service Application definitions entirely:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: production-services
namespace: argocd
spec:
generators:
- git:
repoURL: https://github.com/example/k8s-manifests
revision: HEAD
directories:
- path: clusters/prod/services/*
template:
metadata:
name: "{{path.basename}}"
spec:
project: production
source:
repoURL: https://github.com/example/k8s-manifests
targetRevision: HEAD
path: "{{path}}"
destination:
server: https://kubernetes.default.svc
namespace: "{{path.basename}}"
syncPolicy:
automated:
prune: true
selfHeal: trueAdding a new service is adding a directory. Removing a service is removing a directory and merging the deletion PR. Argo CD handles the Application lifecycle. This scales to hundreds of services without any ApplicationSet changes.
Secrets: where most GitOps rollouts stall
Git is not a secret store. But Kubernetes Secrets need to exist in the cluster, and the GitOps model says the cluster's desired state lives in git. This tension is where many adoptions stall, and the wrong answer — committing plaintext secrets "temporarily" — creates security incidents that outlast the temporary.
Three credible approaches, each with genuine tradeoffs:
Bitnami Sealed Secrets. A controller in the cluster holds an asymmetric keypair. You encrypt with kubeseal using the controller's public key; the resulting SealedSecret manifest is safe to commit because it is useless without the private key inside the cluster. The controller decrypts at sync time and materializes a Kubernetes Secret. The limitation: if the controller key is rotated, every sealed secret in the repository must be re-encrypted and re-committed. Key backup is entirely your responsibility. This is a reasonable choice for teams that need zero external dependencies and accept the key management overhead.
# Seal a secret for the production cluster
kubectl create secret generic db-credentials \
--from-literal=password=hunter2 \
--dry-run=client -o yaml | \
kubeseal \
--controller-namespace sealed-secrets \
--controller-name sealed-secrets-controller \
--format yaml > clusters/prod/backend/db-credentials-sealed.yaml
# This file is safe to commit — encrypted blob useless without controller's private key
git add clusters/prod/backend/db-credentials-sealed.yaml
git commit -m "add sealed db credentials for backend (prod)"External Secrets Operator (ESO) with AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault. You commit an ExternalSecret resource that declares what secret to fetch from the external store — not the value itself. ESO polls the external store on a configured interval and materializes a Kubernetes Secret. The secret value never touches git. The limitation: you now have an external dependency — the secrets manager must be reachable at pod start — and IAM policies or Vault policies become a second control plane to maintain. For most production workloads, this is the right tradeoff: secret values inherit the access controls, versioning, rotation, and audit logging of the secrets manager, and rotations in the external store propagate to the cluster without a new commit.
Mozilla SOPS with age or KMS. SOPS encrypts specific values within a YAML file while leaving keys in plaintext, which preserves diff reviewability. Flux natively understands SOPS-encrypted files and can decrypt them during reconciliation using an age private key stored as a bootstrap cluster secret or a reference to a cloud KMS key. The limitation: the decryption key must be present in the cluster before reconciliation can begin.
A practical rule: ESO backed by a cloud-native secrets manager is the most operationally sound default for production. Sealed Secrets is the right pick for air-gapped or fully self-contained clusters where external dependencies are unacceptable.
Where GitOps fights you
GitOps works cleanly when the cluster's desired state is fully expressible as static manifests that can be applied idempotently. Four patterns in real workloads break that assumption.
Horizontal Pod Autoscaler conflicts with replica counts. If your Deployment manifest specifies replicas: 3 and the HPA scales it to 8 under load, the next GitOps sync will reset it to 3. This is technically correct GitOps behavior — the cluster drifted from declared state — but it is operationally wrong. The fix is to remove the replicas field from the Deployment spec entirely when an HPA governs the same Deployment. Kubernetes will defer to the HPA's value when the field is absent. If removing it is not possible (shared charts, third-party manifests), Argo CD's resource customizations can suppress the field from drift detection:
# In argocd-cm ConfigMap, data section
resource.customizations.ignoreDifferences.apps_Deployment: |
jsonPointers:
- /spec/replicasStatefulSet PVC template mutations. The volumeClaimTemplates field on a StatefulSet is immutable once the StatefulSet exists — the Kubernetes API server will reject any attempt to modify it. If you update it in git, the GitOps controller fails to sync with a validation error that does not obviously point to the cause. Treat StatefulSet storage changes as a manual migration operation: snapshot data, delete the StatefulSet (Kubernetes will not delete the PVCs by default), re-create it with the new template, and restore data. This is intentional — storage resizing on running stateful workloads is rarely safe to automate.
Database schema migrations. Running ALTER TABLE or equivalent schema changes is a side effect that cannot be reconciled idempotently by a controller inspecting Kubernetes resource state. Most teams handle this with a Kubernetes Job as a pre-sync hook. In Argo CD:
apiVersion: batch/v1
kind: Job
metadata:
name: db-migrate
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: migrate
image: registry.example.com/backend:1.4.2
command: ["./migrate", "--url", "$(DATABASE_URL)", "up"]
envFrom:
- secretRef:
name: db-credentials
restartPolicy: Never
backoffLimit: 0The PreSync annotation runs the Job before the Deployment rolls. HookSucceeded deletes the Job after success to prevent accumulation. The migration tool must be genuinely idempotent — tools like Flyway and Liquibase track applied versions in a schema version table, so re-runs are safe. If you use raw SQL scripts without version tracking, re-runs will fail or corrupt. Set backoffLimit: 0 to ensure a failed migration halts the sync immediately rather than retrying and potentially making schema state worse.
Custom Resource lifecycle operations with stateful operators. Some operators — PostgreSQL operators, Kafka operators, Redis cluster operators — trigger internal state machines on CR deletion. If a PostgreSQL CR disappears from git and prune: true is set, the controller will delete the CR, and the operator may interpret that as a drop-database instruction. This is not a theoretical edge case.
Measuring what you actually bought
Adopting GitOps is not valuable in itself. The value is measurable improvement in DORA's four key delivery metrics, which have been tracked across tens of thousands of engineering organizations since 2019. Here is where GitOps makes the most direct contribution.
On-demand
Deploy frequency
elite
<1 hr
Change lead time
elite
<1 hr
MTTR
elite
~5%
Change failure rate
elite
Source: DORA State of DevOps, 2024
The change failure rate reduction is where GitOps earns its strongest claim. Every change arriving as a reviewed PR catches misconfigurations before they reach the cluster. When rollback is git revert and the controller reconciles in under 2 minutes, the window between "detect broken deploy" and "cluster restored" shrinks to a fraction of the push-based equivalent. DORA's 2024 report found elite performers achieve MTTR under 1 hour — a benchmark that is practically unreachable without automated rollback, which GitOps makes trivial.
To put a concrete number on the rollback time difference: with push-based CI/CD, rolling back requires triggering a new pipeline run, waiting for the build and test stages to complete (or locating the previous image tag and skipping them), and re-executing the deploy stage. Typical wall-clock time under normal CI queue conditions is 8 to 15 minutes. With GitOps: commit a git revert, push — the controller detects the change within its poll interval or via webhook and reconciles within 60 to 90 seconds. The cluster is back to the previous state in under 2 minutes, with a clean commit in the git log showing who reverted and why. That gap explains why MTTR improves materially when teams actually use GitOps rollback rather than treating it as a theoretical option.
Deployment frequency improves because the operational risk of each deploy drops. When teams fear deployments — because rollback is unclear, because "what's running in prod?" requires querying multiple systems — they batch changes and deploy infrequently. Remove that friction and deploy frequency moves naturally toward on-demand. DORA's longitudinal data consistently shows frequency and stability improve together in high-performing teams. They are not in tension, despite the intuition that deploying more often should cause more failures.
77% of respondents using GitOps in production and Argo CD appearing in roughly 60% of Kubernetes clusters are not early-adopter figures. They mean teams evaluating Kubernetes-based infrastructure today are joining a practice that has become the default, not pioneering one. The organizational question is no longer whether to adopt GitOps, but how to adopt it completely — working through secrets management, repo structure, and the autoscaler and stateful-operator edge cases described here before a production incident surfaces them instead.