DevOps was designed to remove the wall between development and operations. In many organisations it did something subtler and less helpful: it relocated the wall rather than demolishing it. Instead of a handoff between two departments, every product team now carries a full infrastructure load — Kubernetes, Terraform, CI pipelines, cloud IAM, observability stacks, secrets management, SLO budgets. The expertise diffuses but does not compound; each team re-solves near-identical problems slightly differently, and those differences accumulate as inconsistency, toil, and security debt. Platform engineering is the architectural response to that drift. Build the infrastructure capabilities once, with the same rigour you would apply to a customer-facing product, and let every product team direct their energy toward the work only they can do.
~19%
Teams at elite DORA performance tier
DORA 2024
182×
More frequent deploys: elite vs. low performers
DORA 2024
80%
Large engineering orgs with platform teams by 2026
Gartner
Source: DORA State of DevOps 2024; Gartner, 2023; CNCF Annual Survey 2024
The Infrastructure Tax Every Team Pays
The "you build it, you run it" principle was correct — it restored accountability and eliminated the slow conveyor belt of tickets between dev and ops. The unintended consequence was the surface area each team now needed to understand.
Consider what a backend engineer at a mid-sized cloud company needs to know before deploying a new service today: Docker multi-stage builds, Kubernetes deployment manifests, Helm chart templating, Terraform resource declarations, cloud IAM role chaining, GitHub Actions workflow syntax, artifact registry authentication, environment-specific config injection, Prometheus metric exposition, distributed tracing instrumentation, log format conventions, alert routing rules, and on-call rotation setup. That is not an exaggeration — each of those is a real prerequisite in a standard Kubernetes-on-cloud stack, and most of them require meaningful expertise to configure correctly.
In an organisation with fifteen product teams, this knowledge is not shared — it is duplicated. Each team develops their own Terraform modules, their own CI templates, their own base container images. The variance accumulates: seventeen different tagging strategies break cost allocation; mismatched alert thresholds mean incidents wake the wrong team; divergent base images mean fifteen separate CVE patches when a critical vulnerability lands in a shared layer. None of this is caused by incompetence. It is the predictable result of distributing infrastructure responsibility without also centralising infrastructure investment.
Matthew Skelton and Manuel Pais formalised the cost in Team Topologies (IT Revolution, 2019): cognitive load is finite. Teams that carry more of it than their structure can support deliver less. The fix is not to remove ownership from product teams — it is to shrink the surface area they need to own by absorbing the common infrastructure into a platform that someone else operates well.
The CNCF Annual Survey 2024 (roughly 750 respondents from the cloud-native community) found that only 28% of organisations have a dedicated platform engineering team; 41% use a fragmented multi-team approach and 31% have no formal approach at all. That distribution is the shape of an industry still paying the infrastructure tax in full.
What an Internal Developer Platform Actually Is
The term "internal developer platform" gets overloaded. It is not a developer portal. It is not a ticketing-system with a fancier UI. It is not a shared script directory wrapped in a React app. An IDP is the union of tooling, workflows, and documentation that lets a product team provision, deploy, observe, and run a service without needing to understand the full infrastructure stack beneath it.
A useful structural model divides an IDP into five planes:
- Developer plane — the interface layer: a service catalog, self-service scaffolding templates, and a CLI or portal. Product developers interact almost exclusively here.
- Delivery plane — the CI/CD backbone: reusable pipeline templates, artifact registries, promotion gates, and environment progression logic (typically GitOps-based).
- Resource plane — the infrastructure: cloud accounts, VPCs, databases, queues, and storage — provisioned on-demand through abstracted APIs, usually Terraform modules or Crossplane composites.
- Monitoring plane — observability as a service: pre-wired dashboards, alert rules, SLO frameworks, and distributed tracing. A new service should be observable by default on day one, not after a separate request to an observability team.
- Security plane — policy enforcement: RBAC, secrets management, container image scanning, admission controllers (OPA or Kyverno). Security controls are built into the delivery path, not appended after the fact.
The platform does not hide these planes from teams that genuinely need to go deeper. It makes the 80% case require zero manual configuration, and keeps the remaining 20% achievable without a support ticket. That boundary is the design constraint: every decision about what to abstract and what to expose is a product decision, not a technical one.
Golden Paths: Opinionated, Supported, Escapable
Spotify's engineering teams popularised the "golden path" framing — an opinionated, supported route from idea to production that covers the common case well. The critical word is supported. A golden path is not a template you fork once and never see again; it is a living product with tests, versioned releases, a changelog, and a named owner who responds to questions about it.
A well-built golden path for a new backend service does all of the following from a single scaffold command:
- 01
Repository
Creates a GitHub repository with branch protection, code-owner files, a default .gitignore, and the team auto-added as CODEOWNERS.
- 02
CI pipeline
Wires a reusable GitHub Actions workflow: lint, test, container build, SBOM generation, and image push to the organisation registry — no per-service YAML to write.
- 03
Staging environment
Runs a Terraform workspace to provision a staging namespace with correct IAM bindings, secrets injection, and network policies already applied.
- 04
Observability
Generates a Grafana dashboard pre-populated with the four golden signals (latency, traffic, errors, saturation) and wires alert routing to the team PagerDuty service.
- 05
Service catalog entry
Registers the service in the catalog with owner, tier, dependency graph, runbook link, and on-call rotation — sourced from infrastructure state, not manually filled in.
The result: a net-new service goes from zero to observable-in-staging in under 15 minutes rather than the two to four days it takes a team assembling the same setup by hand. More importantly, every service produced by the scaffold is structurally identical: same tagging, same log format, same alert topology. When incidents occur, that consistency pays dividends — you do not spend the first 20 minutes of an SEV-2 figuring out where the logs are or why the dashboard is blank.
The escape hatch matters as much as the path itself. A golden path that cannot be overridden is not an abstraction — it is a cage. Teams building services with unusual characteristics (real-time event pipelines, ML inference endpoints, high-throughput data ingestion) need access to the layer beneath the defaults. The platform should make this explicit: document the extension points, support the override through the same delivery pipeline, and track which teams deviate. When enough teams override the same default, that override becomes the next golden path.
Treating the Platform as a Product
This is the failure mode that buries the most well-funded platform engineering efforts. A platform team that builds what it finds architecturally interesting, rather than what its users actually need, will build the wrong thing. A platform team that mandates adoption before the platform is genuinely better than the alternatives will build resentment. Neither failure is technical; both are product failures.
The product model is not complicated, but it requires discipline:
Know your users specifically. Not "developers" in aggregate — identify the stream-aligned teams you serve today, their current tech stack, their deployment frequency, and the top three sources of friction they report in retros or postmortems. Build a roadmap that addresses those three things before adding features the platform team wants.
Measure friction, not output. The platform team's output — features shipped, services onboarded, PRs merged — tells you almost nothing about whether the platform is working. The signal is: time-to-first-deploy for a net-new service; percentage of CI failures attributable to infrastructure rather than application code; number of infrastructure-related tickets filed by product teams per sprint. Those numbers going down means the platform is doing its job.
Earn adoption; do not mandate it. This has empirical backing. DORA 2024 found that teams required to exclusively use internal platforms for all lifecycle tasks saw an 8% decrease in throughput and a 14% decrease in change stability compared to teams that retained autonomy. The interpretation is not that platforms are harmful — it is that mandates substitute for quality. If a team can opt out and chooses not to, the path is genuinely better. If teams opt out anyway, the path has a problem.
Maintain a public roadmap and deprecation policy. Teams building on your platform are making technical bets. Respect that by publishing a roadmap, communicating breaking changes at least 90 days in advance, and running migrations on behalf of consuming teams rather than filing tickets and expecting compliance. A deprecation policy that teams can trust is what makes it safe for them to adopt the platform in the first place.
Set SLOs for the platform itself. An IDP that is flaky does not merely inconvenience the platform team — it blocks every product team simultaneously. Define and publish SLOs for the platform's critical paths: golden path success rate, provisioning P95 latency, portal availability. Treat breaches as production incidents.
A Reference Stack, Without the Vendor Hype
Platform tooling is a crowded market. The right tooling choice matters less than the architectural decisions those tools implement; a stable set of Makefile targets and Terraform modules beats a sprawling Backstage deployment that nobody uses. That said, certain layers have become durable enough to plan around.
| Layer | Open-source option | Hosted option | Notes | |---|---|---|---| | Developer portal | Backstage (CNCF incubating) | Port, Cortex | Backstage setup cost is real — budget 2 engineer-months minimum for a stable instance | | IaC / resource plane | OpenTofu, Terraform | Pulumi Cloud | OpenTofu is the BSL-safe Linux Foundation fork with identical HCL syntax | | GitOps promotion | ArgoCD, Flux | Harness, Codefresh | ArgoCD handles multi-cluster promotion well out of the box | | Secrets management | Vault, External Secrets Operator | AWS Secrets Manager | ESO bridges Vault and cloud-native stores without forcing a choice | | Observability | Grafana + Mimir + Tempo | Datadog, Honeycomb | OpenTelemetry Collector makes the backend switchable at any point | | Admission policy | OPA, Kyverno | Styra DAS | Kyverno's Kubernetes-native syntax carries lower operational cost |
A minimal Backstage software template for a Go service demonstrates how little the product team needs to know:
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: go-service
title: Go microservice
spec:
owner: platform-team
type: service
parameters:
- title: Service details
properties:
name:
type: string
title: Service name
pattern: '^[a-z][a-z0-9-]{2,30}$'
team:
type: string
title: Owning team slug
steps:
- id: fetch-template
name: Fetch skeleton
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
team: ${{ parameters.team }}
- id: publish
name: Create GitHub repository
action: publish:github
input:
repoUrl: github.com?repo=${{ parameters.name }}&owner=your-org
defaultBranch: main
repoVisibility: internal
- id: provision
name: Provision staging environment
action: http:backstage:request
input:
method: POST
path: /api/proxy/tf-api/workspaces/${{ parameters.name }}-staging/runsThe Terraform module the provisioner calls exposes four variables to the product team:
module "service" {
source = "github.com/your-org/platform-modules//service"
version = "~> 4.0"
name = "payments-api"
team = "payments"
environment = "staging"
tier = "standard" # standard | high-availability | batch
}Those four variables are the entire interface. Behind them the module manages VPC attachment, IAM role provisioning, log group creation, CloudWatch alarms, Grafana datasource registration, and the service catalog entry — roughly 600 lines of Terraform the product team never sees. The platform team owns those lines, runs them through a test suite on every PR, and ships upgrades to consuming services without requiring those teams to touch anything. That asymmetry — product team sees four variables, platform team maintains 600 lines — is exactly the leverage model that makes platforms worth building.
Platform Engineering and the DORA Metrics
DORA 2024 provides the clearest empirical picture available of how software delivery performance varies across the industry. The gap between the top and bottom tiers is not incremental:
Low performers
- Deploy between once per month and once every six months
- Lead time from commit to production: one to six months
- Change failure rate: 46–63%
- Time to restore service: one week to one month
Elite performers
- Deploy on-demand, multiple times per day
- Lead time from commit to production: under one hour
- Change failure rate: under 5%
- Time to restore service: under one hour
Elite performers deploy 182 times more often than low performers and recover from incidents roughly 1,000 times faster (DORA 2024). Only approximately 19% of surveyed teams qualified as elite in 2024. More concerning: the high-performance tier shrank from 31% in 2023 to 22% in 2024, while the low tier grew from 17% to 25%. More teams are falling further behind, not converging upward.
Platform engineering's contribution to these metrics runs through three mechanisms. First, a reliable golden path makes deployment a non-event: when the path from merged PR to production is automated and stable, teams stop batching changes out of fear and deploy smaller increments more frequently. Smaller changes fail less. Second, pre-wired observability means the diagnostic signal is available immediately when a change fails — you do not spend 40 minutes determining which log stream to check or why the tracing data is absent. Third, standardised infrastructure means postmortem findings from one incident can be applied across every service that shares the same scaffold; the organisation learns from failures rather than repeating them independently.
DORA 2024 quantified the productivity benefit at 10% improvement in team performance and 8% improvement in individual productivity for teams using an internal developer platform with genuine autonomy. Those are not dramatic numbers in isolation, but they compound: at scale, a 10% throughput improvement across 20 product teams is the equivalent of two additional senior engineers.
Pitfalls That Kill Platform Teams
These failure modes come up in practice repeatedly. Each has a specific fix.
Building the portal before building the platform
Backstage is a React frontend. An IDP is everything behind it. Teams that spend three months deploying a Backstage instance before building reliable Terraform modules ship a UI that integrates nothing, surfaces broken links, and generates tickets from frustrated users. The correct sequence: build the resource plane first, expose it through a CLI, prove it works reliably at the command line, and then add a portal as a better interface to automation that already exists. A portal built on top of working automation is genuinely useful. A portal built in anticipation of automation is a maintenance burden that undermines trust in the platform before it has done anything.
Mandating the platform before it is better than the alternative
If a product team can provision a staging environment faster by writing their own Terraform than by using the self-service flow, they will — and they are right to do so. Mandating the slower path destroys goodwill and, per DORA 2024, measurably degrades delivery performance. The readiness test before mandating: can a developer with no platform expertise use the self-service path successfully on the first attempt, without reading documentation or asking for help? If the answer is no, the path is not ready to mandate.
Treating the service catalog as a manually maintained wiki
A catalog that is hand-curated goes stale within weeks. Every service entry that was accurate six months ago and is now wrong erodes trust in the entire catalog. The fix is to derive catalog entries from infrastructure state: auto-register services from Terraform outputs, sync ownership from GitHub CODEOWNERS files, pull SLO status from Prometheus. The catalog should be a read-only view of actual system state, not a wiki that teams are expected to keep current.
Skipping the golden path for the "edge case" that is actually 40% of traffic
Platform teams frequently defer building support for a workload type because it seems uncommon, only to discover it represents a significant fraction of the organisation's services once they audit. Before declaring something an out-of-scope edge case, count the actual services in that category. If more than three or four teams are maintaining their own variation of the same pattern, the platform needs to absorb it.
Zero platform SLOs
When the platform is unreliable, it does not just block the platform team — it blocks every product team simultaneously. A Terraform workspace provisioner that fails 20% of the time causes infrastructure-step failures across every CI run in the organisation. Define explicit SLOs for the platform's critical paths: golden path end-to-end success rate, environment provisioning P95 latency, and portal availability. Track them on a shared dashboard. Treat breaches as SEV-2 incidents, because to the affected teams they are.
The platform team's job is to reduce the cognitive load required of stream-aligned teams — not to reduce their autonomy.