ClimsTech
Cloud architecture28 May 2026

Microservices or a modular monolith? The question to ask first

The Prime Video '90% cost cut' story is real and widely misread. This is the full engineering case: the distributed-systems tax itemised, when microservices actually justify it, the decision framework, and migration patterns that work in either direction.

ClimsTech Engineering · 18 min read

A few years ago, an engineering team at a very large streaming company published a post explaining how they rebuilt one of their monitoring tools — moving it away from serverless microservices and back into a single consolidated service — and cut its infrastructure cost by over 90%. The internet did what the internet does: declared microservices dead. That was the wrong lesson. The right one is more useful and more demanding: there is no default architecture, only a fit between a system and its problem. Getting that fit right requires understanding what you are actually buying with each approach, what forces genuinely justify the cost, and what the migration looks like when you need to move in either direction. This post works through all three.

What the Prime Video story actually says

The original design strung together AWS Step Functions orchestrating many Lambda functions, passing video frame data between them at every step. For this particular workload — high-volume, tightly-coupled video quality analysis — most of the cost and complexity was in the coordination: orchestration overhead and data serialised and transferred between components on every processing step. Collapsing it into a single ECS service that did all the work in-process removed that overhead almost entirely.

before

Distributed microservices

  • Step Functions orchestrating many Lambda functions
  • Video frame data serialised and passed between services each step
  • Each component scaled and billed independently
  • Inter-service data transfer and orchestration overhead dominated cost
after

One consolidated service

  • Work done with in-process function calls
  • Single ECS service, one deployable unit
  • Scaled as one; no cross-service data transfer cost
  • Reported over 90% lower infrastructure cost
Prime Video monitoring tool refactor — one team's workload, not a company-wide verdictSource: Prime Video Tech Blog, 2023

The distributed-systems tax, itemised

When you decompose a process into services communicating over a network, you acquire a set of costs that in-process calls simply do not carry. Understanding what you are actually paying for is a prerequisite to deciding whether the trade is worthwhile.

Latency compounds across hops. A request that touches five services at a p99 inter-service latency of 10 ms each accumulates 50 ms in network overhead before doing any real work. Under load, tail latencies stack. A chain of ten services at p99 20 ms each is a 200 ms floor — before your business logic runs. At p50 this is often fine; at p99 under sustained traffic it can dominate your SLA.

Failure surfaces multiply. Every service boundary is a failure point. A ten-service request chain where each service individually maintains 99.9% availability gives a naive combined uptime of roughly 99.0% — nearly ten times more downtime per year than a single service at the same individual reliability, assuming independent failures that real distributed systems rarely provide. Every boundary now requires circuit breakers, retries with exponential back-off, bulkheads, dead-letter queues, and explicit timeout budgets.

Data consistency becomes explicit engineering work. The moment two services own separate data stores, cross-service ACID transactions are gone. Anything that used to be a database transaction across two tables now requires a saga pattern, event sourcing, or accepting visible inconsistency windows. These patterns are well-understood, but they are non-trivial to implement correctly and they represent work that a monolith simply does not need.

Operational surface scales with service count. Every service needs its own CI/CD pipeline, container image, health checks, alerting rules, runbooks, and on-call assignment. A team of eight engineers running forty services is managing five services per engineer — at that ratio, each service gets incremental attention, not ownership.

Over-provisioning multiplies across every service you run. This is the least discussed cost, and it compounds at scale. The Cast.ai 2025 Kubernetes Cost Benchmark — drawn from analysis of 2,100-plus organisations across AWS, GCP, and Azure — found that 99.94% of clusters were over-provisioned, with average CPU utilisation at approximately 10% and average memory utilisation at approximately 20%.

Average resource utilisation across Kubernetes clusters — provisioned vs. actually used
CPU provisioned100%
CPU used (avg)~10%
Memory provisioned100%
Memory used (avg)~20%
Source: Cast.ai 2025 Kubernetes Cost Benchmark Report

The numbers compound quickly. Consider a team running 50 microservices, each provisioned at two replicas with a 500m CPU request per pod — a modest, realistic configuration for high availability:

# A representative Deployment for one of 50 microservices
resources:
  requests:
    cpu: 500m      # 0.5 vCPU, used as Kubernetes scheduling baseline
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

That configuration gives 50 services times 2 replicas times 0.5 vCPU = 50 vCPUs minimum, fleet-wide, just to keep the lights on. At 10% average CPU utilisation, the actual computational work consumes roughly 5 vCPUs. On c5.large-class instances at approximately $0.0425 per vCPU-hour:

  • Provisioned baseline: 50 vCPUs at $0.0425/hr over 8,760 hours = approximately $18,600 per year
  • Work actually performed: 5 vCPUs at $0.0425/hr over 8,760 hours = approximately $1,860 per year

The gap is roughly $16,700 per year from the CPU provisioning floor alone — before counting service mesh sidecars (an Envoy proxy in Istio typically adds 50–100m CPU and 50 MB RAM per pod per service), per-service load balancers, and the monitoring agents running in every namespace. The Prime Video team was solving exactly this class of problem.

When microservices genuinely justify the tax

The distributed-systems tax is real, but some systems need to pay it. The cases where it pays off follow a recognisable pattern.

Team and organisational scale. When separate teams need to deploy their components without coordinating release windows with each other, independent deployability earns its cost. Uber went from a monolith to services in the 2014–2016 period as they scaled from roughly 100 to over a thousand engineers. The forcing function was team autonomy: teams were blocking each other's deploys. Discord followed a similar path — their voice, text, and presence components had different team owners and radically different scaling requirements, and decomposing them was justified on both axes simultaneously.

Radically different scaling profiles. If one component needs to scale 100x at peak while another stays flat, coupling them in a monolith means over-provisioning the flat component or under-provisioning the scaling one. Netflix's recommendation engine, stream delivery, and billing systems carry such different traffic profiles that bundling them makes the scaling math uniformly worse.

Different compliance or isolation requirements. Payment processing subject to PCI-DSS audit scope is operationally cleaner when isolated from a content delivery service. Keeping a compliance boundary narrow and explicit is easier when the deployment boundary matches it. The same principle applies to data sovereignty requirements and per-component SLA contracts.

This is the performance profile of elite teams. DORA's 2024 State of DevOps report found that elite performers — 19% of respondents — deploy on demand with a change failure rate of approximately 5% and mean recovery times under one hour. They deploy 182 times more frequently and recover 8 times faster than low performers.

Elite DevOps performer benchmarks

On-demand

Deploy frequency

elite tier

~5%

Change failure rate

elite tier

Under 1h

Mean recovery time

elite tier

182x

More deploys vs. low tier

performance gap

Source: DORA State of DevOps, 2024

Notice what the DORA metrics do not say: they do not say microservices cause elite performance. They describe what strong independent deployability and genuine service ownership enable — whether the deployable unit is one service in a microservices fleet or a module in a well-structured monolith with its own pipeline. The architecture is a means; ownership is the actual driver.

The inverse is worth making explicit. The same DORA dataset shows that as teams add services without adding the platform maturity to run them, stability degrades faster than throughput improves. More moving parts without the operational tooling to manage them is a liability, not a feature.

Dynamics
outcomeAI adoption →throughput ↑stability ↓high platform qualitycloses the gap
Adding services increases deployment independence, but past a team's operational capacity, instability rises faster than throughput gainsSource: ClimsTech analysis

What a modular monolith looks like in code

The false binary in most of these discussions is "big ball of mud" versus "fifty microservices." The pragmatic option for most teams is a single deployable with disciplined internal module boundaries — a structure where modules could be extracted when a real forcing function appears, without paying the network tax until it does.

37signals' Hey email service is a prominent live example. Hey runs as a Rails monolith on roughly ten machines. Sub-millisecond in-process calls, one deploy pipeline, deterministic behaviour under load. At their scale — a small team serving hundreds of thousands of paying customers — the forcing functions for decomposition simply do not exist, and they have said so directly.

Shopify's path was more instructive at larger scale: rather than jumping from a Rails monolith directly to microservices, they spent years enforcing hard module boundaries inside the monolith using Rails engines. Modules could not call each other's internal classes; all cross-module interaction went through declared public interfaces. The result was a codebase where extraction was possible along seams already drawn — but extraction happened only where team and scaling pressure genuinely justified it.

In Go, the same discipline maps naturally to the internal package convention. The layout makes the module boundaries structural:

cmd/
  api/
internal/
  billing/
    handler.go
    service.go
    repository.go
  notifications/
    handler.go
    service.go
  payments/
    handler.go
    service.go
  shared/
    types.go      ← shared value types only; no business logic lives here

The rule: code under internal/billing/ can only be imported by code within the billing subtree and by cmd/api. If notifications needs to trigger a billing event, it does so through a declared interface in shared/ — not by importing billing internals. This boundary, enforced by the Go compiler, is conceptually the same boundary a service extraction would enforce, at zero network cost.

You can add an explicit audit layer with golangci-lint and the depguard linter, which lets you declare disallowed import paths per package and fails the build on violation:

# Enforce cross-module import restrictions
golangci-lint run --enable depguard ./...

The .golangci.yml configuration makes the rules explicit and reviewable:

linters-settings:
  depguard:
    rules:
      billing-no-notifications:
        files:
          - "**/internal/billing/**/*.go"
        deny:
          - pkg: "github.com/yourorg/app/internal/notifications"
            desc: "billing must not import notifications — use a shared interface"

In Java or Kotlin, the equivalent discipline uses ArchUnit rules that run as part of the standard test suite:

// Fail the build if billing depends on notifications internals
noClasses()
  .that().resideInAPackage("..billing..")
  .should().dependOnClassesThat().resideInAPackage("..notifications..")
  .check(importedClasses);

The specific tooling is secondary. The point is that module boundaries need to be enforced by the build system, not by convention. Convention decays under deadline pressure; compiler errors and test failures do not.

The decision framework

The question is not "micro or mono" as an abstract preference. It is a concrete set of signals about your organisation and your system today. The following maps those signals to their architectural implication.

| Signal | Lean monolith | Lean microservices | |---|---|---| | Team size | Under 20–30 engineers | 50-plus engineers, multiple autonomous teams | | Deploy coupling | One team owns all affected code | Teams regularly block each other on releases | | Scaling profile | Roughly uniform across components | Components differ by 10x or more at traffic peak | | Data isolation | Shared schema is operationally fine | Different durability, latency, or compliance scopes | | Compliance boundary | Single audit scope acceptable | Parts need separate audit or regulatory treatment | | Domain understanding | Boundaries still emerging through use | Domain stable and well-understood | | Platform maturity | Small or no dedicated platform team | Dedicated platform engineering, service mesh in place |

The most common mistake is applying the right column too early — before team size reaches the forcing threshold, before the domain is stable enough to know where the real boundaries are. A service boundary you draw in year one will often need to move in year two, and moving a boundary between services with separate data stores is substantially more expensive than moving a boundary between modules in a monolith.

Split when the friction of staying together exceeds the friction of network calls. Not before.
A principle every architecture review should apply

Migration patterns that work in either direction

Monolith to microservices: the Strangler Fig

The reliable approach to extracting a service from a monolith is the Strangler Fig pattern, described by Martin Fowler in 2004. The idea is to route new traffic through a new service while the monolith still handles existing traffic. Over time the new service takes over completely. There is no big-bang cut-over.

Extracting a bounded context using the Strangler Fig pattern
  1. 01

    Identify the bounded context

    Pick the component with a clear owner, its own data, and a defined API surface. If you cannot draw a clean line around the data, do not extract yet — the boundary is not ready.

  2. 02

    Introduce a routing facade

    Put an API gateway or reverse proxy in front of the monolith. Traffic still flows to the monolith. The facade is the future seam — it exists before the new service does.

  3. 03

    Build the new service

    Implement the extracted context as a standalone service with its own data store. Start with new records and new traffic, not a data migration. Accept that it starts incomplete.

  4. 04

    Dual-write the data

    Write new records to both the monolith schema and the new service's data store. Validate parity continuously. This phase is operationally awkward — budget for it, and end it on a fixed deadline.

  5. 05

    Switch reads, then writes

    Flip reads to the new service first. Confirm correctness under production load. Then flip writes. Confirm again. Then stop writing to the monolith path.

  6. 06

    Remove the dead code

    Delete the extracted code from the monolith. A migration that never removes the source is a migration that has not finished. Set the deadline before the migration starts.

Source: Martin Fowler, Strangler Fig Application (2004)

The step most teams indefinitely defer is step six. Extracted code left in the monolith for "safety" becomes a maintenance liability that quietly drifts out of sync with the new service, causes confusion during incidents, and doubles the fix surface for bugs. Setting a hard removal deadline at the outset is not pedantry — it is the thing that actually finishes the migration.

Microservices to monolith: consolidation

Consolidation is underused as a deliberate strategy. The trigger patterns are recognisable: services that always deploy together (accidental coupling disguised as microservices), services where inter-service call latency or serialisation dominates CPU time (the Prime Video pattern), or services too small to justify independent on-call ownership.

The consolidation approach:

  1. Confirm the services share a single team owner and a coherent bounded context.
  2. Merge the data stores first, or design a seam interface that lets code merge without a data migration yet.
  3. Replace network calls with in-process calls. Monitor latency and error rates — they should fall immediately.
  4. Flatten the CI/CD pipelines into one.
  5. Maintain the internal module boundaries from step one. Consolidation is not permission to couple.

Point five is the discipline that makes consolidation reversible. A consolidated service with clean internal structure can be extracted again if team growth or scaling changes make it worthwhile. A consolidated service where modules are allowed to couple is a ball of mud that can only be replaced, not refactored.

Five pitfalls that show up in production

Decomposing too early

The wrong service boundary is worse than no boundary at all. A team that splits along a domain boundary they will later need to move faces a much higher refactoring cost than a team moving a module boundary inside a monolith. Moving business logic across service boundaries means migrating data between separate stores, managing dual-write windows, updating API contracts across teams, and re-deploying multiple services in coordination. The fix is to start with a monolith and find the real seams through actual product development — boundaries that feel obvious at design time often shift during the first year.

The distributed monolith

A system where every request causes Service A to call Service B synchronously, which calls Service C, which calls Service D, is a distributed monolith. It has all the operational overhead of microservices and none of the independence: services cannot deploy without the whole chain working, failures cascade synchronously, and latency compounds at every hop. This pattern nearly always emerges when a monolith is decomposed along technical layers (a "frontend service," a "business logic service," a "data service") rather than along domain ownership boundaries.

The fix is to redesign the interaction model. Services should own a domain, not a layer. Use asynchronous messaging where strong consistency is not required. If a synchronous chain cannot be broken, the services probably form one bounded context with an unnecessary network boundary between them.

The shared database

Splitting the application layer while keeping a shared relational database gives you all the operational complexity of microservices with none of the data isolation benefit. Worse, the shared schema becomes a hidden coupling point: changing a column requires coordinating every service that reads it, removing all the independent-deploy advantage you were trying to gain.

The rule is strict: a service that cannot own its data is not ready to be a service. If data ownership is genuinely shared between two proposed services, that is a signal the bounded context is drawn incorrectly. Redraw it before extracting.

No ownership

A fleet of 80 services owned nominally by five teams — where no individual engineer can list their team's complete service inventory — is an operational liability that compounds over time. Services accumulate CVEs quietly, incidents get routed to the wrong team, and post-incident reviews reveal no one knew who was responsible for a component. The principle is simple and non-negotiable: one team owns a service end-to-end, including on-call rotation, dependency updates, and retirement when the service is no longer needed. If you cannot name the team, the service should not exist as an independent deployable.

Skipping observability

Distributed systems without distributed tracing are effectively undebuggable in production at the boundaries. A latency regression that spans three services is invisible in per-service dashboards; it requires trace sampling with propagated trace context to locate the slow span. Before decomposing a monolith, instrument with OpenTelemetry, propagate trace context across all boundaries, and confirm that your backend — Jaeger, Grafana Tempo, or a managed equivalent — can correlate spans across services.

Observability is not a nice-to-have that you add after the architecture is working. It is a prerequisite for knowing whether the architecture is working at all.