Digital media · Cloud architecture & production modernisation
Built for traffic that does not arrive gradually
A cloud-native platform designed to absorb sudden live-event demand, protect the database layer, and give engineers greater control over availability, deployment and recovery.
CricRadioNamed with the client's permission.120,000+
concurrent-user load-test capacity
42%
faster average API response
65%
fewer manual production interventions
99.97%
peak-event availability
In brief
A real-time sports platform needed to support highly concentrated demand during live matches, where traffic could multiply within minutes around match starts and key moments. The existing environment relied on fixed-capacity compute, shared application resources and direct database reads, so latency rose and engineers intervened manually under load. ClimsTech redesigned the platform around containerised services on AWS — Kubernetes for independent scaling and recovery, Redis to absorb repeated reads, and observability across latency, cache, workload health and database pressure.
Working constraints
- Highly variable, event-driven traffic
- Live production users throughout the migration
- Existing MongoDB application dependency
- Limited release windows during major events
- No complete application rewrite
- Mixed stateless and stateful workloads
- Small internal operations team
- Strict response-time expectations
The problem
What was actually going wrong
Live sports products have a distinctive demand profile. Capacity cannot be planned only around averages because the most important customer moments also create the sharpest demand. The platform needed to remain responsive during sudden user surges while avoiding permanent over-provisioning during quieter periods.
What discovery surfaced
- 1Multiple services were scaled together despite very different demand patterns.
- 2Frequently requested score and session data repeatedly reached the primary database.
- 3Application capacity was defined by server size rather than service behaviour.
- 4Alerts showed infrastructure pressure but did not identify the user-facing application path.
- 5Releases lacked a consistent rollback model.
- 6Resource allocation varied between environments.
The engineering
What we built and changed
1Container platform
Application services were packaged into Docker images and deployed onto Kubernetes, with workloads separated so high-demand APIs could scale independently from background workers. Health probes, resource limits, and rolling deployment strategies were defined as production requirements.
2Performance and data protection
Redis was introduced as a caching layer for frequently accessed information, reducing repeated MongoDB reads and stabilising response times during demand spikes.
3Elastic scaling
Horizontal scaling policies were configured using CPU, memory, and application-level indicators, calibrated through staged load tests rather than default settings.
4Delivery automation
A standard delivery workflow automated container build, validation, registry publication, deployment, and rollback, with environment-specific configuration separated from application images.
5Production observability
Dashboards were created covering request volume, response time, error rate, cache hit ratio, database utilisation, pod health, and scaling events, with alerts aligned to customer impact rather than infrastructure fluctuation alone.
Before the engagement, senior engineers were needed for scaling, release coordination, and incident recovery. After the transformation, the team had standardised pipelines, shared dashboards, automated recovery, and clearer operational ownership.
The architecture
Before and after
- Single load balancer
- Fixed application servers
- MongoDB
- Local logs
- Manual scaling and recovery
- CDN and edge protection
- Application load balancer
- Kubernetes application platform
- API workloads
- Background workloads
- Redis
- MongoDB
- Metrics, logs, traces and alerts
Judgement calls
Decisions that shaped the outcome
Why Kubernetes instead of larger virtual machines?
Larger machines would have increased total capacity but would not have allowed application components to scale independently. Kubernetes provided workload-level elasticity, standardised deployment, and automated replacement of unhealthy services.
Why Redis?
The platform served high volumes of repeated information. Caching reduced avoidable demand on MongoDB and improved response consistency during traffic spikes.
Why rolling deployment?
The product needed to release updates without taking the complete platform offline. Rolling deployment allowed new versions to be introduced gradually and reversed more safely.
How it ran
- Phase 1
Weeks 1–2
Discovery and dependency mapping
- Phase 2
Weeks 3–5
Kubernetes foundation and network design
- Phase 3
Weeks 6–9
Containerisation and Redis integration
- Phase 4
Weeks 10–12
CI/CD and observability
- Phase 5
Weeks 13–15
Load testing and performance tuning
- Phase 6
Week 16
Production readiness and handover
Verified outcomes
What changed for the business
- Load-tested beyond 120,000 concurrent users
- Average API response improved by 42%
- Database read traffic reduced by 38%
- Deployment time reduced from 90 minutes to 12 minutes
- Manual production intervention reduced by 65%
| Area | Before | After |
|---|---|---|
| Scaling | Manual server changes | Automated workload scaling |
| Recovery | Engineer-led restarts | Automatic workload replacement |
| Database demand | Repeated direct reads | Redis-supported caching |
| Deployment | Long, manual release process | Controlled rolling deployment |
| Visibility | Infrastructure-only alerts | Application and platform dashboards |
What this engagement proves
Scalability was not achieved by adding compute alone. The largest gains came from separating workloads, protecting the database, and aligning scaling decisions with actual service behaviour.
Field notes on this class of problem
All field notesAutoscaling for traffic spikes: beyond a single HPA
Layer pod, node and event-driven scaling — a lone HPA won't survive launch day.
21 min read
Cloud architectureCell-based architecture: buying down blast radius
Stop partial failures from spreading: blast-radius boundaries you can drain.
20 min read
Kubernetes & platformKarpenter vs the cluster autoscaler: getting node scaling right
Cluster autoscaler or Karpenter: the choice that decides how much of your bill is waste.
19 min read