Online marketplace · Reliability transformation
Turning fragmented monitoring into an operating system for reliability
A unified observability model across infrastructure, applications, logs, security signals and incident response.
92
workloads standardised
14
AWS accounts included
68%
reduction in alert noise
38 min
MTTR (from 2h 20m)
In brief
A fast-growing marketplace operated a sizeable AWS estate with uneven visibility — some workloads richly monitored, others on local logs, and several critical alerts without clear ownership. ClimsTech created a common observability framework across infrastructure metrics, application logs, availability, security signals and escalation, and improved patching, IAM review, WAF visibility, CDN monitoring and runbooks.
Working constraints
- Multi-account AWS environment
- Different monitoring maturity across workloads
- Customer-facing services with varied criticality
- High alert volume
- Mixed application and infrastructure ownership
- Limited shared runbooks
- Security and operations data stored separately
The problem
What was actually going wrong
As platform scale increased, teams found it harder to answer basic production questions quickly: which service was failing, whether the problem was application, infrastructure, or dependency related, which alerts required action, who owned the response, and which systems were unmonitored or unpatched.
What discovery surfaced
- 1Alert volume was high, but actionable signal was low.
- 2Several teams monitored infrastructure without application context.
- 3Logs were inconsistent and difficult to correlate.
- 4Critical services did not share a common escalation model.
- 5Patch compliance and monitoring coverage were not measured consistently.
- 6Operational knowledge lived with a small number of engineers.
The engineering
What we built and changed
1Monitoring standardisation
ClimsTech defined a core monitoring baseline for compute, storage, network, service availability, and workload health.
2Central logging
Logs were centralised and enriched with application, environment, severity, and correlation data.
3Alert engineering
Duplicate and low-value alerts were removed, and thresholds were aligned with customer impact and escalation urgency.
4Incident response
Severity levels, ownership, escalation, and first-response runbooks were standardised.
5Security operations
Monitoring was extended across IAM events, edge protection, patching, and security findings.
Teams gained a shared production view, consistent incident language, and clearer operational ownership. Investigations no longer started with searching for the correct server or account.
The architecture
Before and after
- Isolated AWS accounts
- Separate dashboards
- Local logs
- Email alerts
- Manual checks
- No shared incident context
- AWS accounts
- Metrics pipeline
- Central log pipeline
- Security events
- Unified observability layer
- Role-based dashboards
- Actionable alerts
- Incident response workflow and runbooks
Judgement calls
Decisions that shaped the outcome
Why reduce alerts instead of adding more?
Reliability improves when teams receive meaningful signals. More alerts without prioritisation create fatigue and slower response.
Why combine application and infrastructure context?
Infrastructure utilisation alone does not explain customer impact; application latency, error rate, and dependency health were needed to understand service behaviour.
Why include operational ownership?
A technically accurate alert is still ineffective when no team is responsible for acting on it.
Verified outcomes
What changed for the business
- Standardised monitoring across 92 workloads
- Included 14 AWS accounts
- Mean time to resolution reduced from 2h 20m to 38m
- Alert noise reduced by 68%
- Critical monitoring coverage reached 98%
- Patch compliance increased from 71% to 96%
- More than 25 runbooks created
What this engagement proves
Observability is not a dashboard project. It is an operating model connecting signals, ownership, action and learning.
Field notes on this class of problem
All field notesObservability at scale: when telemetry becomes a deletion problem
At scale, observability becomes a deletion problem — put backpressure on telemetry.
16 min read
SRE & reliabilityOpenTelemetry: instrument once, route anywhere
Separate instrumentation from destination and vendor lock-in stops setting the bill.
17 min read
SRE & reliabilitySLOs and error budgets: turning reliability into a number
Turn “is it reliable enough?” from an argument into a number with a policy.
18 min read