Online marketplace · Reliability transformation

Turning fragmented monitoring into an operating system for reliability

A unified observability model across infrastructure, applications, logs, security signals and incident response.

AWSCloudWatchWAFRunbooks

workloads standardised

AWS accounts included

68%

reduction in alert noise

38 min

MTTR (from 2h 20m)

In brief

A fast-growing marketplace operated a sizeable AWS estate with uneven visibility — some workloads richly monitored, others on local logs, and several critical alerts without clear ownership. ClimsTech created a common observability framework across infrastructure metrics, application logs, availability, security signals and escalation, and improved patching, IAM review, WAF visibility, CDN monitoring and runbooks.

Working constraints

Multi-account AWS environment
Different monitoring maturity across workloads
Customer-facing services with varied criticality
High alert volume
Mixed application and infrastructure ownership
Limited shared runbooks
Security and operations data stored separately

The problem

What was actually going wrong

As platform scale increased, teams found it harder to answer basic production questions quickly: which service was failing, whether the problem was application, infrastructure, or dependency related, which alerts required action, who owned the response, and which systems were unmonitored or unpatched.

What discovery surfaced

1Alert volume was high, but actionable signal was low.
2Several teams monitored infrastructure without application context.
3Logs were inconsistent and difficult to correlate.
4Critical services did not share a common escalation model.
5Patch compliance and monitoring coverage were not measured consistently.
6Operational knowledge lived with a small number of engineers.

The engineering

What we built and changed

1Monitoring standardisation

ClimsTech defined a core monitoring baseline for compute, storage, network, service availability, and workload health.

2Central logging

Logs were centralised and enriched with application, environment, severity, and correlation data.

3Alert engineering

Duplicate and low-value alerts were removed, and thresholds were aligned with customer impact and escalation urgency.

4Incident response

Severity levels, ownership, escalation, and first-response runbooks were standardised.

5Security operations

Monitoring was extended across IAM events, edge protection, patching, and security findings.

Teams gained a shared production view, consistent incident language, and clearer operational ownership. Investigations no longer started with searching for the correct server or account.

The architecture

Before and after

Before

Isolated AWS accounts
Separate dashboards
Local logs
Email alerts
Manual checks
No shared incident context

After

AWS accounts
Metrics pipeline
Central log pipeline
Security events
Unified observability layer
Role-based dashboards
Actionable alerts
Incident response workflow and runbooks

Judgement calls

Decisions that shaped the outcome

Why reduce alerts instead of adding more?

Reliability improves when teams receive meaningful signals. More alerts without prioritisation create fatigue and slower response.

Why combine application and infrastructure context?

Infrastructure utilisation alone does not explain customer impact; application latency, error rate, and dependency health were needed to understand service behaviour.

Why include operational ownership?

A technically accurate alert is still ineffective when no team is responsible for acting on it.

What this engagement proves

Observability is not a dashboard project. It is an operating model connecting signals, ownership, action and learning.