ClimsTech

Online marketplace · Reliability transformation

Turning fragmented monitoring into an operating system for reliability

A unified observability model across infrastructure, applications, logs, security signals and incident response.

AWSCloudWatchWAFRunbooks

92

workloads standardised

14

AWS accounts included

68%

reduction in alert noise

38 min

MTTR (from 2h 20m)

In brief

A fast-growing marketplace operated a sizeable AWS estate with uneven visibility — some workloads richly monitored, others on local logs, and several critical alerts without clear ownership. ClimsTech created a common observability framework across infrastructure metrics, application logs, availability, security signals and escalation, and improved patching, IAM review, WAF visibility, CDN monitoring and runbooks.

Working constraints

  • Multi-account AWS environment
  • Different monitoring maturity across workloads
  • Customer-facing services with varied criticality
  • High alert volume
  • Mixed application and infrastructure ownership
  • Limited shared runbooks
  • Security and operations data stored separately

The problem

What was actually going wrong

As platform scale increased, teams found it harder to answer basic production questions quickly: which service was failing, whether the problem was application, infrastructure, or dependency related, which alerts required action, who owned the response, and which systems were unmonitored or unpatched.

What discovery surfaced

  1. 1Alert volume was high, but actionable signal was low.
  2. 2Several teams monitored infrastructure without application context.
  3. 3Logs were inconsistent and difficult to correlate.
  4. 4Critical services did not share a common escalation model.
  5. 5Patch compliance and monitoring coverage were not measured consistently.
  6. 6Operational knowledge lived with a small number of engineers.

The engineering

What we built and changed

1Monitoring standardisation

ClimsTech defined a core monitoring baseline for compute, storage, network, service availability, and workload health.

2Central logging

Logs were centralised and enriched with application, environment, severity, and correlation data.

3Alert engineering

Duplicate and low-value alerts were removed, and thresholds were aligned with customer impact and escalation urgency.

4Incident response

Severity levels, ownership, escalation, and first-response runbooks were standardised.

5Security operations

Monitoring was extended across IAM events, edge protection, patching, and security findings.

Teams gained a shared production view, consistent incident language, and clearer operational ownership. Investigations no longer started with searching for the correct server or account.

The architecture

Before and after

Before
  • Isolated AWS accounts
  • Separate dashboards
  • Local logs
  • Email alerts
  • Manual checks
  • No shared incident context
After
  • AWS accounts
  • Metrics pipeline
  • Central log pipeline
  • Security events
  • Unified observability layer
  • Role-based dashboards
  • Actionable alerts
  • Incident response workflow and runbooks

Judgement calls

Decisions that shaped the outcome

Why reduce alerts instead of adding more?

Reliability improves when teams receive meaningful signals. More alerts without prioritisation create fatigue and slower response.

Why combine application and infrastructure context?

Infrastructure utilisation alone does not explain customer impact; application latency, error rate, and dependency health were needed to understand service behaviour.

Why include operational ownership?

A technically accurate alert is still ineffective when no team is responsible for acting on it.

Verified outcomes

What changed for the business

  • Standardised monitoring across 92 workloads
  • Included 14 AWS accounts
  • Mean time to resolution reduced from 2h 20m to 38m
  • Alert noise reduced by 68%
  • Critical monitoring coverage reached 98%
  • Patch compliance increased from 71% to 96%
  • More than 25 runbooks created

What this engagement proves

Observability is not a dashboard project. It is an operating model connecting signals, ownership, action and learning.