Healthcare technology · Infrastructure reliability
Detecting infrastructure risk before users report it
A proactive monitoring model across servers, application services, databases, capacity and escalation.
58
servers covered
4 min
detection time (from 22m)
39%
less infrastructure downtime
80%
fewer manual daily checks
In brief
A healthcare technology organisation ran a distributed server estate where many incidents were discovered through user reports rather than monitoring. ClimsTech built an infrastructure inventory, classified workload criticality, introduced monitoring and service checks, calibrated thresholds per workload, and defined escalation runbooks — moving from routine checking to exception-based response.
Working constraints
- Mixed operating systems
- Distributed server locations
- Varying application criticality
- Existing support team processes
- Need to avoid excessive alerting
- Limited downtime windows
- Healthcare-related service sensitivity
The problem
What was actually going wrong
The organisation lacked a common view of server health and service availability. Disk exhaustion, stopped processes, resource saturation, and database issues could remain unnoticed until users were affected.
What discovery surfaced
- 1Some servers had only basic availability checks.
- 2Application process health was not monitored consistently.
- 3Thresholds were uniform despite different workload behaviour.
- 4Alert ownership was unclear.
- 5Manual checks consumed engineering time.
- 6Incident history was difficult to analyse.
The engineering
What we built and changed
1Asset and service inventory
Servers, applications, databases, dependencies, and ownership were documented.
2Monitoring baseline
CPU, memory, disk, filesystem, service availability, and selected application processes were monitored across the estate.
3Threshold calibration
Thresholds were adapted by workload rather than applying one universal default.
4Escalation and runbooks
Severity, routing, acknowledgement, and first-response actions were documented for structured incident response.
5Dashboards
A shared operational view showed current health, trend, capacity risk, and unresolved alerts.
Support moved from routine checking and user-led discovery to exception-based monitoring and structured response.
The architecture
Before and after
- Distributed server estate
- Basic availability checks only
- Inconsistent application process monitoring
- Uniform alerting thresholds
- Unclear alert ownership
- Manual daily checks
- Application servers
- Database servers
- Integration servers
- Monitoring collectors
- Central monitoring platform
- Operational dashboards
- Severity-based alerting
- Operations team and runbooks
Judgement calls
Decisions that shaped the outcome
Why monitor services separately from servers?
A server can be available while the application process is unavailable, so server-level checks alone give an incomplete picture of service health.
Why avoid universal thresholds?
Workloads have different normal behaviour; uncalibrated thresholds create false alarms that erode trust in the alerting system.
Why retain informational events without paging?
Operational history is valuable for analysis, but not every event should interrupt engineers — separating information from actionable alerts reduces noise.
Verified outcomes
What changed for the business
- Monitoring introduced across 58 servers
- Detection time reduced from 22 to 4 minutes
- Infrastructure downtime reduced by 39%
- Repeat incidents reduced by 46%
- Critical service coverage reached 100%
- Manual checks reduced by 80%
- Alert acknowledgement improved by 63%
What this engagement proves
Monitoring creates value only when detection leads to clear ownership and repeatable action.
Field notes on this class of problem
All field notesSLOs and error budgets: turning reliability into a number
Turn “is it reliable enough?” from an argument into a number with a policy.
18 min read
SRE & reliabilityOn-call that doesn't burn people out
Good on-call is mostly quiet. The difference is which alerts you allow.
19 min read
SRE & reliabilityBlameless postmortems: turning incidents into reliability
Find what let it break, not who broke it — and close the whole failure class.
21 min read