Capability · SRE & operations
SRE & managed operations
We run what we build — reliability owned, not hoped for.
92
workloads standardised
14
AWS accounts included
68%
reduction in alert noise
38 min
MTTR (from 2h 20m)
Measured on one engagement — anonymised client, verified with the owner.
Sound familiar?
Two or more of these means this page is for you.
- 1Users report incidents before your monitoring does
- 2Dashboards exist, but alerts page the wrong people about the wrong things
- 3Every incident is an all-hands — recovery depends on who's awake
- 4“Is it reliable enough?” is an argument, not a number
- 5Backups exist; restores are assumed
- 6After the incident, nothing structurally changes
The transformation
How this discipline behaves when it's done right
- 1
Observability foundation
Metrics, logs and traces standardised around what users feel, not what hosts emit — with ownership on every alert.
- 2
SLOs & error budgets
Reliability targets as numbers with a policy attached, so 'reliable enough' becomes a decision rule instead of an argument.
- 3
Incident response
On-call, escalation and runbooks that make recovery a procedure — and postmortems that close the whole failure class.
- 4
Recovery validation
Backup and restore drills on a schedule. Disaster recovery that is proven, not presumed.
- 5
Managed operations
We run the platform day to day as an extension of your team, with a senior engineer accountable for the trend line.
Decisions
The calls we make — and why
What do we instrument first?
The user-facing symptom. Availability and latency where the customer meets the system; infrastructure detail earns its place by explaining those signals.
How many nines?
The number the business can defend. Each nine multiplies cost — we set SLOs from consequence, not aspiration, and let the error budget arbitrate speed versus stability.
Fewer alerts, or more?
Fewer, owned, actionable. An alert that doesn't demand action trains people to ignore the one that does.
Artifacts
What you hold at the end
- Dashboard
Service dashboards aligned to user-facing SLIs
- Policy
SLOs and error-budget policy
- Runbook
Incident response and escalation runbooks
- Drill
Backup, restore and recovery drill records
- Report
Monthly reliability review with actions
Evidence
What it did on a real system
Situation
A fast-growing marketplace on a 14-account AWS estate with uneven visibility — critical alerts without clear ownership, some workloads on local logs.
Intervention
A common observability framework across metrics, logs, availability and security signals, with ownership and escalation defined per alert.
Measured result
92 workloads standardised; alert noise down 68%; mean time to recovery fell from 2h 20m to 38 minutes over the engagement.
Verified with the engagement owner · client anonymised by agreement.
Read the full engagementStart here
Begins with an observability and reliability review; most clients continue into managed operations, where the improvements compound month over month.
Delivery & ongoing
- Observability — metrics, logs and traces
- SLOs and error budgets
- Incident response and on-call
- Recovery and backup validation
Delivered as code with handover — or run ongoing as managed operations.
Before you engage
Do you replace our team or extend it?
Extend. Your engineers keep building product; we own the reliability discipline with them — and everything we run is visible and handed over as we go.
What does managed operations actually include?
Monitoring and incident response, recovery drills, capacity and cost review, and a monthly reliability review with a senior engineer accountable for the trend.
Not in scope
- Dashboards without ownership
- On-call we run but you can't see into
- SLOs set to look good rather than to decide
How we think about this problem
All field notesObservability at scale: when telemetry becomes a deletion problem
At scale, observability becomes a deletion problem — put backpressure on telemetry.
16 min read
SRE & reliabilitySLOs and error budgets: turning reliability into a number
Turn “is it reliable enough?” from an argument into a number with a policy.
18 min read
SRE & reliabilityBlameless postmortems: turning incidents into reliability
Find what let it break, not who broke it — and close the whole failure class.
21 min read
Review your reliability operating model
Bring your last three incidents — or the dashboard nobody trusts. We'll map where recovery actually loses its minutes.