Multi-tenant SaaS · Logging transformation
Following one request across dozens of services
A central logging platform designed around correlation, structured fields, retention, access and actionable investigation.
180 GB
of logs processed daily
45 min
investigation (from 3h)
36
services covered
84%
less time on individual servers
In brief
A multi-tenant SaaS platform generated logs across applications, servers and cloud services, and engineers connected to individual machines during incidents. ClimsTech centralised collection, enriched events with structured context and request correlation, created dashboards and alerts, and set retention and access policy — moving investigation from machine-by-machine searching to shared, correlated analysis.
Working constraints
- Multiple applications and formats
- High daily log volume
- Sensitive tenant information
- Different retention requirements
- Existing server access practices
- Need for correlation
- Search performance and cost
The problem
What was actually going wrong
A multi-tenant SaaS platform generated logs across applications, servers, and cloud services. Engineers connected to individual machines during incidents and struggled to trace a request across service boundaries.
What discovery surfaced
- 1Log formats differed significantly.
- 2Timestamps and identifiers were inconsistent.
- 3Some logs contained unnecessary sensitive data.
- 4Important logs could be overwritten locally.
- 5Engineers relied on server access.
- 6Retention was not based on operational need.
The engineering
What we built and changed
1Source inventory
Applications, servers, databases, and cloud services were mapped.
2Collection and parsing
Log shippers and processing rules transformed data into consistent fields.
3Correlation
Request, tenant, environment, service, and severity identifiers were added where appropriate.
4Retention and access
Different categories received different retention and role-based access.
5Operational use
Dashboards, search patterns, and alerts supported common investigation scenarios.
Incident investigation moved from machine-by-machine searching to shared, structured, and correlated analysis.
The architecture
Before and after
- Applications and services — inconsistent log formats
- Local server log storage — overwritable
- No request correlation identifiers
- Per-server access for incident investigation
- No structured retention or access policy
- Applications, servers, databases, cloud services
- Log shippers
- Processing and enrichment
- Elasticsearch
- Kibana dashboards
- Alerts
Judgement calls
Decisions that shaped the outcome
Why not retain everything indefinitely?
Retention cost and search performance must align with business and operational need.
Why prioritise request identifiers?
Correlation enables engineers to follow one transaction through multiple services.
Why remove secrets and unnecessary personal data?
Logs are operational records, not unrestricted data stores.
Verified outcomes
What changed for the business
- 180 GB processed daily
- Investigation reduced from 3 hours to 45 minutes
- Correlation improved by 73%
- Server-access time reduced by 84%
- Repeat errors identified 60% faster
- 36 services covered
- Critical alert response improved by 52%
What this engagement proves
Central logging becomes valuable when data is structured around investigation, not merely collected in one place.
Field notes on this class of problem
All field notesObservability at scale: when telemetry becomes a deletion problem
At scale, observability becomes a deletion problem — put backpressure on telemetry.
16 min read
SRE & reliabilityOpenTelemetry: instrument once, route anywhere
Separate instrumentation from destination and vendor lock-in stops setting the bill.
17 min read
Kubernetes & platformKubernetes multi-tenancy: namespace isolation vs cluster isolation
Namespace or cluster isolation, decided by your actual trust boundaries.
19 min read