Outage Intelligence

Unifying 10 siloed monitoring systems into one auto-RCA agent — cutting MTTD from hours to under 5 minutes

10 siloed monitoring systems, ~500 P1 incidents per year, MTTD measured in hours. One auto-RCA agent that cut detection to under 5 minutes — validated in shadow mode by SRE leads before a single line ran in production.

Company:

Global Auto Finance Leader

My Role:

AI Product Manager, Enterprise Solutions (Unframe AI)

Year:

2026

Techstack

Multi-Source Log Ingestion · Signal Normalization · ML Anomaly Scoring · Automated Root Cause Analysis · Agent Orchestration (Framery OS) · Splunk · ThousandEyes · Datadog · Dynatrace · AppDynamics · Azure Monitor · Elastic · ServiceNow · PagerDuty · Jira

A global auto finance company was running 10 separate monitoring tools: Splunk, ThousandEyes, Datadog, Dynatrace, AppDynamics, Azure Monitor, Elastic, ServiceNow, PagerDuty, and Jira. When something broke, an on-call engineer's first 20–30 minutes were spent just opening tabs — pulling the same time window across all 10 consoles, eyeballing which graphs moved first, cross-referencing a separate change calendar to guess at cause.

Mean Time to Detect for classified incidents stretched into hours. The client's target: cut P1 incident volume by 60% and get MTTD for classified incidents under 5 minutes. The constraint: read-only access only — their security team would not allow an AI system to make any changes to production infrastructure.

STEP 1 — Multi-Source Data Architecture

Built a normalization layer on Unframe's Framery OS that mapped each of the 10 sources into a common signal schema (timestamp, entity, metric/event type, severity, source confidence) via the Knowledge Fabric — without requiring the client's IT org to change a single existing tool or workflow. Connectors were read-only by design. Splunk via scheduled search API, ThousandEyes/Datadog/Dynatrace via REST polling, AppDynamics via a custom polling adapter, and ServiceNow/Jira/PagerDuty via REST for enrichment context.

STEP 2 — Noise Reduction and Anomaly Scoring

Raw signals from 10 systems without correlation logic produces noise, not insight. Co-defined the anomaly taxonomy with the client's SRE leads: what counts as a real signal vs. background noise, per source. A Datadog CPU spike means something different than a ThousandEyes path-loss event — the model needed source-specific baselines, not one global threshold. The ML anomaly scoring layer applied those baselines continuously, surfacing only events that crossed the source-calibrated threshold.

STEP 3 — Automated Root Cause Analysis

The auto-RCA engine correlated anomalies across all 10 normalized signal streams to identify the likely originating event — the one signal that moved first, with the highest confidence score. Generated a structured RCA report: originating signal, correlated downstream effects, change events from ServiceNow/Jira that coincided with the window, and a confidence score. The report went to the on-call engineer as the first thing they saw when paged, not after 20 minutes of tab-switching.

STEP 4 — Continuous Learning Loop

SRE leads scored every generated RCA for accuracy during a 2-week shadow-mode run before cutover — was the identified root cause correct, was the confidence level calibrated, did the correlated signals reflect real dependencies? Their feedback updated the source-specific baselines and correlation weights. After cutover, the same feedback loop continued: on-call engineers confirmed or corrected each RCA, and the model updated accordingly. Each incident made the next detection faster.

RESULTS

10 previously siloed monitoring and ITSM systems unified → 1 console

MTTD for classified incidents: hours → under 5 minutes

P1 incident volume target: 60% reduction from ~500 per year

False positive suppression: automated cross-source confirmation — no manual triage

On-call tooling: 10 consoles → 1 unified triage view

2-week shadow-mode RCA accuracy scoring by SRE leads before production cutover

Next Project

Banking Agent OS

Topic 3

Topic 2

Topic 1

A white-label AI operating system for community banks — 10 systems unified behind a dual tool marketplace

AI & Automation

Banking Agent OS

Topic 3

Topic 2

Topic 1

A white-label AI operating system for community banks — 10 systems unified behind a dual tool marketplace

AI & Automation

Banking Agent OS

Topic 3

Topic 2

Topic 1

A white-label AI operating system for community banks — 10 systems unified behind a dual tool marketplace

AI & Automation

Banking Agent OS

Topic 3

Topic 2

Topic 1

A white-label AI operating system for community banks — 10 systems unified behind a dual tool marketplace

AI & Automation