10 siloed monitoring systems, ~500 P1 incidents per year, MTTD measured in hours. One auto-RCA agent that cut detection to under 5 minutes — validated in shadow mode by SRE leads before a single line ran in production.
Company:
Global Auto Finance Leader
My Role:
AI Product Manager, Enterprise Solutions (Unframe AI)
Year:
2026
Techstack
Multi-Source Log Ingestion · Signal Normalization · ML Anomaly Scoring · Automated Root Cause Analysis · Agent Orchestration (Framery OS) · Splunk · ThousandEyes · Datadog · Dynatrace · AppDynamics · Azure Monitor · Elastic · ServiceNow · PagerDuty · Jira
A global auto finance company was running 10 separate monitoring tools: Splunk, ThousandEyes, Datadog, Dynatrace, AppDynamics, Azure Monitor, Elastic, ServiceNow, PagerDuty, and Jira. When something broke, an on-call engineer's first 20–30 minutes were spent just opening tabs — pulling the same time window across all 10 consoles, eyeballing which graphs moved first, cross-referencing a separate change calendar to guess at cause.
Mean Time to Detect for classified incidents stretched into hours. The client's target: cut P1 incident volume by 60% and get MTTD for classified incidents under 5 minutes. The constraint: read-only access only — their security team would not allow an AI system to make any changes to production infrastructure.
STEP 1 — Multi-Source Data Architecture
Built a normalization layer on Unframe's Framery OS that mapped each of the 10 sources into a common signal schema (timestamp, entity, metric/event type, severity, source confidence) via the Knowledge Fabric — without requiring the client's IT org to change a single existing tool or workflow. Connectors were read-only by design. Splunk via scheduled search API, ThousandEyes/Datadog/Dynatrace via REST polling, AppDynamics via a custom polling adapter, and ServiceNow/Jira/PagerDuty via REST for enrichment context.
STEP 2 — Noise Reduction and Anomaly Scoring
Raw signals from 10 systems without correlation logic produces noise, not insight. Co-defined the anomaly taxonomy with the client's SRE leads: what counts as a real signal vs. background noise, per source. A Datadog CPU spike means something different than a ThousandEyes path-loss event — the model needed source-specific baselines, not one global threshold. The ML anomaly scoring layer applied those baselines continuously, surfacing only events that crossed the source-calibrated threshold.
STEP 3 — Automated Root Cause Analysis
The auto-RCA engine correlated anomalies across all 10 normalized signal streams to identify the likely originating event — the one signal that moved first, with the highest confidence score. Generated a structured RCA report: originating signal, correlated downstream effects, change events from ServiceNow/Jira that coincided with the window, and a confidence score. The report went to the on-call engineer as the first thing they saw when paged, not after 20 minutes of tab-switching.
STEP 4 — Continuous Learning Loop
SRE leads scored every generated RCA for accuracy during a 2-week shadow-mode run before cutover — was the identified root cause correct, was the confidence level calibrated, did the correlated signals reflect real dependencies? Their feedback updated the source-specific baselines and correlation weights. After cutover, the same feedback loop continued: on-call engineers confirmed or corrected each RCA, and the model updated accordingly. Each incident made the next detection faster.
RESULTS
10 previously siloed monitoring and ITSM systems unified → 1 console
MTTD for classified incidents: hours → under 5 minutes
P1 incident volume target: 60% reduction from ~500 per year
False positive suppression: automated cross-source confirmation — no manual triage
On-call tooling: 10 consoles → 1 unified triage view
2-week shadow-mode RCA accuracy scoring by SRE leads before production cutover


