- What is Event Enrichment?
Event Enrichment is the process of classifying and enhancing events with critical information that accelerates remediation efforts. Enrichments typically include the name / contact details for the person / team associated with the event source (application, server, router, switch etc), as well as any known steps for triage and remediation.
- Why would I use Event Enrichment?
To provide a live framework for your IT Operations Runbook and to decrease remediation time for network events. Most IT Operations Runbooks are static snapshots which are not continually updated. An outdated Runbook, missing critical information, causes complications, frustration, and increased time to repair at the worst possible time.
- Give me an example of the Event Enrichment process
An event arrives from a Network Management System (NMS) denoting the failure of a critical interface on a router. Upon receipt, we add escalation and remediation information to the event. The enriched event is then forwarded on to the Network Operations Center / On-call engineer responsible for the event source for remediation.
- What does an enriched event look like?
Notification Type: PROBLEM
Info: CRITICAL – Host Unreachable (184.108.40.206)
Date/Time: Sat Jan 16 11:09:23 JST 2013
This is a CRITICAL alert which needs immediate escalation to the site DEVOPS team. Use the Pagerduty DEVOPS service.
1) Attempt to ping the host from the nagios server
2) If ping is successful, attempt to ssh to the host (email@example.com)
3) if ssh is not successful , initiate DB_HOST_DOWN recipe sequence
- How do I implement the Event Enrichment process?
Use the Event Enrichment cycle:
Triage your events:
Categorize events! Events are either actionable or noise. If they are actionable, they need enrichment. If they are noise, eliminated them.
Get rid of the noise! If noise is overwhelming your team (Yes! NMS systems absolutely excel at generating noise), then critical events are lost.
Enrich your events:
Add critical information! Now that you know that an event is actionable, do your team a giant favor, and add critical escalation/remediation information to the event before it arrives at the NOC. If groggy engineers / NOC operators get middle-of-the-night-event, why should they start looking for the escalation and remediation information? If it is already in the event then they can immediately start working on resolving the problem.