The Beginner’s Guide to Event Enrichment

beginners guide opener 1024x694 The Beginners Guide to Event Enrichment

Overview

Event Enrichment allows IT Managers and Operators to enhance their actionable IT Operations with pertinent information such as escalation and remediation procedures. An additional benefit is a very substantial decrease in the quantity of noise coming from their Network Management Systems (NMSes). The Network Operations Center (NOC) is tasked with handling and routing events on a 24×7 basis throughout the organization’s infrastructure. Hardware failures (in switches, routers, and servers), performance problems in applications, and security alerts, are all examples of the types of events that NOCs handle.

Events from these sources are generated by the NMS. One of the major challenges in keeping the NOC at peak performance is that the NOC team becomes inundated with large numbers of events. Without a significant amount of tuning, most, if not all, NMSes generate far too much noise. Elimination of this noise is a fundamental benefits of Event Enrichment. The Event Enrichment process consists of three steps: Classification, Triage, and Enrichment.

enrichment cycle1 300x204 The Beginners Guide to Event Enrichment

Classification

In order to begin the Classification portion of Event Enrichment, group events according to Role (web server, app server, network device, etc), Severity (Critical, High, Low, etc), and Frequency (how often the event occurs). If your NMS provides a method of exporting events to a CVS or XLS file, then choose a set of events (all events from the previous week is a good starting point) and export them to a file. Group each event into its appropriate category and create tabs in the spreadsheet for each category. Once this step is complete, sort the events in each tab in descending order according to frequency. This spreadsheet then becomes the basis for your triage sessions.

Triage

The Triage portion of Event Enrichment typically takes place in a series of meetings between the NOC Manager and the relevant Operations groups (Systems / Network / DevOps). In order to accelerate the process, these meetings should occur twice a week during the early phase of the Event Enrichment cycle. Review each event in the chosen spreadsheet tab (beginning with the most frequent) and assign each a tag of ACTION, FIX, or SUPPRESS.

You can check out a sample event triage spreadsheet here.

Enrichment

The final step of Event Enrichment is to enrich all ACTION events, SUPPRESS all noise (irrelevant to normal operations), and rectify all FIX (misconfiguration or known transient problems) events. If ACTION, enrich the event with its associated escalation path and remediation information and configure the NMS to enrich any subsequent events of the same type.

Below is an example of such an enriched event from the Nagios NMS:

Enrich Nagios Events

From: nagios@acme.com
To: ops@acme.com
Date: Sunday, September 15, 2013 2:45:58 PM
Subject: ** SERVICE: web01 : HTTP check is CRITICAL! **

***** Nagios *****

This is a PROBLEM notice that the HTTP check on web01 is CRITICAL!

Host IP is 172.16.0.224
Performance Data: Connection refused
Duration is 0d 0h 3m 10s
Date/Time: Sun Sept 15 11:45:58 UTC 2013

Nagios Summary
Total Unhandled Host Problems:0
Total Unhandled Service Problems:2

ESCALATION: 
The web server for web01 in unreachable which is a CRITICAL problem. Escalate this issue to the on-call Systems Engineering team NOW.

REMEDIATION: 
1) Attempt to ping the host from the OPS1 server
2) If ping is successful, attempt to SSH to the host (ops1@172.16.0.224)
3) If SSH is unsuccessful, immediately initiate the steps listed in the WEB_SRVR_DOWN recipe in the Ops Runbook.
4) If SSH is successful, initiate the steps listed in the NGINX_DOWN recipe in the Ops Runbook

Implementation Timeframe

Weeks 1-2:   Enrichment session twice per week

event enrichment timelineStart meeting with service managers and domain experts and triage the top 10 most frequent events.

Triage refers to the process of choosing whether a given event is ACTION, FIX, or SUPPRESS. If the event is ACTION then mark it as  requiring Enrichment.

Fix all issues (misconfiguration or known temporary issues) tagged as FIX. Ignore (do NOT send to the NOC) all events that are classified as SUPPRESS.

Weeks 3-8:  Enrichment sessions once per week

Resolve the most frequent 10 events at each session

Month 3 and beyond: Event Enrichment session every two weeks

Over time, the duration of these sessions will decrease, as there will probably only be a few events to triage


Implementing Event Enrichment on your NMS manually is time consuming. If you want to enrich your events on a platform specifically built for Event Enrichment, check out the new Event Enrichment Platform from Event Enrichment HQ.


What's your opinion?