Event Enrichment : Linux : High I/O Alert

silver surfer high io Event Enrichment : Linux : High I/O Alert

Here’s another addition to our real life Enrichment series; the High I/O threshold warning event.

Enhancements / Comments are welcome!

Name:

High I/O Alert

Escalation:

Send to SYS on call team  <= Add your escalation destination

Remediation:

NOTE: Use copy/paste to apply the commands shown after the “>>>” above in your SSH session.

SSH to samplehost.acme.com <== the host having the problem

>>> ssh samplehost.acme.com

verify disk io of samplehost.acme.com using:

>>> iostat -d -c 5

Linux 3.8.6-ubuntu-12-opt (turbo-srv) 02/26/2014 _x86_64_ (1 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
          0.36  0.00  0.08     0.11   0.00 99.45

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda      0.67 8.01     3.66     16532906 7546340
vdb      0.00 0.00     0.00       3231     424

record 10 iterations of iostat and send results to SYS on-call group

 


If you are interested in substantially cutting down on the time necessary to implement Event Enrichments, check out the new Event Enrichment Platform.

 

Why is Event Enrichment fundamental to effective IT Operations?

noc Why is Event Enrichment fundamental to effective IT Operations?

Monitoring systems vary widely in their capabilities. However, one thing all monitoring systems share, is their objective: providing a meaningful event, at the right time, allowing IT teams to fix problems as quickly as possible.

There are additional intricacies to a proper monitoring system, for example: proactive monitoring, automatic actions, and many others. At its most basic level, a monitoring system must keep us aware of what is going on in our IT environment.

Monitoring System Types

There are two main types of monitoring systems:

  • Status based monitoring – This type of monitoring system compares the status of a monitored node to a predefined desired result, or configured threshold. It will then provide a “status” indicating whether the node is at its normal/expected state or not. The status will look something like this: “red” for bad, “green” for good, “yellow” for a warning (ex: something is wrong but not completely down). Generally, this type of system does not provide any additional information.

  • Event based monitoring – This type of monitoring system provides an “event” which indicates the severity / relative importance of the problem, as well as a brief description of the problem.

In order to increase visibility into one’s IT environment, there is also the possibility of combining the two aforementioned methods. In this case, the final result would be an event that includes: status, severity, and a brief description. However, there is still a critical piece that is missing.

When there is a problem, it is usually associated with a loss of functionality or service. Regardless of the specifics, an organization’s goal is to resolve the issue as quickly as possible (also known as reduction of MTTR – Mean Time To Repair). Having a status indication, severity, or a short description of the problem, does not provide the cause of the problem, or the steps required for resolution.

Let’s say you are monitoring a server’s CPU usage in a status based monitoring system. When the usage exceeds the configured threshold, the server’s status will be “red”. A “red” status could indicate that the server is down, but it could also indicate that the server is overloaded, but still up and running. These details are not provided by the monitoring system. Taking this a step further, the “red” status will also be visible if an application fails on the server; in this case, the problem is NOT the server itself but the application.

These examples highlight the fact that when an event shows up on a given monitoring system, it could take time to identify exactly what the problem is, and then take additional time to pinpoint the fix. Even in cases where there is a description of the problem (i.e. CPU usage is above 90%), it will take time to figure out how to resolve the issue.

Increasing NMS efficiency

There are several ways we can increase the efficiency of monitoring systems; some apply to basic monitoring systems, and others to the more advanced.

  • For basic monitoring systems, we can add more information to the event. This information should provide details as to why the event was raised, what it indicates, as well as the steps that should be taken to resolve the issue. Providing this information will make the event easier to understand, and will shorten the time to resolution.

  • For example: in the event of high CPU load, the information could look like this:

“Use the task manager to detect the affecting process and restart the process/service that consumes the highest CPU. If this action doesn’t help, contact the system administrator”

  • For more advanced monitoring systems, we can create a runbook containing explicit instructions detailing what to do for each particular event that might occur. This runbook could be used for manual execution; or, in even more advanced systems, could be part of an automated system that will fix the problem on its own, or enrich the event with remediation information and then escalate the alert to the correct person.

  • For example: Assume we receive an event stating a node’s SNMP daemon is down. The event could be enriched to provide the initial triage steps from the runbook *prior* to being sent to the NOC. The NOC, upon receipt of the event, would implement the embedded steps and then add the results of those triage steps to their escalation to on-call engineering. Doing so would save time in two places, at the NOC level for the initial runbook lookup and at the on-call engineer’s level as the results of the triage would already be included in the escalation.

In our business, saving time = saving money. Every minute we shave off our event remediation efforts equates to recovering lost revenue. It is for this reason that we make extensive use of Event Enrichment in our day-to-day Operations at Playtech.

Eli Eyal – OSS Group Manager at Playtech


eli eyal 148x150 Why is Event Enrichment fundamental to effective IT Operations?

 

Eli has 15 years of experience in implementing advanced monitoring systems in companies around the world. Currently, Eli manages a multi-national monitoring team that designs, implements, and maintains Playtech’s monitoring systems.

 


Playtech is the world’s largest publicly-traded online gaming software supplier. Founded in 1999 and based on the Isle of Man, Playtech develops unified software platforms and content for the online and land-based gaming industries, and provides
a range of ancillary services such as marketing, hosting and Customer Relationship Management (CRM). Its best-of-breed product suite includes casino, casual games, sports betting, live gaming, lottery, bingo and one of the world’s largest Poker
networks. Web applications lie at the heart of Playtech’s gaming business and are the company’s primary revenue source. Flawless performance and 99.999% uptime are critical, especially during peak usage times such as sports races and other special events. Even minor performance glitches can spoil users’ gaming experiences and disrupt revenue streams.


Event Enrichment : Linux : Mailq threshold warning

enrichments panning 1024x686 Event Enrichment : Linux : Mailq threshold warning

Here’s another addition to our real life Enrichment series; the Mailq threshold warning event.

Enhancements / Comments are welcome!

Name:

Mailq threshold warning

Escalation:

Send to SYS on call team  <= Add your escalation destination

Remediation:

Note: Copy/Paste the commands after >>> into your ssh session

1) Log in to mail-01.acme.com <== replace with your host

>>> ssh ops@mail-01.acme.com

2) Issue the ‘mailq’ command to view the queue

>>> mailq

Response should be similar to:

-Queue ID- –Size– —-Arrival Time—- -Sender/Recipient——-
AAE603481BD 3471 Thu Oct 18 14:44:30 user@domain.com
(connect to outside.com[xxx.xxx.xxx.xxx]: Connection timed out)
user2@outside.com

The Q-id number is the id for the mail having a problem.

3) Review the email and investigate why it is not being sent by issuing the postcat command

>>> postcat -q AAE603481BD

4) If the email should be deleted, use the postsuper command

>>> postsuper -d AAE603481BD

OPTIONAL:

Sample script to delete no-reply emails:

for i in $(mailq | grep -B1 'do-not-reply@domain.com'| grep '^[A-Z,0-9]'| awk '{print $1}')
 do
 postsuper -d $i
 done

 


If you are interested in substantially cutting down on the time necessary to implement Event Enrichments, check out the new Event Enrichment Platform.