Event Enrichment : Linux : High I/O Alert

silver surfer high io Event Enrichment : Linux : High I/O Alert

Here’s another addition to our real life Enrichment series; the High I/O threshold warning event.

Enhancements / Comments are welcome!

Name:

High I/O Alert

Escalation:

Send to SYS on call team  <= Add your escalation destination

Remediation:

NOTE: Use copy/paste to apply the commands shown after the “>>>” above in your SSH session.

SSH to samplehost.acme.com <== the host having the problem

>>> ssh samplehost.acme.com

verify disk io of samplehost.acme.com using:

>>> iostat -d -c 5

Linux 3.8.6-ubuntu-12-opt (turbo-srv) 02/26/2014 _x86_64_ (1 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
          0.36  0.00  0.08     0.11   0.00 99.45

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda      0.67 8.01     3.66     16532906 7546340
vdb      0.00 0.00     0.00       3231     424

record 10 iterations of iostat and send results to SYS on-call group

 


If you are interested in substantially cutting down on the time necessary to implement Event Enrichments, check out the new Event Enrichment Platform.

 

Seven awesome Operations Tools that you should use

awesome op tools 600 1024x694 Seven awesome Operations Tools that you should use

Network Operations Support tools we recommend.

We’ve compiled a list of the top operations support tools that we use in 24×7 operations environments. These tools are also (I know strange right? icon wink Seven awesome Operations Tools that you should use ) well suited for Event Enrichment.

Zenoss is an extremely versatile NMS which is open source and highly configurable. It lends itself well to integrations of many kinds and is very scalable. Zenoss is a well-rounded NMS, one which handles both events and performance metrics.

Nagios, another open source NMS, is a widely used platform which has been around for many years. It is not as feature rich as Zenoss and its core offering does not include graphing. Extensibility is Nagios’ defining feature. It is very straightforward to add support for any device using the Nagios plugin architecture.

Zabbix, another heavy weight in the open source NMS arena, has a number of strengths including excellent graphing capabilities, support for a number of databases, and built in web application monitoring.

Pagerduty is a fantastic escalation and alerting platform which supports guaranteed delivery of critical notifications (critical trait for Operations). Their excellent API facilitates integration with existing applications.

HipChat is an Instant Messaging platform and is great for creating Operations rooms. It has native, web, and iOS clients, and is very easy to integrate with other systems (including Hubot). Real time communication with the various members of the on-call team / NOC using HipChat is extremely useful while working on problem remediation.

Hubot is a bot which can be integrated with a very wide variety of messaging platforms including: Hipchat, IRC, Campfire, and many others. It has a wide variety of plugins which integrate with many useful services. In the past we’ve used this tool to integrate our Operations Support System (OSS) and Instant Messaging platforms.

The Event Enrichment Platform (EEP) is the world’s only platform specifically designed for Event Enrichment. Event Enrichment allows you to minimize downtime by injecting escalation and remediation information directly into your NMS alerts. With both email and PagerDuty notification options, the EEP gives you the information that you need when you need it most.

If you have recommendations for tools that you use for Operations support, add them in the comments!

Event Enrichment-How to fix the Zenoss AWS ZenPack 403 Forbidden Error

cloud computing e1398455235811 Event Enrichment How to fix the Zenoss AWS ZenPack 403 Forbidden Error

 

Zenoss AWS ZenPack 2.0 is a nice addition to the hundreds of other existing ZenPacks. In order to use it, you must configure the appropriate AWS IAM privileges. The easiest, yet least secure, way to do this is to provide full (administrator) access to the Zenoss AWS user; this is not ideal in terms of security. Instead, we recommend that you use a restricted profile that provides access to the specific metrics required by the ZenPack, and nothing more.

In general, when the IAM permissions are incorrectly modified, an AWS 403 Forbidden error will result.

Sample error:

2014-04-24 19:02:14,765 ERROR zen.AWS: Cust_7_VPC: AWS: 403 Forbidden

This event is useful, but sparse in terms of providing the information necessary for resolution. Utilizing Event Enrichment, we can dramatically cut down time to remediation. In an enriched event, the information required to properly triage the problem is already embedded in the initial incident alert.

The following Event Enrichment provides the steps to fix this problem.

 

EVENT ENRICHMENT – ZENOSS AWS ZENPACK 403 FORBIDDEN ERROR


REMEDIATION

A 403 Forbidden error typically signifies that the Amazon AWS IAM policy does not provide sufficient access to the Cloudwatch metrics needed by the Zenoss AWS ZenPack.

Investigate the problem using the following triage steps:

  • Log in to the AWS IAM Console
  • Check the IAM permissions for the Zenoss user (created for the AWS ZenPack)
  • Click on the Zenoss user
  • Click on the Permissions tab
  • Click on the Policy under the User Policies section. The Policy should look like this:
{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "cloudwatch:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "rds:Describe*"
      ],
      "Resource": "*"
    }
  ]
}

If the policy does not look as it should, note the difference, and copy / paste the Policy into your escalation findings report.

ESCALATION

Provide the SysEng team with the following information:

Original Event Summary:

2014-04-24 19:02:14,765 ERROR zen.AWS: Cust_7_VPC: AWS: 403 Forbidden

Verified Findings:

The expected IAM Zenoss user policy and the existing one on AWS do not match.

Expected:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "cloudwatch:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "rds:Describe*"
      ],
      "Resource": "*"
    }
  ]
}

Actual:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "cloudwatch:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
      ],
      "Resource": "*"
    }
  ]
}

 


 

So there you have it: an Event Enrichment targeted to resolving the Zenoss AWS ZenPack 403 Forbidden error. Remember, each time you spend a few minutes creating a new enrichment, you decrease the time it will take to fix this problem every time it occurs. Saving time is saving money!

Speaking of enrichments, I invite you to check out our new Event Enrichment Platform (EEP). This SAAS platform dramatically simplifies implementing Event Enrichment on a wide variety of NMS platforms, including Zenoss.

 

Why is Event Enrichment fundamental to effective IT Operations?

noc Why is Event Enrichment fundamental to effective IT Operations?

Monitoring systems vary widely in their capabilities. However, one thing all monitoring systems share, is their objective: providing a meaningful event, at the right time, allowing IT teams to fix problems as quickly as possible.

There are additional intricacies to a proper monitoring system, for example: proactive monitoring, automatic actions, and many others. At its most basic level, a monitoring system must keep us aware of what is going on in our IT environment.

Monitoring System Types

There are two main types of monitoring systems:

  • Status based monitoring – This type of monitoring system compares the status of a monitored node to a predefined desired result, or configured threshold. It will then provide a “status” indicating whether the node is at its normal/expected state or not. The status will look something like this: “red” for bad, “green” for good, “yellow” for a warning (ex: something is wrong but not completely down). Generally, this type of system does not provide any additional information.

  • Event based monitoring – This type of monitoring system provides an “event” which indicates the severity / relative importance of the problem, as well as a brief description of the problem.

In order to increase visibility into one’s IT environment, there is also the possibility of combining the two aforementioned methods. In this case, the final result would be an event that includes: status, severity, and a brief description. However, there is still a critical piece that is missing.

When there is a problem, it is usually associated with a loss of functionality or service. Regardless of the specifics, an organization’s goal is to resolve the issue as quickly as possible (also known as reduction of MTTR – Mean Time To Repair). Having a status indication, severity, or a short description of the problem, does not provide the cause of the problem, or the steps required for resolution.

Let’s say you are monitoring a server’s CPU usage in a status based monitoring system. When the usage exceeds the configured threshold, the server’s status will be “red”. A “red” status could indicate that the server is down, but it could also indicate that the server is overloaded, but still up and running. These details are not provided by the monitoring system. Taking this a step further, the “red” status will also be visible if an application fails on the server; in this case, the problem is NOT the server itself but the application.

These examples highlight the fact that when an event shows up on a given monitoring system, it could take time to identify exactly what the problem is, and then take additional time to pinpoint the fix. Even in cases where there is a description of the problem (i.e. CPU usage is above 90%), it will take time to figure out how to resolve the issue.

Increasing NMS efficiency

There are several ways we can increase the efficiency of monitoring systems; some apply to basic monitoring systems, and others to the more advanced.

  • For basic monitoring systems, we can add more information to the event. This information should provide details as to why the event was raised, what it indicates, as well as the steps that should be taken to resolve the issue. Providing this information will make the event easier to understand, and will shorten the time to resolution.

  • For example: in the event of high CPU load, the information could look like this:

“Use the task manager to detect the affecting process and restart the process/service that consumes the highest CPU. If this action doesn’t help, contact the system administrator”

  • For more advanced monitoring systems, we can create a runbook containing explicit instructions detailing what to do for each particular event that might occur. This runbook could be used for manual execution; or, in even more advanced systems, could be part of an automated system that will fix the problem on its own, or enrich the event with remediation information and then escalate the alert to the correct person.

  • For example: Assume we receive an event stating a node’s SNMP daemon is down. The event could be enriched to provide the initial triage steps from the runbook *prior* to being sent to the NOC. The NOC, upon receipt of the event, would implement the embedded steps and then add the results of those triage steps to their escalation to on-call engineering. Doing so would save time in two places, at the NOC level for the initial runbook lookup and at the on-call engineer’s level as the results of the triage would already be included in the escalation.

In our business, saving time = saving money. Every minute we shave off our event remediation efforts equates to recovering lost revenue. It is for this reason that we make extensive use of Event Enrichment in our day-to-day Operations at Playtech.

Eli Eyal – OSS Group Manager at Playtech


eli eyal 148x150 Why is Event Enrichment fundamental to effective IT Operations?

 

Eli has 15 years of experience in implementing advanced monitoring systems in companies around the world. Currently, Eli manages a multi-national monitoring team that designs, implements, and maintains Playtech’s monitoring systems.

 


Playtech is the world’s largest publicly-traded online gaming software supplier. Founded in 1999 and based on the Isle of Man, Playtech develops unified software platforms and content for the online and land-based gaming industries, and provides
a range of ancillary services such as marketing, hosting and Customer Relationship Management (CRM). Its best-of-breed product suite includes casino, casual games, sports betting, live gaming, lottery, bingo and one of the world’s largest Poker
networks. Web applications lie at the heart of Playtech’s gaming business and are the company’s primary revenue source. Flawless performance and 99.999% uptime are critical, especially during peak usage times such as sports races and other special events. Even minor performance glitches can spoil users’ gaming experiences and disrupt revenue streams.


Event Enrichment : Linux : Mailq threshold warning

enrichments panning 1024x686 Event Enrichment : Linux : Mailq threshold warning

Here’s another addition to our real life Enrichment series; the Mailq threshold warning event.

Enhancements / Comments are welcome!

Name:

Mailq threshold warning

Escalation:

Send to SYS on call team  <= Add your escalation destination

Remediation:

Note: Copy/Paste the commands after >>> into your ssh session

1) Log in to mail-01.acme.com <== replace with your host

>>> ssh ops@mail-01.acme.com

2) Issue the ‘mailq’ command to view the queue

>>> mailq

Response should be similar to:

-Queue ID- –Size– —-Arrival Time—- -Sender/Recipient——-
AAE603481BD 3471 Thu Oct 18 14:44:30 user@domain.com
(connect to outside.com[xxx.xxx.xxx.xxx]: Connection timed out)
user2@outside.com

The Q-id number is the id for the mail having a problem.

3) Review the email and investigate why it is not being sent by issuing the postcat command

>>> postcat -q AAE603481BD

4) If the email should be deleted, use the postsuper command

>>> postsuper -d AAE603481BD

OPTIONAL:

Sample script to delete no-reply emails:

for i in $(mailq | grep -B1 'do-not-reply@domain.com'| grep '^[A-Z,0-9]'| awk '{print $1}')
 do
 postsuper -d $i
 done

 


If you are interested in substantially cutting down on the time necessary to implement Event Enrichments, check out the new Event Enrichment Platform.