Monitoring systems vary widely in their capabilities. However, one thing all monitoring systems share, is their objective: providing a meaningful event, at the right time, allowing IT teams to fix problems as quickly as possible.
There are additional intricacies to a proper monitoring system, for example: proactive monitoring, automatic actions, and many others. At its most basic level, a monitoring system must keep us aware of what is going on in our IT environment.
Monitoring System Types
There are two main types of monitoring systems:
Status based monitoring – This type of monitoring system compares the status of a monitored node to a predefined desired result, or configured threshold. It will then provide a “status” indicating whether the node is at its normal/expected state or not. The status will look something like this: “red” for bad, “green” for good, “yellow” for a warning (ex: something is wrong but not completely down). Generally, this type of system does not provide any additional information.
In order to increase visibility into one’s IT environment, there is also the possibility of combining the two aforementioned methods. In this case, the final result would be an event that includes: status, severity, and a brief description. However, there is still a critical piece that is missing.
When there is a problem, it is usually associated with a loss of functionality or service. Regardless of the specifics, an organization’s goal is to resolve the issue as quickly as possible (also known as reduction of MTTR – Mean Time To Repair). Having a status indication, severity, or a short description of the problem, does not provide the cause of the problem, or the steps required for resolution.
Let’s say you are monitoring a server’s CPU usage in a status based monitoring system. When the usage exceeds the configured threshold, the server’s status will be “red”. A “red” status could indicate that the server is down, but it could also indicate that the server is overloaded, but still up and running. These details are not provided by the monitoring system. Taking this a step further, the “red” status will also be visible if an application fails on the server; in this case, the problem is NOT the server itself but the application.
These examples highlight the fact that when an event shows up on a given monitoring system, it could take time to identify exactly what the problem is, and then take additional time to pinpoint the fix. Even in cases where there is a description of the problem (i.e. CPU usage is above 90%), it will take time to figure out how to resolve the issue.
Increasing NMS efficiency
There are several ways we can increase the efficiency of monitoring systems; some apply to basic monitoring systems, and others to the more advanced.
For basic monitoring systems, we can add more information to the event. This information should provide details as to why the event was raised, what it indicates, as well as the steps that should be taken to resolve the issue. Providing this information will make the event easier to understand, and will shorten the time to resolution.
“Use the task manager to detect the affecting process and restart the process/service that consumes the highest CPU. If this action doesn’t help, contact the system administrator”
For more advanced monitoring systems, we can create a runbook containing explicit instructions detailing what to do for each particular event that might occur. This runbook could be used for manual execution; or, in even more advanced systems, could be part of an automated system that will fix the problem on its own, or enrich the event with remediation information and then escalate the alert to the correct person.
For example: Assume we receive an event stating a node’s SNMP daemon is down. The event could be enriched to provide the initial triage steps from the runbook *prior* to being sent to the NOC. The NOC, upon receipt of the event, would implement the embedded steps and then add the results of those triage steps to their escalation to on-call engineering. Doing so would save time in two places, at the NOC level for the initial runbook lookup and at the on-call engineer’s level as the results of the triage would already be included in the escalation.
In our business, saving time = saving money. Every minute we shave off our event remediation efforts equates to recovering lost revenue. It is for this reason that we make extensive use of Event Enrichment in our day-to-day Operations at Playtech.
Eli Eyal – OSS Group Manager at Playtech
Eli has 15 years of experience in implementing advanced monitoring systems in companies around the world. Currently, Eli manages a multi-national monitoring team that designs, implements, and maintains Playtech’s monitoring systems.
Playtech is the world’s largest publicly-traded online gaming software supplier. Founded in 1999 and based on the Isle of Man, Playtech develops unified software platforms and content for the online and land-based gaming industries, and provides
a range of ancillary services such as marketing, hosting and Customer Relationship Management (CRM). Its best-of-breed product suite includes casino, casual games, sports betting, live gaming, lottery, bingo and one of the world’s largest Poker
networks. Web applications lie at the heart of Playtech’s gaming business and are the company’s primary revenue source. Flawless performance and 99.999% uptime are critical, especially during peak usage times such as sports races and other special events. Even minor performance glitches can spoil users’ gaming experiences and disrupt revenue streams.