IT Monitoring is Terrible: We Can Fix it with Machine Learning

In IT operations, we need to know when something isn’t working. But, humans are just bad at identifying anomalies over time.
FIGURE 1: Rapidly decreasing accuracy, after Mackworth & Taylor 1963.
A typical person’s ability to identify an anomaly that we know to look for can drop by more than half in the first 30 minutes on duty1. If that’s not bad enough, when unaided by technology, it can take us up to four times as long to recognize one2. We’re actually terrible at this elementary IT requirement of identifying when things go wrong, and that’s before we get to the ugly case of looking for problems we don’t expect. Combining automation and anomaly detection powered by machine learning (ML) may be the only chance we have to successfully identify and respond to the rising swell of data in IT.In this blog post, we’ll talk about how the biology of the human brain impacts IT operations, how we can augment our teams with ML applications, and finish with two concrete examples of these applications: one offered as a service today by Red Hat, and another which (as far as I can tell) is a novel approach to assisting Root Cause Analysis with ML.

Monitoring is a Human Problem, and We Can’t Fix It Alone

Our brains are great at recognizing patterns3. We’re so good that sometimes we see them where there aren’t any. If you’ve ever seen a cloud that looked like a cat, or rock that looked like your cousin, you know this is just regular human brain stuff. Our brains get used to seeing emerging patterns very quickly, it’s a physiological process called habituation – our brains expect the pattern. It helps us spend fewer cycles to understand what’s going on around us. In fact, when what’s going on around us isn’t radically changing, habituation reduces the attention we pay to the “signals,” in Signal Detection Theory lingo, from the pattern.In the case of IT monitoring, we’re inundated with “unwanted signals”– signals that indicate everything is OK and can be ignored by operators. These unwanted signals play a valuable role in helping the monitoring systems know that the services (and the monitoring solution itself) are performing as designed, but are detrimental to human processing. Eventually, the human brain adjusts to the idea of receiving a large number of signals, which becomes the expected pattern. We then pay less attention to whether the signal means OK or PROBLEM. This habituation causes us to require more effort over time to identify the exceptions to the expected, and causes us to be slower at recognizing exceptions too. That’s long winded, let’s use an example:Construction starts hammering away next door. It’s very loud, so of course you notice immediately. Over the rest of the day, you grow used to (habituated to) the noise of the hammering. When the it stops for the evening, it takes a minute, but you notice that the hammering has stopped. It’s not immediate like when the hammering started. If you’d like to try out your own attention skills, here’s a 60 second selection attention test from Daniel Simons.This predestined loss in attention, the vigilance decrement, is magnified when we’re looking for rare problems – in IT, like those that cause unplanned downtime.

Work Smarter, Not Harder.

I hate that phrase. Said out loud, it’s too often a cop-out. It means: not enough budget, not enough headcount – go do impossible again. In other words, keep working harder. Do you remember the good old days when our teams and budgets grew at the same rate as the work we had to get done? Me either.IT ops teams are asked to support ever-larger environments (more containers than VMs, more functions than containers, etc.), and also more types of things (application frameworks, development languages, etc.). This growth in scale and complexity makes support an increasingly daunting effort. So, we’re left with that despicable phrase: work smarter, not harder. When it comes to preventing errors, and especially in the world of overwhelming data that we live in, we need a systematic change to monitoring. Research shows that, rather than only relying on operators’ attention, a systematic approach can be superior for creating highly reliable operations.With the velocity of complexity in IT, we clearly need that new systemic approach. We need a different approach to scale IT operations by accounting for natural human variability. Our customers often use automation to build quality into IT processes. That helps, and we’ve seen spectacular results. But, if we want the next big jump in improvement, automation is only half of the solution. Since we can only respond when we find an anomaly, how can we do better about recognizing them if we can’t even keep up with the incoming data?As soon as ops sits down to a shift, their capability of finding something odd quickly decreases across almost every dimension: they miss more, they wrongly call good things as bad, they grow less confident in their decisions, and it takes them longer to make the decision.Enter artificial intelligence. The availability of machine learning based anomaly detection is the start of a new way to support operations. Through machine learning, operations can learn to provide a higher level of service through identifying and eliminating more anomalies, and more rare anomalies, earlier and with greater accuracy.Finding the anomalies is the first step, but that alone won’t solve the problem. You have to know what the anomaly means. Machine learning and advanced analytics can help with that, too. Let’s go through two examples of locating anomalies and helping provide information about what’s going on: one Red Hat provides as SaaS, and another you can build for yourself.

Red Hat Insights

A couple of years ago, we released Red Hat Insights, a predictive service which identifies anomalies, helps you understand the causes, and helps automate fixes before the causes become problems. If you subscribe to Insights, it uses a tiny4 bit of metadata to identify the causes of pending outages in real-time. With the data from well over 15 years of resolving support cases, we are able to train Insights to provide both descriptive explanations of the problems and prescriptive remedies. To take it a step further and make operators lives a little easier, we recently extended Insights with the ability to remediate identified problems with automation. As more and more customers use Insights for risk mitigation and automated issue resolution, the additional information enables Insights to become smarter every day, and enables more informed actions by operations. Connect automation with machine learning to identify and resolve problems before you know to look for them.

Use machine learning to help us diagnose software.

Red Hat Insights provides exact and automatable actions to resolve the complex interactions that lead to downtime. We can also use machine learning to assist in identifying other types of software problems, and reduce the time required to discover the root cause by narrowing down where to first look to a few educated predictions – without having to pour through logs by hand. We can use machine learning to aid operators in root cause analysis by suggesting a possible dependency chain that led to the breakdown – a diagnostic map.Applications and platforms responsible for the deployment and management of many things (VMs, containers, microservices, functions, etc.) are increasingly providing maps of the things under their control in order to provide operators with context. The example below shows the topology of container interactions in a Kubernetes cluster on Red Hat’s container platform, OpenShift. This works well for platforms that create the topologies, but what about trying to determine the topology for applications we don’t know or control?
FIGURE 2: CloudForms managing a Kubernetes Cluster in OpenShift

Turning one minute of laptop CPU into a diagnostic map.

System logs on Linux (and *nix-based cousins) are great sources of information for what isn’t working well, but any entry rarely provides much context outside the program or subsystem that generated it. In today’s world of massively interconnected systems, unless an operator already has experience with the observed problem, any log entry is rarely enough information to understand its root cause. However, even when we’ve never seen the problem before, we can use machine learning to build a diagnostic map and help us narrow down where to look first for root causes. Here’s an example.
FIGURE 3: Diagnostic Map
Figure 3 represents a part of a machine learning-derived diagnostic map of Linux programs from log entries in syslog. Each circle is a program that logged events in syslog. The arrows suggest an influence relationship: the first program, that points to the second program, impacts the behavior of the second. Now you have a picture of the entire system of events that led to the problematic behavior that led you to look at the events in the first place: you now have context.

Diagnostic maps can reduce cognitive overload, and identify what’s important.

With more deployment types, frameworks, and rapidly evolving applications, the interdependencies of things we support are exploding in number. The best IT operators can debug only some of these problems quickly. However, when we use ML to aid in identifying problems and generate diagnosis maps, we can help reduce time to resolution across problems we haven’t seen before.Not only are graphs like this valuable as a troubleshooting tool, they can also be tied into monitoring systems to help operators identify and prioritize the right alerts. When something big happens, like a cloud region going down, we’re flooded with alerts. In this case getting an alert from your monitoring systems that every application, every container and every VM is down doesn’t add any new information to help resolve the problem. However, each and every alert takes cognitive effort to process and decide whether it’s important. In cases like this of alert floods, the human brain becomes overwhelmed and stops processing any new alerts. If you’ve ever felt overwhelmed by the amount of email in your inbox, that’s a small version of the same principle.With an understanding of dependencies, you can gate alerts: you don’t need any more alerts about the applications down if the VM they’re running on are also down. But, knowing a new VM is down may be essential to know.Artificial Intelligence (AI) is a rapidly evolving field, and its use in IT operations even more so. But, it’s a lot more than academic, and we’re beginning to see emerging markets categories of use. Red Hat already uses it to offer Insights, the service identifying and resolving infrastructure issues before your teams know about them. We also saw an emerging example of using AI to assist in root cause analysis. The field is just getting started, and these are just two of many exciting possible directions the field may evolve.We’ve seen that it’s essentially impossible for people to watch for anomalies at any level approaching business critical; humans just aren’t wired for it. The good news is machine learning is good at it: both finding anomalies, and helping your teams figure out where to look to solve them. And, you’re not alone in this need.If you have a substantial investment in any software, call those vendors and ask what tools they have to help your teams identify, diagnose, and solve problems with their software. If you’re feeling like you want to push a little harder, ask what they’re doing to help solve problems where their software is only one piece of the puzzle.Erich Morisse
Director, Management Strategy
@emorisse
  1. Jane F. Mackworth, “Vigilance and Attention,” Penguin Books, Ltd. 1970 

  2. Mackworth, N. H. (1948). The breakdown of vigilance during prolonged visual search. Quarterly Journal of Experimental Psychology, vol. 1, pp.6-21 

  3. Jeff Hawkins’ “On Intelligence” is a great and accessible read on the topic. 

  4. Less than 5% of the data you provide for a single support case. 

, , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: