Over time, software has been getting more complicated. Platforms such as Google may contain more than two billion lines of code.. Even if your application is simpler, it likely contains code that you didn’t write yourself. It invokes multiple code languages, involves different libraries, incorporates APIs attached to vendor products, and so on. Your application is a petri dish, and its individual components have been combined in ways that were likely never envisioned or tested by their creators.
What this means is that your application – and your overall application stack – will likely generate anomalies over the course of its lifetime. These will likely frustrate your users and customers as they attempt to use your application, which will, in turn, be responsible for customer churn, lost revenue and lost productivity. In short, because your code is complicated, you will need an anomaly detection system to help make it run smoothly over the product lifecycle.
Manual Detection Isn’t Going to Help
Many businesses (hopefully, including your business) now recognize the need for an anomaly detection system. With that said, there’s more than one way to detect an anomaly. A lot of companies rely on manual anomaly detection. This is to say that they agree on a number of KPIs that represent the health of their most important applications, and then they keep track of them in one of two ways:
- Dashboards: Put up a bunch of graphs on a screen, and then hire a bunch of people to watch those graphs.
- Alerting: Based on an observation of your environment, you create thresholds for your KPIs. When the numbers go too high or too low, an administrator gets a text message.
You can see where these fall short, right? You’re running dozens of applications with millions of lines of code between them. At some point, you’re going to run out of people and screens (or administrators and phones). Even if you believe that your current monitoring capability is scaled appropriately, human observers can only see so much.
Even alerting based on thresholds isn’t guaranteed to work – an anomaly that doesn’t quite reach the threshold for alert status is still an anomaly. What’s more, you have thousands or even millions of metrics to monitor. Are you going to manually learn what’s “normal” for each of these metrics? Even if you attempt this undertaking, baselines tend to shift. Once the normal values for these metrics change, you’ll be swamped in false alerts.
Once you pass a certain scale, machine learning and automation are the only game in town, at least as far as anomaly detection is concerned.
Design Principles of an Automated Anomaly Detection System
Now that you know you need an automated anomaly detection system, you have yet more choices to make. After all, you don’t want just any system – you want one that can maintain application health, detect crashes, mitigate slowdowns, and alert on pricing or checkout glitches. Here are the factors you’ll need to look out for:
Timeliness: How fast do you want to detect anomalies? This isn’t a trick question, because there are always trade-offs. Detecting anomalies faster means enduring a higher rate of false positives. Slower anomaly detection uses batch-processing as opposed to real-time data. It’s more accurate, but it doesn’t scale – the more records you need to process, the slower it gets.
Scale: This is another tradeoff between real-time data and batch processing. If a system is very large, mission-critical and is expected to grow larger, it’s a good idea to scan for anomalies using real-time data. Again, this is much faster if less accurate. Static systems that are less crucial can use batch processing.
Rate of change: If your company is like most, your baseline – the day-to-day activity that’s considered “normal” – may change constantly. Some companies are different. Manufacturing data, for example, is supposed to change slowly if it changes at all. If you’re part of the former group, however, then you’ll need algorithms that can adapt to changing baseline data.
Conciseness: How does your anomaly detection system alert you to anomalies? Ideally, you want to be able to tell what’s going wrong at first glance. Try to find a system that can collect multiple symptoms, aggregate them into a related report and start telling you where the true anomaly lies.
Fundamental Machine Learning Techniques for Anomaly Detection
From the section above, it’s very likely that you will be incorporating real-time machine learning in order to police anomalies in customer-facing applications that experience high rates of change. Once again, you have options – this time regarding the kind of model that you choose.
Unsupervised Anomaly Detection: First, you take a time series and use a mathematical model to establish the normal range for a metric. Then, you take samples and see if they fit within the model. If the sample doesn’t fit, then it’s an anomaly. This is simple to spell out, but not simple to achieve. For example, your data may experience a high rate of change. How do you establish “normal” when the value for normal can change on a day-to-day basis?
Supervised Anomaly Detection: In this case, you begin training your system manually. Every time your system brings up something that may be an anomaly, it’s incumbent upon the administrator to determine whether or not it’s a false positive. The system evolves based on your feedback. This method is slower – and it depends heavily on your having a good working definition of an anomaly – but it does produce fewer false positives in the end.
Requirements of an Anomaly Detection System
In effect, the kind of model that you choose needs to fit the kind of signal that you produce. Not all data produces the same signal. Some are fairly regular, some are stepwise, some are discrete and some are all over the place. Your anomaly detection system needs to identify the kind of signal in use and then switch models on the fly to generate accurate results.
In addition, your system needs to detect seasonal patterns. If you don’t compensate for seasonal patterns, you’re going to have massive error bars – every variance within the peaks and troughs of your seasonal data is going to look normal, even when that’s clearly not the case. Unsurprisingly, there are multiple ways for your system to find seasonality, with pros and cons for each.
Pitfalls to Watch Out For
With so many factors to choose from, it can be easy to get things wrong in unexpected ways. For example, your model can learn too quickly or too slowly. Too quickly, and the model doesn’t get used to the average time series – suddenly everything is an anomaly. Too slowly, and the model doesn’t recognize even the widest variations from the mean.
Similarly, your model needs to understand how to adapt to anomalies. How does your model understand what an anomaly looks like and distinguish it from normal data? The answer here is actually to pause learning during an anomalous event. The machine learning model will not incorporate the anomalous data into its sense of what is “normal,” and it will continue to flag anomalies when they occur.
Fitting Machine Learning to Your Organization
The process of the right anomaly detection system for your organization is a massive, branching decision tree – and no two organizations are going to make the same decisions. Right now, e-commerce organizations are using machine learning to detect fraud, manufacturers are using machine learning to conduct proactive maintenance and data scientists are even using machine learning to audit the output of other machine learning models. Each of these potential use-cases requires an organization to configure their anomaly detection system in a different way.
This may seem like a difficult process, but it’s important to remember the introduction – the average enterprise environment is supported by hundreds of applications running millions of lines of code. There’s almost no way not to generate anomalies in this context, and so the only thing more damaging than not fine-tuning your anomaly detection system is not having one at all.
Ira Cohen is chief data scientist and co-founder of Anodot, where he develops real-time multivariate anomaly detection algorithms designed to oversee millions of time-series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has more than 12 years of industry experience.