17 January, 2020

Anomaly Detection with False Positive Suppression.

We explore the possibility of combining the anomaly detector with a supervised system to produce a heuristic modification of the anomaly score such that those datapoints that previously had led to false positives now have a ‘suppressed’ low anomaly score.

by Felix Berkhahn, Lead Data Scientist

Anomaly detection is a cornerstone in the analytics toolkit of relayr. A properly trained and configured anomaly detector can be used for a variety of tasks in the IoT field, including predictive maintenance, bottleneck detection or erroneous sensor identification.

Moreover, labeled data is rare. Very often we observe that companies have access to a large amount of sensor data, but no structured (human) annotations of that data, which is usually time consuming and expensive to obtain.

In the context of predictive maintenance it could also be almost impossible to collect a labeled dataset that contains all conceivable types of failures a machine could experience. There could be just too many of them, and even a given type of failure could manifest each time differently in a variety of ways. Hence, it is often more appropriate to build a model that detects any out of the ordinary manifestation of sensor readings and alarms the operators of a potentially impending failure. This is exactly what anomaly detection achieves.

There are of course also downsides to this approach. One of them, and this will be the topic of this blogpost, is that in some circumstances anomaly detectors tend to produce false positives. If a machine exhibits some rare, but perfectly normal and healthy behavior, an anomaly detector will naturally highlight it as an anomaly (just because it is ‘rare’), hence producing a false positive. If you measured for instance the vibration of your car and trained a model based on that data, you would most likely produce anomalies in the rare event of overtaking another car with full throttle. While the vibration data in this event indeed looks different than the vibration data under normal operational conditions, it is not a signal that our car needs maintenance. Consequently, we would like to suppress it, because it would create false alarms in a predictive maintenance context. How can that be achieved?

One way to address the problem would be to use dedicated feature engineering, by extracting features that are insensitive to the rare events causing false positives. In the car example above, one could for instance try to normalize the vibration measurements with the RPM of the motor. While feature engineering is an option to address the problem of false positives, it might not always be possible to come up with ‘good’ insensitive features. Very often, a residual sensitivity will continue to exist. Or it could be that the engineered feature loses predictive power to detect actual engine failures. Last but not least, it is hard to anticipate all types of scenarios that would lead to a false positive signal right at the outset. Realistically, most scenarios would pop up for the first time during operation of the model. A solution that requires to go back to the drawing board each time an issue appears and to come up with freshly engineered features and a new model is not scalable.

In this blogpost, we will instead explore the possibility to combine the anomaly detector with a supervised system to produce a heuristic modification of the anomaly score such that those datapoints that previously had led to false positives now have a ‘suppressed’ low anomaly score.

What we envision is to have a system in place that allows the user to label a false positive as such, and which will then lead to the training of a supervised system. The supervised system then subsequently assigns a ‘false positive’ probability p to each datapoint. We now propose the following ad hoc modification of the anomaly score x:

(1)          x → (1 – p) x 

If the supervised ‘false positive classifier’ works reliably, this allows to weed out common manifestations of false positives, as their heuristic anomaly score will be suppressed by the term 1 – p.
To obtain a reliable classifier requires to label several false positive instances. If the required number of samples that have to be labeled is too large, this might render this approach prohibitive.

Fig. 1: architecture of the semi-supervised VAE that we are proposing

This is where the semi-supervised model developed at relayr comes into play (see for instance our blog post or paper). This semi-supervised, ‘unified’ model is essentially a normal VAE with the addition of a “pi” layer (see Fig. 1) that feeds off from the latest layer prior to the latent layer. The loss function contains not only the standard reconstruction probability and KL divergence, but on top also the classification loss of the supervised problem.

The idea is to use the reconstruction probability of this ‘unified’ model for anomaly detection, and its supervised component for the false positive suppression. In other words, in the network architecture of Fig. 1, the pi layer will be used to detect false positives, while the “normal” output layer will be used for the anomaly detection task.

The amount of false positives is expected to be much smaller than the number of overall datapoints (else, the anomaly detector wouldn’t have produced a false positive for those datapoints in the first place). This is the exact situation in which the semi-supervised setup promises substantial improvements compared to a setup in which two independent models are used. The false positive classification task can leverage the ‘knowledge’ that the network acquires from the large amount of unlabeled data. Also from an architectural perspective it is advantageous to only have a single model which is handling both the anomaly detection and classification tasks, as it relieves us from synchronizing and combining the results of two models in the backend pipeline (the modification (1) can be built into the tensorflow graph as a separate model output). What this model actually allows us to do is to start off with a pure anomaly detection task, and then once the first false positive labels arrive to seamlessly transition into a semi-supervised setup, where the supervised component of the unified model feeds off from the already learned representations in its deeper layers. We hence expect to get good classification results from the get-go without the need to collect a large dataset of false positive labels.

To prove the above value proposition in a simple setting, we ran the following experiments: We used the MNIST dataset with a simplistic model (two fully connected layers in the encoder, the latent layer, and two fully connected layers in the decoder). From the 10 MNIST digits, one is excluded from the training process altogether (which will represent the ‘true’ anomaly label) and one is only injected with a small amount (which will represent the false positive class). Our goal is to end up with a model that performs well in identifying actual anomalies, but which does not produce high anomaly scores for samples from the false positive class.

We use two metrics to evaluate the model performance: The AUC score between the normal class and the anomaly class, and the AUC score between the anomaly class and the false positive class. The former (‘normal AUC’) tells us how well the model disentangles normal data from actual anomalous data, while the latter (‘false positive AUC’) tells us how well the model disentangles the anomaly class from the false positive class. We want to show that the method of false positive suppression helps to improve the false positive AUC, without affecting the normal AUC too much.

Lets consider first the false positive AUC. In the following experiment, we used 0 as the anomaly class, and 2 as the false positive class. In other words, the model was trained on the data of all digits 1, 3, 4, …, 9 and a varying amount of digits of the class 2:

Fig. 2: comparing the performance of a VAE with and without false positive suppression.

The AUC’s were computed between the ‘normal’ and the ‘false positive’ class. A higher score means that the model can better disentangle both classes. For this experiment, the x-axis (‘number of samples’) was resolved in steps of 1000.

Here, the x axis represents the number of samples of the class 2. The blue curve shows the false positive AUC for a pure anomaly detection model, while the yellow curve shows the false positive AUC of our unified model using the heuristic modification (1). Of course, if we include a sufficient amount of ‘false positive samples’ in the training dataset, a normal anomaly detector will also at some point stop producing false positives, because those events will no longer be rare. This is why the blue curve is also pointing upwards with the number of included samples. However, this is by far beaten by the yellow curve, which already with 500 samples jumps to almost 100% AUC (note that this experiment was conducted with steps of 500 samples).

Lets zoom into the region at the beginning, and see how many samples are minimally needed to achieve high false positive AUC scores. We will from now on ignore the ‘pure’ anomaly detection model, as it does obviously not yield competitive result, and would require thousands of samples to achieve high false positive AUC scores. The relevant region at the beginning looks as follows:

Fig. 3: Similar to Fig. 2, but a zoomed in version, ie a much finer step size on the x-axis was used. In addition, the AUC of the normal anomaly detection problem, ie the AUC between the normal and the anomalous classes is shown.

The yellow curve represents the false positive AUC. We see that already with 40 samples we obtain AUC’s above 80%. With a mere 70 samples, the performance is pushed to over 90%. Note that the training dataset overall contained 480.000 samples, so this amounts to a fraction of only 0.014% overall labeled datapoints! The red curve in contrast shows the ‘normal AUC’, ie the ability of the model to detect actual anomalies. As we can see, the anomaly detection performance of the unified model is not affected at all. In other words, we achieved exactly what we had set out to achieve: Suppressing false positives by adding a small amount of labeled samples into the dataset, while at the same time retaining the performance of detecting actual anomalies. This is further corroborated by experiments we conducted with other datasets and models.

It is also interesting to see how much performance boost is obtained from the *semi*-supervised nature of the model. If we had instead used a system with two models, ie a normal VAE for anomaly detection and a separate supervised model to predict false positives, what AUC would we have obtained? We ran exactly this type of experiment, and the results look as follows (40 labeled false positive samples were used):

What we can see is that the unified model suppresses false positives much better than the equivalent setup with two models considerable. The unified model hence leverages the large amount of unlabeled data as expected. Note that both models perform the same with respect to the actual anomaly detection task.

Another interesting question that we can ask is: How good are the results if we label only a subset of the false positive datapoints the model is exposed to during the training process? We also investigated in that direction, and the results look as follows:

We see that the results get indeed a bit boosted from the presence of unlabeled false positive samples, but not as much as if they had been labeled.

Finally, we might ask ourselves how the above results translate to a time series model (which is the type of model we are mostly using at relayr). This can nicely be showcased using the twitter traffic dataset released for anomaly detection purposes. We again trained a semi-supervised model following the architecture of Fig. 1, only that both the encoder and decoder contain LSTM cells and time distributed dense layers. We used a lookback of 500 timesteps, and created training / test samples with maximum overlap between samples. The anomaly scores were derived based on the model reconstruction of the ‘last’ datapoint of a sample.

Without assigning any labels to the dataset, the log of the reconstruction probabilities (which can be thought of as an ‘inverse’ of the anomaly score) of a held out test set look as follows:

Fig. 4: log(p) on a held out test set of the original model without false positive suppression

Compare that with the plot of the traffic data itself (Fig. 5):

Fig. 5: held out test set of the Twitter traffic data on which anomaly detection was performed.

There are clear anomalies visible at the beginning and end of the data series. Another anomaly is obvious at around timestep 3000. Note that there is another subtle, but unequivocal anomaly visible at around timestep 1700.

First of all, it is apparent that the model picked up most actual anomalies, both at the very beginning and end of the dataset, and the anomalous drop at around timestep 3000. What also becomes clear is that the model tends to produce bad reconstructions at all minima of the time series. This can potentially blur actual anomalies and will also reduce the signal strength of the model. In the above example, the anomaly at around timestep 3000 produces a signal almost as strong as the false signals at the minima.

We can clearly identify anomalies at the very beginning and the end of the dataset. There is also an obvious anomaly at around timestep ~3000. There is another less obvious, but unequivocal anomaly at around timestep ~1700.

The same observations can actually be made on the training dataset. We now labeled all instances of the training dataset where the model produced false signals at the minima of the time series as false positives. We then fine tuned the model using these extra labels. Applying the heuristic of formula (1), yields a result on the held out test set that looks as follows:

Fig. 6: effect of suppressing the anomaly score using the heuristic (1). We can see that not only did the false positive regions completely disappear, but a new anomaly appeared at around timestep ~1700. Note that these results were obtained after a retraining of the model, hence the unsuppressed results vary slightly compared to Fig. 4.

The suppression works very well. Not only did the false signals at around timesteps ~500, ~1400 and ~2700 completely disappear (with the anomaly at around timestep ~3000 being very pronounced), but a new anomaly occurred at around timestep ~1700. Close investigation of the the plot of the traffic data itself (Fig. 5) indeed verifies the existence of a subtle but unequivocal anomaly at that time. This anomaly is not occurring at a minima of the time series. So it is not that this anomaly was masked by false positives in the previous model. Hence, remarkably, the new false positive suppressing model is not only better as it is in fact suppressing false positives, but it also finds new anomalies. This can most likely be attributed to the fact that the false positive suppressing model is forced to find more descriptive latent representations. This seems to be another instance of the ‘semi-unsupervised’ learning that we described in https://arxiv.org/abs/1908.03015.

When retraining the unsuppressed model several times, we can also find cases in which the unsuppressed model detects the anomaly at timestep ~1700. However, in some cases, it is either not found or the signal is weaker than the noise of the false positive regions. In these latter cases, the fine tuned suppressed model seems to discover the anomaly reliably.