21 September, 2018

One Model to Rule Them All.

AUTHORS: Felix Berkhahn, Richard Keys, Wajih Ouertani, Nikhil Shetty.


Supervised machine learning models require the availability of ‘labels’ (or target prediction) in the training dataset. For example, the label for an image classifier would be the object present in the image, whilst for a sales forecasting model, the label would be the transaction price. In general, all regression and classification problems rely on a well labelled dataset, which can be restrictive, considering that obtaining rich labels is time consuming and may require that the data is manually labelled by humans.

Consider the world of predictive maintenance, how does one go about training a model to predict the remaining life of a machine without multiple examples of failures in the training data? In such cases, obtaining a single label can require a lot of time and domain expertise to diagnose and correctly label the problem. If acquiring a single label is so demanding, then it’s unrealistic to expect more than a handful of labels in the entire training set. If that’s the case, perhaps it is worth focussing more attention on models which don’t require labelled data.

Unsupervised models make use of unlabelled data – the features of the dataset in the absence of any context, e.g. raw sensor data form a machine. Unlabelled data is often readily available in mass, especially in the modern era where everything produces readings, measurements or metrics of some kind. This makes it much easier to build an unsupervised model, but it comes at a cost – without any labels, it’s not possible to train a classification or regression model. What we do get is an anomaly detection model, such as an ‘Isolation Forest’ or an ‘Autoencoder’. Such models learn all the patterns hidden in unlabelled data and determine whether a new datapoint is normal or anomalous. The output can be thought of as a ‘distance score’ between the datapoint in question and the learned representation of the training data. Anomaly detectors are valuable models, providing us with a deeper look at what’s going on with our data and more importantly, the machine producing our data.

The anomaly detection systems we have implemented at relayr cannot only tell us that something is going wrong, but they also tell us the symptoms that make the datapoint anomalous. This actually allows us to perform some sort of automatic root-cause analysis. Supervised models, however, can still go beyond this. They can not only aggregate the symptoms to categorize the type of problem, but also predict the expected remaining lifetime of the machine. Also, the accuracy of the supervised model is expected to be better than a solution solely relying on anomaly detection.

Let us make a day-to-day example that everybody can relate to. If we equipped our body with sensors to monitor our health condition, an anomaly detector would be able to give us an indication of our overall health condition (the ‘anomaly score’), and it would also tell us the symptoms we suffer from if our health condition is bad. For instance, it might tell us that our body temperature is too high, and we have to cough more often than usual. A predictive model, however, would in addition also tell us that we are suffering from a bacterial lung infection, and what our expected remaining lifetime due to that infection is.

Semi-supervised models attempt to bridge this gap between unsupervised models (anomaly detectors in our case) and supervised models, in that they only require a subset of the data to be labelled. A perfect example of such a model could build a clear understanding of the data from the large volume of unlabelled data, and would then only require a small number of labels to unlock predictive capabilities similar in performance to those of a supervised model. The business value for such systems is huge — it relieves the customer from the burden of creating a big labelled dataset (which they very often don’t want to allocate resources for), while still allowing them to benefit from the upsides of performant predictive systems.

Building on some previous work from Kingma et al at Deep-Mind, we’ve developed such a model, which adapts seamlessly – functioning on the full 0-100% range of available labels. The result is a ‘unified’ model in which the anomaly detection capability is improved by any available label, and vice versa in which the predictive capability is significantly boosted by the abundance of unlabelled data.

The novel additions compared to previous work are (to the best of our knowledge) the following:

  • The model architecture we present. It is easy and allows to turn any existing normal variational auto-encoder into a semi-supervised one with just a couple of lines of code.
  • Investigation of semi-supervised variational autoencoders operating on time series data and utilizing recurrent cells.
  • Using the variational autoencoder setup to regularize a pure supervised model, improving its classification accuracy.
  • Demonstration of an ‘inverse’ semi-supervised setup: We haven proven that the anomaly detection performance of our model is boosted by the additional availability of labels.