21 September, 2018

One Model to Rule Them All.

Building on previous work from Kingma et al at Deep-Mind, we’ve developed a semi-supervised model, which adapts seamlessly – functioning on the full 0-100% range of available labels. The result is a ‘unified’ model in which the anomaly detection capability is improved by any available label, and vice versa in which the predictive capability is significantly boosted by the abundance of unlabelled data.

by Felix Berkhahn, Richard Keys, Wajih Ouertani, Nikhil Shetty.


Supervised machine learning models require the availability of ‘labels’ (or target prediction) in the training dataset. For example, the label for an image classifier would be the object present in the image, whilst for a sales forecasting model, the label would be the transaction price. In general, all regression and classification problems rely on a well labelled dataset, which can be restrictive, considering that obtaining rich labels is time consuming and may require that the data is manually labelled by humans.

Consider the world of predictive maintenance, how does one go about training a model to predict the remaining life of a machine without multiple examples of failures in the training data? In such cases, obtaining a single label can require a lot of time and domain expertise to diagnose and correctly label the problem. If acquiring a single label is so demanding, then it’s unrealistic to expect more than a handful of labels in the entire training set. If that’s the case, perhaps it is worth focussing more attention on models which don’t require labelled data.

Unsupervised models make use of unlabelled data – the features of the dataset in the absence of any context, e.g. raw sensor data form a machine. Unlabelled data is often readily available in mass, especially in the modern era where everything produces readings, measurements or metrics of some kind. This makes it much easier to build an unsupervised model, but it comes at a cost – without any labels, it’s not possible to train a classification or regression model. What we do get is an anomaly detection model, such as an ‘Isolation Forest’ or an ‘Autoencoder’. Such models learn all the patterns hidden in unlabelled data and determine whether a new datapoint is normal or anomalous. The output can be thought of as a ‘distance score’ between the datapoint in question and the learned representation of the training data. Anomaly detectors are valuable models, providing us with a deeper look at what’s going on with our data and more importantly, the machine producing our data.

The anomaly detection systems we have implemented at relayr cannot only tell us that something is going wrong, but they also tell us the symptoms that make the datapoint anomalous. This actually allows us to perform some sort of automatic root-cause analysis. Supervised models, however, can still go beyond this. They can not only aggregate the symptoms to categorize the type of problem, but also predict the expected remaining lifetime of the machine. Also, the accuracy of the supervised model is expected to be better than a solution solely relying on anomaly detection.

Let us make a day-to-day example that everybody can relate to. If we equipped our body with sensors to monitor our health condition, an anomaly detector would be able to give us an indication of our overall health condition (the ‘anomaly score’), and it would also tell us the symptoms we suffer from if our health condition is bad. For instance, it might tell us that our body temperature is too high, and we have to cough more often than usual. A predictive model, however, would *in addition* also tell us that we are suffering from a bacterial lung infection, and what our expected remaining lifetime due to that infection is.

Semi-supervised models attempt to bridge this gap between unsupervised models (anomaly detectors in our case) and supervised models, in that they only require a subset of the data to be labelled. A perfect example of such a model could build a clear understanding of the data from the large volume of unlabelled data, and would then only require a small number of labels to unlock predictive capabilities similar in performance to those of a supervised model. The business value for such systems is huge — it relieves the customer from the burden of creating a big labelled dataset (which they very often don’t want to allocate resources for), while still allowing them to benefit from the upsides of performant predictive systems.

Building on some previous work from Kingma et al at Deep-Mind, we’ve developed such a model, which adapts seamlessly – functioning on the full 0-100% range of available labels. The result is a ‘unified’ model in which the anomaly detection capability is improved by any available label, and vice versa in which the predictive capability is significantly boosted by the abundance of unlabelled data.

The novel additions compared to previous work are (to the best of our knowledge) the following:

  • The model architecture we present. It is easy and allows to turn any existing normal variational autoencoder into a semi-supervised one with just a couple of lines of code.
  • Investigation of semi-supervised variational autoencoders operating on time series data and utilizing recurrent cells.
  • Using the variational autoencoder setup to regularize a pure supervised model, improving its classification accuracy.
  • Demonstration of an ‘inverse’ semi-supervised setup: We haven proven that the anomaly detection performance of our model is boosted by the additional availability of labels.


Our model is built on top of a type of anomaly detector known as an ‘Autoencoder’, a special type of neural network well suited to tasks such as compression, denoising and anomaly detection. Autoencoders consist of two distinct parts, the encoder and the decoder, which are connected sequentially and trained together as a single network.

The encoder is essentially a cone shaped neural network, in which the layers become progressively smaller up until the final and smallest layer which can be as small as a few neurons wide. The decoder is the reverse of the encoder, in which the layers progressively increase in size. The resulting network takes on a ‘sand timer’ shape, with a bottleneck in the centre, known as the ‘latent layer’ (see Fig1).

During training, the loss function is typically the difference (rmse for example) between the input and the output – thus training the model to reproduce its input. This results in the encoder learning to compress the input data to a dense representation in the latent layer, whilst the decoder learns to reconstruct the original data from the latent layer. This turns out to be a suitable model for anomaly detection – the decoder will easily be able to reconstruct normal looking data producing a low distance score, yet struggle with anomalous data and produce a high distance score.

Sounds great, but the classic Autoencoder suffers from some limitations due to the way data is compressed in the encoder. The compression process can be thought of as a direct mapping from the input to a specific set of activations within the latent layer. This discretized representation in the latent layer restricts the potential of the Autoencoder, especially when used for tasks such as data generation for example.

Consider a dataset comprising a cluster (see Fig2 chart above); the Autoencoder is generating discrete mappings between each datapoint and the activations in the latent layer (represented as bars in Fig2). The loss function is actually favoring a model in which each and every datapoint is exactly reconstructed from the decoder, without rewarding the extra understanding of all of them belonging to the same cluster. As a consequence, the Autoencoder is more prone to overfit the training data and it has harder times to understand the relations between the datapoints (in this example, that they all belong to the same cluster). Whenever the encoder maps a new datapoint into a low-density region in the latent layer, the standard Autoencoder struggles to reconstructs that datapoint well, even if it was falling into a low-density region within the cluster itself.

It’s easy to see why this is a problem with data generation for example; the decoder could be used for data generation by way of adding noise to the latent activations; however, there is a chance that the result is not useful because the addition of noise can fall between the discrete mappings encoded within the latent layer.

These limitations can be overcome using Variational Autoencoders, which, Like the standard Autoencoder, are composed of an encoder and a decoder joined with a bottlenecked in the centre. The key difference is in the latent layer, instead of a single layer containing the compressed representation, a Variational Autoencoder forks at the centre into a mu and sigma layer, followed by a merge into the latent layer (see Fig3).

The mu and the sigma layer are now interpreted as the mean and standard deviation of a normal distribution (hence giving them their names). In other words, every single input training data point is represented as a normal distribution with a mean and standard deviation determined by the encoder (those are visualized with the smooth lines in Fig4 below).

This simple modification allows the latent layer to represent the data as a continuous probability distribution instead of discrete mappings, providing a more abstract representation of the data.

The network can be thought of as follows: the encoder operates in real vector space and compresses the data, the latent layer maps this compressed representation into a continuous function space and finally the decoder is approximating the inverse of this continuous function by transforming the normal distribution of the latent layer into the output probability distribution. This transformation corresponds to the sequence of probability distributions depicted at the left hand side of Fig3: We start with the normal probability distribution of the latent layer, and every layer of the decoder is subsequently deforming the distribution until the output probability distribution is returned.

As the model produces probability distributions as outputs (and not just the most likely reconstruction), it can inherently model relations between datapoints. For instance in the example of Fig2, the model can learn that all datapoints of the training set belong to the same cluster. Not only that, in this example it could even represent every single datapoint of the training set with (almost) the same latent activations — which is the activation that actually corresponds to the probability distribution of the cluster. As a consequence, the variational autoencoder is less prone to overfitting the training data compared to a standard autoencoder, because it does not have to reproduce every single datapoint exactly, but only has to model the probability distribution from which it was drawn. We will see in the next section that the loss function of the variational autoencoder indeed contains a term that is pushing the model to produce such a ‘smeared out’ representation of the dataset.

How can we model with a neural network the decoders’ mappings in function space? We can’t map whole functions (probability distributions) directly using a neural network. We can, however, approximate this mapping by sampling from the probability distribution of the latent layer, and map the sampled points using the decoder in the standard way. This is exactly what is depicted with the red dots on the probability distributions of the left hand side of Fig3. On the second of these images, we sampled a bunch of points according to the normal probability distribution of the latent layer. These points are then propagated through the layers, where in the output layer they constitute an approximation of the output probability distribution.

Model overview

As mentioned, our model is inspired from the [M2 model](https://arxiv.org/pdf/1312.6114.pdf) developed by Kingma et al. The architecture is based on the Variational Autoencoder, with the addition of a classification network for labelled data. The latent layer then contains the standard continuous latent distribution from the encoder, but also a latent class label coming from either the classifier or the raw label, depending on availability.


Consider the job of the encoder in a VAE network – it creates a dense, abstract representation of the data before transforming it into a continuous function space. In a typical style classification network, the output layer is determined by the activations of the previous layer – which is again a dense, abstract representation of the input. Whether or not a label is available, there is no reason why this abstract representation of the data should differ, meaning the classifier can share neurons with the encoder before the layers are split, now into mu, sigma and pi layers (where pi contains the latent class label), see Fig5. The benefit of this is two fold; firstly, the model contains less neurons improving training time and secondly and more importantly, all of the weights in the classifier network (apart from the pi layer) are directly trained by unlabelled data by way of the encoder.

Loss function

The loss function is formed as the sum of the cross entropy between the original input and the output of the decoder, the KL divergence of the mu and sigma layers and (in the case of a label) the cross entropy of the label and the pi layer:

Here, q corresponds to the output probability density function and p to the actual probability density function (it can be delta functions in the continuous case, in which also the sum would be replaced with integrals). Similarily, pi corresponds to the actual label, and r to the probability of that label as is output by the model. If the label is not known, all pi’s are set to zero for this datapoint. The index k sums over all different labels, while i sums over all different output values. If we have more than one feature, the mean of L\_ent and L\_kl over these is taken.

Let’s break down the loss function and examine from an intuitive perspective how each of the three terms help the model to learn. Firstly, note that the first two terms comprise the loss function of a standard VAE, a KL term and a cross entropy term. The KL term pushes the latent layer to stay close to a standard gaussian probability distribution. Without it, the gaussian distribution may degenerate into a delta function, in which case the VAE will behave exactly the same as a normal Autoencoder. It would then be unable to capture any information beyond the datapoints seen in training. The resulting decoder would only be able to learn static mappings, but with the KL term, the decoder learns to reconstruct data from a continuous distribution function, giving it a stronger ability to generalise.

These first two terms take effect with every datapoint, whether a label is present or not, which results in the entire network being trained with the exception of the pi layer. The direct effect of an unlabelled data point is to thus train the entire anomaly detector and the majority of the classification network.

The third term only contributes to the loss in the event of a label. The effect of this term is two fold; firstly, it teaches the pi layer to translate the compressed representation of the data into a class label, but it also forces the rest of the network to learn a better representation. If the loss of the first two terms is low, i.e. the VAE thinks it’s found a good representation, but the labels are wrong, the loss will be higher and force the network to try again. The loss function is therefore improving all parts of the network with all types of data. Excluding the pi layer, the entire classifier is trained by unlabelled data whilst the entire network is improved by a label – resulting in a model that’s improved by any and all available data.

It turns out, as we shall see in more detail later on, that this architecture outperforms equivalent supervised models. In a standard classifier, the loss function is designed to achieve a single task – update the network weights such that the correct activation is triggered in the output layer. In our model, the loss is performing two distinct tasks: achieve the correct classification, but also reconstruct the original input as closely as possible.

Pure supervised application

Consider a situation in which all datapoints in the training dataset are labelled, and we are interested in building a classifier. In this context, we would usually start off building a neural network something as only the red colored parts of Fig6.

What we tried however, is to still use the full Variational Autoencoder network, including the pi layer, and trained it end-to-end on the fully labelled training dataset. One can argue that the decoder (and latent layer) act as a sort of regularizer (see Fig7) in this context, as the model not only has to produce the correct label, but also to reproduce the input. It hence has to build up a better representation of the input data in its deeper layers, from which both the pi layer and the decoder feed off.

If we train the network as such, and for inference cut out the regularizer piece as in Fig6, it turns out that the remaining classifier piece outperforms an equivalent classification model that was trained in the standard way (ie only to produce the correct label outputs). The decoder hence indeed behaved as a regularizer, helping the classifier model to find a good local minimum.

Anomaly Detection application

In the previous sections, we saw how unlabelled datapoints aid the classification performance of a variational autoencoder. Maybe the opposite is true as well: Do the availability of labels aid tasks usually addressed by pure unsupervised systems, such as for instance anomaly detection?

Again, from a pure theoretical perspective, there is again good reason that this is the case. If we ask the model not only to reproduce the input, but also to produce the correct label, the model is again forced to build better representations in its dense layers compared to a purely unsupervised model. This in turn should also help the reconstruction (and hence anomaly detection) task.

In a way, this is the opposite of what we did in the previous section. Instead of cutting out the decoder part after training, we now cut out the pi part (see Fig8).

We will see in the results section that the anomaly detection performance is indeed improved by the availability of labels in the training dataset.


We experimented with three different variations of this architecture:

  • The simple dense layered architecture as described above (in which both the encoder and decoder are modeled with vanilla networks)
  • A variation with convolutional layers
  • A recurrent variation with LSTM layers.

Results & Experiments

So, how do the models perform? For the static models we will be using the MNIST digit recognition dataset and for the recurrent models we shall be using the UCI-HAR (Human Activity Recognition) dataset. For each task, we shall also be comparing the results to those of an equivalent supervised / unsupervised model to really see the improvements offered by our architecture.

From here on, the models will be referred to as follows:


As we’re proposing a semi-supervised model, we’ll start by examining the performance of our model in area which it is designed to excel – classification with limited labels in the training data.

Static models

Firstly, we’ll examine the static models, trained for 20 epochs on the MNIST dataset. We’ll be using all 600,000 available images, but simply removing the labels, meaning the equivalent supervised model will not be able to train on the unlabelled images.


Unsurprisingly, both flavours of semi-supervised model drastically outperform their supervised counterparts, with the simple dense model showing an impressive improvement of 6.3%, and the convolutional version showing an even bigger improvement of 13.9%. Notice however that even EQ\_S\_D outperforms EQ\_S\_C, which should not be the case as convolutional models are much more suited to such tasks. The score here simply stems from the lack of training data for the convolutional model, which requires many more samples to properly converge.

Increasing the number of available labels to 1000, we see the performance of EQ\_S\_C drastically improve as the model now has enough images to converge, yet the semi-supervised SS\_C still outperforms the model by 2.7%. Interestingly, in the case of the dense models, the gap between SS\_D and EQ\_S\_D has increased to 10.1%.

Increasing the number of labels to 100%, the gap closes significantly, but the semi-supervised architectures still score slightly higher than the supervised equivalents. This is where the regularising effect of the decoder becomes apparent, the training requirement of reconstructing the original input really forces the classification part of the model to gain a deeper, more generalised representation of the data. This effect is even more pronounced when considering the log loss.

Recurrent models

The recurrent models were tested in the same way using the UCI-HAR data, with the results showing an even more drastic improvement over the static models. For the following experiments, the models were trained for 40 epochs with varying label availability.


Again, the semi-supervised model clearly outperforms its supervised equivalent in every category; however the difference with lower label availability in this case is significantly higher.

Anomaly detection

The anomaly detection task is set up as follows: Take a fully labelled dataset and choose one of the label classes as the one that you want to consider ‘anomalous’ in what follows. Train an anomaly detector on a sample of the datapoints of the remaining ‘normal’ classes. At inference time, compare the anomaly scores the anomaly detector produces using the ‘anomalous’ datapoints with those of a held-out set of ‘normal’ datapoints.

When we consider the VAE with pi layer, we additionally feed the label class of the ‘normal’ datapoint into the model while training.

For the dense and convolutional models, we were again using the MNIST dataset for benchmarking. The results are as follows:


These results show that the model consuming labels always outperforms the equivalent unsupervised model not consuming labels, hence proving the value proposition.

Data generation

A bonus feature of this model is data generation. Like a VAE, the data is represented as a probability distribution, so we can generate data with the decoder by randomly sampling from this distribution. As with the M2 model the addition of the latent class label allows us to specify the class of data we want to generate. For example in the case of MNIST we can specify which digit we would like to generate. Another feature retained from the M2 model is the ability to separate the class of data from the style. As the latent layer is essentially just a probabilistic representation of the data on which the model was trained, depending on where we sample from the distribution, we are changing the source material on which our data will be generated. So, with MNIST for example, sampling from the centre of the distribution will generate the most common handwriting styles, which vary as we move to the edges of the distribution. Put simply, it’s possible to generate any digit with any handwriting style (limited on what’s available in the training set of course).

An example of different styles of the digit two generated by the model looks like (Fig9):

Summary & Conclusions

In this article, we presented a new model that seamlessly operates in the full range of unsupervised, semi-supervised and supervised learning contexts. The unsupervised learning target (for instance anomaly detection) is improved by any available label, while the supervised learning target (for instance classification) is vice versa improved by any datapoint, irrespective of it being labelled or not. Even if all datapoints are labelled, we have shown that the classification accuracy is improved compared to an equivalent supervised model, as the decoder acts effectively as a regularizer.

The business value of such a model is huge:

  • Due to the semi-supervised nature of the model, there are less labels needed to perform classification tasks. This adds a lot of value as the availability of labels is usually rare, and collecting them is both time and money intensive. The presented model hence facilitates on-boarding new customers to predictive systems greatly.
  • You only have to maintain one model even if you serve both unsupervised and supervised tasks in your company.
  • New customers can seamlessly be on-boarded to the full range of your products. Consider the situation where you start with a customer that does not have any label yet. You can still use the presented variational autoencoder model in this situation, where it will start building up useful representations of the customers data, and will start producing results for anomaly detection. Once the first labels come in, we can feed them into the same model, without having to re-configure or re-train yet another model for the classification task. On the contrary, the model will automatically leverage the unlabelled datapoints it was previously exposed to for the classification task from day one.
  • The model is expected to produce better results than specialized models for unsupervised and supervised tasks.