Our model is built on top of a type of anomaly detector known as an ‘Autoencoder’, a special type of neural network well suited to tasks such as compression, denoising and anomaly detection. Autoencoders consist of two distinct parts, the encoder and the decoder, which are connected sequentially and trained together as a single network.
The encoder is essentially a cone shaped neural network, in which the layers become progressively smaller up until the final and smallest layer which can be as small as a few neurons wide. The decoder is the reverse of the encoder, in which the layers progressively increase in size. The resulting network takes on a ‘sand timer’ shape, with a bottleneck in the centre, known as the ‘latent layer’ (see Fig1).
During training, the loss function is typically the difference (rmse for example) between the input and the output – thus training the model to reproduce its input. This results in the encoder learning to compress the input data to a dense representation in the latent layer, whilst the decoder learns to reconstruct the original data from the latent layer. This turns out to be a suitable model for anomaly detection – the decoder will easily be able to reconstruct normal looking data producing a low distance score, yet struggle with anomalous data and produce a high distance score.
Sounds great, but the classic Autoencoder suffers from some limitations due to the way data is compressed in the encoder. The compression process can be thought of as a direct mapping from the input to a specific set of activations within the latent layer. This discretized representation in the latent layer restricts the potential of the Auto-encoder, especially when used for tasks such as data generation for example.
Consider a dataset comprising a cluster (see Fig2 chart above); the Auto-encoder is generating discrete mappings between each datapoint and the activations in the latent layer (represented as bars in Fig2). The loss function is actually favoring a model in which each and every datapoint is exactly reconstructed from the decoder, without rewarding the extra understanding of all of them belonging to the same cluster. As a consequence, the Auto-encoder is more prone to overfit the training data and it has harder times to understand the relations between the data points (in this example, that they all belong to the same cluster). Whenever the encoder maps a new datapoint into a low-density region in the latent layer, the standard Autoencoder struggles to reconstructs that datapoint well, even if it was falling into a low-density region within the cluster itself.
It’s easy to see why this is a problem with data generation for example; the decoder could be used for data generation by way of adding noise to the latent activations; however, there is a chance that the result is not useful because the addition of noise can fall between the discrete mappings encoded within the latent layer.
These limitations can be overcome using Variational Autoencoders, which, Like the standard Autoencoder, are composed of an encoder and a decoder joined with a bottlenecked in the centre. The key difference is in the latent layer, instead of a single layer containing the compressed representation, a Variational Autoencoder forks at the centre into a mu and sigma layer, followed by a merge into the latent layer (see Fig3).
The mu and the sigma layer are now interpreted as the mean and standard deviation of a normal distribution (hence giving them their names). In other words, every single input training data point is represented as a normal distribution with a mean and standard deviation determined by the encoder (those are visualized with the smooth lines in Fig4 below).
This simple modification allows the latent layer to represent the data as a continuous probability distribution instead of discrete mappings, providing a more abstract representation of the data.
The network can be thought of as follows: the encoder operates in real vector space and compresses the data, the latent layer maps this compressed representation into a continuous function space and finally the decoder is approximating the inverse of this continuous function by transforming the normal distribution of the latent layer into the output probability distribution. This transformation corresponds to the sequence of probability distributions depicted at the left hand side of Fig3: We start with the normal probability distribution of the latent layer, and every layer of the decoder is subsequently deforming the distribution until the output probability distribution is returned.
As the model produces probability distributions as outputs (and not just the most likely reconstruction), it can inherently model relations between data points. For instance in the example of Fig2, the model can learn that all data points of the training set belong to the same cluster. Not only that, in this example it could even represent every single datapoint of the training set with (almost) the same latent activations — which is the activation that actually corresponds to the probability distribution of the cluster. As a consequence, the variational auto-encoder is less prone to overfitting the training data compared to a standard auto-encoder, because it does not have to reproduce every single datapoint exactly, but only has to model the probability distribution from which it was drawn. We will see in the next section that the loss function of the variational auto-encoder indeed contains a term that is pushing the model to produce such a ‘smeared out’ representation of the dataset.
How can we model with a neural network the decoders’ mappings in function space? We can’t map whole functions (probability distributions) directly using a neural network. We can, however, approximate this mapping by sampling from the probability distribution of the latent layer, and map the sampled points using the decoder in the standard way. This is exactly what is depicted with the red dots on the probability distributions of the left hand side of Fig3. On the second of these images, we sampled a bunch of points according to the normal probability distribution of the latent layer. These points are then propagated through the layers, where in the output layer they constitute an approximation of the output probability distribution.