An introduction to unsupervised learning with Fastai
Introduction
As we all know, Deep Learning has really soared these past years, and has become the face of “AI” today. But both from a theoritical and practical point of view, can we really call “intelligent” an algorithm which not only needs millions of images, but also to label images, to understand what a cat is ?
We know that what makes humans so efficient at learning is to be able to learn from little data, create abstractions and act on it on several tasks. Nonetheless this is not the case today in the paradigm of supervised learning, where we provide massive amounts of data, and labels.
In my opinion, this is not a sustainable solution, for ecological reasons, as training huge models on huge datasets really consumes much resource, nor does it offer a chance to everyone to create their own competitive algorithm as we do not all have the same amount of data as Google, nor do we have clusters of TPU. Even though transfer learning has enabled people to (re)train effective models on custom and often small data sets, it mainly works for Computer Vision and NLP, while all the other cases such as tabular, recommender system, or signal, have no way to be pretrained in the same way as they are much more specific to the data at hand. Moreover, it can also happen that on your data sets, not only do you have little data, but also few labels because as they are costly to acquire, e.g. medical labels.
To answer this issue of little data and labels, we will discover semi-supervised training, where we will first try to learn our data, without asking the model to classify first, by training an unsupervised feature extractor, and then train a classifier on top of it.
As a huge fan of Fast.ai, I will try to adopt the same top down approach, where we will first play with data, then gradually uncover the ideas behind this approach.
Unsupervised learning
Unsupervised learning differs from supervised learning, as we no longer try to predict a variable y, from a variable x, but we simply try to learn more about the distribution of x itself.
To get the intuition behind this, imagine you are a beginner at painting. You are provided tons of paintings from your favorite painter. Your goal is then to learn more about painting. One way of achieving this, without the need of supervision, is simply to look at the paintings, first try to grasp the essence of this, then repaint and try to see if you managed to get something similar. If not, you will try to learn from your mistakes, maybe you did not manage to understand the spirit of the painting, or you understood it but your painting skills were poor.
But we can expect that after doing this for some time, you will be able to be very good at painting, and you will have honed two skills :
- Grasping the essence of paintings, which means looking at the raw image, and then extracting features such as how colors are mixed, style of drawing, etc … This is what we will call the encoder.
- Recovering the original painting from the abstract representation you extracted with the encoder. This will be called the decoder.
In practice this is exactly how autoencoders work : we have an encoder whose role is to compress the original input x, to a condensed high level representation z by using our encoder, then we try to recover the original input x using only z with our decoder.
In practice, our encoder and decoder will be neural networks, and we will train them using a loss function, usually the mean squared error between the original input, and the reconstructed input.
In the end, by training our autoencoder, we are able to train both the encoder and the decoder, and the encoder will be used for classfication later. As we need to be able to reconstruct x from z, the encoder must learn to extract useful features, and this feature extractor can be used later in classification.
Semi supervised learning
To use once again the analogy of the painter, imagine that you are given all paintings of post-impressionists without being told who painted each painting. You first learn how paintings are made by doing this autoencoder procedure, then we give you ten paintings of Van Gogh, and ask you to identify the remaining paintings of Van Gogh among the whole data set of post-impressionist paintings you were given. In that case, having seen and captured the essence of post impressionism painting, it will be much easier for you to identify the paintings that have been painted by Van Ghogh, because you are able to extract meaningful representations of those paintings, and you know how a few painting of Van Gogh looked like.
This is the essence of semi-supervised learning : learning first how things are generated to understand their underlying features, then training a classifier on top of it !
Experiments
Now that we have seen the core spirit of semi supervised learning, where we first exploit unsupervised training without labels, and then train our classifier on top of our feature extractor with all or part of the labels of our samples, we will get our hands dirty with some experiments.
As MNIST has become the “Hello world !” of machine learning, we will also work on this, but this time with a special constraint : we will use only 512 samples from the 60 000 samples of MNIST, so less than 1% of the total data, and use even less labels !
I have chosen to use the excellent framework Fast.ai, to develop a quick and intuitive interface to play with unsupervised learning. For those of you who do not know about Fast.ai I strongly recommend you watching the amazing videos available on Youtube !
To make the training of autoencoders easier, I coded a little library called fastai_autoencoder, which facilitates writing autoencoders, and makes it as clear as the usual fastai code implementations.
Now let’s dive into the code ! The GitHub of the project can be found here : https://github.com/dhuynh95/fastai_autoencoder
First here we import fastai. I will not go into too much detail behind the custom imports from fastai_autoencoder, but to give a quick idea, the bottleneck is the way you chose to filter the information in the encoder. There are several ways to encode the information, which we will cover later. Here we will use a VAE (Variationnal Auto Encoder), which uses a special loss term in addition to the reconstruction loss, a Kullback Divergence term which we will also cover later.
Finally, we import a custom Learner for autoencoders, which is a kind of wrapper to facilitate training, especially to handle callbacks. I created a subclass called VisionAELearner because there are some interesting visualizations that we can get with autoencoders with images.
There is a get_data function which you can have a look at but it’s a bit long, it simply randomly select a few samples from MNIST, like 512, plus some validation example, and store the rest of the samples which will be used as a test set for the classifier.
Here we simply create our data bunch with get_data, which will create a Databunch with empty labels, as we do unsupervised training first, and select only 512 images for training set.
Then we will create our model here. Here you can see that I have tried to factor my code and make it easy to play with, as I have tried a few different options, and tried to implement different papers to see the results. But for now let’s skip this, all you need to know is that at the end, we create a model, by combining three pieces together : the encoder, the bottleneck, and the decoder. Finally we wrap everything in a Fastai learner to use the APIs we are used to.
Notice here that the final size of the bottleneck is 16, which means that we take a 28*28 = 768 pixels image, compress it to 16 variables, and reconstruct a 768 pixels image. This means we have a factor of compression of 48 !
Now we can do the usual learn.lr_find() and learn.recorder.plot(), and have our little plot as usual in Fastai :
Just before our model starts training, we can have a look at the reconstructions using, a little function I coded called plot_rec(), which is a function of VisionAELearner to plot the reconstructions.
Here we can see that the reconstruction is just random noise as we have not yet started training. So what are we waiting for ? Let’s go !
Here I proceeded in two phases, first I trained without the KL divergence term, as is suggested here : https://orbit.dtu.dk/files/121765928/1602.02282.pdf
Now that training is over, let us have a look at our reconstructions :
Ah ! It’s much better, even though we see that there are minor changes, the reconstruction is correct, it even smoothens a little bit the edges, even though we have compressed the size by 48 in the encoder !
To continue having fun, we can see the output of the different steps, using a little function I created, which I call plot_rec_steps, which will give a visualization of the output of the different reconstruction steps done in the decoder through convolutions :
Here, the output at step 0 is the output just after we receive our encoded information z of size 16, which has been broadcast to the whole grid using Spatial Broadcast Decoder, which we will cover later. Afterwards we apply several convolutions for the decoder, and we can see here how our image gets reconstructed. First the decoder reconstructs the shape, which can sometimes be similar to other numbers, but then with the last layers the finer details get added and we find the original image.
Finally, we can see bellow the 2D projection using T-SNE of our encoded images z, colored by their classes :
We can see here, that even though we did not have access to the labels during unsupervised training, the features our autoencoder learn make our data easily separable, as the underlying features generating our images are quite distinct between classes.
Semi supervised training
Now that we have a cool trained autoencoder, we can try training a classifier on top of it and see how it performs. Here we will study how well our model performs in three different settings : when we provide only 128 labels, 256 or all 512 labels from which it had been trained from during the unsupervised training part.
This way, we try to mimic the real settings where you can have only a portion of your images that are labelled, but want to exploit the unlabelled data for training.
One popular technique when working with little data is to use transfer learning. Here I compared using the pretrained encoder from the unsupervised training part, froze it, and added a 2 layers classification head on top of it, and a Resnet18 fine tuned on small datasets of growing size, from 128 labels to 512 labels. We finally compare the accuracy results on the rest of MNIST, which is the 59 000 images that we did not use for unsupervised pretraining. We have the results bellow :
Here we can notice that pretraining the model gives a boost, compared to a Resnet in the very low data regime. Nonetheless, as we increase the amount of labels available, Resnet starts to surpass the pretrained model.
Just for the fun, we can also see how these methods work for Fashion MNIST which is a harder dataset than MNIST, where we also used only 512 images. We first get the following reconstruction :
And finally, when retraining a classifier we get the following results :
Conclusion
So in the end, we have quickly seen how unsupervised learning work and how it can be used to pretrained model when there is little data/label. This method is interesting, as it can be used to explore a data set without needing the labels first, and also give a boost when really few labels are present.
In very low label regime it can be interesting, even compared to standard models such as Resnet, though this advantage is lost when more data is present.
For images or text, I do not believe that unsupervised training done the way it was done in this article is competitive compared to traditional transfer learning, but I believe it could be fruitful in other domains, such as signal, tabular, or recommender systems, where it can be useful to use unsupervised learning on your own data set, as there is no big pretrained model available.
I hope you liked this article, and I will come up with other articles soon presenting the theory behind VAE, and how you can implement it in Fastai !
If you have any question, do not hesitate to contact me on Linkedin , you can also find me on Twitter !