AlexNet Revisited: A complete review


Evolution of Deep Learning- 1


Hello People!! 

I recently read the famous research paper: “ ImageNet Classification with Deep Convolutional Neural Networks ”, commonly known as “AlexNet”. Maybe there is no such researcher in deep learning and computer vision community who doesn’t know about AlexNet. The paper is very easily understandable and it was a pleasure to read the paper. G.E Hinton, the author of this paper is considered as the godfather of deep learning and won the turning award with Yann LeCun and Yoshua Bengio for Deep Learning in 2018. I want to present a short review of the paper in my first blog. Since many of you might have known about it, I named the blog as AlexNet Revisited.


Introduction

AlexNet is a Deep Convolutional Neural Network trained on the ImageNet dataset. This technique is the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012) and considered a breakthrough in the field of deep learning. They trained a network with 5 convolutional layers followed by 3 fully connected layers and a softmax function to classify 1000 labels consisting of 65K neurons and 60M parameters. Training such big networks is computationally very costly. Hence they used multiple GPUs for training purposes. I will be explaining about the breakthroughs brought by the advent of GPUs for training AI models in my next blog.

Image recognition was a challenging task for many years. There were no big enough datasets and complex architectures to deal with big data sets initially. But the datasets like LabelMe and ImageNet brought the solution for the problem of shortage of big labeled datasets. To train on very large datasets like ImageNet proper architecture is required. During the same period of time when the dataset was released, Convolutional neural networks had shown its advantages over Feed-Forward neural networks and proved themselves as the best solution for Image by providing strong and correct assumptions about images. 

The authors of the paper used a highly optimized 2D convolution operation for GPUs to optimize the training process. More details about dataset, architecture, measures taken by them for solving the problem of overfitting, training process, and results in upcoming sections. For those who want a quick understanding, the figure below gives a complete overview of the whole process. (I believe, one image makes things more clear than 1000 words)


Data Preparation and GPU Training

As I mentioned earlier, the dataset used is ImageNet. ImageNet is a very large dataset prepared to satisfy the data-hunger networks. It contains around 15M for more than twenty thousand classes. AlexNet is trained on a subset of the ImageNet dataset. The dataset of ILSVRC is used for training, it contains 1.2M images for 1000 classes in the training set and 50k Images in the validation set. They didn’t perform much pre-processing on Image data. Since images in the dataset are of various resolutions, they downsampled images to a constant resolution of 256 x 256. They eliminated the mean over the whole training set from each pixel to normalize the data and used the normalized raw RGB pixels as input.

Training complex architectures of neural networks on huge amounts of data is not an easy task. It requires a lot of computational power, this is also one of the reasons, people don’t use to train complex models earlier. Hence they used two Nvidia GTX 580 GPU, each consisting of 3GB memory. As I mentioned earlier, I will explain about GPU’s in my upcoming blog. 

For training the model on huge data of 1.2M images, they used some tricks while training on GPU. They used Cross GPU parallelization, which means, both the GPU’s can communicate (read/ write), directly from each other while training is going on. Communication and information share between GPU’s happen only in a few layers. The architecture is shown below clearly, explains how training happened on both GPU’s at the same time.


Details of Architecture

ReLU non-linearity is used in the Architecture. These are Non-Saturating Non-Linearities. Mainly these are used because they speed up the process much better than the other non-linearities like tanh. When they used both ReLU and tanh, they reached a 25% error rate 6 times faster in the case of ReLU.

Local Response Normalization is employed in AlexNet. This type of normalization is inspired by real neurons. The output of the ReLU unit is response normalized over the adjacent kernel maps in the same spatial position. It can be called as brightness normalization since mean is not being eliminated. It caused a performance improvement of over 2%.

Overlapping Pooling is used for summarizing the output of convolution layers. There are mainly two kinds of pooling, namely overlapping pooling, and non-overlapping pooling. When stride size of the pooling layer is less than the size of kernel size, then it is called as overlapping pooling. They employed a stride of size 2 units while the kernel size is 3 units. This decreased the error rate by 0.3–0.4% when compared with the non-overlapping pooling.

The complete architecture of AlexNet is shown in the below figure, which consists of convolutional blocks followed by fully connected layers and a 1000-class softmax at the end. More details above architecture and other things are present in the original paper.


Architecture of AlexNet

Model Training and Performance Improvement 

The model is trained with a batch size of 128 images. Weights are initialized with a zero-mean Gaussian distribution. Bias is present in only a few layers. AlexNet uses a stochastic gradient descent optimization algorithm. The learning rate of the optimization process in initially 0.01 and decreased by 10 times whenever validation loss is increased. 

The performance is validated over a validation set and hyperparameter is tuned according to it. For architecture that is trained on large datasets and consisting of many parameters, Overfitting is a very common problem. Hence to overcome it, the following techniques are employed.

Data Augmentation is used for increasing the training data, which in turn decreases the high bias. The extracted patches of size 224 x 224 from the image of size 256 x 256. This process will produce 5 images from one image. After this, the horizontal reflection of these 5 images is also considered, which makes a sum of 10 images extracted from each image. And also intensities of RGB values of images are varied with the help of Principal Component Analysis.

Dropout is another jewel that came from the group of Professor Hinton. The usage of multiple models proved effective. In dropout, neuron which is less than some value like 0.5 is dropped from the network. Hence different architecture is formed every time. The dropout technique helps in learning more robust features hence, increases performance and decreases high bias.

Results and Conclusion

The results of AlexNet need no discussion. It achieved the top-1 and top-5 test set error rates of 37.5% and 17.0% on the ILSVRC test set. It got very significant results compared to the previous year winners of ILSVRC. And began a new era in the field of deep learning. The detailed results over the various test set are presented in the paper. After this, in the consequent years, many deep architectures are developed. Mainly many state-of-the-art architectures that are developed based on the ImageNet dataset are benchmarks today. We will revisit all of them in further blogs. Now today deep learning became very advanced with many discoveries happened in the course of these few years. AlexNet has a huge impact and role in this progress. 


Thank you for reading! Subscribe for updates


References

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

(All images are taken from the original paper)





Comments