top of page

How to use Inception Model for Image recognition

Have you ever thought of how search engines and things like these work with image processing? If you want to try it by yourself then you’re at the right place, in this blog we will be discussing simple object detection using Inception model.

Topics covered in this Blog are:

  • What is Inception.

  • Why we use Inception.

  • Types of Inception.

  • Layer architecture of inception v3.

  • How to use Inception v3 for object detection from an Image, Python implementation.

  • Conclusion.

What is Inception?

  • Inception model is a convolutional neural network which helps in classifying the different types of objects on images.

  • Also known as GoogLeNet.

  • It uses ImageNet dataset for training process.

  • In the case of Inception, images need to be 299x299x3 pixels size.

  • Inception Layer is a combination of 1×1, 3×3 and 5×5 convolutional layer with their output filter banks concatenated into a single output vector forming the input of the next stage.

  • And firstly introduced in 2015.

Why we use Inception?

So, if the size of the object in the image is of different size so it difficult to detect or fetch the correct information of the object from the various images, then we need to give the correct size of filter. Let us understand this by an example:

Fig 1.
Fig 2.

Since, here are two different images of same object but with different size, in image 1, the car occupies more region so it requires higher size filters, and in image 2, car occupies less region which requires lower size filter. So therefore, Inception model allows the internal layers to choose which filter size will be relevant to learn the required information, so even if the size of the object is in the image is different, the layers work according to recognize the face.

  • Other models generate some computational complexities and overfitting, therefore we use versions of Inception model to reduce such types of problems.

  • The model attained greater than 78.1% accuracy in about 170 epochs on each of these.

  • While training, the model requires several passes through the training dataset to improve its image recognition proficiency. In the case of Inception v3, depending on the global batch size, the number of epochs needed will be somewhere in the 140 to 200 range.

  • File contains a multi-option pre-processing stage with different levels of complexity that has been used successfully to train Inception v3 to accuracies in the 78.1-78.5% range.

  • Both in terms of speed and accuracy, it uses a lot of tricks to push performance. The model itself is made up of symmetric and asymmetric building blocks, including convolutions, average pooling, max pooling, filter concatenations, dropouts, and fully connected layers. Batch normalization is used extensively throughout the model and applied to activation inputs. Loss is computed via SoftMax function.

Types of Inception:

Types of Inception versions covered in this blog are:

  • Inception v1

  • Inception v2

  • Inception v3

1) Inception v1 (Naïve version)

Naïve version performs convolution on an input, with 3 different sizes of filters i.e. 1x1, 3x3 and 5x5 convolution. Furthermore, max pooling is also performed. The output’s layers are then concatenated and passed to the next Inception module. Image of the Naïve Inception module is given below:

We know that, deep neural networks are computationally expensive. To make it computationally cheaper, the number of input channels kept limited by adding an extra 1x1 convolution before the 3x3 and 5x5 convolutions. And always remember to put 1x1 convolution after max pool layer, to put it before instead. Though adding an extra operation may seem contradictory, 1x1 convolutions are far cheaper than 5x5 convolutions, and the reduced number of input channels also help.

By using the dimension reduced inception module, a neural network architecture was built. This was popularly known as GoogLeNet (Inception v1). The architecture is shown below:

Image source Going deeper with convolutions

GoogLeNet has 9 such inception modules stacked linearly. It is 22 layers deep or 27, including the pooling layers. It uses global average pooling at the end of the last inception module.

As the model is very deep, so it can generate vanishing gradient problem, to prevent the middle part of the network from vanish, the authors introduced two auxiliary classifiers. They essentially applied SoftMax to the outputs of two of the inception modules, and evaluated an auxiliary loss over the same labels.

2) Inception v2

It proposes a number of filters which leads to increases the accuracy and reduces the computational complexity. In this version we use factorization method so that computational complexity becomes less. And by this way we can make our module deeper instead of wider.

  • In Inception v2 architecture, 5×5 convolution is replaced by the two 3×3 convolutions. This also decreases computational time and thus increase computational speed because a 5×5 convolution is 2.78 more expensive than 3×3 convolution. So, using two 3×3 layers instead of 5×5 boost the performance of architecture.

  • This architecture also converts nxn factorization into 1xn and nx1 factorization. As we discuss above that a 3×3 convolution can be converted into 1×3 then followed by 3×1 convolution which is 33% cheaper in terms of computational complexity as compared to single 3×3 convolution.

  • To deal with the problem of the representational bottleneck, the feature banks of the module were expanded instead of making it deeper. This would prevent the loss of information that causes when we make it deeper.

3) Inception v3

They argued that they function as regularizes, especially if they have Batch Norm or Dropout operations because authors noted that the auxiliary classifiers didn’t contribute much until near the end of the training process when accuracies were nearing saturation.

So, version 3 adds some of the following modifications: