Single shot object detection SSD using MobileNet and OpenCV

single shot object detection

In this post we will have a practical overview of Single Shot Object detection (SSD) using deep learning , MobileNet and OpenCV. Object detection is hottest topic of the computer vision field. Object detection is breaking into a wide range of industries, with use cases ranging from personal safety to productivity in the workplace. Object detection and recognition is applied in many areas of computer vision, including image retrieval, security, surveillance, automated license plate recognition, optical character recognition, traffic control, medical field, agricultural field and many more.

By seeing such a lot of practical applications you must be excited,  right? Let’s jump onto this topic right now and expand our knowledge .

What is Object Detection

In Object Detection, we categorize an image and identify where does an object resides in an image. For this, we have to obtain the bounding box i.e (x, y)-coordinates of an image. Frame detection is considered as a regression problem. That makes it easy to understand.Single Shot detection is one of the methods of Object Detection

Various Object Detection Methods

Whenever we talk about object detection, we mainly talk on these primary detection methods.

  1. Faster RCNN
  2. Single-shot detection
  3. You look only once (YOLO)

Faster RCNN perform detection on various regions and then ends up doing prediction multiple times for various regions in an image. Its speed varies from 5 to 7 frames per second.

There is also another type of detection called YOLO object detection which is quite popular in real time object detectors in computer vision. It’s architecture is similar to Faster RCNN. YOLO sees the whole image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. In simple words, we pass the image once through the Faster RCNN network and output its main prediction.

Properties of YOLO

  • Its processing speed is 45 frames per second , and is better than real-time detection.
  • It makes less background errors as compared to RCNN.

YOLO uses k-means clustering strategy on the training dataset to determine those default boundary boxes.

The main problem with YOLO is that is leaves much accuracy to be desired.

Single Shot Detector for Object Detection

Image result for object detection"

Let us understand what is single shot object detection.

Single Shot object detection or SSD takes one single shot to detect multiple objects within the image. As you can see in the above image we are detecting coffee, iPhone, notebook, laptop and glasses at the same time.

It composes of two parts

  • Extract feature maps, and
  • Apply convolution filter to detect objects

SSD is developed by Google researcher teams to main the balance between the two object detection methods which are YOLO and RCNN.

There are specifically two models of SSD are available

  1. SSD300: In this model the input size is fixed to 300×300. It is used in lower resolution images, faster processing speed and it is less accurate than SSD512
  2. SSD512: In this model the input size is fixed to 500×500. It is used in higher resolution images and it is more accurate than other models.

SSD is faster than R-CNN because in R-CNN we need two shots one for generating region proposals and one for detecting objects whereas in SSD It can be done in a single shot.

The MobileNet SSD method was first trained on the COCO dataset and was then fine-tuned on PASCAL VOC reaching 72.7% mAP (mean average precision).

For example- In Pascal VOC 2007 dataset , SSD300 has 79.6% mAP and SSD512 has 81.6% mAP which is faster than out R-CNN of 78.8% mAP.

We are using MobileNet-SSD (it is a caffe implementation of MobileNet-SSD detection network with pretrained weights on VOC0712 and mAP=0.727).

VOC0712 is a image data set for object class recognition and mAP(mean average precision) is the most common metrics that is used in object recognition.If we merge both the MobileNet architecture and the Single Shot Detector (SSD) framework, we arrive at a fast, efficient deep learning-based method to object detection.

Why use MobileNet in SSD

The question arises why we use MobileNet, why we can’t use Resnet, VGG or alexnet.

The answer is simple.  Resnet or VGG or alexnet has a large network size and it increases the no of computation whereas in Mobilenet there is a simple architecture consisting of a 3×3 depthwise convolution followed by a 1×1 pointwise convolution.

Comparison between all Primary Object Detection Methods

A screenshot of a cell phone

Description automatically generated

In the above picture we can observe that R-FCN outperforms the accuracy. All these accuracy are taken from running the model on PASCAL VOC 2017 and famous coco data sets.

Let us take a look at the practical code implementation so we can get an overview to implement this single shot object detection algorithm.

  • The first step is to load a pre-trained object detection network with the OpenCV’s dnn (deep neural network) module.
  • This will allow us to pass input images through the network and obtain the output bounding box (x, y)-coordinates of each object in the image.
  • Now we write the code to print the name of the detected object and their confidence scores.
  • At last, we look at the output of MobileNet Single Shot Detector for our input images.

Practical example of using SSD MobileNet for Object Detection

  1. Save the below code as is the file that contains object detection code logic.

2. Use the caffe deploy prototxt file MobileNetSSD_deploy.prototxt.txt on the following link

3.Get the Caffe pretrained model MobileNetSSD_deploy.caffemodel on the following link

4.Use the images car.jpg and aero.jpg for this example.

5.Store the files used from 1 to 4 as shown below.

Single Shot Detection MobileNet example

6.We have to also install OpenCV and numpy library to run our program

numpy library is used for numerical computation and it provides tools for working with these arrays and open-cv is used to load an image, display it and save it back.

Now let us understand the code step by step.

Step 1- Load all the required libraries

cv2 is used to load the input image and it is also used to display output.argparse make it easy to write user-friendly command line interfaces. In this code we are using it to parse the command-line arguments.

Step 2- The next step is to parse our command-line arguments as follows.

Now we parse the above arguments

Step 3- The next step is to define the class labels and color of the bounding box.

It will detect all those objects that are mentioned in the class and then it assigns a color to the bounding boxes that are blue.

Step 4- Then after that, we load the model and call the command-line arguments

Step 5- Now we load the input image and construct an input blob (it is collection of single binary data stored in some database system) for the image and then resize it to a fixed 300*300 pixels and after that, we normalize the images (note: normalization is done via the authors of MobileNet SSD implementation)

Step 6- After that we pass the blob through our neural network

Above lines of code shows that we set the input blob to a network and then computed the forward pass for the object detection and prediction

Step 7- This step is used to determine what and where the objects are in the image.

Now we loop over all the detection and extract the confidence score for each detection. After that, we filter out all the weak detections whose probability is less than 20%. And then we print the detected object and their confidence score (it tells us that how confident the model is that box contains an object and also how accurate it is).

In mathematics it is calculated as

confident scores= probability of an object * IOU

IOU stands for Intersection over union. It is the ratio of overlapping area of ground truth and predicted area to the total area.

IOU= area of overlap / area of union

Step 8- At-last we use imshow function of cv2 to display our output image to a screen until a key is pressed.

Object Detection using SSD Mobilenet : Results

In this example we detect multiple cars using deep learning-based object detection.Use the below command. You can use the car.jpg that I have uploaded here.

A car parked in a parking lot

Description automatically generated

Our first results show us that we have detected both cars with around 100% confidence score.

In the next example we detect aeroplane using deep learning-based object detection:

A picture containing plane, sky, outdoor, airplane

Description automatically generated

Our second results show us that we have detected aeroplane with around 98.42% confidence score.


In today’s blog post we have learned about single-shot object detection using open cv and deep learning. There are many flavors for object detection like Yolo object detection, region convolution neural network detection. We use SSD to speed up the process by eliminating the region proposal network. By this we have a drop in accuracy so we combine the MobileNet and SSD to get better accuracy.We have also learned that the Yolo object detection has fast processing speed than the other detection method.

Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book


You must be logged in to post a comment.
Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book


Arm yourself with the most practical data science knowledge available today.