top of page
Music Apps

Want to train your first Object Detection Model

Updated: Aug 5, 2021

Computer Vision is a state-of-the-art technology that deals with image processing. Training a Computer Vision model is an essential step in solving problems based on image classification, object detection, segmentation, etc. We will learn how we can train RCNN model on custom data set.

First step is setting up the environment

For running the object detection model on your system, you must fulfill the necessary requirements of setting up the environment. You must install a Tensorflow-GPU and an Anaconda virtual environment with Python 3.6. We recommend to install an NVIDIA GPU.

Create a Tensorflow environment and download Protobuf, lxml, Jupyter notebook, matplotlib, pillow, Tensorflow. You can use PIP command to install the code.

Choosing the pre-trained model

Model selection is one of the difficult part and it depends on problem statement. Here, we are using RCNN to detect and classify two categories of motorbikes namely, Harley Davidson and Hayabusa. For this purpose, we can either work with an RCNN model or an SSD model.

Another factor that comes into play is the requirement of the user, if you want higher accuracy going for an RCNN model is a smart choice, however, it may take more time to train. On the other hand, an SSD model though less accurate trains faster and is typically suitable for running on devices like smartphones or a Raspberry Pi.

The Architecture of Faster RCNN Model:

Second step is to prepare the faster RCNN Model

#Variables Used
MODEL_NAME= ‘faster_rcnn_inception_v2_coco_2018_01_28’
PATH_TO_CKPT= MODEL_NAME + ‘/frozen_inference_graph.pb’

#Download Model
opener= urllib.request.URLopener()
for file in tar_file.getmembers():
 if ‘frozen_inference_graph.pb’ in file_name:
tar_file.extract(file, os.getcwd())

Creating the Custom Dataset

The most important step for training a model with custom images is to prepare the Tf -records

for Tf_records first we have to convert the image into xml file. you may use "LabelImg' tool.

Once we have converted the image into xml file we have to convert the image into csv file, we can generate tf-records

Third step is to configuring the Training Pipeline

The training pipeline is required to organize and automate hyperparameter tuning, pre-processing, model training, and post-processing tasks. It defines which models and what parameters will be used for training and points through the training images and data. For this purpose, We have used the faster_rcnn_inceptionv2_pets.config file.

A protobuf or protocol buffers are Google’s extensible mechanism for serializing structured data. They are useful for storing data with one another over a network or for storing data and can be represented with Python. In TensorFlow, the tf.train.Example class represents the protocol buffer used to store data for the input pipeline.

At a high level, the config file is split into 5 parts:

  1. The model configuration. This defines what type of model will be trained (meta-architecture, feature extractor). In our case, it is the rcnn model.

  2. The train_config, which decides what parameters should be used to train model parameters (SGD parameters, input preprocessing, and feature extractor initialization values).

  3. The eval_config, which determines what set of metrics will be reported for evaluation.

  4. The train_input_config, which defines what dataset the model should be trained on.

  5. The eval_input_config, which defines what dataset the model will be evaluated on. Typically, this should be different than the training input dataset.

Few parameters that we need to adjust are

  1. num_classes: The number of classes present in the training dataset. In my case, there is only 1 class, i.e., Harley Davidson

  2. num_steps: Number of training steps. Based on the size of the data, we can increase or decrease the num_steps. In my case, we have taken it as 1000.

  3. batch_size: The batch_size decides the number of images to be fed into the model in one step. The batch size should be set upon consideration of the hardware limitations. In my case, I have kept the batch_size as 10 to increase the performance of my model.

  4. fine_tune_checkpoint: Fine-tune checkpoint is used for the implementation of transfer learning. Transfer learning is the process of using a model pre-trained on a huge dataset and saving its base weights to apply it on our own classification problem. This helps in getting rid of the overfitting problem when we have a small dataset. The file that contains the base weights of the rcnn_inceptionv2 model is initialized to the fine_tune_checkpoint parameter (/content/models/research/pretrained_model/model.ckpt.)

  5. input_path (for train/test): We need to specify the input_path so that it points to the directory that contains the train.records/test.records.

  6. label_map_path: It should contain the path to the label_map.pbtxt file that contains the ID and name of our classes.

  7. num_examples: Signifies the number of training images.

Pipeline Structure:

# Faster R-CNN with Inception v2, configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
    feature_extractor {
      type: 'faster_rcnn_inception_v2'
      first_stage_features_stride: 16
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.5
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.5
        max_detections_per_class: 100
        max_total_detections: 300
      score_converter: SOFTMAX
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0

train_config: {
  batch_size: 10
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0002
          schedule {
            step: 900000
            learning_rate: .00002
          schedule {
            step: 1200000
            learning_rate: .000002
      momentum_optimizer_value: 0.9
    use_moving_average: false
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/content/models/research/pretrained_model/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 1000
  data_augmentation_options {
    random_horizontal_flip {

train_input_reader: {
  tf_record_input_reader {
    input_path: "/content/gdrive/MyDrive/Data/FaceDetection/images/annotations/train.record"
  label_map_path: "/content/gdrive/MyDrive/Data/FaceDetection/images/annotations/label_map.pbtxt"

eval_config: {
  metrics_set: "coco_detection_metrics"
  num_examples: 3

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/content/gdrive/MyDrive/BikeData/BikeDetection/images/annotations/test.record"
  label_map_path: "/content/gdrive/MyDrive/BikeData/BikeDetection/images/annotations/label_map.pbtxt"
  shuffle: false
  num_readers: 1

Now, We are set to Train our Model

Make sure your training data is unbiased and contains a variety of images. Here, we have included pictures of a single bike only, you can also use images containing multiple similar or dissimilar objects to be detected and label them separately.

Once the data is prepared and the pipeline has been configured, the next step is to train the model on our dataset. To begin training the model, type the following command in the cmd prompt /object_detection folder.

python — logtostderr — train_dir=training/ — pipeline_config_path=training/faster_rcnn_inception_v2_pets.config

This will begin the training of the model. The loss at each step is calculated and displayed until it reaches a certain minimum value (commonly below 0.05) or until it starts to increase. If the loss keeps increasing after a certain point, it means that the model is overfitting. In that case, the model will assume the lowest loss value before increasing. The training will look something as shown in the figure below.

The progress of the training step can be viewed on Tensorboard by activating the Tensorflow environment in the anaconda prompt and running the following command inside the object_detection directory. The process can be lengthy and will take a few hours to complete depending on the size of your dataset.

Saving the .pb file / Inference Graph

Now that training is complete, the next step is to export the frozen inference graph. The training checkpoints are stored in the training folder after every 5 minutes using Tensorflow. Among these, the checkpoint that has the highest step count will be used to generate our classifier and export the inference graph (.pb file). In the command prompt, type the following command inside the object_detection folder.

python --input_type image_tensor --pipeline_config_path training/faster_rcnn_inception_v2_pets.config --trained_checkpoint_prefix training/model.ckpt-XXXX --output_directory inference_graph

Here, replace the XXXX with the highest value of step count that you got. The .pb file generated contains the object detection classifier. In Tensorflow, the .pb format is used to hold models.

Here is the time to test our Custom Trained Model

Once the inference graph has been downloaded, we can now run the testing on the model by using a python script to test it on an image, video, or a live webcam

Thanks for your time, Hope you have enjoyed the blog. Feel free to ask your queries in the comment box.

708 views2 comments

2 comentarios

Shivangi Dubey
Shivangi Dubey
04 ago 2021

For the training purpose I used 50 images. However, considering that it may not give a very high accuracy you can train your model with around 100 or more images. I would suggest that you also take images with different object environments to increase the accuracy of your model.

Me gusta

Shireen Shaykh
Shireen Shaykh
26 jul 2021

Could you tell how many images did you use for training?

Me gusta
bottom of page