top of page

Want to train your first Object Detection Model

Updated: Aug 5, 2021

Computer Vision is a state-of-the-art technology that deals with image processing. Training a Computer Vision model is an essential step in solving problems based on image classification, object detection, segmentation, etc. We will learn how we can train RCNN model on custom data set.

First step is setting up the environment

For running the object detection model on your system, you must fulfill the necessary requirements of setting up the environment. You must install a Tensorflow-GPU and an Anaconda virtual environment with Python 3.6. We recommend to install an NVIDIA GPU.

Create a Tensorflow environment and download Protobuf, lxml, Jupyter notebook, matplotlib, pillow, Tensorflow. You can use PIP command to install the code.

Choosing the pre-trained model

Model selection is one of the difficult part and it depends on problem statement. Here, we are using RCNN to detect and classify two categories of motorbikes namely, Harley Davidson and Hayabusa. For this purpose, we can either work with an RCNN model or an SSD model.

Another factor that comes into play is the requirement of the user, if you want higher accuracy going for an RCNN model is a smart choice, however, it may take more time to train. On the other hand, an SSD model though less accurate trains faster and is typically suitable for running on devices like smartphones or a Raspberry Pi.

The Architecture of Faster RCNN Model:

Second step is to prepare the faster RCNN Model

#Variables Used
MODEL_NAME=faster_rcnn_inception_v2_coco_2018_01_28MODEL_FILE= MODEL_NAME +.tar.gz’
PATH_TO_CKPT= MODEL_NAME +/frozen_inference_graph.pb’

#Download Model
opener= urllib.request.URLopener()
for file in tar_file.getmembers():
 if ‘frozen_inference_graph.pb’ in file_name:
tar_file.extract(file, os.getcwd())

Creating the Custom Dataset

The most important step for training a model with custom images is to prepare the Tf -records

for Tf_records first we have to convert the image into xml file. you may use "LabelImg' tool.

Once we have converted the image into xml file we have to convert the image into csv file, we can generate tf-records

Third step is to configuring the Training Pipeline

The training pipeline is required to organize and automate hyperparameter tuning, pre-processing, model training, and post-processing tasks. It defines which models and what parameters will be used for training and points through the training images and data. For this purpose, We have used the faster_rcnn_inceptionv2_pets.config file.

A protobuf or protocol buffers are Google’s extensible mechanism for serializing structured data. They are useful for storing data with one another over a network or for storing data and can be represented with Python. In TensorFlow, the tf.train.Example class represents the protocol buffer used to store data for the input pipeline.

At a high level, the config file is split into 5 parts:

  1. The model configuration. This defines what type of model will be trained (meta-architecture, feature extractor). In our case, it is the rcnn model.

  2. The train_config, which decides what parameters should be used to train model parameters (SGD parameters, input preprocessing, and feature extractor initialization values).

  3. The eval_config, which determines what set of metrics will be reported for evaluation.

  4. The train_input_config, which defines what dataset the model should be trained on.

  5. The eval_input_config, which defines what dataset the model will be evaluated on. Typically, this should be different than the training input dataset.

Few parameters that we need to adjust are

  1. num_classes: The number of classes present in the training dataset. In my case, there is only 1 class, i.e., Harley Davidson

  2. num_steps: Number of training steps. Based on the size of the data, we can increase or decrease the num_steps. In my case, we have taken it as 1000.

  3. batch_size: The batch_size decides the number of images to be fed into the model in one step. The batch size should be set upon consideration of the hardware limitations. In my case, I have kept the batch_size as 10 to increase the performance of my model.

  4. fine_tune_checkpoint: Fine-tune checkpoint is used for the implementation of transfer learning. Transfer learning is the process of using a model pre-trained on a huge dataset and saving its base weights to apply it on our own classification problem. This helps in getting rid of the overfitting problem when we have a small dataset. The file that contains the base weights of the rcnn_inceptionv2 model is initialized to the fine_tune_checkpoint parameter (/content/models/research/pretrained_model/model.ckpt.)

  5. input_path (for train/test): We need to specify the input_path so that it points to the directory that contains the train.records/test.records.

  6. label_map_path: It should contain the path to the label_map.pbtxt file that contains the ID and name of our classes.

  7. num_examples: Signifies the number of training images.

Pipeline Structure:

# Faster R-CNN with Inception v2, configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
    feature_extractor {
      type: 'faster_rcnn_inception_v2'
      first_stage_features_stride: 16
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.5
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.5
        max_detections_per_class: 100
        max_total_detections: 300
      score_converter: SOFTMAX
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0

train_config: {
  batch_size: 10
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0002
          schedule {
            step: 900000
            learning_rate: .00002
          schedule {
            step: 1200000
            learning_rate: .000002
      momentum_optimizer_value: 0.9
    use_moving_average: false
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/content/models/research/pretrained_model/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 1000
  data_augmentation_options {
    random_horizontal_flip {

train_input_reader: {
  tf_record_input_reader {
    input_path: "/content/gdrive/MyDrive/Data/FaceDetection/images/annotations/train.record"
  label_map_path: "/content/gdrive/MyDrive/Data/FaceDetection/images/annotations/label_map.pbtxt"

eval_config: {
  metrics_set: "coco_detection_metrics"
  num_examples: 3