Skip to main content

Tensorflow2 Crash Course - Part V

Mong Kok, Hongkong

This set of Notebooks provides a complete set of code to be able to train and leverage your own custom object detection model using the Tensorflow Object Detection API.

This article is based on a Tutorial by @nicknochnack.

Github Repository

Performance Tuning

Adding more Images for low performing Classes

Add and label new images, copy them into the training folder then re-run the training:

source tfod/bin/activate 
python 02_training_the_model.py
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.03s).
Accumulating evaluation results...
DONE (t=0.02s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.706
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.881
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.706
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.713
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.725
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.744
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.744

Interestingly, the evaluation remained identical to what I had before ?

Rerun Training with more Steps

The training script contains the training command:

training_command = "python {} --model_dir={} --pipeline_config_path={} --num_train_steps=2000".format(TRAINING_SCRIPT, paths['CHECKPOINT_PATH'],files['PIPELINE_CONFIG'])

Increase the number training steps to the desired value:

--num_train_steps=10000

The run takes about approx. 20 Minutes:

INFO:tensorflow:Step 7000 per-step time 0.124s

After 10.000 steps I got the following metrics (0.759/0.781):

Accumulating evaluation results...
DONE (t=0.02s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.759
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.851
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.759
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.775
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.781
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.781
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.781
  • Precision: True Positive / (True Positive + False Positive)
  • Recall: True Positive / (True Positive + False Negative)
Eval2000 steps10000 steps
Average Precision0.7060.759
Average Recall0.7440.781

Changing model architecture using a different pre-trained model as a starting point

Detection Models: Tensorflow provides a collection of detection models pre-trained on the COCO 2017 dataset. So far I have been using the SSD MobileNet V2 FPNLite 320x320. I will now replace it with the slightly slower but more accurate SSD MobileNet V2 FPNLite 640x640:

Model nameSpeed (ms)COCO mAPOutputs
SSD MobileNet V2 FPNLite 320x3202222.2Boxes
SSD MobileNet V2 FPNLite 640x6403928.2Boxes

How do I have to proceed from here? I added the model to my trainings script and re-run it:

PRETRAINED_MODEL_NAME = 'ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8'
PRETRAINED_MODEL_URL = 'http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz'

I can see that the model was downloaded and the pipeline file updated. But it seems that the training was not executed - but the evaluation (precision/recall) dropped to 0.478/0.550:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.478
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.664
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.530
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.478
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.512
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.550

I deleted the entire content of Tensorflow/workspace/models/my_ssd_mobnet to get rid of all the checkpoint data from the old model and re-run the training script. Ok this seems to work - I can see the trainings steps again. This time a lot slower than with the old model - as expected:

INFO:tensorflow:Step 4000 per-step time 0.446s

This would get you a trainings time of about 75 Minutes. The result is (0.752/0.769):

Accumulating evaluation results...
DONE (t=0.02s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.752
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.876
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.752
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.762
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.762
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.769
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.769

Results

The results I am getting are a bit confusing. At first the 320 model performed slightly worse than the 640 - as expected. I then deleted the trainings data for it and re-run the 320. This time I am getting the opposite results as presented above. Is Tensorflow storing trainings data outside of the designated trainings folder? Or is the training in general so inconsistent and needs to be run for a much longer time? Right now I am getting a lot better results after the 20 minute training of the 320 model compared to the 75 minutes for the 640...

'SSD MobileNet V2 FPNLite 320x320' vs 'SSD MobileNet V2 FPNLite 640x640'

Model nameSpeed (ms)COCO mAPOutputsAverage PrecisionAverage Recall
SSD MobileNet V2 FPNLite 320x3202222.2Boxes0.7590.781
SSD MobileNet V2 FPNLite 640x6403928.2Boxes0.7520.769

Spock

The test run makes it even more confusing - the 640 seems to perform better here. But there actually a miss there at the end :-?

SSD MobileNet V2 FPNLite 320x320 SSD MobileNet V2 FPNLite 320x320

SSD MobileNet V2 FPNLite 640x640 SSD MobileNet V2 FPNLite 640x640