Tensorflow2 Crash Course - Part V
This set of Notebooks provides a complete set of code to be able to train and leverage your own custom object detection model using the Tensorflow Object Detection API.
This article is based on a Tutorial by @nicknochnack.
- Tensorflow2 Crash Course Part I
- Tensorflow2 Crash Course Part II
- Tensorflow2 Crash Course Part III
- Tensorflow2 Crash Course Part IV
- Tensorflow2 Crash Course Part V
Performance Tuning
Adding more Images for low performing Classes
Add and label new images, copy them into the training
folder then re-run the training:
source tfod/bin/activate
python 02_training_the_model.py
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.03s).
Accumulating evaluation results...
DONE (t=0.02s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.706
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.881
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.706
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.713
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.725
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.744
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.744
Interestingly, the evaluation remained identical to what I had before ?
Rerun Training with more Steps
The training script contains the training command:
training_command = "python {} --model_dir={} --pipeline_config_path={} --num_train_steps=2000".format(TRAINING_SCRIPT, paths['CHECKPOINT_PATH'],files['PIPELINE_CONFIG'])
Increase the number training steps to the desired value:
--num_train_steps=10000
The run takes about approx. 20 Minutes:
INFO:tensorflow:Step 7000 per-step time 0.124s
After 10.000 steps I got the following metrics (0.759
/0.781
):
Accumulating evaluation results...
DONE (t=0.02s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.759
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.851
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.759
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.775
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.781
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.781
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.781
- Precision: True Positive / (True Positive + False Positive)
- Recall: True Positive / (True Positive + False Negative)
Eval | 2000 steps | 10000 steps |
---|---|---|
Average Precision | 0.706 | 0.759 |
Average Recall | 0.744 | 0.781 |
Changing model architecture using a different pre-trained model as a starting point
Detection Models: Tensorflow provides a collection of detection models pre-trained on the COCO 2017 dataset. So far I have been using the SSD MobileNet V2 FPNLite 320x320. I will now replace it with the slightly slower but more accurate SSD MobileNet V2 FPNLite 640x640:
Model name | Speed (ms) | COCO mAP | Outputs |
---|---|---|---|
SSD MobileNet V2 FPNLite 320x320 | 22 | 22.2 | Boxes |
SSD MobileNet V2 FPNLite 640x640 | 39 | 28.2 | Boxes |
How do I have to proceed from here? I added the model to my trainings script and re-run it:
PRETRAINED_MODEL_NAME = 'ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8'
PRETRAINED_MODEL_URL = 'http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz'
I can see that the model was downloaded and the pipeline file updated. But it seems that the training was not executed - but the evaluation (precision/recall) dropped to 0.478
/0.550
:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.478
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.664
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.530
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.478
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.512
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.550
I deleted the entire content of Tensorflow/workspace/models/my_ssd_mobnet
to get rid of all the checkpoint data from the old model and re-run the training script. Ok this seems to work - I can see the trainings steps again. This time a lot slower than with the old model - as expected:
INFO:tensorflow:Step 4000 per-step time 0.446s
This would get you a trainings time of about 75 Minutes. The result is (0.752
/0.769
):
Accumulating evaluation results...
DONE (t=0.02s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.752
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.876
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.752
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.762
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.762
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.769
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.769
Results
The results I am getting are a bit confusing. At first the 320 model performed slightly worse than the 640 - as expected. I then deleted the trainings data for it and re-run the 320. This time I am getting the opposite results as presented above. Is Tensorflow storing trainings data outside of the designated trainings folder? Or is the training in general so inconsistent and needs to be run for a much longer time? Right now I am getting a lot better results after the 20 minute training of the 320 model compared to the 75 minutes for the 640...
'SSD MobileNet V2 FPNLite 320x320' vs 'SSD MobileNet V2 FPNLite 640x640'
Model name | Speed (ms) | COCO mAP | Outputs | Average Precision | Average Recall |
---|---|---|---|---|---|
SSD MobileNet V2 FPNLite 320x320 | 22 | 22.2 | Boxes | 0.759 | 0.781 |
SSD MobileNet V2 FPNLite 640x640 | 39 | 28.2 | Boxes | 0.752 | 0.769 |
The test run makes it even more confusing - the 640 seems to perform better here. But there actually a miss there at the end :-?
SSD MobileNet V2 FPNLite 320x320
SSD MobileNet V2 FPNLite 640x640