Skip to main content

Tensorflow2 Model Zoo

Mong Kok, Hongkong

What I am trying to do

I need to gain an understanding how different models and training steps affect the accuracy of the resulting model. Also gathering data in form of screen caps and labeling has a huge effect on the outcome of the training. Let' s see how this can be optimized.

Project Setup

I am using this Tensorflow Boilerplate. But instead of using an IP camera to film my hand and labeling hand gestures I now want to use screen caps from popular TV shows and label the characters depicted in that scene. With the exception of this change I am following the Tensorflow Tutorial and try to end up with a model that can tell me the character names when I show it a TV show.

Detection Models: Tensorflow provides a collection of detection models pre-trained on the COCO 2017 dataset. So far I have been using the SSD MobileNet V2 FPNLite 320x320. I will now replace it with the slightly slower but more accurate SSD MobileNet V2 FPNLite 640x640:

| Model name | Speed (ms) | COCO mAP | Outputs | | -- | -- | -- | -- | -- | -- | | SSD MobileNet V2 FPNLite 320x320 | 22 | 22.2 | Boxes | | SSD MobileNet V2 FPNLite 640x640 | 39 | 28.2 | Boxes |

ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8

INFO:tensorflow:Step 2000 per-step time 0.117s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.302
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.449
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.401
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.318
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.629
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.679
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.679
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.679

After 2000 steps I only got a few positive recognitions. All of them were from one label. In general the result was pretty bad... So another 3000 steps on top:

INFO:tensorflow:Step 3000 per-step time 0.123s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.486
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.669
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.604
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.488
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.707
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.721
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.721
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.721

This greatly increased the metrics above. But I still ran into the issue that only one of my labels was being recognized - Captain Jean-Luc Picard. The model also seemed to be obsessed with hair styles. Evaluating them higher than facial features...

Tensorflow Model Zoo

Expanding the ROI

Which is interesting because during labeling he was the only character where I choose to make the entire head ROI. For the others I used a region of interest centered on their face to prevent the model from over-emphasizing their hair cut... so now I went back to labeling and corrected this - potential - mistake:

Tensorflow Model Zoo

Now I re-ran the 3000 steps and already here I am getting a much better result:

INFO:tensorflow:Step 3000 per-step time 0.117s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.436
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.576
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.478
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.436
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.650
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.793
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.793
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.793

I now also found an image that tested - correctly - positive for another label. But I also lost a couple of detection that worked before:

Tensorflow Model Zoo

So let's re-run the training and add another 10.000 steps to see if this makes a difference:

INFO:tensorflow:Step 10000 per-step time 0.125s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.473
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.558
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.558
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.473
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.831
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.831
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.831
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.831

What I am noticing here is that the Recall value greatly benefits from the training. But there is almost no increase in Precision:

  • Precision: True Positive / (True Positive + False Positive)
  • Recall: True Positive / (True PMost are still undetected.

This must be because the recognition in general is terrible - I am getting almost no recognitions false or positive. But if there is a hit it is almost always good. So why doesn't this work? It might be that the 320x320 image resolution is not sufficient for the training. So let's try using the higher one and see how this changes the evaluation:

| Model name | Speed (ms) | COCO mAP | Outputs | | -- | -- | -- | -- | -- | -- | | SSD MobileNet V2 FPNLite 320x320 | 22 | 22.2 | Boxes | | SSD MobileNet V2 FPNLite 640x640 | 39 | 28.2 | Boxes |

ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8

INFO:tensorflow:Step 2000 per-step time 0.439s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.234
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.372
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.283
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.234
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.512
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.595
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.595
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.595

INFO:tensorflow:Step 11000 per-step time 0.468s

Oh, I noticed a mistake here. I thought the steps for a new run will be added to the already run steps. But by rerunning the command with 11.000 steps I ended up at 11.000 and not 13.000 as I expected. So I am now comparing 10.000 steps for the 320 model with 11.000 steps for the 640 model

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.474
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.559
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.548
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.484
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.729
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.786
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.786
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.786
Model nameAP (2000 steps)AR (2000 steps)AP (10000 steps)AR (10000 steps)
SSD MobileNet V2 FPNLite 320x3200.3020.6790.5580.831
SSD MobileNet V2 FPNLite 640x6400.2340.5950.5480.786

So the 640 performed a worse at 2000 steps but almost got the the same level as the 320 model at 11.000 steps. But given the long training time this is a bit underwhelming. Going through the images - the detection rate did not even get that much better.

Letting the training running over night to get up to 90.000 steps

INFO:tensorflow:Step 90000 per-step time 0.432s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.243
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.328
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.263
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.244
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.560
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.683
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.683
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.683

And taking a look at the results from a 20.000 (left) steps training and the end result at 90.000 (right). For all labels that already performed (at least a little bit) I can see an improvement in confidence. There have also been a few new hits (correct and false). But overall the training did lead to a more accurate model:

Tensorflow Model Zoo

Tensorflow Model Zoo

Tensorflow Model Zoo

Tensorflow Model Zoo

Tensorflow Model Zoo

Tensorflow Model Zoo

Labeling Done Right!

Ok, I think I figured it out. And it is pretty obvious now... Since I collected separate images for each label (character) I ended up only labeling on person per image - even if there was a group of them in there. This must have caused my model a lot of frustration due to false false-positive every time it recognized something correctly that wasn't labeled ... So now I went through my stash of images and re-did the labeling part - this time adding multiple labels to each image (where appropriated):

Tensorflow Model Zoo

I deleted the trained model and re-run the training. I am getting a much better result - already, after only 10.000 steps:

INFO:tensorflow:Step 10000 per-step time 0.431s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.804
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.976
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.600
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.808
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.819
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.824
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.824
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.600
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.828

And the detections follow suit - so I added another 10.000 steps and saw a slight decrease in numbers:

INFO:tensorflow:Step 20000 per-step time 0.454s

Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.798
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.600
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.803
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.820
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.820
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.820
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.600
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.824

Tensorflow Model Zoo

And again I am surprised - while the detection rate for some labels stayed the same, it rose to the high nineties for others. While dropping blow 50% confidence where the detection rate was high before:

Tensorflow Model Zoo

The next steps would be to replace the trainings images of badly performing labels with higher quality ones. And in general add more, diverse trainings images for all labels and keep running trainings until the improvements start to level out.