Each grid cell has depth of D. The value of D depends on the number of classes we want to detect. When we have C classes of object, D is D=B(5+C) (3.1)
The output of the network looks like this. There are 13x13 = 169 grid cells in total, and each grid cell can detect up to B bounding boxes. One bounding box has 5 + C properties, therefore a grid cell has D = Bx(5+C) values (this is depth).
Tensor=SxSxBx(5+C) (3.2)
Figure 3.3: This 13x13 tensor can be considered as a 13x13 grid representing the input image, where each cell of this tensor will hold the 5 box definitions and 30 class probabilities.
In our case classes number C=30 and B=5
The logic is that if there was an object on that cell, we define which object by using the biggest class probability value from that cell.
Figure 3.4: Yolov2-tiny for each grid has B=5 bounding box
x: x coordinate of the box center
y: y coordinate of the box center
w: box width
h: box height
P(obj): probability that an object exists in this box
Each grid cell is able to predict B bounding boxes. Since each bounding box prediction is composed of 5 + C values, the total length of predicted values on one grid cell is B*(5+C). I will consider the case when B = 5 and C = 30, so one grid cell has length D = 175.
Note that x, y, w and h are not in ‘pixels’ since images on which we apply object detection do not have the same size. For example, one image may have size 1080x1920x3, while another may have 2160x4096x3 (where 3 is for RGB). Therefore, before we feed images to the network, we reshape them into 416x416x3 images such that they have the same size. I’ll show you, later, how the actual output looks like. For now, we don’t have to care them exactly.
Figure 3.5: B bounding boxes a grid cell predicts
This image represents one of the B bounding boxes a grid cell predicts. The first 5 values are fixed while C varies depending on the number of classes. P(obj) x Cᵢ becomes the probability that an object of the i-th class exists in this bounding box.
C represents conditional probabilities that, given an object exits in the box, the object belongs to a specific class:
Cᵢ = P (the obj belongs i-th class | an obj exists in this box) (3.3)
w here .
So the probability that an object of the i-th class is given by:
If this value is greater than a threshold, we think that the network predicted that an object of the i-th class exists in this bounding box.
3.3. Network Architecture
The input to the network is 416x416x3 image in YOLOv2-tiny. There is no fully connected layer in it.
Layer
|
kernel
|
Stride/Filters
|
Output shape
|
Input
|
|
|
416x416x3
|
Convolution
|
3×3
|
1/16
|
416x416x16
|
MaxPooling
|
2×2
|
2
|
208x208x16
|
Convolution
|
3×3
|
1/32
|
208x208x32
|
MaxPooling
|
2×2
|
2
|
104x104x32
|
Convolution
|
3×3
|
1/64
|
104x104x64
|
MaxPooling
|
2×2
|
2
|
52x52x64
|
Convolution
|
3×3
|
1/128
|
52x52x128
|
MaxPooling
|
2×2
|
2
|
26x26x128
|
Convolution
|
3×3
|
1/256
|
26x26x256
|
MaxPooling
|
2×2
|
2
|
13x13x256
|
Convolution
|
3×3
|
1/512
|
13x13x512
|
MaxPooling
|
2×2
|
1
|
13x13x1024
|
Convolution
|
3×3
|
1/1024
|
13x13x1024
|
Convolution
|
3×3
|
1
|
13x13x1024
|
Convolution
|
1×1
|
1/175
|
13x13x175
|
Table 3.1: Details of Network
Chapter4 Experimental Results
4.1. Dataset
In our project we used FlickrLogos-32 dataset. The FlickrLogos-32 dataset contains photos showing brand logos and is meant for the valuation of multi-class logo recognition as well as logo retrieval methods on real-world images. Logos of 32 different logo classes and 6000 negative images were collected by downloading them from Flickr. The dataset includes images, ground truth, annotations (bounding boxes plus binary masks), evaluation scripts and pre-computed visual features. The dataset FlickrLogos-32 contains photos depicting logos and is meant for the evaluation of multi-class logo detection/recognition as well as logo retrieval methods on real-world images.
Do'stlaringiz bilan baham: |