An FPGABased Reconfigurable CNN Accelerator For YOLO

deployment to the embedded devices. In the One-stage model, SSD does not support multi-scale training, and the network generalization ability is worse than that of YOLOv2 and YOLOv3 [18]. YOLOv3 has complex network structure and slower than YOLOv2. Therefore, YOLOv2 and YOLOv2 -Tiny [20] are selected to deploy on the FPGA platform.

Layer Filters Size/Stride/padding Input Output ×3 / 1 / 1 2×2 / 2 / 0 416× ×208 Conv2 Pool ×3 / 1 / 1 2×2 / 2 / 0 208× ×104 Conv3 Pool ×3 / 1 / 1 2×2 / 2 / 0 104×104 52×52 Conv4 Pool ×3 / 1 / 1 2×2 / 2 / 0 52×52 26×26 Conv5 Pool ×3 / 1 / 1 2×2 / 2 / 0 26×26 13×13 Conv6 Pool ×3 / 1 / 1 2×2 / 2 / 0 13×13 13×13 Conv ×3 / 1 / 1 13×13 13×13 Conv ×3 / 1 / 1 13×13 13×13 Conv ×1 / 1 / 0 13×13 13×13

YOLOv2 is proposed by Joseph Redmon et al. [4], which removes the fully connected layer from YOLO and uses anchor boxes in Faster R-CNN to predict bounding boxes [17]. YOLOv2 is based on Darknet-19 (including 19 Conv layers and 5 maximum pooling layers), and adds three 3×3 Conv layers and 1×1 convolution layer, and references the route layer in ResNet [20]. The input image is divided into S×S cells, each cell predicts 5 bounding boxes, and each bounding box predicts (C + 5) numbers, including the probability of C classes, the position and size information of the bounding boxes, and the probability of real objects in the bounding boxes, so the network output size is S×S× (5+C). YOLOv2 Tiny is different from YOLOv2 in that the network is more simplified, and the pre-processing and post- processing processes are the same. For space reasons, we only show network structure of YOLOv2 Tiny in Table I. YOLOv2 Tiny has 9 Conv layers and 6 maximum pooling layers, and the first six Conv layers are followed by the maximum pooling layer.

In the Conv layer, multiple convolution kernels are used to extract the advanced features of the input feature maps. The number of convolution kernels determines the number of output feature map. The size of convolution kernel K×K is generally 1×1 or 3×3. In order to make the size of input and output feature maps consistent, a padding operation is often performed around the input feature maps, and the padding number is (K-1) / 2. The Conv layer can be expressed as:

f,c,j,iK-1i=0K-1 j=0C-1c=0

is the value of the h row w column of the f

output feature map, C is the number of input feature map, K is the convolution kernel size, and S is the step size. The result of convolution often needs to be activated. The activation funtion of YOLOv2 and YOLOv2 Tiny is Leaky ReLU.

Maximum pooling is used for the down sampling of feature mapping, which improves the robustness of feature extraction. By establishing a K×K sampling window, the maximum value in the window are saved. The size of the sampling window is generally 2×2, and the step size is generally 1 or 2.

Batch Normalization (BN) layer:

BN layer can significantly reduce the training time of deep CNN [13], and solve the problem caused by the change of data distribution in the middle layer during the training process. For a mini-batch, calculate the mean

}). In order to recover the distribution of the original network feature, the learnable

parameters γ and β are introduced to get the output results

is a constant added to the mini-batch variance for numerical stability [22]. The BN layer is located in front of the activation layer. When deployed in hardware, the BN layer can generally be fused into the weight or bias [13].

In recent years, people have made a lot of progress in deploying the object detection model based on CNN to the hardware. Through the model compression technology, SqueezeNet is less than 0.5Mb while maintaining the accuracy [10]. MobileNet adopts the deep separable convolution and the ShiftNet replaces the spatial convolution with a simple shift operation [9][23]. These efficient models reduce the computation and model scale. [11]-[16] explored the parallel space of convolution, in which [11] proposes the dynamic combination of parallelism within a convolution, intra-output parallelism, and inter-output parallelism, which fully improves the computing performance. In this paper, we focus on the following three points: (1) hardware architecture exploration, parallel space exploration; (2) quantifying the model while maintaining accuracy; (3) reducing data bandwidth, reducing the number of weight values and intermediate data bit width to reduce the number of off-chip memory accesses. III.

More than 90% of the computing time spent in CNN is in Conv layer, which consists of multiple nested loops. Each

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.