Research on OpenCL optimization for FPGA deep learning application

In recent years, with the development of computer science, deep learning is held as competent enough to solve the problem of inference and learning in high dimensional space. Therefore, it has received unprecedented attention from both the academia and the business community. Compared with CPU/GPU, FPGA has attracted much attention for its high-energy efficiency, short development cycle and reconfigurability in the aspect of deep learning algorithm. However, because of the limited research on OpenCL optimization on FPGA of deep learning algorithms, OpenCL tools and models applied to CPU/GPU cannot be directly used on FPGA. This makes it difficult for software programmers to use FPGA when implementing deep learning algorithms for a rewarding performance. To solve this problem, this paper proposed an OpenCL computational model based on FPGA template architecture to optimize the time-consuming convolution layer in deep learning. The comparison between the program applying the computational model and the corresponding optimization program provided by Xilinx indicates that the former is 8-40 times higher than the latter in terms of performance.


Introduction
Recently, artificial intelligence technology has attracted worldwide attention. As a new field of artificial intelligence, deep learning has an excellent strength to solve complex learning problems [1] [2]. However, with the progressive innovation of technology, the number of neural network models also increases rapidly. During the period from 2012 to 2018, as the number of neural network models increased, both the amount of model parameters and calculation increased rapidly. The size of the AlexNet model designed in 2012 and VGG-16 model designed in 2014 exceeded 200MB and 500MB respectively [3]. Meanwhile, the model parameters have increased from 60 million to 138 million. Hundreds of millions operations are required for each run. In order to improve performance, scholars have turned to designing more efficient deep neural networks [4].
In accelerating the application of deep learning, FPGA has attracted a lot of attention due to its advantages over GPU and ASIC. Compared with GPU, the acceleration design of FPGA is hardware design. Its power consumption is lower than GPU. The acceleration of FPGA can achieve higher performance under per power consumption. For example, in the reasoning stage of convolutional neural network, Microsoft team uses FPGA (Stratix V D5) to achieve the acceleration performance of 134 pictures processing per second and the power consumption is only 25 watts. If the superior FPGA (Arria 10 GX1150) is used, this acceleration performance is expected to 233 pictures processing per second, while the power consumption is basically unchanged. For high-performance GPU implementation (Caffe + cuDNN), the acceleration performance is 500-824 pictures processing per second, and the power consumption is 235 watts [5]. It means that FPGA has better energy efficiency compared with GPU [6] [7] [8] [9]. Unlike GPU and ASIC with fixed hardware architectures, FPGA is reconfigurable hardware, which means developers can connect the logical blocks within the FPGA through programmable connections to achieve their desired function [10]. This programmability enables developers to adjust their hardware design at any time according to the deep learning algorithm. However, hardware acceleration design based on FPGA requires software developers have a certain amount of hardware expertise, which is a high threshold for them. In recent years, FPGA programming environment has been greatly improved. Until now, the developers without corresponding hardware expertise have been allowed to develop FPGA with advanced programming languages such as C, C++ and OpenCL. It to some extent reduces the difficulty of FPGA development, shortens the FPGA development cycle and provides convenience for researchers and developers [11]. In order to reduce the difficulty of FPGA development, the key technologies in the automated high level synthesis tool chain are studied. These researches can be easily classified from different perspectives. From the perspective of the input language used by the user, it can be divided into C language and C−like language. The research uses C/C++ as its input language [12] [13] [14], this kind of research is divided into two categories when implementing automated generation of FPGA hardware architecture. One category of research is a complete automated generation tool chain. The process of generating hardware architecture is completely controllable, but the disadvantage is the insufficient universality of tools [14]. The other is to use the current mainstream hardware generation high level synthesis tool chain [12] [13], but it need to study the automation code generator in depth. The C/C++ language is translated by users to generate the input language supported by the commercial tool chain. The main research of this category is how to map one high-level language to another high-level language (such as OpenCL). Another kind of research work that directly uses C−like language (such as OpenCL) as an input language, focuses on different architecture of the CNN Accelerator [15] [16] [17]. However, because the same function of the program is implemented in different OpenCL code, the hardware architecture generated by the automation tool optimization is different. To implement efficient hardware circuits, developers need to constantly try to optimize various configuration combinations. Even though the push-button automated tool also requires iterative optimization. The key innovation point of this paper is proposed a computational model to help software engineers rationally design parameters without additional third-party tools, how to quickly reduce the iteratively written OpenCL code, and generate efficient hardware based on deeply loop pipelined architecture.

Convolutional neural network
As is often used in image processing, convolutional neural network is one of the most classical models in deep learning. Given an image and using filtering to extract features, the machine will obtain an image called feature map [17]. The most commonly used Valid convolution assumes that the input feature is one dimension and the filter is one dimension, there are Eq (1): In this equation, t = 1,2,. . .,n-m+1, and n > m.
In addition to CNN, most of the neurons in the neural network layer are fully connected. Although full-connected neurons can recognize more complex images, there are also some problems such as lack of flexibility and computational complexity. Convolution operation can reduce unnecessary weight connections to make the transformed images more robust.
The pooling layer is also called the subsampling layer. It usually connects with the convolution layer. Through partial correlation principle, on the one hand, it can improve the robustness of the system, and on the other hand, it can reduce the calculation of the characteristic pattern.
The propagation process in the maximum pooling layer is as Eq (2): In the formula, L 1 and L 2 represent the core pool size. For pooling layer and convolution layer, we tend to pool after convolution and put an activation function after convolution. The activation function is a simple non-linear operation, which improves the ability of non-linear characterization. With the process of convolution-activation function-pooling, CNN can obtain more robust features.
Deep learning convolutional neural networks bring is that the convolution layer needs to consume a lot of memory [18], especially in the training process, because back-propagation needs all the intermediate values of forward transmission. If the size of the input image is H × W and the filter size is m × n, the convolution can be expressed in the Eq (3): In the equation, w is the weight of the kernel. However, the equation above is not enough with multiple convolution layers considered. Thus, a parameter is added to the kernel. The modified equation is as Eq (4): In the equation, c represents the image channel. If the number of kernels is k and the channel is c, the convolution image size is (M − m + 1) × (N − n + 1) through the above equation.
Assuming that the size of the convolution kernel is 5×5, 200 feature map with the size of 150×100 are needed to output. If the input is three channels, the whole process requires 225 million floating point multiplication. This process involves a large number of multiplication addition calculations, which requires a reasonable calculation computational model to improve the performance of the system. However, in the actual optimization, we should not only consider the optimization of computation, but also whether the storage resources on the FPGA chip can transmit the data needed for multiplication addition calculation at one time [19]. Assuming that ThroughputRate is the throughput of the system, it is affected by two aspects of computation and memory access. The relationship between system throughput and computation and memory access is shown in Eq (5): CalculatedPeak is the peak of computing power of calculation resources, and MemoryPeak is the maximum floating point performance of the memory support. According to the above equation, the overall throughput of the system is less than the minimum value [20] of two items of calculation and memory access. The execution of the computation requires data support. The original data convolution layer computation requires are usually copied from Host memory to FPGA off-chip global memory by the Host program. When the FPGA use that data, it usually needs to be read from the off-chip global memory. Besides, the data generated after the execution of the FPGA also need to be written back to the Host memory through the off-chip global memory. However, FPGA off-chip global memory is not on the FPGA chip, and FPGA usually takes much time to make data interaction with it. If no optimization is carried out, it usually has a great impact on the performance of the program [21]. At the same time, convolution calculation involves a relatively large amount of data, and a lot of data also need to be reused in calculation, which results in low computational efficiency. The paper [22] proposed a parallel acceleration strategy of CNN based on FPGA with OpenCL by the use of Xilinx SDAccel. But there is no optimization details and method to configure the parameters, it is difficult for researchers to reproduce. To deal with the above problems, this paper proposes an OpenCL optimization strategy with an effective solution to the low memory efficiency and the extra overhead generated by repeatedly read / written data from FPGA out of memory.

Loop pipelined architecture
The use of OpenCL tools to implement deep learning algorithms on FPGA greatly reduces the work of designers. For the process of the internal hardware architecture mapping of the FPGA is not considered, the development threshold is reduced. However, the process of converting OpenCL into FPGA bitstream through development tools is transparent to the designers, for whom it is hard to add into the project better hardware modules in other languages. In addition, most designers do not know the way to configure the bit width reasonably and the way improve the parallelism to make full use of the advantages of FPGA and the effects of data transfer on performance [23]. As a result, the deep learning algorithm designed by those designers has no obvious advantage in performance. In this section, an loop pipelinedarchitecture template is proposed for the optimization of convolutional neural network performance. By using the optimized architecture template given in this section and the configuration of specific technical parameters, it is verified by experiments that the performance of the algorithm can be improved effectively in the accelerated design of FPGA deep learning algorithm. The following section gives a detailed description of the optimized template architecture and configuration of technical parameters.

Template parameters
According to the parameterized optimization architecture diagram, the following parameters are needed for calculation and determination: OpenCL optimization for FPGA deep learning application • Determining the parameter data N d , the number of elements in the vector data type is N e .
• Determining the width of the data port bit_width.
In the equation, nun_bit is the number of data digits corresponding to the data type.
• The theoretical value of the total number of data transfer times is N t , and the average data amount of each data transfer is K t .
In the equation, N i is the total number of per variable data transfer and Bi is the data bits.
• The number of DSP needed for the calculation is dsp_need.
In the equation, N unroll is the unrolling loop degree in the unrolling loop scheme; K is the number of using DSP for each loop iteration. This value can be obtained from the resource use report. The number of DSP required for each multiplication operation can be obtained through the related development board documents, and then K also can be calculated.
• Calculate the number of sub parts of the array data memory, p_num. In the equation, num_d is the total number of data in the array. num is the number of consecutive data addresses for each calculation. num_max is the upper limit of the array partition supported by the compiler. addr_i is the address interval of each of the adjacent data.
• The cyclic boundary L b .
• Determine the number of the configuration computing unit, N.
In the equation, B and D are the percentages of BRAM resources consumed by a single cell and the percentage of DSP resources consumed by a single cell respectively, and N k is the number of cells restricted by the compiler.
• The number of memory ports for storing data is numport, and the theoretical parallelism of the data calculation is v_cal. The maximum parallelism of data reading/writing is v_data, and these three parameters can be obtained from the program execution information.

Computational model
1. The relevant parameters are obtained by calculation out of hardware architecture template drawing and parameter calculation equation. The specific OpenCL optimization technology and related parameters are selected by the following algorithm steps.
(1) Judge whether the address stored in the off-chip global memory of the corresponding parameter is continuous. If so, entered (2); if not, the data vectorization is not optimized.
(2) Judge whether the data quantity contained in parameter is suitable for data vectorization. N e represents the number of elements in vector data type supported by OpenCL compiler. N e 2 {2, 3, 4, 8, 16}. Traverse the value of N e . If 9K n 2 N + can make N d and N e satisfy the formula (1), the value of K n is recorded. After the traversal is completed, the whole value of K n recorded is combined into a set called L. If the set L is not empty, enter (3). If the set is empty, the data vectorization optimization is not carried out.
(3) The minimum value in the set is recorded as L min . The data are grouped in ascending order of size, and the number of groups is L min . The number of data in each group is N d / L min . After completing the grouping, each group of data are used as a whole to replace the original data in kernel. If they can be substituted equivalently and do not affect the correct execution of the program, they enter (4). Otherwise, remove the L min from the set L and repeat (3).
(4) Carry out data vectorization optimization. N d /L min represents the number of elements contained in vector type data.
2. Configure the number of data ports and the bit width. The bit width of a data port is usually related to the data type of data transmitted through the port. At present, the OpenCL compiler supports a bit width of 32, 64, 128, 256 and 512 bits. If the data type corresponding to the data bit num_bit 2 {32, 64, 128, 256, 512}, the port bit width is set as num_bit. Otherwise, keep the default setting. By default, the OpenCL compiler automatically configures the bit width of the data port according to the actual situation.
3. Assuming that there are n global variables involved in the data transfer(read), the total number of data transfers (read) per variable is N 1 , N 2 , � � �, N n . The data digits are B 1 , B 2 , � � �, B n , and the burst length of data read and write is Bu 1 , Bu 2 , � � �, Bu n . The burst length of burst read-write model is usually 16, and the length of non-burst read-write model is 1.
According to the procedure execution report, it is judged whether the new optimization is carried out. The steps are as follows: (1) Record the total number of data transfers (reading/writing) in the program execution report and the average amount of data (reading/writing) per data transfer. The values are N r , N w , K r , K w respectively.
(2) According to the Eqs (3) and (4), calculate the total number of data transfers (reading/ writing) and the average amount of data (reading/writing) per data transfer after memory optimization. The values obtained are N tr , N tw , K tr , and K tw respectively.
(3) If N tr is less than N r (or K tr is greater than K r ) or N tw is less than N w (or K tw is greater than K w ), and the difference is larger, it is necessary to adjust the optimization; Otherwise, there is no need to be re-optimized.

The set
A is all iterations in the nested loop to be analyzed. Unrolling loop and optimizing array partition are carried out according to the following process: (1) The analysis is started from the most inner loop in the set A. If the layer is already the outermost loop or the cycle order of the layer cannot be exchanged with the innermost loop, record the innermost loop and all loops that can exchange order with the inner loop as the set B. Remove the elements in set B from set A and enter (2); Otherwise, analyze the outer loop.
(2) According to the Eq (10), the number of DSP dspneed required for the scheme is calculated. And then, compare dspneed with the total number of DSP on-chip dsptotal. If dsp need < dsp total , and the array division that conforms to the computational parallelism of the scheme can be realized, enter (3); Otherwise, analyze the next scheme.
(3) In this scheme, if all loops in set B are fully expanded and set A is not empty, then enter (1) and calculate the degree of parallelism; otherwise, enter (4).
(4) Optimization is carried out according to the unrolling loop scheme and the corresponding array partition scheme. Analyze whether an array partition that satisfies the computation parallelism in (2) can be achieved. Next, analyze the data after the unrolling loop and group the data, and then the array stored on the same FPGA on-chip memory is divided into a set. Analyze each group of data sequentially, and select the corresponding analysis method according to its storage method on FPGA in the forms of one-dimensional array and multi-dimensional array. If all arrays can be partitioned to satisfy computational parallelism, it is shown that an efficient array partition can be made for the unrolling loop scheme. Otherwise, the effective array partition cannot be carried out.
(5) It is analyzed from two aspects: one-dimensional array and multi-dimensional array. The steps of one-dimensional array analysis are given as follows: (a) Analyze the address characteristics of each calculation involving data after the loop unrolling. If the addresses are continuous, carry out cyclic division to the array and enter (c). If the address is not continuous but the interval is uniform, carry out block division to the array and enter (c). If the data address characteristics do not meet the both of the above conditions, enter (b). The calculation method of dividing the number of sub-parts storing array data memory is like Eq (11).
(b) If num_d < num_reg, the array is to be divided entirely and enter (c), otherwise it cannot be effectively divided.num_reg is the total number of FPGA on-chip registers available.
(c) Verify whether the parallelism of data reading/writing after array partition satisfies the parallelism of computation in the unrolling loop scheme. If it is satisfied, the array partition is effective and the array partition scheme is recorded; otherwise, the array partition is invalid.
The steps of multi-dimensional array analysis are as follows: (a) According to the array dimension, the array is completely divided on this dimension. Verify whether the parallelism of data reading/writing after array partition satisfies the parallelism of computation in the unrolling loop scheme.
(b) If there exists a dimension that can be realized, the multi-dimensional array is regarded as one dimension array of this dimension, and the analysis method is carried out according to the one dimension array; otherwise, it enters (c). It should be noted that in the analysis of complete partitioning, what needs to be considered is not the restriction of registers on the FPGA chip, but the number of subparts of the array partitioning p_num < num_max.
(c) If num_d < num_reg, the multi-dimensional array is to be divided entirely, otherwise it cannot be effectively divided.
(6) Optimization based on the cyclic pipelining is as follows: (a) In loop unrolling, if there is a circular boundary L b which is x, no optimization is made; Otherwise, enter (2).
(b) If there is a loop which enables all loops to unfold optimization, the involved cycles are to be its sub-cycles, with the most inner loop selected and enter (3); otherwise, the optimization of cyclic flow is not carried out.
(c) If the L b of all the sub loops in the inner loop is c, the optimization of the cyclic flow is performed; otherwise, it will not be optimized. (8) According to the program execution report, the number of storing data memory ports is num port , and the theoretical parallelism of data calculation is v_cal. If the data is not stored in registers and num port = 1, it can be judged that the computing does not make full use of data reading/writing parallelism, because the OpenCL compiler can allocate up to two data ports for each memory at most. To deal with this situation, it is generally necessary to re-optimize the calculation. If num port = 2, it illustrates that the parallelism of data reading/writing is fully utilized. At this time, compare the values of A and B. If v_call > v_data, repartition the array.

Experimental section
The compiler tool used in this experiment is the Xilinx SDx tool, and the FPGA development board produced by the Alpha Data company is the ADM-PCIE-7V3 board. Linux is the execution environment of the Host terminal. The specific environment of this experiment is shown in Table 1 and the specific configuration of the ADM-PCIE-7V3 board is given in Table 2.

Optimization example
In this section, this paper introduces the OpenCL example of convolution layer on FPGA firstly. Based on this example, the computational model is applied to the convolution layer.
Later the application of the computational model is explained in detail. Finally, the results of the optimized program execution are given.

Convolution layer OpenCL example on FPGA.
In the convolutional neural network model, the operation of each convolution layer is consistent and it is convolution operation. The difference lies in data processing and the scale of data, so the optimization methods and ideas are basically the same in the optimization of different convolution layer. Accordingly, this section focuses on the example of a single convolution layer in convolutional neural network. The example program given in this section is an ordinary convolution layer program without any optimization, whose parameters are shown in Table 3.
The number of convolution kernel channels is 48, and the number of convolution kernels is 256. The convolution layer is mainly implemented in the OpenCL kernel program, and the Host is mainly responsible for configuring the environment required by the kernel program, calling the kernel program, carrying out data transfer with kernel, etc. The specific implementation of pseudo code is shown in Algorithm 1. The original convolution layer program is compiled, deployed and run to get information about data transfer as shown in Table 4. where it can be seen that there are multiple data transfers between the kernel of convolution layer and the off-chip global memory, and the average data amount for each data transfer is only 4 bytes. Therefore, the data transfer efficiency is low.

Algorithm 1: Realization of pseudo code in convolution layer
Example optimization scheme. The basic program of convolution layer is optimized according to the hardware architecture diagram and computational model proposed in the third and fourth section. The main work of this section is to apply the computational model to the example program, and give the parameters needed for each step and the detailed optimization scheme.
According to the first step of the computational model, the amount of data N d contained in the parameter is 307200. Traverse N e , and figure out K n 2 {19200, 38400, 76800, 102400, 153600} according to the parameter calculation Eq (6), and K n is not empty. Take out the minimum value 19200. The data are grouped in ascending order according to the address. The number of groups is 19200, and the number of data in each group is 16. Carry out the vector optimization to the data which can equivalently replace the original data used in the kernel.
According to the second step of the computational model, the global parameter is set to be _global int 16 � pre, and the bit width of data transfer between the computing unit and the memory interconnection / memory controller is set to be 16×32, namely, 512 bits. According to the parameter Eq (8), the bit width is determined to be 512 bits, which makes it accessible to support a single data transfer of 512 bits. Thus, the number of data transfer each time variesfrom 1 to 16, which can effectively reduce the number of memory transfer data.
According to the third step of the computational model, N t and K t are calculated by the parameter calculation Eqs (8) and (9). For N tr is larger than N r in the program execution report and N tw is larger than N w in the execution report in the example, there is no need to re-optimize this time.
The data transfer between convolution layer kernel and off-chip global memory can be obtained after the first three steps of optimization according to the computational model, as shown in Table 5, which indicates that the number of data transfers between the convolution layer kernel program and the global memory off-chip is greatly reduced, and the average amount of data transfer is increased to 64 bytes. Table 4. Related information of data transfer in the convolution layer basic program. According to the fourth step of the computational model, the dsp need is figured out to be 251 by the parameter calculation Eq (10). The total number of DSP on-chip is 3600. The number of DSP needed is less than the total number and set B is fully unrolled loop. Fig 2 shows the use of optimized instruction and the equivalent code after unrolling. Graph (a) shows the equivalent code when the unrolling factor is 2, and graph (b) shows the equivalent code when the unrolling factor is by default.

Transfer Type Number of Transfers Transfer Rate(MB/s) AvgBandwidth Utilization(%) Avg Size(KB) Avg Time(ns)
For the size of convolution kernel is 5×5, the theoretical parallelism of computation is 25. However, the input characteristic graph data involved in the calculation and the convolution kernel data are stored locally in the form of one-dimensional arrays. Without the array partition optimization, the OpenCL compiler only assigns two ports to it at most. That is, the degree of parallelism of reading is 2, which is much less than the degree of parallelism of calculation, so the arrays need to implement array partitioning.
According to the fifth step of the computational model, the p_num of cyclic partitioning and block partitioning are calculated by Eq (11). Since the total number of num_d in the array is less than that of num_reg of available registers on-chip, the array is completely partitioned. In the process of loop unwrapping and array partitioning optimization, the last three layers of the convolution layer implementation code (calculation of single pixel in output characteristic graph) are optimized and the corresponding array partitioning is carried out. Meanwhile, for the convenience of optimization, this section divides the last three layers into double threelayer according to convolution multiplication and addition. For the two three-layer loops are consistent in architecture, the corresponding optimization strategies are nearly identical. The specific code of the optimization of the convolution multiplication calculation is shown as in Fig 3. In this optimization, the inner and outer two-layer cycles are completely unrolled, and the theoretical calculation parallel degree is 48×5, namely 240. The theoretical value of data reading/writing parallelism involved in the calculation is 48×5×2, namely 480. Since the parallelism of data reading and writing is greater than that of computing, the outer loop is partially unrolled in order to match the parallelism of the both and the unrolling factor is 2 in this optimization.
According to the sixth step of the computational model, the cyclic pipelining instruction __attribute__ ((xcl_pipeline_loop)) can be used when writing an OpenCL program. The main function of this instruction is to ensure the FPGA performs each iteration of the loop in a pipelining manner by adding the instruction outside of the for-loop. The entire loop boundaries are constants and the L b is obtained from Eq (12) to optimize cyclic pipelining for c. Fig 4 shows the use of loop instructions and the execution situation of loops before and after the use of instructions. Figure (a) gives the performance of the cycle execution without using loop pipelining optimization, and figure (b) demonstrates the use of loop pipelining.
According to the seventh step of the computational model, the 1/B and 1/D in the Eq (13) are 22.1 and 14.3 respectively according to the execution report. The compiler limits the number of computing units to 10. According to the parameter calculation Eq (13), the number of N is less than 10. In this range, it is the best to use six computing units for this example program, so six computational are configured elements this time. The kernel program is split into 6 working groups, with each containing only one work items and the specific OpenCL kernel optimization code shown in Fig 5. The output characteristic graph data are stored in order according to the channel. To make the address space of the output data in the off-chip global memory continuous  when each computing unit transmits data with the off-chip global memory, this division of the kernel program is based on the number of channels in the output characteristic graph.
According to the eighth step of the computational model, unrolling loop and array partition optimization of the last three layer loops of convolution operation are carried out. The degree of parallelism of the optimized convolution multiplication is 480, which is two fifths of that of the ideally optimized convolution multiplication. Multiple computing units optimization of the outermost loop of the first three layers is carried out, and the parallelism of the output feature image pixel calculation after optimization is 6, which is much less than that of the pixel calculation of the output feature image after optimization. The optimization of cyclic flow is carried out for the inner loop. Finally, by comparing the values of additions, it is found that there is no need to repartition the array.
Optimization performance analysis of the program. According to the computational model proposed, the example code is directed toward optimization, whose result compared to the latest Xilinx optimization program [17] is shown in Table 6. From the runtime of each cell and the whole kernel, it is found that these four cells are basically executed in parallel.
The final optimization result of this example program is shown in Table 7. The final execution time of the example program is 9.76 milliseconds after optimization. Moreover, this paper also tests the performance of the convolution layer optimization program provided by Xilinx, as summarized in Table 5 where it can be seen that the final performance of the program is 29 times higher than that of the optimization program provided by Xilinx company.
The final optimization results of this experiment are compared with the CPU implementation [20], as indicated in Table 8. The 1-thread in the table is set as single thread execution, and the 16-thread is set as the 16-thread execution. -O3 represents the optimization level of a compiler is -O3. From Table 8, it can be seen that the performance of the optimized convolution on FPGA is 9.76 times higher than that of single-thread CPU, 2.8 times higher than that of 16-thread CPU. Also, it is indicated that the energy consumption of the convolution program optimized by the computational model proposed and implemented on FPGA is significantly lower than that of CPU.

Comparison and analysis of different scale convolution programs
In order to analyze the performance of different scale convolution programs, eight kinds of convolution layer programs are set up according to the ascending order as shown in Table 9. Layer 1 is one of scales, and the number of input and output channels is 64 and 128 respectively. Input a picture sample of a size of 111164 with the convolution kernel of 3364 size, the number of convolution kernels are 128 and the step size is 1. The result of the final output after the convolution operation is 1111128. The other 7 scales of convolution program analysis methods are the same with Layer1.
The computational model proposed is applied to the convolution layer of different scale, and its performance is compared with the corresponding optimization program provided by Xilinx company, as shown in Table 10. Convolution of different scales is optimized based on the computational model proposed, and its optimized time consumption is significantly reduced while compared with that of the optimization program by Xilinx. Fig 6 shows the speed-up ratio of Layer1Lãyer8 where the higher the speed-up ratio, the better the optimization effect is. Accordingly, the optimized program has higher performance than optimized program of Xilinx. This paper put the code link into the paper openly accessible for other researchers to study and explore new accelerator method for deep neural networks. It can be found at the following link: https://github.com/PoetryAndWine/FPGA_CNN_Acceleration.

Conclusion
This paper proposes an computational model based on OpenCL, which enables the transformation of the OpenCL model on GPU/CPU to FPGA. This computational model is used to help software programmers without fundamental hardware knowledge for a quick implementation in deep learning algorithm with high performance using FPGA. In terms of performance, the computational model not only reduces the cost of data interaction, but also improves the efficiency of data calculation. In terms of adaptability, the computational model is flexible and suitable for convolution layers of different sizes. The results of the proposed computational model applied to convolution layers of different scales show that the performance of the proposed computational model is 8-40 times higher than that of the corresponding optimization program provided by Xilinx Company. OpenCL optimization for FPGA deep learning application Supporting information S1 File. Convolution layer optimization code and the performance data. (Xilinx and this paper). (ZIP)