Paper Summary. Opencl Caffe: Accelerating Together With Enabling A Cross Platform Automobile Learning Framework

This 2016 paper presents an OpenCL branch/port of the deep learning framework Caffe. More specifically, this branch replaces the CUDA-based backend of Caffe to an opened upwardly touchstone OpenCL backend. The software was start located at https://github.com/amd/OpenCL-caffe, thence graduated to https://github.com/BVLC/caffe/tree/opencl.

Once nosotros develop a DNN model, nosotros ideally similar to hold out able to deploy it for dissimilar applications across multiple platforms (servers, NVDIA GPUs, AMD GPUs, ARM GPUs, or fifty-fifty over smartphones together with tablets) alongside minimum developing efforts. Unfortunately, most of the deep learning frameworks (including Caffe) are integrated alongside CUDA libraries for running on NVIDIA GPUs, together with that limits portability across multiple platforms.

OpenCL helps for portability of heterogenous computing across platforms since it is supported yesteryear a diversity of commercial flake manufacturers: Altera, AMD, Apple, ARM Holdings, Creative Technology, IBM, Imagination Technologies, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, ZiiLABS together with etc. In social club to enable compatibility betwixt dissimilar platforms, OpenCL programme detects the specific devices together with compiles at runtime.

OpenCL was originally developed yesteryear Apple, Inc together with afterward submitted to Khronos Group. It is also supported yesteryear many operating systems including Android, FreeBSD, Linux, macOS, Windows.

OpenCL Backend Porting together with Optimization

Caffe framework is originally written inwards C++ together with CUDA. The CUDA layer of Caffe handles optimization of hardware resources allotment together with utilization, e.g. CPU-GPU chore assignment, retention management, together with information transfer. Since CUDA together with OpenCL are dissimilar inwards hardware device abstraction, retention buffer management, synchronization, information transfers, the OpenCL backbend porting is non a straightforward process.

The newspaper breaks the OpenCL porting procedure into 2 phases. Phase 1 achieves a layerwise porting of 3 layers, namely C++ machine learning interfaces, OpenCL wrappers, together with GPU kernels. Layerwise porting agency the layers are ported i yesteryear i together with unit of measurement tested yesteryear using the originals of the other layers to guarantee correctness together with convergence of the DNN algorithm.

After all layers are ported to OpenCL inwards Phase 1, the Phase 2 focuses on functioning optimization. Profiling the OpenCL port inwards Phase 1 (via the AMD profiling tool, CodeXL, assisted alongside OpenCL lawsuit together with printf) demonstrates some big bottlenecks. OpenCL's online compilation often calls clBuildProgram to exercise each GPU kernel: for 100 iterations of Cifar training, at that spot were 63 clBuildProgram calls that took close 68% of the time. Another bottleneck was that the convolutional layers stimulate got upwardly most of the computation time. BLAS's functioning suffered from irregular tall together with skinny matrix sizes from dissimilar layers.

To handgrip these, the newspaper proposes 3 fundamental optimization techniques including marrow caching to avoid OpenCL online compilation overheads, a batched mode information layout scheme to boost information parallelism, together with multiple ascendancy queues to boost chore parallelism. The optimization techniques effectively map the DNN work size into existing OpenCL math libraries, together with improve hardware resources utilization together with boost functioning yesteryear 4.5x.

Evaluation

The evaluation uses the AlexNet DNN model for ImageNet. The evaluation compares the functioning of Caffe alongside CUDA (including both cuBLAS together with cuDNN v2) vs OpenCL (including both master copy clBLAS together with batched mode optimization) on NVIDIA TitanX together with AMD R9 Fury, alongside Alexnet model alongside mini-batch size 100. As shown inwards Figure 3, OpenCL Caffe alongside optimizations over clBLAS matches functioning of cuBLAS Caffe.

Compared to the highly optimized machine learning cuDNN library, OpenCL Caffe nevertheless has a functioning gap of 2x equally it lacks those optimizations. The authors combat given the electrical flow performance, the OpenCL caffe is nevertheless competitive inwards damage of functioning per dollar, considering the marketplace cost divergence of AMD R9 Fury (about 560 dollars) together with the NVIDIA TitanX (about thou dollars).

Cross Platform Capability Analysis

A natural inquiry that occurs is, would their OpenCL port of Caffe that is tested on AMD GPUs would automatically operate alongside ARM MALI GPUs equally well? That would hold out a adept examine of portability of OpenCL port of Caffe. This has non been answered inwards the paper.

However, the authors caution close tiddler problems inwards compatibility. "There are some differences inwards specific manufacturers' extension together with keywords. For example, caffe uses a lot of template inwards GPU kernels to back upwardly dissimilar floating hollo for precision. But it turns out the template keywords for dissimilar manufactures are different, which adds to a greater extent than difficulty for the same code to run on dissimilar platform without modifications."

OpenCL back upwardly of deep learning frameworks is nevertheless non great, only hopefully it is getting improve everyday equally this newspaper shows.

The slides presentation for the newspaper is also available here.