heal.abstract |
One of the fastest growing ground-based areas of research is Artificial Intelligence (AI), which has revolutionized a variety of application domains. Modern artificial neural networks (ANNs) impose increased computational complexity, and as a result, general-purpose CPUs struggle to provide sufficient performance. For this reason, developers are forced to integrate AI into broader customer bases with smaller, more power efficient, AI microchips and accelerators. Anticipating this trend, Google provides the Tensor Processing Units (TPUs) to accelerate AI inference in data-centers and at the edge.
In this thesis, targeting embedded AI, we focus on the Edge TPU. The Edge TPU is a small Application-Specific Integrated Circuit (ASIC) that delivers high performance in a small physical and power footprint, enabling the deployment of high accuracy AI at the edge. It is a dedicated hardware that enables the parallelization of certain computations in order to achieve faster inference of them. The Edge TPU processor is capable of performing 4 Trillion Operations Per Second (TOPS), using 0.5 Watt for each TOPS (2 TOPS per Watt). However, the architecture and the instructions of such an AI-specific accelerator imposes hardware challenges and limitations for non-AI workloads for general-purpose computing. In this thesis, our goal is to provide solutions to this challenge by proposing a custom methodology for building Edge TPU compatible networks for general-purpose calculations. Moreover, we propose a solution for overcoming the barrier of the 8-bit-only operations on the TPU by breaking N-bit algrebraic computations in 8-bit parts. In this way, we support both element-wise and matrix multiplications for larger bit-widths without significant decrease in performance.
Initially, we perform benchmarking on the TPU to explore and evaluate its capabilities, including both pre-trained and custom networks. For our Ship Detection network we achieve 1000-2000 FPS with no significant accuracy loss. The experimental results reveal significant acceleration in comparison to the ARM A53 co-processor and other embedded devices. Overall, the Edge TPU provides remarkable speedup for medium- and large-sized CNNs and MLPs, as well as for custom models dominated by matrix multiplications. The matrix multiplication operations are improved up to 4x compared to the 8-bit quantized ARM execution and up to 7x for 32-bit floating point. Moreover, for classic Digital Signal Processing (DSP) operations, such as the Sobel Edge Detector and Image Binning, the Edge TPU provides up to 6x better performance than ARM A53. |
en |