Acceleration of computer based simulation, image processing, and data analysis using computer clusters with heterogeneous accelerators


Chong Chen

Date of Award


Degree Name

Ph.D. in Electrical Engineering


Department of Electrical and Computer Engineering


Advisor: Tarek Taha


With the limits to frequency scaling in microprocessors due to power constraints, many-core and multi-core architectures have become the norm over the past decade. The goal of this work is the acceleration of key computer simulation tools, data processing, and data analysis algorithms in multi-core and many-core computer clusters and the analysis of their accelerated performances. The main contributions of this dissertation are: 1. Acceleration of vector bilateral filtering for hyperspectral imaging with GPGPU: a GPGPU based acceleration for vector bilateral filtering called vBF_GPU was implemented in this dissertation. vBF_GPU use multiple threads to processing one pixel of a hyperspectral image to improve the efficiency of the cache memory. The memory access operation of vBF_GPU was fully optimized to reduce the data transfer cost of the GPGPU program. The experiment results indicate that vBF_GPU can provide up to 19x speedup when compared with a multi-core CPU implementation and up to 3x speedup when compared with a naive GPGPU implementation of vector bilateral filtering. vBF_GPU can process hyperspectral imaging with up to 266 spectrums, and the window size of the bilateral filter is unlimited.;"2. Optimization of acceleration of alternative least square algorithm using GPGPU cluster: this study presented an optimized implementation for Alternative Least Square Algorithm (ALS) to realize large-scale matrix factorization based recommendation system. In this study, a GPGPU optimized implementation is developed to conduct the batch solver in ALS algorithm. An equivalent mathematical form of equations was used to simplify the computation complexity of ALS algorithm. A distributed version of this implementation was also developed and tested using a cluster of GPGPUs. The experiment results in this study indicates that our application running at a GPGPU can achieve up to 3.8x speedup when compared with an 8-core CPU. And the distributed implementation made excellent scalability at a computer cluster with multiple GPGPU accelerators.";"3. Accelerating a preconditioned iterative solver for a very large sparse linear system on clusters with heterogeneous accelerators: this study presents a parallelized preconditioned conjugate gradient solver for large sparse linear systems on clusters with heterogeneous accelerators. The primary accelerator we examined in this study is Intel Xeon Phi accelerator with Many Integrated Core (MIC) architecture. We also realized a highly optimized parallel solver on clusters with the NVIDIA GPGPU accelerators, and clusters of Intel Xeon CPU with the Sandy Bridge architecture. Several approaches are applied to reduce the communication cost between different compute nodes in the cluster. A lightweight balancer was developed for the Xeon Phi based solver. Our results show that the Xeon Phi based iterative solver is faster than the GPGPU based solver for one to two compute nodes when the balancer was applied, particularly when the number of non-zero elements was unevenly distributed. For a larger number of compute nodes, however, the GPGPU cluster performed better. An analysis of the scalability of our iterative solver on a distributed memory systems is presented. The experiment results and analysis indicate that the acceleration completed in this dissertation leads to impressive speed up. And the optimization method I used in this study will benefit future research work in the high-performance computing area."


Heterogeneous distributed computing systems, Parallel computers, Multiprocessors, Computer Engineering, parallel computing, distributed computing, GPGPU, Xeon Phi, Preconditioned Iterative Solver, ALS, bilateral filtering

Rights Statement

Copyright 2016, author