Design Space Exploration and Architecture Design for Inference and Training Deep Neural Networks

Date of Award


Degree Name

Ph.D. in Electrical and Computer Engineering


Department of Electrical and Computer Engineering


Tarek Taha


Deep Neural Networks (DNNs) are widely used in various application domains and achieve remarkable results. However, DNNs require a large number of computations for both the inference and training phases. Hardware accelerators are designed and implemented to compute DNN models efficiently. Many accelerators have been proposed for DNN inference, while only a limited set of DNN training accelerators has been proposed. Almost all of these accelerators are highly custom-designed and limited in the types of networks they can process. This dissertation focuses on designing novel architectures and tools for efficient training of deep neural networks, particularly for edge applications. We proposed several novel architectures and a design space exploration tool. Our proposed architecture can be used for efficient processing of DNNs, and the design space exploration model could help DNN architects explore the design space of DNN architecture design for both inference and training and help home in on the optimal architecture in different hardware constraints in applications. The first area of contribution in this dissertation is the design of Socrates-D-1, a digital multicore on-chip learning architecture for deep neural networks. This processing unit design demonstrates the capability to process the training phase of DNNs efficiently. A statically time-multiplexed routing mechanism and a co-designed mapping method are also introduced to improve overall throughput and energy efficiency. The experimental results show 6.8 to 22.3 times speedup and more than a thousand times energy efficiency over a GPGPU. The proposed architecture is also compared with several DNN training accelerators and achieves the best energy and area efficiencies. The second area of contribution in this dissertation is the design of Socrates-D-2, which is an enhanced version of Socrates-D-1. This architecture presents a novel neural processing unit design. A dual-ported eDRAM memory replaces the double eDRAM memory design used in Socrates-D-1. In addition, a new mapping method utilizing neural network pruning techniques is introduced and evaluated with several datasets. The co-designed mapping methods helped the architecture achieve both throughput and energy efficiency without loss of accuracy. Compared with Socrates-D-1, this new architecture shows an average of 1.2 times higher energy efficiency and 1.25 times better area efficiency. The third area of contribution in this dissertation is the development of TRIM, a design space exploration model for DNN accelerators. TRIM is an infrastructure model and can explore the design space of DNN accelerators for training and inference. It utilizes a very flexible hardware template, which can model a wide range of architectures. TRIM explores the design space of data partition and reuse strategies for each hardware architecture and estimates the optimal time and energy. Our experimental results show that TRIM can achieve more than eighty percent accuracy on time and energy estimations. To the best of our knowledge, TRIM is the first infrastructure to model and explore the design space of DNN accelerators for training and inference. The fourth area of contribution in this dissertation is a set of design space explorations using TRIM. Through several case studies, we explored the design space of DNN accelerators for training and inference. We compared different dataflows and showed the impact of dataflow on efficient processing DNNs. We showed how to use TRIM to optimize the dataflow. We explored the design space of spatial architectures and showed the results of varying different hardware choices. Based on the exploration results, several high throughput and energy-efficient DNN training accelerators were presented. The fifth area of contribution in this dissertation is the design of an FPGA-based training accelerator for edge devices. We designed a CPU-FPGA accelerator that can operate under 5W. TRIM is utilized for dataflow optimization and hardware parameter selection. The experimental results show that we could achieve a 1.93 times speedup and 1.43 times energy efficiency for end-to-end training over a CPU implementation.


Electrical Engineering, Computer Engineering, Artificial Intelligence, deep neural network, DNN, computer architecture, DNN accelerator, design space exploration, edge computing, hardware architecture

Rights Statement

Copyright 2021, author.