Attention Based Temporal Convolutional Neural Network for Real-time 3D Human Pose Reconstruction

Date of Award


Degree Name

Ph.D. in Engineering


Department of Electrical and Computer Engineering


Advisor: Vijayan Asari


Computer vision and artificial intelligence aim to give computers a high-level understanding of images or videos. Through imitating the human brain that perceives and understands multimode information, a neural network can implicitly learn intricate structures of large-scale data. Deep learning allows computational models of multiple processing layers to learn and represent data with multiple levels. The main objective of this dissertation research is to develop robust deep learning architectures for human detection, pose estimation, and 3D pose reconstruction. 3D human pose estimation is a classic vision task enabling numerous applications from activity recognition to human-robot interaction and virtual/augmented reality. We present a deep convolutional neural network architecture that encapsulates a multi-scale feature fusion strategy for human detection in a complex background. To detect the human pose on 2D images and to project it to 3D space for 3D pose reconstruction, we need to obtain human keypoints such as face landmark points and joints of hands and body. We present a deep convolutional neural network architecture for human keypoints detection and 2D pose estimation. Our approach for 3D pose prediction from 2D image measurements, is based on two key observations: (1) temporally incoherent and jittery estimates often yield from individual frame prediction; (2) error rate can be remarkably reduced with an enhanced 2D pose input. Therefore, we propose an attention-based temporal convolutional neural network (ATCN) that is capable of guiding the network to adaptively identify important frames. ATCN can also extract a more significant portion of the intermediate output from each processing layer to estimate the 3D pose. A multi-scaled dilated convolution (MDC) method is employed that can model long-range dependencies among frames to achieve large temporal receptive fields. MDC will help to handle partial occlusions, fast motion, and complex background conditions. The ATCN architecture is built in such a way that it can be easily adapted to a causal model enabling real-time performance. We tested the effectiveness of the human detector and 2D pose estimator on the MS COCO dataset and observed outstanding performance when compared to several state-of-the-art methods. We performed an extensive quantitative evaluation of ATCN with MDC on standard benchmarks datasets such as Human3.6M and HumanEva for 3D pose estimation performance, and we observed that our method outperforms all the state-of-the-art 3D pose estimation systems with significant improvement in accuracy. Future directions focus on 3D pose reconstruction of multiple persons in the monocular video by detection, re-identification, and tracking of human keypoints.


Computer Engineering

Rights Statement

Copyright © 2019, author