Learning latent temporal manifolds for recognition and prediction of multiple actions in streaming videos using deep networks

Date of Award


Degree Name

Ph.D. in Electrical Engineering


Department of Electrical and Computer Engineering


Advisor: Vijayan K. Asari


Recognizing multiple types of actions appearing in a continuous temporal order from a streaming video is the key to many possible applications ranging from real-time surveillance to egocentric motion for human computer interaction. Current state of the art algorithms are more focused either on holistic video representation or on finding a specific activity in video sequences. But the major drawback is that these algorithms work only on applications pertaining to unconstrained video search from the web and requires the complete sequence for reporting what kind of actions are present.In this dissertation, we propose an algorithm to detect and recognize multiple actions in a streaming sequence at every instant. This approach was successful in recognizing the type of action being performed and also provides a percentage of completion of that action at every instant in real-time. This system is invariant to the number of frames and the speed at which the action is being performed. Apart from these benefits, the proposed model can also predict the motion descriptors at future instances corresponding to the action present. Since human motion is inherently continuous in nature, the algorithm presented in this dissertation computes novel motion descriptors based on the dense optical flow at every instant and evaluates their variations along the temporal domain using deep learning techniques.For each action type, we compute a non-linear transformation from motion descriptor space into the latent temporal space using stacked autoencoders where this transformation is learned from its training patterns. The latent features thus obtained, forms a temporal manifold where the transitions along it are modeled using the Conditional Restricted Boltzmann Machines (CRBMs). Using these trained autoencoders and CRBMs for every action type, we can make an inference into multiple latent temporal action manifolds at an instant from a set of streaming input frames. Our model achieved a high accuracy of 93% in recognizing actions per frame and was able to predict the future action instances with an accuracy of 84% for KTH dataset. Similarly, it was also tested with the UCF Sports dataset and achieved an accuracy of 84% in recognizing the action per-frame and around 69% of predictive capability. Therefore we believe that the proposed model can benefit applications in human computer interaction, gaming and IP surveillance where the action classification using temporal manifolds and its predictive capability are crucial.


Human activity recognition Data processing, Image processing Digital techniques, Pattern recognition systems, Computer Engineering, Computer Science, Electrical Engineering, Statistics, motion descriptors, shape descriptors, principal component analysis, neural networks, autoencoder, restricted Boltzmann machine, conditional restricted Boltzmann machine, deep learning, action recognition, latent temporal manifold, action localization

Rights Statement

Copyright 2015, author