Human behavior understanding is a fundamental problem of computer vision. It is an important component of numerous real-life applications, such as human-computer interaction, sports analysis, video search, and many others. In this thesis we work on the problem of action recognition and localization, which is a crucial part of human behavior understanding. Action recognition explains what a human is doing in the video, while action localization indicates where and when in the video the action is happening. We focus on two important aspects of the problem: (1) capturing intra-class variation of action categories and (2) inference of action location. Manual annotation of videos with fine-grained action labels and spatio-temporal action locations is a nontrivial task, thus employing weakly supervised learning approaches is of interest. Real-life actions are complex, and the same action can look different in different scenarios. A single template is not capable of capturing such data variability. Therefore, for each action category we automatically discover small clusters of examples that are visually similar to each other. A separate classifier is learnt for each cluster, so that more class variability is captured. In addition, we establish a direct association between a novel test example and examples from training data and demonstrate how metadata (e.g., attributes) can be transferred to test examples. Weakly supervised learning for action recognition and localization is another challenging task. It requires automatic inference of action location for all the training videos during learning. Initially, we simplify this problem and try to find discriminative regions in videos that lead to a better recognition performance. The regions are inferred in a manner such that they are visually similar across all the videos of the same category. Ideally, the regions should correspond to the action location; however, there is a gap between inferred discriminative regions and semantically meaningful regions representing action location. To fill the gap, we incorporate human eye gaze data to drive the inference of regions during learning. This allows inferring regions that are both discriminative and semantically meaningful. Furthermore, we use the inferred regions and learnt action model to assist top-down eye gaze prediction.
Copyright is held by the author.
The author granted permission for the file to be printed and for the text to be copied and pasted.
Supervisor or Senior Supervisor
Thesis advisor: Mori, Greg
Member of collection