Learning compositional models for activity understanding

Resource type
Thesis type
(Thesis) Ph.D.
Date created
Compositionality serves as a key design principle in artificial intelligence algorithms. In this thesis, we focus on developing compositional models for activity understanding. The core idea of this thesis is to design compositional representations for human activity videos that are specific to the downstream task and are learned using different types of compositional information available at various granularities of the videos. We applied this idea to a diverse set of video tasks aimed at understanding realistic activities. First, we introduce the task of generating human-object interactions in a zero-shot compositional setting and propose a generative model that uses an object-centric spatio-temporal scene graph for generating videos. Second, we work on the problem of temporal action localization and develop an end-to-end learnable transformer model that represents the input video as graphs over video segments and output space of actions as graphs of abstract learnable entities. Third, we focus on the task of long term action anticipation and design a transformer based model trained using two-stage learning approach to employ segment-level and video-level representations for action anticipation. Overall, we demonstrate the benefits of designing compositional representations for human activity videos.
132 pages.
Copyright statement
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Mori, Greg
Member of collection
Download file Size
etd22405.pdf 8.95 MB

Views & downloads - as of June 2023

Views: 46
Downloads: 1