Resource type
Thesis type
(Thesis) Ph.D.
Date created
2023-04-18
Authors/Contributors
Author: Nawhal, Megha
Abstract
Compositionality serves as a key design principle in artificial intelligence algorithms. In this thesis, we focus on developing compositional models for activity understanding. The core idea of this thesis is to design compositional representations for human activity videos that are specific to the downstream task and are learned using different types of compositional information available at various granularities of the videos. We applied this idea to a diverse set of video tasks aimed at understanding realistic activities. First, we introduce the task of generating human-object interactions in a zero-shot compositional setting and propose a generative model that uses an object-centric spatio-temporal scene graph for generating videos. Second, we work on the problem of temporal action localization and develop an end-to-end learnable transformer model that represents the input video as graphs over video segments and output space of actions as graphs of abstract learnable entities. Third, we focus on the task of long term action anticipation and design a transformer based model trained using two-stage learning approach to employ segment-level and video-level representations for action anticipation. Overall, we demonstrate the benefits of designing compositional representations for human activity videos.
Document
Extent
132 pages.
Identifier
etd22405
Copyright statement
Copyright is held by the author(s).
Supervisor or Senior Supervisor
Thesis advisor: Mori, Greg
Language
English
Member of collection
Download file | Size |
---|---|
etd22405.pdf | 8.95 MB |