Lan, Tian

Resource type

Thesis

Thesis type

(Thesis) Ph.D.

Date created

2013-08-29

Authors/Contributors

Author: Lan, Tian

Abstract

Visual recognition is a fundamental problem in computer vision. It is significant to many applications such as surveillance, security, entertainment and health care. We have observed tremendous growth in visual recognition over the past decade. However, it remains a challenging problem for computers. One of the main reasons is the clear gap between human descriptions of the visual world and the output of the current visual recognition system. The semantic space humans used to describe the visual world is highly structural - besides naming an object (action), human would additionally describe it in multiple levels of detail, ranging from the fine-grained descriptions (e.g. color, shape) to the higher-level relationships among multiple objects (actions). How to represent and learn the rich structures in the visual data is the focus of this dissertation. We address two fundamental problems in visual recognition: understanding human activities and understanding images. For solving both problems, we start with flat structures and move towards richer hierarchical structures: First, we develop figure-centric models for joint action recognition and localization that capture the spatial-temporal arrangements of an action over video sequences. Then, we propose hierarchical models for recognizing multi-person activities in entire scenes. Multiple levels of detail including actions, social roles and a scene-level event are encoded in a unified learning framework. For understanding images, we follow the same route by first developing flat models to capture the spatial structures in object queries for image retrieval, and then move towards hierarchical models to handle more complex multi-level semantic labelings for object detection. This dissertation contributes to visual recognition by learning structured models, and in particular, hierarchical models for multi-level activity recognition and object detection. The work presented in this dissertation attempts to provide insights into several critical and yet open questions in visual recognition: How to label a visual entity (action, object, scene)? How many levels of detail should we consider? How should a recognition problem be represented? How to model the complex structures? What is the desirable output of a recognition system?

Keywords

Identifier

etd8031

Copyright statement

Copyright is held by the author.

Permissions

The author granted permission for the file to be printed, but not for the text to be copied and pasted.

Scholarly level

Graduate student (PhD)

Supervisor or Senior Supervisor

Thesis advisor: Mori, Greg

Member of collection

Computing Science Theses

Download file	Size
etd8031_TLan.pdf	41.2 MB

From Flat to Hierarchical: Modeling Structures in Visual Recognition

Keywords

Views & downloads - as of June 2023