A grand challenge of computer vision is to enable machines to ``see people''. A solution to this challenge will enable numerous applications in various fields, e.g., security, surveillance, entertainment, human computer interaction, bio-mechanics, etc. This dissertation focus on two problems in the general area of ``looking at people'', human pose estimation and human action recognition. Th e first problem is to identify the body parts of a person from a still image. The second problem is to recognize the actions of the person from a video sequence. We formulate the solutions to these problems as learning structured models. In particular, we propose models and algorithms to address the following structures: (1) human pose estimation as structured output problem. We propose a boosted multiple tree model for modeling the spatial and occlusion constraints between human body parts; (2) temporal structure in human action recognition. We present two models based on the ``bag-of-words'' representation to capture the temporal structures of video sequences; (3) human action recognition as classification with hidden structures. We develop a model based on the hidden conditional random field to recognize human actions. We also propose a max-margin learning method for training the model. The learning method is general enough to be applied in many other applications in com puter vision, even other areas in computer science.
Copyright is held by the author.
Member of collection