Many visual recognition tasks can be represented as multiple instance problems. Two examples are image categorization and video classification, where the instances are the image segments and video frames, respectively. In this regard, detecting and counting the instances of interest can help to improve recognition in a variety of applications. For example, classifying the collective activity of a group of people can be directed by counting the actions of individual people. Further, encoding the cardinality-based relationships can reduce sensitivity to clutter or ambiguity, in the form of individuals not involved in a group activity or irrelevant segments/frames in an image/video. Multiple instance learning (MIL) aims to use these counting relations in order to recognize patterns from weakly supervised data. Contrary to standard supervised learning, where each training instance is labeled, in the MIL paradigm a bag of instances share a label, and the instance labels are hidden. This weak supervision significantly reduces the cost of full annotation in many recognition tasks. However, it makes learning and recognition more challenging. In a general MIL problem, three major issues usually emerge: how to infer instance labels without full supervision; how the cardinality relations between instance labels contribute to predict the bag label; how the the bag as a whole entity which integrates the instances is labeled. In this thesis, we try to address all these challenges. To this end, first we propose a boosting framework for MIL, which can model a wide range of soft and linguistic cardinality relations. Next, a probabilistic graphical model is proposed to capture the interactions and interrelations between instances, instance labels, and the whole bag. This is a general and flexible model, which can encode any cardinality-based relations. For training this model, we introduce novel algorithms based on latent max-margin classification, kernel learning, and gradient boosting. Thus, very rich and high-capacity models are obtained for bag classification. We evaluate our proposed methods in various applications such as image classification, human group activity recognition, human action recognition, video recognition, unconstrained video event detection, and video summarization.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Member of collection