The success of machine learning relies heavily on the data, thus is also limited by the data when no sufficient annotation can be provided for a standard supervised training pipeline. Weakly-supervised learning aims to tackle the absence of training data by relaxing the requirement of annotation to a weaker level than the desired output. We study the problem of weakly-supervised localization and grounding of actions and objects to enable the training of corresponding machine learning models without groundtruth location annotations. We propose to exploit the structure information in the weakly-supervised data to facilitate the learning of corresponding weakly-supervised models and propose three novel approaches to the above tasks. In the first work we explore the temporal structures in videos and design an attention-based loss function to help the learning of action localization focus on distinctive moments for better robustness and performance under the weakly-supervised setting. In the second work we utilize the contextual structures between visual and textual data and propose an iterative context-aware refinement for the textual and visual representations in the weakly-supervised visual grounding task, allowing flexibility of the semantic embeddings to resolve the ambiguity and adapt to different grounding scenarios. In the third work we take advantage of higher level relational structure across data to extend a previous interpretability method to embedding networks for localization which at the same time serves as visual explanation to interpret this particular type of neural network.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Mori, Greg
Member of collection