Recently, several deep learning models are proposed that operate on graph-structured data. These models, which are known as graph neural networks, emphasize on methods for reasoning about non-Euclidean data. By combining end-to-end and handcrafted learning, graph neural networks can supply both relational reasoning and compositionality which are extremely important in many emerging tasks. This new paradigm is also consistent with the attributes of human intelligence: a human represents complicated systems as compositions of simple units and their interactions. Another important feature of graph neural networks is that they can often support complex attention mechanisms, and learn rich contextual representations by sending messages across different components of the input data. The main focus of this thesis is to solve some multimodal learning tasks by either introducing new graph neural network architectures or extending the existing graph neural network models and applying them to solve the tasks. I address three tasks: visual question answering (VQA), scene graph generation, and automatic image caption generation. I show that graph neural networks are effective tools to achieve better performance on these tasks. Despite all the hype and excitements about the future influence of graph neural networks, an open question about graph neural networks remains: how can we obtain the (structure of) the graphs that graph neural networks perform on? That is, how can we transform sensory input data such as images and text into graphs. A second main emphasis of this thesis is, therefore, to introduce new techniques and algorithms to address this issue. We introduce a generative graph neural network model based on reinforcement learning and recurrent neural networks (RNNs) to extract a structured representation from sensory data. The specific contributions are the following: We introduce a new neural network architecture, Multimodal Neural Graph Memory Networks (MN-GMN), for the VQA task. A key issue for VQA is how to reason about information from different image regions that is relevant for answering the question. Our novel approach uses graph structure with different region features as node attributes and applies a recently proposed powerful graph neural network model, Graph Network (GN), to reason about objects and their interactions in the scene context. The flexibility of GNs allows us to integrate bimodal sources of local information, text and visual, both within and across each modality. Experiments show MN-GMN outperforms the state-of-the-art on Visual7W and VQA v2.0 datasets and achieves comparable to the state-of-the-art results on CLEVR dataset. We propose a new algorithm, called Deep Generative Probabilistic Graph Neural Networks (DG-PGNN), to generate a scene graph for an image. The input to DG-PGNN is an image, together with a set of region-grounded captions (RGCs) and object bounding-box proposals for the image. To generate the scene graph, DG-PGNN constructs and updates a new model, called a Probabilistic Graph Network (PGN). A PGN can be thought of as a scene graph with uncertainty: it represents each node and each edge by a CNN feature vector and defines a probability mass function (PMF) for node-type (object category) of each node and edge-type (predicate class) of each edge. The DG-PGNN sequentially adds a new node to the current PGN by learning the optimal ordering in a Deep Q-learning framework, where states are partial PGNs, actions choose a new node, and rewards are defined based on the ground-truth. After adding a node, DG-PGNN uses message passing to update the feature vectors of the current PGN by leveraging contextual relationship information, object co-occurrences, and language priors from captions. The updated features are then used to fine-tune the PMFs. Our experiments show that the proposed algorithm significantly outperforms the state-of-the-art results on the Visual Genome dataset for the scene graph generation. We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on a two-dimensional context, by selecting or ignoring its input. The Grid LSTM can be seen as a graph neural network model with a grid structure. The Grid LSTM has not been applied to the image caption generation task before. Another novel aspect is that we leverage a set of local RGCs obtained by transfer learning. The RGCs often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on the MS-COCO dataset show that our architecture outperforms the state-of-the-art.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Schulte, Oliver
Member of collection