Computing Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Automating data preparation with statistical analysis

Author: 
Date created: 
2021-03-18
Abstract: 

Data preparation is the process of transforming raw data into a clean and consumable format. It is widely known as the bottleneck to extract value and insights from data, due to the number of possible tasks in the pipeline and factors that can largely affect the results, such as human expertise, application scenarios, and solution methodology. Researchers and practitioners devised a great variety of techniques and tools over the decades, while many of them still place a significant burden on human’s side to configure the suitable input rules and parameters. In this thesis, with the goal of reducing human manual effort, we explore using the power of statistical analysis techniques to automate three subtasks in the data preparation pipeline: data enrichment, error detection, and entity matching. Statistical analysis is the process of discovering underlying patterns and trends from data and deducing properties of an underlying distribution of probability from a sample, for example, testing hypotheses and deriving estimates. We first discuss CrawlEnrich, which automatically figures out the queries for data enrichment via web API data, by estimating the potential benefit of issuing a certain query. Then we study how to derive reusable error detection configuration rules from a web table corpus, so that end-users get results with no efforts. Finally, we introduce AutoML-EM, aiming to automate the entity matching model development process. Entity matching is to find the identical entities in real-world. Our work provides powerful angles to automate the process of various data preparation steps, and we conclude this thesis by discussing future directions.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Jiannan Wang
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Multilingual unsupervised word alignment models and their application

Date created: 
2021-03-05
Abstract: 

Word alignment is an essential task in natural language processing because of its critical role in training statistical machine translation (SMT) models, error analysis for neural machine translation (NMT), building bilingual lexicon, and annotation transfer. In this thesis, we explore models for word alignment, how they can be extended to incorporate linguistically-motivated alignment types, and how they can be neuralized in an end-to-end fashion. In addition to these methodological developments, we apply our word alignment models to cross-lingual part-of-speech projection. First, we present a new probabilistic model for word alignment where word alignments are associated with linguistically-motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. The proposed joint generative models (alignment-type-enhanced models) significantly outperform the models without alignment types in terms of word alignment and translation quality. Next, we present an unsupervised neural Hidden Markov Model for word alignment, where emission and transition probabilities are modeled using neural networks. The model is simpler in structure, allows for seamless integration of additional context, and can be used in an end-to-end neural network. Finally, we tackle the part-of-speech tagging task for the zero-resource scenario where no part-of-speech (POS) annotated training data is available. We present a cross-lingual projection approach where neural HMM aligners are used to obtain high quality word alignments between resource-poor and resource-rich languages. Moreover, high quality neural POS taggers are used to provide annotations for the resource-rich language side of the parallel data, as well as to train a tagger on the projected data. Our experimental results on truly low-resource languages show that our methods outperform their corresponding baselines.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Anoop Sarkar
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Towards event analysis in time-series data: Asynchronous probabilistic models and learning from partial labels

Author: 
Date created: 
2021-03-10
Abstract: 

In this thesis, we contribute in two main directions: modeling asynchronous time-series data and learning from partial labelled data. We first propose novel probabilistic frameworks to improve flexibility and expressiveness of current approaches in modeling complex real-world asynchronous event sequence data. Second, we present a scalable approach to end-to-end learn a deep multi-label classifier with partial labels. To evaluate the effectiveness of our proposed frameworks, we focus on visual recognition application, however, our proposed frameworks are generic and can be used in modeling general settings of learning event sequences, and learning multi-label classifiers from partial labels. Visual recognition is a fundamental piece for achieving machine intelligence, and has a wide range of applications such as human activity analysis, autonomous driving, surveillance and security, health-care monitoring, etc. With a wide range of experiments, we show that our proposed approaches help to build more powerful and effective visual recognition frameworks.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Greg Mori
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Explaining inference queries with Bayesian optimization

Author: 
Date created: 
2021-04-08
Abstract: 

Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this work, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO - a technique for finding the global optimum of a black-box function - is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on three real-world datasets.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Jiannan Wang
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Towards memory-efficient incremental processing of streaming graphs

Author: 
Date created: 
2021-03-03
Abstract: 

With growing interest in efficiently analyzing dynamic graphs, streaming graph processing systems rely on stateful iterative models where they track the intermediate state as execution progresses in order to incrementally adjust the results upon graph mutation to reflect the changes in the latest version of the graph. We observe that the intermediate state tracked by these stateful iterative models significantly increases the memory footprint of these systems, which limits their scalability on large graphs. Due to the ever-increasing size of real-world graphs, it is crucial to develop solutions that actively limit their memory footprint while still delivering the benefits of incremental processing. We develop memory-efficient stateful iterative models that demand much less memory capacity to efficiently process streaming graphs with delivering the same results as provided by existing stateful iterative models. First, we propose a Selective Stateful Iterative Model where the memory footprint is controlled by selecting a small portion of the intermediate state to be maintained throughout execution, and the selection can be configured based on the capacity of the system’s memory. Then, we propose a Minimal Stateful Iterative Model that further reduces the memory footprint by exploiting the key properties of graph algorithms. We develop incremental processing strategies for both of our models in order to correctly compute the effects of graph mutations on the final results even when intermediate states are not available. The evaluation shows our memory-efficient models are effective in limiting the memory footprint while still retaining most of the performance benefits of traditional stateful iterative models, hence being able to scale on larger graphs that could not be handled by the traditional models.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Keval Vora
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Neural disjunctive normal form: Vertically integrating logic with deep learning for classification

Author: 
Date created: 
2020-12-14
Abstract: 

Inspired by the limitations of pure deep learning and symbolic logic-based models, in this thesis we consider a specific type of neuro-symbolic integration called vertical integration to bridge logic reasoning and deep learning and address their limitations. The motivation of vertical integration is to combine perception and reasoning as two separate stages of computation, while still being able to utilize simple and efficient end-to-end learning. It uses a perceptive deep neural network (DNN) to learn abstract concepts from raw sensory data and uses a symbolic model that operates on these abstract concepts to make interpretable predictions. As a preliminary step towards this direction, we tackle the task of binary classification and propose the Neural Disjunctive Normal Form (Neural DNF). Specifically, we utilize a per- ceptive DNN module to extract features from data, then after binarization (0 or 1), feed them into a Disjunctive Normal Form (DNF) module to perform logical rule-based classi- fication. We introduce the BOAT algorithm to optimize these two normally-incompatible modules in an end-to-end manner. Compared to standard DNF, Neural DNF can handle prediction tasks from raw sensory data (such as images) thanks to the neurally-extracted concepts. Compared to standard DNN, Neural DNF offers improved interpretability via an explicit symbolic representation while being able to achieve comparable accuracy despite the reduction of model flexibility, and is particularly suited for certain classification tasks that require some logical composition. Our experiments show that BOAT can optimize Neural DNF in an end-to-end manner, i.e. jointly learn the logical rules and concepts from scratch, and that in certain cases the rules and the meanings of concepts are aligned with human understanding. We view Neural DNF as an important first step towards more sophisticated vertical inte- gration models, which use symbolic models of more powerful rule languages for advanced prediction and algorithmic tasks, beyond using DNF (propositional logic) for classification tasks. The BOAT algorithm introduced in this thesis can potentially be applied to such advanced hybrid models.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Martin Ester
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Unsupervised annotation of regulatory domains by integrating functional genomic assays and Hi-C data

Date created: 
2020-12-03
Abstract: 

In each cell type, chromosomes are organized into a specific 3D structure that controls the function of a cell through different mechanisms including domain-scale regulation. Because of the correlation between genome structure and its function, different methods have been proposed to integrate 1D functional genomic and 2D Hi-C data to identify domain types. Existing methods rely on an assumption that directly connected genomic regions are more probable to have the same domain type, however, spatial clustering of genomic regions is based on both their first-order and second-order proximities. Here, we present an integrative approach that uses 1D functional genomic features and 3D interactions from Hi-C data to assign labels to genomic regions that can discriminate both spatial and functional genomic patterns. We use graph embedding to learn latent variables for nodes (genomic regions) that preserve the Hi-C graph second-order proximity. Such latent variables summarize spatial information in Hi-C data, and we feed them in addition to existing 1D functional features to the Segway, a genome annotation method, to infer domain states. We show that our labels distinguish a combination of the spatial and functional states of the genomic regions, for example, loci locating in the nucleus interior can be furthermore clustered into significantly and moderately expressed domains. We also found the importance of each of the spatial and functional features to explain different cell activities including replication timing and gene expression profile, and how coupling two feature types improve the prediction of such activities. Finally, we showed that incorporating spatial features allow finding domain types, which are co-regulated even in large genomic distance from each other. Our framework can be generalized to aggregate different 1D genomic assays and 3D interactions from Hi-C to find the mechanisms behind the association of genome 3D structure and epigenetic profile.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Maxwell Libbrecht
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Unsupervised learning of latent edge types from multi-relational data

Author: 
Date created: 
2021-05-10
Abstract: 

Many relational datasets, including relational databases, feature links of different types (e.g., actors act in movies, users rate movies), known as multi-relational, heterogeneous, or multilayer networks. Edge types/network layers are often not explicitly labeled, even when they influence the underlying graph generation process. For example, IMDb lists Tom Cruise as a cast member of Mission Impossible, but not as its star. Inferring latent layers is useful for relational prediction tasks (e.g., predict Tom Cruise’s salary or his presence in other movies). This thesis discusses Latent Layer Generative Framework - LLGF, a generative framework for learning latent layers that generalizes Variational Graph Auto-Encoders (VGAEs) with arbitrary node representation encoders and link generation decoders. The decoder treats the observed edge type signal as a linear combination of latent layer decoders. The encoder infers parallel node representations, one for each latent layer. We evaluate our proposed framework, LLGF, on eight benchmark graph learning datasets for this study. Four of the datasets are heterogeneous (originally labeled with edge types); we apply LLGF after removing the edge labels to assess how well it recovers ground-truth layers. LLGF increases link prediction accuracy, especially for heterogeneous datasets (up to 5% AUC), and recovers the ground-truth layers exceptionally well.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Oliver Schulte
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

SANet: Scene agnostic network for camera localization

Author: 
Date created: 
2021-03-04
Abstract: 

This thesis presents a scene agnostic neural architecture for camera localization, where model parameters and scenes are independent from each other. Despite recent advancement in learning based methods with scene coordinate regression, most approaches require training for each scene one by one, not applicable for online applications such as SLAM and robotic navigation, where a model must be built on-the-fly. Our approach learns to build a hierarchical scene representation and predicts a dense scene coordinate map of a query RGB image on-the-fly given an arbitrary scene. The 6 DoF camera pose of the query image can be estimated with the predicted scene coordinate map. Additionally, the dense prediction can be used for other online robotic and AR applications such as obstacle avoidance. We demonstrate the effectiveness and efficiency of our method on both indoor and outdoor benchmarks, achieving state-of-the-art performance among methods working for arbitrary scenes without retraining or adaptation.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Ping Tan
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

SigTools: An exploratory visualization tool for genomic signals

Author: 
Date created: 
2021-01-20
Abstract: 

With the advancement of sequencing technologies, genomic data sets are constantly being expanded by high volumes of different data types. One recently introduced data type in genomic science is genomic signals, with genomic coordinates associated with a score or probability indicating some form of biological activity. An example of genomic signals isEpigenomicmarkswhich represent short-read coverage measurements over the genome, and are utilized to locate functional and nonfunctional elements in genome annotation studies. To understand and evaluate the results of such studies, one needs to explore and analyze the characteristics of the input data. Information visualization is an effective approach that leverages human visual ability in data analysis. Several visualization applications have been deployed for this purpose such as the UCSC genome browser, Deeptools, and Segtools. However, we believe there is room for improvement in terms of programming skills requirements and proposed visualizations. Sigtools is an R-based exploratory visualization package, designed to enable the users with limited programming experience to produce statistical plots of continuous genomic data. It consists of several statistical visualizations such as value distribution, correlation, and autocorrelation that provide insights regarding the behavior of a group of signals in larger regions – such as a chromosome or the whole genome – as well as visualizing them around a specific point or short region. To demonstrate Sigtools utilization, first, we visualize five histone modifications downloaded from Roadmap Epigenomics data portal and show that Sigtools accurately captures their characteristics. Then, we visualize five chromatin state features, probabilistic generated genome annotations, to display how sigtools can assist in the interpretation of new and unknown signals.

Document type: 
Thesis
File(s): 
Supervisor(s): 
Kay C. Wiese
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.