Computing Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Multidimensional Parallelization for Streaming Text Processing Applications Based on Parabix Framework

Author: 
Date created: 
2017-12-15
Abstract: 

Streaming text processing is important for transforming and analyzing the rapidly growing data in modern society. Unfortunately, text processing software written using the sequential byte-at-a-time processing model fails to take full advantage of the resources available on modern processors for many reasons, including significant branch misprediction penalties due to the input-dependent branching structures of text processing applications, cache miss penalties for table-based operations applied per byte of input, and logical complexity that makes it difficult to process more than a single-byte at a time. Common solutions to process text streams that have dependencies from start to end may involve state machine or recursive algorithms, which are generally considered hard to parallelize and hence ill-suited for multicore or manycore processors. However, the Parabix approach to text processing has recently been shown to offer a promising alternative, based on the concept of bitwise data parallelism: a transform representation of text that uses the full width of available processor registers at a density of one bit per input byte. This dissertation investigates the further development of the Parabix framework to incorporate multidimensional parallelization, combining Parabix methods with several different models of multithreading such as task parallelism, data parallelism and pipeline parallelism as well as with GPU-based SIMT processing. A form of data-pipeline parallelism is developed and shown to be beneficial for text-processing applications even with strong sequential state dependencies. Compilers for both pure pipeline parallelism and data-pipeline parallelism are developed and integrated into Parabix framework to provide automated multithreading support for any Parabix applications. Methods for task parallelism and data parallelism are also developed, but need to be customized by the programmer for specific applications. GPU support is added to Parabix framework by translating LLVM IR into PTX, which can be compiled into binary code and run on GPU devices. Programmers can simply choose to use NVPTX driver instead of CPU driver for code that is executed on GPU. Several applications based on Parabix are implemented and tested with different parallelization techniques to analyze the advantages and limits of multidimensional parallelization extensions of Parabix framework. In data-pipeline mode, we are able to achieve 215% speedup compared with the sequential version on a quad-core machine. The GPU implementation has its limitations but can give up to 310% speed-up.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Arrvindh Shriraman
Robert Cameron
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Subspace Clustering Methods for Understandable Information Organization

Author: 
Date created: 
2017-12-08
Abstract: 

Clustering has been widely used to identify possible structures in data and help users understand data in an unsupervised manner. Although clustering methods can provide group information for users, it is still challenging for users to efficiently and effectively understand clustering structures. Subspace clustering methods address this challenge by providing each clustering an understandable feature subspace. Moreover, as high-dimensional data has become more and more popular in real-world applications due to the advances of big data technologies, some subspace clustering methods are especially scalable for high-dimensional data. In this thesis, we study the subspace clustering problem in different application scenarios (e.g., semi-supervised and unsupervised) to bridge the semantic gap between low-level clustering structures and high-level understandable meanings, that is, providing understandable information organization for end users. Specifically, we develop a series of novel subspace clustering methods for different application purposes. Taking image retrieval in CBIR without query input as an example, we study how subspace clustering methods can help retrieve relevant images by relevance feedback, by efficient iterative search, or by data summarization. To tackle the challenges arising from real-world applications, we develop three efficient and effective subspace clustering methods to provide preferred or understandable clustering structures for end users. First, we present a semi-supervised subspace clustering method to discover a feature subspace, in which a preferred clustering structure is hidden. Second, we propose a subspace hierarchical clustering method that can generate a balanced hierarchy with semantics to help users search friendly and efficiently. Third, we develop a subspace multi-clustering method that can automatically discover a certain number of feature subspaces, where different clustering structures reside. Comprehensive empirical studies using synthetic and real data sets demonstrate the effectiveness of our proposed methods.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Jian Pei
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

REP3D: 3D Human Motion Capture Dataset for Athletic Movement

Author: 
Date created: 
2017-12-12
Abstract: 

The field of human 3D pose estimation suffers from a small population of diverse public motion capture datasets, each with a low number of environments and subjects. We propose a new dataset including 45 participants and 22 environments, using motion capture technology that allows data collection in arbitrary locations. The dataset is composed of video and motion capture data for athletic actions selected from golf and baseball, recorded from a plurality of angles and distances. The annotation process for semi-automatically aligning video data with ground truth 3D joint locations is fully outlined. The performance of a modern human 3D pose estimation model on a subset of the dataset is reported.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Greg Mori
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Computational study on a branch decomposition based exact distance oracle for planar graphs

Author: 
Date created: 
2017-10-19
Abstract: 

We present a simple exact distance oracle for the point-to-point shortest distance problem in planar graphs. Given an edge weighted planar graph G of n vertices, we decompose G into subgraphs by a branch-decomposition of G, compute the shortest distances between each vertex in a subgraph and the vertices in the boundary of the subgraph, and keep the shortest distances in the oracle. Let bw(G) be the branchwidth of G. Our oracle has O(bw(G)) query time, O(bw(G)n log(n)) size and O(n^2 log(n)) pre-processing time. Computational studies show that our oracle is much faster than Dijkstra’s algorithm for answering point-to-point shortest distance queries for several classes of planar graphs.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Qianping Gu
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Computational exploration of transmission and acquisition of drug-resistant tuberculosis

Author: 
Date created: 
2017-09-07
Abstract: 

Causing the deaths of millions of people annually, tuberculosis is now a universally shared threat. The emergence and spread of drug-resistant tuberculosis strains make effective treatment and control of tuberculosis complicated. Drug-resistance can arise from two sources: acquisition and transmission. Whole genome sequencing now plays an important role in bioinformatics and the research on infectious diseases. Based on the whole genome sequencing data of 110 tuberculosis isolates, a phylogenetic tree was built and the ancestral sequence was reconstructed. A simulation model was then designed to explore the development of drug-resistant tuberculosis in a population and resulted in about 300,000 tuberculosis patients. The tuberculosis patient clustering algorithm, which achieved desirable performance, was then proposed to cluster these tuberculosis patients into epidemiologically related groups. This research explored the transmission and acquisition of drug-resistant tuberculosis, and used modeling and algorithms in support of treatment and control of tuberculosis.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Leonid Chindelevitch
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Model for police dispatching and shift scheduling

Date created: 
2017-12-08
Abstract: 

Emergency services have been in existence since the beginning of recorded history, yet most efficient and effective use of resources in this field is still considered an open problem. In this thesis, we explore the challenges involved in police services, and present an approach for determining police force capacity for a given call for service data. We use mathematical programming for modeling police dispatching and shift scheduling, and test our method on real occurrence data.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Binay Bhattacharya
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

A high-throughput dependency parser

Date created: 
2017-10-17
Abstract: 

Dependency parsing is an important task in NLP, and it is used in many downstream tasks for analyzing the semantic structure of sentences. Analyzing very large corpora in a reasonable amount of time, however, requires a fast parser. In this thesis we develop a transition-based dependency parser with a neural-network decision function which outperforms spaCy, Stanford CoreNLP, and MALTParser in terms of speed while having a comparable, and in some cases better, accuracy. We also develop several variations of our model to investigate the trade-off between accuracy and speed. This leads to a model with a greatly reduced feature set which is much faster but less accurate, as well as a more complex model involving a BiLSTM simultaneously trained to produce POS tags which is more accurate, but much slower. We compare the accuracy and speed of our different parser models against the three mentioned parsers on the Penn Treebank, Universal Dependencies English, and Ontonotes datasets using two different dependency tree representations to show how our parser competes on data from very different domains. Our experimental results reveal that our main model is much faster than the 3 external parsers while also being more accurate in some cases; our reduced feature set model is significantly faster while remaining competitive in terms of accuracy; and our BiLSTM-using model is somewhat faster than CoreNLP and is significantly more accurate.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Anoop Sarkar
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Advanced techniques for bounded and unbounded repetition in parabix regular expression search

Author: 
Date created: 
2017-09-12
Abstract: 

The three-level architecture of the regular expression search tool named icGrep, which is based on a pure parallel Parabix framework, has shown great speedup compared to conventional search tools. This thesis proposed some advanced techniques for bounded and unbounded repetition in Parabix regular expression search. We first accelerated the bounded repetition type of Unicode unit-length regular expressions by utilizing a log2 technique with the UTF8-to-UTF32 pipeline. To reduce the overhead brought about by the UTF-8-to-UTF-32 transformation, the multiplexed character classes concept was proposed. For the unbounded repetition part, we have reviewed finite automata theory for application to Parabix regular expression matching and proposed a totally different compile pipeline for the local language. In the meanwhile, we proposed star-normal-form optimization to make the RE abstract syntax tree less complex and less ambiguous. All of these innovative techniques have demonstrated their performance dealing with repetition against the basic pipeline.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Robert D. Cameron
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Probabilistic Analysis of Distributed Processes with Focus on Consensus

Date created: 
2017-09-22
Abstract: 

This thesis is devoted to the study of stochastic decentralized processes. Typical examples in the real world include the dynamics of weather and temperature, of traffic, the way we meet our friends, etc. We take the rich tool set from probability theory for the analysis of Markov Chains and employ it to study a wide range of such distributed processes: Forest Fire Model (social networks), Balls-into-Bins with Deleting Bins, and fundamental consensus dynamics and protocols such as the Voter Model, 2-Choices, and 3-Majority.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Petra Berenbrink
Claire Mathieu
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Speed versus accuracy in neural sequence tagging for natural language processing

Author: 
Date created: 
2017-09-12
Abstract: 

Sequence Tagging, including part of speech tagging, chunking and named entity recognition, is an important task in NLP. Recurrent neural network models such as Bidirectional LSTMs have produced impressive results on sequence tagging. In this work, we first present a Bidirectional LSTM neural network model for sequence tagging tasks. Then we show a simple and fast greedy sequence tagging system using a feedforward neural network. We compare the speed and accuracy between the Bidirectional LSTM model and the greedy feedforward model. In addition, we propose two new models based on Mention2Vec by Stratos (2016): Feedforward-Mention2Vec for named entity recognition and chunking, and BPE-Mention2Vec for part-of-speech tagging. Feedforward-Mention2Vec predicts tag boundaries and corresponding types separately. BPE-Mention2Vec uses the Byte Pair Encoding algorithm to segment words first and then predicts the part-of-speech tags for the subword spans. We carefully design the experiments to demonstrate the speed and accuracy trade-off in different models. The empirical results reveal that the greedy feedforward model can achieve comparable accuracy and faster speed than recurrent models for sequence tagging, and Feedforward-Mention2Vec is competitive with the fully structured BiLSTM model for named entity recognition while being more scalable in the number of named entity types.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Anoop Sarkar
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.