Computing Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection

Employing neural hierarchical model with pointer generator networks for abstractive text summarization

Author: 
Date created: 
2019-10-31
Abstract: 

As the growth of online data continues, automatic summarization is integral in generating a condensed version of a text while preserving the meaning of the original input. Although most of the earlier works on automatic summarization use extractive approaches to identify the most important information of a document, recent research works focus on the more challenging task of making the summaries abstractive. Sequence-to-sequence models with attention have quantitatively shown to be effective for abstractive summarization, but the quality of the generated summaries is often poor with incorrect and redundant information. In this thesis, we present an end-to-end neural network framework which combines a hierarchical content selector and pointer generator networks abstractor through a multi-level attention mechanism that uses the sentence importance scores from the former model to help the word-level attention of the latter model make better decisions when generating the output words. Hence, words from key sentences will be attended more than words in less salient sentences of input. Our approach is motivated by human writers who tend to focus only on the relevant portions of an article when summarizing while ignoring anything irrelevant that might degrade the output quality. We conduct experiments on the challenging CNN/Daily Mail dataset, which consists of long newswire articles paired with multiple-sentence summaries. Experimental results show that our end-to-end architecture outperforms the extractive systems and strong lead-3 baseline and achieves competitive ROUGE and METEOR scores with previous abstractive systems on the same dataset. Qualitative analysis of test data shows that the generated summaries are fluent as well as informative.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Fred Popowich
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Inference of tumor subclonal composition and evolution by the use of single-cell and bulk DNA sequencing data

Author: 
Date created: 
2019-08-21
Abstract: 

Cancer is a genetic disease characterized by the emergence of genetically distinct populations of cells (subclones) through the random acquisition of mutations at the level of single-cells and shifting prevalences at the subclone level through selective advantages purveyed by driver mutations. This interplay creates complex mixtures of tumor cell populations which exhibit different susceptibility to targeted cancer therapies and are suspected to be the cause of treatment failure. Therefore it is of great interest to obtain a better understanding of the evolutionary histories of individual tumors and their subclonal composition. In this thesis we present three methods for the inference of tumor subclonal composition and evolution by the use of bulk and/or single-cell DNA sequencing data. First, we present CTPsingle, a method which aims to infer tumor subclonal composition from single-sample bulk sequencing data. CTPsingle consists of two steps: (i) robust clustering of mutations using beta-binomial mixture modelling and (ii) inference of tumor phylogenies by the use of integer linear programming. On simulated data, we show that CTPsingle is able to infer the purity and the clonality of single-sample tumors with high accuracy even when restricted to a coverage depth as low as 30x. CTPsingle is currently used to infer clonality as a part of the Evolution and Heterogeneity Working Group of Pan Cancer Analysis of Whole Genomes project where sequencing data of over 2700 tumors are analyzed. Next, we present B-SCITE, the first available computational approach that infers tumor phylogenies from combined single-cell and bulk sequencing data. B-SCITE is a probabilistic method which searches for tumor phylogenetic tree maximizing the joint likelihood of the two data types. Tree search in B-SCITE is performed by the use of customized MCMC search over the space of labeled rooted trees. Using a comprehensive set of simulated data, we show that B-SCITE systematically outperforms existing methods with respect to tree reconstruction accuracy and subclone identification. On real tumor data, mutation histories generated by B-SCITE show high concordance with expert generated trees. In the third part, we introduce PhISCS, the first method which integrates single-cell and bulk sequencing data while accounting for the possible existence of mutations affected by undetected copy number aberrations, as well as mutations for which the commonly used and recently debated Infinite Sites Assumption is violated. PhISCS is a combinatorial method and, in contrast to the available alternatives which are mostly based on the probabilistic search schemes, it can provide guarantee of optimality of the reported solutions. We provide two different implementations of PhISCS: (i) the implementation based on the use of integer linear programming and (ii) the implementation based on the use of constraint satisfaction programming. We show that the latter has lower running time on most of the instances that we used to asses the performance of the two implementations. These results suggest that in some applications constraint satisfaction programming might be a viable alternative to commonly used integer linear programming. We also demonstrate the utility of PhISCS in analyzing real sequencing data where it reports more plausible and parsimonious tumor phylogenies than the available alternatives.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Leonid Chindelevitch
S. Cenk Sahinalp
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Cost-sensitive ranking: Evolving data analytics in power systems and risk management

Author: 
Date created: 
2019-08-29
Abstract: 

This thesis develops Cost-Sensitive Ranking, Supervised Learning to directly minimize costs and risks from real industrial resource deployment policies, supported by two case studies with the power utility BC Hydro: Cost-Sensitive Learning to Rank and Risk Clearance. Our first solution is motivated to guide resource deployment within the top-k highest predicted impact events in sets of possible threats/items. For example, British Columbia storms can result in hundred thousands of customers losing power with global yearly storm outage losses exceeding billions of dollars. To prevent the most costly power outages before a storm, BC Hydro could apply data analytics models to recommend the top-k highest damage areas so they can act in these areas given their limited response resources. Though Regression could return the top-k highest predicted outage damage areas, Regression's emphasis on numeric prediction accuracy even on ``low damage events'' can compromise the ranking and thus the resulting preventative actions. Similarly, existing ranking solutions in Learning to Rank focus on web page search domains with vastly different priorities than deploying preventative resources as shown in our extensive studies. We thus define Cost-Sensitive Learning to Rank, a ranking problem of directly maximizing a company's cost-saving resulting from a ranking model; our suite of Cost-Sensitive Adaptions adapt most existing ranking models to this new problem, achieving up to 20% more cost-saving than competitors for identifying forest fires, crimes, quality assurance, and outages. While outage prevention was limited by available resources, our second problem is limited by ensuring regulatory standards while minimizing expensive costs. For example, BC Hydro's asset management for toxic electrical transformers must balance the cost of identification, breaking $100,000s of equipment to test for contaminated oil, with the risk of leaving a hazard unidentified, an unlikely toxic transformer leak's health and environmental damage. Because of the extreme difficulty in estimating health/environmental risks as a dollar value, company decision-making is instead guided by the ``clearance threshold'', a maximum acceptable probability of toxicity in unknown outcome transformers. Such ``under threshold'' transformers are within safety guidelines and are cleared from expensive tests. Other domains, such as regulations on maximum tolerable risk of device failures, rely on similar clearance thresholds to guide recalls or other actions. We therefore develop data analytics via clearance thresholds: Categorical Risk Clearance, to clear many future low-risk items to conserve expensive actions with a binary training set of hazards or non-hazards, and Numeric Risk Clearance, clearance with a numeric data set where an item is only a hazard if a numeric value exceeds a maximum such as exceeding 50 parts per million of toxins. The Categorical version only considers the all-or-nothing hazard/non-hazard label while the Numeric version can exploit the more informative numeric toxicities or badness. Because current Classification or Regression solutions assume inapplicable costs or fail to adapt to the specific goals of varying thresholds, we develop clearance guided solutions, CUT+ for Categorical Risk Clearance and RiskClear for Numeric Risk Clearance; in experiments, our methods are consistently more cost-effective in 56 of 57 tests across varying thresholds.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Ke Wang
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Accurate alignment-free inference of microbial phylogenies

Author: 
Date created: 
2019-08-21
Abstract: 

Classifying, clustering or building a phylogeny on a set of genomes without the expensive computation of sequence alignment involves calculating pairwise distances by an appropriate metric. One such metric is the normalized compression distance (NCD), an approximation of the true information distance between two objects. Despite NCD's universal applicability, it has seen few applications in bioinformatics, with no existing tools applying NCD to whole-genome datasets to the best of our knowledge. We introduce Sequence Non-Alignment Compression and Comparison (snacc), a pipeline specifically tailored for computing pairwise distances between genomic sequences. snacc employs the NCD with a variety of compression algorithms, alongside an integer linear programming approach for selecting a sequence's reverse complement. We investigate the use of snacc with 5 common compression algorithms, and apply it to several bacterial and viral datasets with varying properties. Our results show that snacc achieves comparable accuracy relative to other metrics, demonstrating a large improvement over previous NCD implementations, and can be successfully used to reconstruct microbial phylogenies. In addition, snacc is flexible enough to incorporate almost any compression algorithm in a simple manner. snacc is an open-source tool and is available at https://github.com/SweetiePi/snacc/.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Leonid Chindelevitch
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Enabling SQL-ML Explanation to Debug Training Data

Author: 
Date created: 
2019-08-22
Abstract: 

Supporting SQL-ML queries in database systems have recently attracted great attention in industry. A SQL-ML query treats ML models as user defined functions and embeds them into a SQL query. Since ML models do not always produce perfect predictions, a user may find the answer to a SQL-ML query different from what she expects and asks the system to provide an explanation. Although SQL-only or ML-only explanation has been well studied in the literature, to the best of our knowledge, we are the first to study the SQL-ML explanation problem. This thesis makes two major contributions. Firstly, we propose a formal definition of the SQL-ML explanation problem. Intuitively, our definition aims to trace the query answer back to the training data and identifies a small number of training examples that have the biggest impact on the query answer. Secondly, we study how to extend existing explanation frameworks and discuss their limitations to solve our problem. To overcome these limitations, we propose InfComp, a novel influence function based approach for SQL-ML explanation. We find that InfComp is a powerful tool to debug training data (i.e., detect corrupted features and mislabeled instances). We conduct extensive experiments using three real applications (Entity Resolution, Image Classification, and Spam Detection), and compare with the state-of-the-art approaches. Results show that InfComp can more accurately identify erroneous training examples than the baselines in an efficient manner.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Jiannan Wang
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Missing MRI pulse sequence synthesis using multi-modal generative adversarial network

Author: 
Date created: 
2019-08-22
Abstract: 

Magnetic resonance imaging (MRI) is being increasingly utilized to assess, diagnose, and plan treatment for a variety of diseases. The ability to visualize tissue in varied contrasts in the form of MR pulse sequences in a single scan provides valuable insights to physicians, as well as enabling automated systems performing downstream analysis. However many issues like prohibitive scan time, image corruption, different acquisition protocols, or allergies to certain contrast materials may hinder the process of acquiring multiple sequences for a patient. This poses challenges to both physicians and automated systems since complementary information provided by the missing sequences is lost. In this paper, we propose a variant of generative adversarial network (GAN) capable of leveraging redundant information contained within multiple available sequences in order to generate one or more missing sequences for a patient scan. The proposed network is designed as a multi-input, multi-output network which combines information from all the available pulse sequences, implicitly infers which sequences are missing, and synthesizes the missing ones in a single forward pass. We demonstrate and validate our method on two brain MRI datasets each with four sequences, and show the applicability of the proposed method in simultaneously synthesizing all missing sequences in any possible scenario where either one, two, or three of the four sequences may be missing. We compare our approach with competing unimodal and multi-modal methods, and show that we outperform both quantitatively and qualitatively.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Ghassan Hamarneh
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Understanding RNN States with predictive semantic encodings and adaptive representations

Author: 
Date created: 
2019-07-26
Abstract: 

Recurrent Neural Networks are an effective and prevalent tool used to model sequential data such as natural language text. However, their deep nature and massive number of parameters pose a challenge for those intending to study precisely how they work. This is especially the case for researchers with the expertise to understand the mathematics behind these models at a macroscopic level, but who often lack the tools to expose the microscopic details of what information they internally represent. We present a combination of visualization and analysis techniques to show some of the inner workings of Recurrent Neural Networks and facilitate their study at a fine level of detail. Specifically, we use an auxiliary model to interpret the meaning of hidden states with respect to the task level outputs. A visual encoding is designed for this model that is quickly interpreted and relates to other elements of the visual design. We also introduce a consistent visual representation for vector data that is adaptive with respect to the available visual space. When combined, these techniques provide a unique insight into RNN behaviours, allowing for both architectural and detail views to be visualized in concert. These techniques are leveraged in a fully interactive visualization tool which is demonstrated to improve our understanding of common Natural Language Processing tasks.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Fred Popowich
Steven Bergner
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Traffic-engineered distribution of multimedia content

Author: 
Date created: 
2019-08-15
Abstract: 

The amount of video traffic transmitted over the Internet has been steadily increasing over the past decade. This is due to the recent interest in streaming high-definition (HD), 4K, and immersive videos to many users. Streaming such multimedia content at large scale faces multiple challenges. For example, streaming systems need to handle users heterogeneity and interactivities in varying network conditions. In addition, large-scale video streaming stresses the ISP network that carries the video traffic, because the ISP needs to carefully direct traffic flows through its network to satisfy various service level agreements. This increases the complexity of managing the network and system resources. To address these challenges, we first propose a novel client-based rate adaptation algorithm for streaming multiview videos. Our algorithm achieves high video quality, smooth playback and efficient bandwidth utilization. For example, it achieves up to 300% improvement in the average quality compared to the algorithm used by YouTube. Then, we propose an algorithm to efficiently solve the resource management problem in the emerging ISP-managed Content Distribution Networks (CDNs). Our solution achieves up to 64% reduction in the inter-domain traffic. To reduce the network load of live streaming sessions, ISPs use multicast to efficiently carry the traffic through their networks. Finally, we propose a label-based multicast forwarding approach to implement traffic-engineered multicast trees in ISP networks. We prove that the proposed approach is scalable as it imposes minimal state and processing overheads on routers. We implemented the proposed multicast approach in a high-speed network testbed, and our results show that it can support thousands of concurrent multicast sessions.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Mohamed Hefeeda
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.

Understanding help-seeking diversity in modern feature-rich applications

Author: 
Date created: 
2019-06-17
Abstract: 

For most modern feature-rich software, considerable external help and learning resources are available on the web (e.g., documentation, tutorials, videos, Q&A forums). But, how do users new to an application discover and make use of such resources? We conducted in-lab and diary studies with 26 software newcomers from a variety of different backgrounds who were all using Fusion 360, a 3D modeling application, for the first time. Our results illustrate newcomers’ diverse needs, perceptions, and help-seeking behaviors. We found a number of distinctions in how technical and non-technical users approached help-seeking, including: when and how they initiated the help-seeking process, their struggles in recognizing relevant help, the degree to which they made coordinated use of the application and different resources, and in how they perceived the utility of different help formats. We also created customized visualizations of our newcomers’ navigation patterns between the 3D modeling application and help resources and evaluated our initial visualization design with software developers. We discuss implications for moving beyond “one-size-fits-all” help resources towards more structured, personalized, and curated help and learning materials.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Parmit Chilana
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.

Band and quality selection for efficient transmission of hyperspectral images

Date created: 
2019-08-14
Abstract: 

Due to recent technological advances in capturing and processing devices, hyperspectral imaging is becoming available for many commercial and military applications such as remote sensing, surveillance, and forest fire detection. Hyperspectral cameras provide rich information, as they capture each pixel along many frequency bands in the spectrum. The large volume of hyperspectral images as well as their high dimensionality make transmitting them over limited-bandwidth channels a challenge. To address this challenge, we present a method to prioritize the transmission of various components of hyperspectral data based on the application needs, the level of details required, and available bandwidth. This is unlike current works that mostly assume offline processing and the availability of all data beforehand. Our method jointly and optimally selects the spectral bands and their qualities to maximize the utility of the transmitted data. It also enables progressive transmission of hyperspectral data, in which approximate results are obtained with very small amount of data and can be refined with additional data. This is a desirable feature for large-scale hyperspectral imaging applications. We have implemented the proposed method and compared it against the state-of-the-art in the literature using hyperspectral imaging datasets. Our experimental results show that the proposed method achieves high accuracy, transmits a small fraction of the hyperspectral data, and significantly outperforms the state-of-the-art; up to 35% improvements in accuracy was achieved.

Document type: 
Thesis
File(s): 
Senior supervisor: 
Mohamed Hefeeda
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.