## About Summit

# Computing Science - Theses, Dissertations, and other Required Graduate Degree Essays

Receive updates for this collection## Localizing violations of approximate constraints for data error detection

Error detection is key for data quality management. Leveraging domain knowledge in the form of user-specified constraints is one of the major approaches to error detection. A recent trend in error detection has been utilizing approximate constraints (ACs) that a relation is expected to satisfy only to a certain degree rather than completely. An example are the recently introduced statistical constraints, that allow the user to specify which correlations among attributes she expects to be present or absent in the data. Statistical constraints allow the user to express a broad range of statistical and causal domain knowledge. Extensive empirical investigations indicate that even traditional integrity constraints such as functional dependencies hold only approximately in real-world datasets. Approximate functional dependencies (AFDs) have been a data cleaning tool for some time. This thesis introduces a new technique for enhancing error detection with approximate constraints. Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient algorithm for identifying distinct data regions that violate given ACs to different degrees, based on a recursive tree partitioning scheme. The learned trees describe different error regions in terms of data attributes that are easily interpreted by users (e.g. all records before 2003). This helps to explain to the user why some records were identified as likely errors. After identifying error regions, we can apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, done using four datasets containing both real world and synthetic errors, shows that identifying error regions increases both precision and recall of error detection based on ACs. Error regions can be combined not only with constraint-based error detection, but also with other approaches such as those based on machine learning. Our experiments provide evidence that the error regions boost the performance of machine learning methods.

## VisR: An interactive visualization framework for analysis of sequencing data

Several tools have been developed to enable biologists to perform initial browsing and exploration of sequencing data. However, the computational tool set for further analyses often requires significant computational expertise to use and many of the biologists with the knowledge needed to interpret these data must rely on programming experts. In this thesis, we focus on addressing this limitation through visualization tools for exploratory analysis of sequencing data and contribute the design and development of two novel systems that are flexible enough to allow a high degree of analysis power, while at the same time are easy to use for non-programmers: (1) a general purpose framework that bridges the gap between the biologists and the bioinformaticians through a system of visual analysis modules that can be rapidly developed and connected together, and (2) a first-of-its-kind system that facilitates visual parameter space analysis for a wide variety of computer models. We start by providing a characterization of the data and an abstraction of the domain tasks in the field of epigenetics and present a design study on development and evaluation of ChAsE, an interactive tool to facilitate analysis and visualization of epigenetic datasets. We will then discuss VisR, a general framework for analysis of sequencing datasets that provides both a computationally rich as well as accessible framework for integrative and interactive analyses through modules called R-apps that utilize packages in R and repositories such as Bioconductor. Our framework provides means for interactive exploration of the results or the R-apps, and supports linking apps to create more complex workflows. It also provides an ecosystem to allow extension and sharing of the apps. We finally present ModEx, a general purpose system for exploring parameters of a variety of computer models. We discuss how the system offers key components of visual parameter space analysis frameworks including parameter sampling, deriving output summaries, and an interactive and customizable exploration interface and explain how it can be used for rapid development of custom solutions for different application domains.

## Service chaining for multicast traffic

Multicast service chaining refers to the orchestration of network services for multicast traffic. Paths of a multicast session that span the source, destinations and required services form a complex structure that we refer to as the multicast distribution graph. In this thesis, we propose a new path-based algorithm, called Oktopus, that runs at the control plane of the ISP network to calculate the multicast distribution graph for a given session. Oktopus aims at minimizing the routing cost for each multicast session. Oktopus consists of two steps. The first one generates a set of network segments for the ISP network, and the second step uses these segments to efficiently calculate the multicast distribution graph. Oktopus has a fine-grained control over the selection of links in the distribution graphs, which leads to significant improvements in the quality of the calculated graphs. Specifically, Oktopus increases the number of allocated sessions because it can reach ISP locations that have the required services, and thus includes them in the calculated graph. Moreover, Oktopus can reduce the routing cost per session as it carefully chooses links belonging to the graph. We compared Oktopus against the optimal and closest algorithms in simulations using real ISP topologies. Our results show that Oktopus has an optimality gap of 5% on average, and it computes the distribution graphs multiple orders of magnitude faster than the optimal algorithm. Moreover, Oktopus outperforms the closest algorithm in the literature in terms of the number of allocated multicast sessions by up to 37%.

## Hierarchical action classification with network pruning

Research on human action classification has made significant progresses in the past few years. Most deep learning methods focus on improving performance by adding more network components. We propose, however, to better utilize auxiliary mechanisms, including hierarchical classification, network pruning, and skeleton-based preprocessing, to boost the model robustness and performance. We test the effectiveness of our method on five commonly used testing datasets: NTU RGB+D 60, NTU RGB+D 120, Northwestern-UCLA Multiview Action 3D, UTD Multimodal Human Action Dataset, and Kinetics 400, which is a challenging and different dataset among the others. Our experiments show that our method can achieve either comparable or better performance on all the first four datasets. In particular, our method sets up a new baseline for NTU 120, the largest dataset among the first four. We also analyze our method with extensive comparisons and ablation studies.

## An intermediate representation for transforming and optimizing the microarchitecture of application accelerators

In recent years, the computing landscape has seen a shift towards specialized accelerators since the scaling of computational capacity is no longer guaranteed for every technology generation. Reconfigurable architectures like FPGAs are promising for accelerator implementation. FPGAs allow implementation of arbitrary logic functions for different classes of applications with better performance per watt over CPUs and GPUs without re-spinning the circuits like fixed-function ASICs. Unfortunately, the software programmer community has stayed away from this technology, primarily because of the abstraction gap exists between software languages and hardware design. Hardware description languages (HDLs) are very low-level, and a hardware designer should think about the design in terms of low-level building blocks such as gates and registers. The alternative to HDLs is High-level synthesis (HLS) tools. HLS frameworks synthesize hardware from a high-level description of an algorithm in the form of untimed mathematical expressions and nested, pipeline and parallel loops in software languages. The primary limitation of HLS is that the functionality and microarchitecture are conflated together in a single language. As a result, making changes to the accelerator design requires code restructuring and microarchitecture optimizations tied by program correctness. In this thesis we propose two new abstractions to decouple functionality from microarchitecture. The first abstraction is a hierarchical intermediate representation for describing parameterized accelerator microarchitecture, targeting reconfigurable architects. In this abstraction, we repre- sent the accelerator as a concurrent structural graph in which components roughly correspond to microarchitecture level hardware blocks. We describe the methods we used to lower the en- tire application graph into a parameterized intermediate hardware abstraction, μIR. We describe the implementation of this intermediate abstraction and an associated pass framework, μopt. We then discuss some of the compiler optimizations that μIR enables, including timing, spatial, and higher-order. The final system is a compiler stack that can take a high-level program as input and translate it into optimized, synthesizable hardware design. The second abstraction is a sequence of instructions that convey the producer-consumer locality between dependent instructions. We show that this new abstraction adapts to different acceleration behaviors by varying the length and the types of fused instructions.

## Partitioning cographs into forests and independent sets

To determine the chromatic number of a graph, we seek to partition the vertices into minimum number of independent sets. Similarly, for arboricity, we seek to partition the vertices into minimum number of sets, each of which induces a forest. Both problems seek to partition the vertices into sets that induce a sparse subgraph, and both are NP-hard in general and can be solved in polynomial time on cographs. In this thesis, we consider a mixed problem, where a graph is partitioned into p forests and q independent sets. It is known that for each p and q, the partition problem has a finite complete set of minimal cograph obstructions. For the cases where p = 0 or p = 1, a minimal obstruction characterization of (p, q) partitionability of cographs was previously known. However, it was also known that the number of minimal obstructions grows exponentially with p. We consider the next case of p = 2 and q = 1, and provide a complete list of minimal cograph obstructions. We also provide polynomial time certifying algorithms for the cases p = 1 for any q, and p = 2 and q = 1. We also consider a vertex deletion version of the partition problem. Here, r vertices are allowed to be deleted so that the remaining graph admits a partition into p forests and q independent sets. For this problem, we provide a complete list of minimal cograph obstructions when p = q = r = 1, and p = r = 1, q = 2.

## Inverse reinforcement learning for team sports: Valuing actions and players

A major task of sports analytics is to rank players based on the impact of their actions. Recent methods have applied reinforcement learning (RL) to assess the value of actions from a learned action value or Q-function. A fundamental challenge for estimating action values is that explicit reward signals (goals) are very sparse in many team sports, such as ice hockey and soccer. This paper combines Q-function learning with inverse reinforcement learning (IRL) to provide a novel player ranking method. We treat professional play as expert demonstrations for learning an implicit reward function. Our method alternates single-agent IRL to learn a reward function for multiple agents; we provide a theoretical justification for this procedure. Knowledge transfer is used to combine learned rewards and observed rewards from goals. Empirical evaluation, based on 4.5M play-by-play events in the National Hockey League (NHL), indicates that player ranking using the learned rewards achieves high correlations with standard success measures and temporal consistency throughout a season.

## User-assisted video reflection removal

Reflections in videos are obstructions that often occur when videos are taken behind reflective surfaces like glass. These reflections reduce the quality of such videos, lead to information loss and degrade the accuracy of many computer vision algorithms. A video containing reflections is a combination of background and reflection layers. Thus, reflection removal is equivalent to decomposing the video into two layers. This problem is ill-posed as there is an infinite number of valid decompositions. To address this problem, we propose a user-assisted approach for video reflection removal. We rely on both spatial and temporal information and utilize sparse user hints to help improve separation. The key idea of the proposed method is to use motion cues to separate the background layer from the reflection layer with minimal user assistance. We show that user-assistance significantly improves the layer separation results. We implement and validate the proposed method through quantitative and qualitative results on real and synthetic videos. Our experiments show that the proposed method successfully removes reflection from video sequences, does not introduce visual distortions, and significantly outperforms the state-of-the-art reflection removal methods in the literature.

## MCS2: Minimal coordinated supports for fast enumeration of minimal cut sets in metabolic networks

Constraint-based modeling of metabolic networks helps researchers gain insight into the metabolic processes of many organisms, both prokaryotic and eukaryotic. Minimal Cut Sets (MCSs) are minimal sets of reactions whose inhibition blocks a target reaction in a metabolic network. Most approaches for finding the MCSs in constrained-based models require, either as an intermediate step or as a byproduct of the calculation, the computation of the set of elementary flux modes (EFMs), a convex basis for the valid flux vectors in the network. Recently, Ballerstein et al. proposed a method for computing the MCSs of a network without first computing its EFMs, by creating a dual network whose EFMs are a superset of the MCSs of the original network. However, their dual network is always larger than the original network and depends on the target reaction. Here we propose the construction of a different dual network, which is typically smaller than the original network and is independent of the target reaction, for the same purpose. We prove the correctness of our approach, MCS2, and describe how it can be modified to compute the few smallest MCSs for a given target reaction.

## Genotyping and copy number analysis of immunoglobin heavy chain variable genes using long reads

One of the remaining challenges to describing an individual's genetic variation lies in the highly heterogenous and complex genomic regions which imped the use of classical reference-guided mapping and assembly approaches. Once such region is the Immunoglobulin heavy chain locus (IGH), which is critical for the development of antibodies and the immune system. Presented is ImmunoTyper, the first PacBio-based genotyping and copy-number calling tool specifically designed for IGH V genes (IGHV). ImmunoTyper's multi-stage clustering and combinatorial optimization approach is demonstrated to be the most comprehensive IGHV genotyping approach published to date, through validation using gold-standard IGH reference sequence. This preliminary work establishes the feasibility of fine-grained genotype and copy number analysis using error-prone long reads in complex multi-gene loci, and opens the door for in-depth investigation into IGHV heterogeneity using accessible and increasingly common whole genome sequence