Computing Science, School of

Receive updates for this collection

Aristotle: Stratified Causal Discovery for Omics Data

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2022-01-15
Abstract: 

Background

There has been a simultaneous increase in demand and accessibility across genomics, transcriptomics, proteomics and metabolomics data, known as omics data. This has encouraged widespread application of omics data in life sciences, from personalized medicine to the discovery of underlying pathophysiology of diseases. Causal analysis of omics data may provide important insight into the underlying biological mechanisms. Existing causal analysis methods yield promising results when identifying potential general causes of an observed outcome based on omics data. However, they may fail to discover the causes specific to a particular stratum of individuals and missing from others.

Methods

To fill this gap, we introduce the problem of stratified causal discovery and propose a method, Aristotle, for solving it. Aristotle addresses the two challenges intrinsic to omics data: high dimensionality and hidden stratification. It employs existing biological knowledge and a state-of-the-art patient stratification method to tackle the above challenges and applies a quasi-experimental design method to each stratum to find stratum-specific potential causes.

Results

Evaluation based on synthetic data shows better performance for Aristotle in discovering true causes under different conditions compared to existing causal discovery methods. Experiments on a real dataset on Anthracycline Cardiotoxicity indicate that Aristotle’s predictions are consistent with the existing literature. Moreover, Aristotle makes additional predictions that suggest further investigations.

Document type: 
Article
File(s): 

Investigating the Role of Culture on Negative Emotion Expressions in the Wild

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2021-12-08
Abstract: 

Even though culture has been found to play some role in negative emotion expression, affective computing research primarily takes on a basic emotion approach when analyzing social signals for automatic emotion recognition technologies. Furthermore, automatic negative emotion recognition systems still train data that originates primarily from North America and contains a majority of Caucasian training samples. As such, the current study aims to address this problem by analyzing what the differences are of the underlying social signals by leveraging machine learning models to classify 3 negative emotions, contempt, anger and disgust (CAD) amongst 3 different cultures: North American, Persian, and Filipino. Using a curated data set compiled from YouTube videos, a support vector machine (SVM) was used to predict negative emotions amongst differing cultures. In addition a one-way ANOVA was used to analyse the differences that exist between each culture group in-terms of level of activation of underlying social signal. Our results not only highlighted the significant differences in the associated social signals that were activated for each culture, but also indicated the specific underlying social signals that differ in our cross-cultural data sets. Furthermore, the automatic classification methods showed North American expressions of CAD to be well-recognized, while Filipino and Persian expressions were recognized at near chance levels.

Document type: 
Article
File(s): 

INGOT-DR: An Interpretable Classifier for Predicting Drug Resistance in M. Tuberculosis

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2021-08-10
Abstract: 

Motivation

Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data.

Contribution

In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time.

Results

We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.

Document type: 
Article
File(s): 

A Deeper Look at Autonomous Vehicle Ethics: An Integrative Ethical Decision-Making Framework to Explain Moral Pluralism

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2021-05-04
Abstract: 

The autonomous vehicle (AV) is one of the first commercialized AI-embedded robots to make autonomous decisions. Despite technological advancements, unavoidable AV accidents that result in life-and-death consequences cannot be completely eliminated. The emerging social concern of how an AV should make ethical decisions during unavoidable accidents is referred to as the moral dilemma of AV, which has promoted heated discussions among various stakeholders. However, there are research gaps in explainable AV ethical decision-making processes that predict how AVs’ moral behaviors are made that are acceptable from the AV users’ perspectives. This study addresses the key question: What factors affect ethical behavioral intentions in the AV moral dilemma? To answer this question, this study draws theories from multidisciplinary research fields to propose the “Integrative ethical decision-making framework for the AV moral dilemma.” The framework includes four interdependent ethical decision-making stages: AV moral dilemma issue framing, intuitive moral reasoning, rational moral reasoning, and ethical behavioral intention making. Further, the framework includes variables (e.g., perceived moral intensity, individual factors, and personal moral philosophies) that influence the ethical decision-making process. For instance, the framework explains that AV users from Eastern cultures will tend to endorse a situationist ethics position (high idealism and high relativism), which views that ethical decisions are relative to context, compared to AV users from Western cultures. This proposition is derived from the link between individual factors and personal moral philosophy. Moreover, the framework proposes a dual-process theory, which explains that both intuitive and rational moral reasoning are integral processes of ethical decision-making during the AV moral dilemma. Further, this framework describes that ethical behavioral intentions that lead to decisions in the AV moral dilemma are not fixed, but are based on how an individual perceives the seriousness of the situation, which is shaped by their personal moral philosophy. This framework provides a step-by-step explanation of how pluralistic ethical decision-making occurs, reducing the abstractness of AV moral reasoning processes.

Document type: 
Article
File(s): 

Predicting the Clinical Management of Skin Lesions Using Deep Learning

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2021-04-08
Abstract: 

Automated machine learning approaches to skin lesion diagnosis from images are approaching dermatologist-level performance. However, current machine learning approaches that suggest management decisions rely on predicting the underlying skin condition to infer a management decision without considering the variability of management decisions that may exist within a single condition. We present the first work to explore image-based prediction of clinical management decisions directly without explicitly predicting the diagnosis. In particular, we use clinical and dermoscopic images of skin lesions along with patient metadata from the Interactive Atlas of Dermoscopy dataset (1011 cases; 20 disease labels; 3 management decisions) and demonstrate that predicting management labels directly is more accurate than predicting the diagnosis and then inferring the management decision (13.73±3.93% and 6.59±2.86% improvement in overall accuracy and AUROC respectively), statistically significant at p<0.001. Directly predicting management decisions also considerably reduces the over-excision rate as compared to management decisions inferred from diagnosis predictions (24.56% fewer cases wrongly predicted to be excised). Furthermore, we show that training a model to also simultaneously predict the seven-point criteria and the diagnosis of skin lesions yields an even higher accuracy (improvements of 4.68±1.89% and 2.24±2.04% in overall accuracy and AUROC respectively) of management predictions. Finally, we demonstrate our model’s generalizability by evaluating on the publicly available MClass-D dataset and show that our model agrees with the clinical management recommendations of 157 dermatologists as much as they agree amongst each other.

Document type: 
Article
File(s): 

Introduction to Causality

Peer reviewed: 
No, item is not peer reviewed.
Date created: 
2020-12-18
Abstract: 

In scientific discovery, engineering, imaging, and machine learning, it is often critical to understand what causes an event or observation, rather than focusing on correlation/association alone. In order to make this complex topic more accessible I would like to share what I learned on how causality can be applied and what concepts are essential in doing so. I will introduce the graphical causal model (Bayesian network), and show how you can translate human intuition on causality into formal axioms that fuse the causal graph with the probability space from observed events. We will discuss counterfactual causality, and end with an overview of recommendations on how to use causality in practice as well as current open issues and relevant papers that tackle those questions.

Document type: 
Lecture / Talk
File(s): 

HASLR: Fast Hybrid Assembly of Long Reads

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2020-08-21
Abstract: 

Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads formany of the analysis tasks. Here,we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.

Document type: 
Article
File(s): 

Deep Learning for Edge Computing Applications: A State-of-the-Art Survey

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2020-03-23
Abstract: 

With the booming development of Internet-of-Things (IoT) and communication technologies such as 5G, our future world is envisioned as an interconnected entity where billions of devices will provide uninterrupted service to our daily lives and the industry. Meanwhile, these devices will generate massive amounts of valuable data at the network edge, calling for not only instant data processing but also intelligent data analysis in order to fully unleash the potential of the edge big data. Both the traditional cloud computing and on-device computing cannot sufficiently address this problem due to the high latency and the limited computation capacity, respectively. Fortunately, the emerging edge computing sheds a light on the issue by pushing the data processing from the remote network core to the local network edge, remarkably reducing the latency and improving the efficiency. Besides, the recent breakthroughs in deep learning have greatly facilitated the data processing capacity, enabling a thrilling development of novel applications, such as video surveillance and autonomous driving. The convergence of edge computing and deep learning is believed to bring new possibilities to both interdisciplinary researches and industrial applications. In this article, we provide a comprehensive survey of the latest efforts on the deep-learning-enabled edge computing applications and particularly offer insights on how to leverage the deep learning advances to facilitate edge applications from four domains, i.e., smart multimedia, smart transportation, smart city, and smart industry. We also highlight the key research challenges and promising research directions therein. We believe this survey will inspire more researches and contributions in this promising field.

Document type: 
Article
File(s): 

Deconvoluting the Diversity of Within-host Pathogen Strains in a Multi-locus Sequence Typing Framework

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2019-12-17
Abstract: 

Background

Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging.

Results

We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen.

Conclusions

Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.

Document type: 
Article
File(s): 

A New Resolution Function to Evaluate Tree Shape Statistics

Peer reviewed: 
Yes, item is peer reviewed.
Date created: 
2019-11-21
Abstract: 

Phylogenetic trees are frequently used in biology to study the relationships between a number of species or organisms. The shape of a phylogenetic tree contains useful information about patterns of speciation and extinction, so powerful tools are needed to investigate the shape of a phylogenetic tree. Tree shape statistics are a common approach to quantifying the shape of a phylogenetic tree by encoding it with a single number. In this article, we propose a new resolution function to evaluate the power of different tree shape statistics to distinguish between dissimilar trees. We show that the new resolution function requires less time and space in comparison with the previously proposed resolution function for tree shape statistics. We also introduce a new class of tree shape statistics, which are linear combinations of two existing statistics that are optimal with respect to a resolution function, and show evidence that the statistics in this class converge to a limiting linear combination as the size of the tree increases. Our implementation is freely available at https://github.com/WGS-TB/TreeShapeStats.

Document type: 
Article
File(s):