Inference of tumor subclonal composition and evolution by the use of single-cell and bulk DNA sequencing data

Resource type
Thesis type
(Thesis) Ph.D.
Date created
Cancer is a genetic disease characterized by the emergence of genetically distinct populations of cells (subclones) through the random acquisition of mutations at the level of single-cells and shifting prevalences at the subclone level through selective advantages purveyed by driver mutations. This interplay creates complex mixtures of tumor cell populations which exhibit different susceptibility to targeted cancer therapies and are suspected to be the cause of treatment failure. Therefore it is of great interest to obtain a better understanding of the evolutionary histories of individual tumors and their subclonal composition. In this thesis we present three methods for the inference of tumor subclonal composition and evolution by the use of bulk and/or single-cell DNA sequencing data. First, we present CTPsingle, a method which aims to infer tumor subclonal composition from single-sample bulk sequencing data. CTPsingle consists of two steps: (i) robust clustering of mutations using beta-binomial mixture modelling and (ii) inference of tumor phylogenies by the use of integer linear programming. On simulated data, we show that CTPsingle is able to infer the purity and the clonality of single-sample tumors with high accuracy even when restricted to a coverage depth as low as 30x. CTPsingle is currently used to infer clonality as a part of the Evolution and Heterogeneity Working Group of Pan Cancer Analysis of Whole Genomes project where sequencing data of over 2700 tumors are analyzed. Next, we present B-SCITE, the first available computational approach that infers tumor phylogenies from combined single-cell and bulk sequencing data. B-SCITE is a probabilistic method which searches for tumor phylogenetic tree maximizing the joint likelihood of the two data types. Tree search in B-SCITE is performed by the use of customized MCMC search over the space of labeled rooted trees. Using a comprehensive set of simulated data, we show that B-SCITE systematically outperforms existing methods with respect to tree reconstruction accuracy and subclone identification. On real tumor data, mutation histories generated by B-SCITE show high concordance with expert generated trees. In the third part, we introduce PhISCS, the first method which integrates single-cell and bulk sequencing data while accounting for the possible existence of mutations affected by undetected copy number aberrations, as well as mutations for which the commonly used and recently debated Infinite Sites Assumption is violated. PhISCS is a combinatorial method and, in contrast to the available alternatives which are mostly based on the probabilistic search schemes, it can provide guarantee of optimality of the reported solutions. We provide two different implementations of PhISCS: (i) the implementation based on the use of integer linear programming and (ii) the implementation based on the use of constraint satisfaction programming. We show that the latter has lower running time on most of the instances that we used to asses the performance of the two implementations. These results suggest that in some applications constraint satisfaction programming might be a viable alternative to commonly used integer linear programming. We also demonstrate the utility of PhISCS in analyzing real sequencing data where it reports more plausible and parsimonious tumor phylogenies than the available alternatives.
Copyright statement
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Chindelevitch, Leonid
Thesis advisor: Sahinalp, S. Cenk
Member of collection