Tree shape statistics and their applications

Resource type
Thesis type
(Thesis) Ph.D.
Date created
Phylogenetic trees are frequently used in biology to study the relationships among several species or organisms. The shape of a phylogenetic tree contains useful information about patterns of speciation and extinction, so powerful tools are needed to investigate the shape of a phylogenetic tree. Tree shape statistics are a common approach to quantify the shape of a phylogenetic tree by encoding it with a single number. Tree shape statistics such as the Sackin and Colless indices have been used in a variety of contexts. Their applications start from differentiating between trees conforming to different stochastic evolution models to more recent developments in phylodynamics, where the tree structures are used to predict short-term growth and fitness. In contrast to the vast application range of tree shape statistics, existing statistics often do not suffice to distinguish between essential scenarios such as trees corresponding to different viral pathogens or different geographical scales for the same pathogen. In this dissertation, we study tree shape statistics from different aspects. First, we propose a new resolution function to evaluate the power of different tree shape statistics to distinguish between dissimilar trees. Second, we propose two classes of new tree shape statistics. For the first one, we use network science, a well-developed branch of data science, to inspire five novel tree shape statistics. For the second one, we introduce a linear combination of two existing statistics that are optimal with respect to a resolution function and show evidence that the statistics in this class converge to a limiting linear combination as the size of the tree increases. Lastly, we investigate the problem of using tree shape statistics and machine learning tools applied to phylogenetic trees to predict the success of individual influenza virus subtrees. Furthermore, we study the distribution of the Robinson-Foulds metric. We modify the dynamic programming algorithm for computing the distribution of the Robinson-Foulds distance for a given tree by leveraging the Number-Theoretic Transform (NTT), and improve the running time from o(n^5) to o(n^3 log⁡(n)), where n is the number of tips of the tree.
Copyright statement
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Chindelevitch, Leonid
Member of collection
Attachment Size
etd20727.pdf 4.07 MB