With the advent of high throughput sequencing technologies, there has been a great deal of work on producing large-scale biological data. One of the fundamental steps for informing clinical intuition is data integration. Data preparation is a money- and time-consuming process which requires different labs or devices to gather the results separately. During the laboratory preparation process for a biological specimen, artifactual effects may be introduced in biological data. Many normalization approaches have been designed to correct the biases so that the dynamic range of the data is stabilized. However, one effect that is not handled by existing normalization approaches is stabilizing the variance of the biological data. It has been shown that the variance of the data is a function of their mean. This property indicates why it may not be a good measurement to consider the difference between the signals in two replicates as a proper metric for quantifying if a locus is significant. This property complicates their analysis; for example, the difference between two signals is not a reliable measure of the magnitude of change. To attempt to stabilize the variance, many analyses employ log or inverse hyperbolic sine transformations (asinh). However, these transformations assume a specific mean-variance relationship in the data. For example, log transformation stabilizes the variance of the data when there is a linear mean-variance relationship in data. In this thesis, we show that existing transformations do not fully stabilize variance in genomic data sets. To solve the variance-instability issue, we first propose VSS, a method that produces variance-stabilized signals for sequencing-based genomic signals. VSS learns the empirical relationship between the mean and variance of a given signal data set and produces transformed signals that normalize for this dependence. Then, we propose the VSS-Hi-C method that stabilizes the inter-chromosomal and intra-chromosomal Hi-C contact frequency matrices. We show that VSS and VSS-Hi-C successfully stabilize the variance of data and doing so improves downstream applications. These signal transformation methods will eliminate the need for downstream methods to implement complex mean-variance relationship models. Moreover, we investigate biological experiments to see if there are any underlying factors responsible for mean and variance dependency in data sets. Also, we introduce a new quality control metric using VSS signals to measure replicable granularity in signal strength. Finally, we build a framework to transform unreplicated experiments using the mean-variance relationship derived from replicated experiments.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Libbrecht, Maxwell
Member of collection