Segmentation and genome annotation (SAGA) algorithms such as ChromHMM and Segway are widely used for genome annotation using epigenomic datasets. These algorithms rely on probabilistic graphical models and take as input a collection of genomics datasets, partition the genome, and assign a label to each segment such that positions with the same label have similar patterns in the input data and output an annotation that assigns to each genomic position its annotated activity, such as Enhancer, Transcribed, etc. Despite the widespread applications of SAGA methods, there is currently no principled way to evaluate the statistical significance of SAGA label assignments. In this study, we are applying principles of reproducibility analysis to assess the statistical significance and the confidence that is to be ascribed to the genome annotations obtained from SAGA algorithms. Moreover, by investigating various individual variables that affect reproducibility, we try to delineate different sources of irreproducibility in genome annotations. We hypothesize that reproducibility measurements provide more realistic confidence estimates of the SAGA annotations, which will uncover irreproducible elements in existing annotations and remove doubt in those that stand up to this statistical scrutiny.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Libbrecht, Maxwell
Member of collection