Big genomic resources such as UK Biobank involve hundreds of thousands of subjects and are being established for prospective epidemiological cohort studies with the goal of improving the screening and treatment of disease. Genome-wide association studies (GWAS) on these resources experience time and space efficiency issues which are amplified at the population level. We show two new methods for mitigation of these issues. Firstly, we present a new compressed file format and associated software which exploits properties of the statistical distribution of population genetics files and enables computationally faster and smaller GWAS, which results in reduced costs for GWAS research. We benchmark this new method on Thousand Genomes Project data against the current state-of-the-art and find a significant space efficiency increase. Secondly, software implementing an efficient clustering method for discovered associations from such studies is also presented. The method is applied on GWAS of nearly 4,000 brain imaging phenotypes from UK Biobank, with results associated with pathways involved in various diseases.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Elliott, Lloyd
Member of collection