Composite Likelihood Method for Genome-Wide and Phenome-Wide Association Studies
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractWith the advances in sequencing technology and wide adoption of electronic medical health records (EHR) nationally and internationally, the massive amount of genetic and phenotypic data are accessible at an affordable price. For example, common genomic variants, such as common single-nucleotide polymorphisms (SNPs), can be measured by the commercial SNP chip for under $100 per sample (Weedon et al., 2021). In addition, disease diagnosis codes (e.g., International Statistical Classification of Diseases (ICD) code) and lab measures stored in EHR allow deep phenotyping. Therefore, we are better equipped to identify associations between human genetic variations and diseases. However, as the sample sizes of current genetic studies have increased exponentially, there is a greater demand for efficient analytical tools that are scalable to ultra-large sample sizes and can simultaneously adjust for confounding factors. In this dissertation, I develop a suite of approaches based on composite likelihood framework to handle the sample relatedness and imbalanced case-control proportions for biobank scale genetic association studies. Linear mixed-effect models (LMM) are commonly used to model the sample relatedness (i.e., polygenic background). In the first project, we introduce the composite likelihood approach to the LMM. Using the composite likelihood, we avoid inverting the huge covariance matrix and reduce memory usage in genome-wide association study (GWAS). We first estimate the variance components in the LMM by the method of moments (MOM) and stochastic gradient descent (SGD) algorithms. Second, we derive Wald and score tests for testing the association between a single SNP and a continuous phenotype. Our method divides the entire high-dimensional correlated data into multiple one-dimensional data and then solves the problem by combining the information from many one-dimensional data. The simulation studies indicate that the memory usage for estimating variance components is substantially reduced compared to the existing methods. Notably, using our approach, millions of tests can be distributed to multiple clusters and evaluated simultaneously. It is known that the diagnosis codes in phenome-wide association study (PheWAS) may be inaccurate and rare (i.e., imbalanced case-control proportions). Both lead to biased results. In the second project, we aim to address this problem by taking advantage of multiple correlated diagnosis codes and their ontology. The composite likelihood approach and the generalized estimating equation (GEE) are applied. First, the SNP effect is estimated by GEE in the composite likelihood framework. Second, the corresponding Wald test statistics are derived for testing the association. Our methods use two-dimensional correlated phenotypes to capture the characters of higher dimensional correlation structures between diagnosis codes. Simulation studies reveal that the type I error of this approach is well controlled, comparable power to the traditional GEE, and much more powerful than the existing PheWAS tools. In the third project, we apply our methods to the UK Biobank study. We implement several existing methods for the GWAS and PheWAS analysis and compare the analysis results between these methods. The GWAS results are consistent between our methods and the existing methods. The PheWAS results of our novel approach are more potent than that of the traditional methods and are supported by the current literature.
Degree ProgramGraduate College