Statistical Methods for the Analysis of Large-Scale and Single-Cell RNA-Sequencing Data
Supplementary data set 1
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
EmbargoRelease after 08/13/2019
AbstractRNA-sequencing (RNA-seq), based on next generation sequencing (NGS) technologies, has become the preferred tool for transcriptome analysis in the past two decades. Ever maturing and decreasing costs of high-throughput sequencing technologies have led to new types of data such as large-scale RNA-seq and single-cell RNA-seq (scRNA-seq) data. The analysis of these new types of RNA-seq data presents both new opportunities and challenges. In this dissertation, I present three novel statistical works that focus on these types of RNA-seq data, driven by various interests of research. The first project, MDSeq, introduces the first gene expression variability analysis for large-scale RNA-seq count data. MDSeq utilizes a novel reparametrization of the negative binomial distribution to provide flexible generalized linear models (GLMs) on both the mean and dispersion, and simultaneously addresses the challenges of analyzing large-scale RNA-seq data by modeling technical excess of zeros, identifying outliers efficiently, and evaluating differential expressions at biologically interesting levels. The last two works, scDoc and scGSA, are analysis tools for the recently emerging scRNA-seq with different perspectives. scDoc is a statistical tool that accurately and robustly imputes drop-out events in scRNA-seq data. It is the first drop-out imputation method that includes drop-out information when accounting for cell-to-cell similarity estimation, which is crucial in scRNA-seq drop-out imputation but has not been appropriately examined. scGSA is a novel gene set analysis tool for scRNA-seq data. Without any prior knowledge about class labels (e.g., label of cell types), which is required by all existing gene set analysis approaches, scGSA can detect significant gene sets relating to biologically meaningful heterogeneity among cells. Through various comprehensive simulation studies, all three proposed methods have demonstrated the highest power compared with other existing methods while type I errors are well controlled. Moreover, all three methods are applied in real data analyses, in which results suggest that they are very useful tools with distinctive advantages.
Degree ProgramGraduate College