Novel Computational and Statistical Approaches in Metagenomic Studies
Author
Sohn, Michael B.Issue Date
2015Keywords
StatisticsAdvisor
An, Lingling
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
Metagenomics has a great potential to discover previously unattainable information about microbial communities. The simplest, but extremely powerful approach for studying the characteristics of a microbial community is the analysis of differential abundance, which tries to identify differentially abundant features (e.g. species or genes) across different communities. For instance, detection of differentially abundant microbes across healthy and diseased groups can enable us to identify potential pathogens or probiotics. However, the analysis of differential abundance could mislead us about the characteristics of microbial communities if the counts or abundance of features on different scales are not properly normalized within and between communities. An important prerequisite for the analysis of differential abundance is to accurately estimate the composition of microbial communities, which is commonly known as the analysis of taxonomic composition. Most of prevalent approaches utilize solely the results of an alignment tool such as BLAST, limiting their estimation accuracy to high ranks of the taxonomy tree. In this study, two novel methods are developed: one for the analysis of taxonomic composition, called Taxonomic Analysis by Elimination and Correction (TAEC) and the other for the analysis of differential abundance, called Ratio Approach for Identifying Differential Abundance (RAIDA). TAEC utilizes the alignment similarity between known genomes in addition to the similarity between query sequences and sequences of known genomes. It is comprehensively tested on various simulated datasets of diverse complexity of bacterial structure. Compared with other available methods designed for estimating taxonomic composition at a relatively low taxonomic rank, TAEC demonstrates greater accuracy in the abundance of bacteria in a given microbial sample. RAIDA utilizes an invariant property of the ratio between the abundance of features, that is, a ratio between the relative abundance of two features is the same as a ratio between the absolute abundance of two features. Through comprehensive simulation studies the performance of RAIDA is consistently powerful and under some situations it greatly surpasses other existing methods for the analysis of differential abundance in metagenomic studies.Type
textElectronic Dissertation
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeStatistics