Aspects of Phylogenetic Inference: Missing Data and Rate Variation
AdvisorSanderson, Michael J.
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractThat the use of large, multi-locus phylogenetic data sets in molecular evolutionary research provides new insights into evolutionary processes and that these data sets raise special methodological challenges, are themes common to the work presented here. In Chapter 1 (attached as Appendix A), The Prevalence of Terraced Treescapes in Phylogenetic Data Sets, we characterized the scale of one such methodological problem. When there are sampling gaps (missing taxon samples) in sequence data matrices used for phylogenetic tree inference, if the inference method satisfies certain conditions, the pattern of presence and absence of taxa in the sequence sample can induce terraces, sets of trees with identical optimality scores. We investigated the properties of terraces in a large collection of published data sets of the kind widely assembled for tree reconstruction. Our results showed that terraces can be large, that taxon sampling density of data impacts terrace size, and that metrics of data set sampling properties derived from terrace theory can predict characteristics of the terraces found. Exploring the methodological implications of terraces further, we found that terraces in bootstrap samples reduce bootstrap support. Because the presence of terraces depends in part on the inference model used, we examined whether the type of maximum likelihood model sufficient to induce terraces is preferred in model choice tests to an alternative model. The model was preferred for some, but not all, of the data sets sampled for our study. In Chapter 2 (attached as Appendix B), Pervasive Among-branch Evolutionary Rate Variation in Insect Proteins , we leveraged a large phylotranscriptomic data set to investigate the variability of molecular evolutionary rates across the insects. Using ANOVA, we found that variation among branches and among genes is significant, that branches contribute more overall rate variation than genes do, and rate variation is distributed broadly among branches and among genes. The presence of strong branch-specific impacts on rates of evolution should encourage the development of divergence time inference models incorporating explicit terms for these effects. We formulate such a model patterned on the ANOVA linear model. In Chapter 3 (attached as Appendix C), Complex Per-gene Substitution Dynamics in Insect Proteins, we again use phylotranscriptomic data to examine molecular evolutionary patterns: we evaluate per-gene indexes of dispersion in hundreds of insect genes. In the context of molecular evolutionary substitution events, the per-gene index of dispersion is defined as the ratio of the variance to the mean of a gene’s branch-wise substitution counts; a value over 1 signals a departure from a Poisson substitution process. Using a conventional weighting scheme to eliminate “lineage effects”, the confounding impacts of the phylogeny on the per-gene indexes of dispersion, we found that index values exceeded Poisson expectations on average, as expected in deep phylogenies. Taking advantage of previously-published divergence time data and the branch-rate effects estimates developed in Chapter 2, we decomposed the lineage effects. Our results showed that, although time contributed a large portion of these effects as expected at deep evolutionary time scales, branch-specific rate impacts constituted no portion of the lineage effects. We hypothesize that differences in the branch-rate and lineage effects weighting measures, coupled with the impact of strong gene effects on the lineage-effects measure may account for the observed disparity between lineage and branch-rate effects. To learn more about per-gene substitution dynamics, we identified the rank of each gene on the continuum of gene rates within each branch, and found that individual genes shift rate ranks frequently. This rate instability might also contribute to the discordance between lineage effects and branch-rate effects. To further identify processes and mechanisms driving per-gene dispersion, we characterized the correlation of the per-gene indexes of dispersion and average per-gene substitution counts, and found the correlation consistent with a negative binomial substitution process model. This result is compatible with fluctuating per-gene substitution rates, and highlights the need for divergence time inference models patterned on a negative binomial substitution process. Our analysis reveals complex patterns of per-gene, inter-branch substitutions, despite relatively stable among-gene rate differences.
Degree ProgramGraduate College
Ecology & Evolutionary Biology