Computationally Efficient, Cost-Effective, and Interpretable Machine Learning Methods for Population Genomic Inference
Author
Tran, Linh Ngo HoangIssue Date
2024Advisor
Gutenkunst, Ryan N.
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
The advent of next-generation sequencing (NGS) technology has enabled researchers unprecedented access to genomic data that can reveal the evolutionary histories as well as future trajectories of natural populations from microbes to human populations. Due to the large quantity of sequencing data as well as the complexity of evolutionary models, such inference often requires sophisticated computational tools. Hence, advances in computational methods are essential for efficient learning and drawing insights from the wealth of genomic data. In this work, I developed several methods aimed to improve various aspects of existing approaches for making inference from genomic variation data. The first method, donni, is a supervised machine learning (ML) approach based on dadi, a widely used likelihood-based demographic history inference framework. With a more efficient management of the data generated by dadi's underlying models, donni significantly improves the computational efficiency while maintaining similar inference accuracy. Donni also includes an accompanying uncertainty quantification method, which is often lacking in most existing ML-based approaches in the field. The second method, ConfuseNN, focuses on improving the interpretability of inference results generated by existing trained convolutional neural networks (CNNs), which is an increasingly popular approach that has been applied to many population genomic inference tasks. With ConfuseNN, we designed and implemented a suite of tests that disrupt a specific feature in the genomic image data (e.g. allele frequency, linkage disequilibrium, etc.) to assess how each feature affects the CNN performance. I applied these tests to CNNs developed by other groups as well as the CNN we developed in our group for DFE inference (DFEnn), identifying the fundamental population genomic features that drive inference for each network. Finally, our group has recently developed an extension of dadi to improve demographic history inference from data obtained from low-coverage whole-genome sequencing, a cost-effective method of data acquisition but can bias downstream analysis if not properly accounted for. I contributed to the validation of this approach by applying it to whole-genome sequencing data from the 1000 Genomes Project and comparing its performance to an existing approach, ANGSD.Type
Electronic Dissertationtext
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeGenetics