Publisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Embargo
Release after 02/06/2022Abstract
Random forest and gradient boosting models are commonly found in publications using prediction models. They are referenced almost interchangeably within data competitions as easy methods for analyzing big data. This thesis compared the prediction accuracy, sensitivity, and specificity of the two methods using simulated data covering a variety of data characteristics. Gradient boosting and random forest had similar accuracy when the data had equal numbers of observations for the binary outcome. However, gradient boosting greatly outperformed random forest as sample size and variable number increased. Gradient boosting also had markedly higher sensitivity and specificity regardless of data characteristics when the outcomes were equal. Both methods had low values in all three categories measured when the binary outcomes were not equally represented, however gradient boosting still had better prediction sensitivity and specificity than random forest. We illustrated the methods using real data from a study of human experts identifying musk-like aromatic molecules. The data contain chemical properties that could potentially be used to predict whether a molecule could be classified as musk without expert identification. As demonstrated by the simulation studies, the two methods had similar accuracy, but random forest had slightly higher sensitivity and higher mean prediction specificity.Type
textElectronic Thesis
Degree Name
M.S.Degree Level
mastersDegree Program
Graduate CollegeBiostatistics
