Using data mining in educational research: A comparison of Bayesian network with multiple regression in prediction
Publisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
Advances in technology have altered data collection and popularized large databases in areas including education. To turn the collected data into knowledge, effective analysis tools are required. Traditional statistical approaches have shown some limitations when analyzing large-scale data, especially sets with a large number of variables. This dissertation introduces to educational researchers a new data analysis approach called data mining, an analytic process at the intersection of statistics, databases, machine learning/artificial intelligence (AI), and computer science, that is designed to explore large amounts of data to search for consistent patterns and/or systematic relationships between variables. To examine the usefulness of data mining in educational research, one specific data mining technique--the Bayesian Belief Network (BBN) based in Bayesian probability--is used to construct an analysis model in contrast to the traditional statistical approaches to answer a pseudo research question about faculty salary prediction in postsecondary institutions. Four prediction models--a multiple regression model with theoretical variable selection, a regression model with statistical variable extraction, a data mining BBN model with wrapper feature selection, and a combination model that used variables selected by the BBN in a multiple regression procedure--are expounded to analyze a data set called the National Survey of Postsecondary Faculty 1999 (NSOPF:99) provided by the National Center of Educational Services (NCES). The algorithms, input variables, final models, outputs, and interpretations of the four prediction models are presented and discussed. The results indicate that, with a nonmetric approach, the BBN can effectively handle a large number of variables through a process of stochastic subset selection; uncover dependence relationships among variables; detect hidden patterns in the data set; minimize the sample size as a factor influencing the amount of computations in data modeling; reduce data dimensionality by automatically identifying the most pertinent variable from a group of different but highly correlated measures in the analysis; and select the critical variables related to a core construct in prediction problems. The BBN and other data mining techniques have drawbacks; nonetheless, they are useful tools with unique advantages for analyzing large-scale data in educational research.Type
textDissertation-Reproduction (electronic)
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeEducational Psychology