High-Confidence Learning from Uncertain Data with High Dimensionality
Author
Washburn, AmmonIssue Date
2018Advisor
Fan, NengZhang, Helen Hao
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Embargo
Release after 12/07/2020Abstract
Some of the most challenging issues in big data are size, scalability and reliability. Big data, such as pictures, videos, and text, have innate structure that does not fit into the structure of the normal data table. Often the sources come from the internet or other domains where accuracy is not possible. When drawn from these sources, it is likely that important information is missing or cannot be measured. This leads to situations where identifying the important part of the data would lead to good solutions, but with all the data the tasks become ill-posed. Another case is where all the data is useful but there is some important and/or hidden structure of which classical methods are not equipped to take advantage. However, many methods have been developed to either deal with data uncertainty or with ill-posed problems. Data uncertainty can come from missing or distributional data. Data imputation combined with uncertainty quantification can allow regular statistical and machine learning methods to be applied and then verified. Other methods combine the steps in a robust way to directly inform the model. This last type of method is common in chance-constrained, robust or distributionally robust programs from the mathematical optimization community. Well-posed problems have a solution which is unique and changes slowly and continuously with the initial conditions. For standard machine learning models, a data set with many irrelevant features gives rise to ill-posed problems. Regularization and feature selection are two possible ways to deal with these problems. Both the regularization and feature selection techniques have been around for a long time. Regularization approaches can include Lp norms or the matrix trace which will give certain properties. Feature selection has been achieved in many ways including a preprocessing step to rank and select features and the use of stepwise regression to classical modern techniques such as LASSO. For many applications, there are both uncertainties and a high-dimensional component of the data. By combining methods that deal with both of these methods and then deriving quick computational algorithms, we can formulate robust, highly-generalizable machine learning models that achieve very good results. Two of our classification models handle samples of points to be classified as one. Traditional machine learning models in classification expect to classify one point but with an interesting data set from Karyometry, several hundred points must be consolidated into one classification. One of the algorithms also can take advantage of a certain nested structure in this data set to gain further information useful for doctors. The third model deals with data and label uncertainty in classification. We do it in a data-driven, distributionally robust way that gives us some confidence intervals on our classification. A large part of this dissertation also deals with the algorithms used to solve these optimization formulations. We advance the solution path algorithms to general cases of convex programming that many machine learning models fall into. We also develop three methods to solve a multiclass generalization of SVM that hitherto has been considered very difficult. In this dissertation, we will focus on support vector machines and how to reformulate them to deal with these issues while still being computationally tractable. In addition, our approach could be applied to many different machine learning models through the general form of Tikhonov regularization. This allows this research to apply to many models which fit the Tikhonov framework.Type
textElectronic Dissertation
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeApplied Mathematics