Natural Language Processing for Complex Tasks: Challenges and Solutions for Small Datasets in the Era of Deep Learning
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
EmbargoRelease after 08/04/2023
AbstractUnstructured text data encodes massive amounts of information about our world. With advances in machine learning (ML), especially deep learning, we have been able to solve increasingly complex natural language processing (NLP) tasks. Deep learning models achieved state-of-the-art performance in tasks such as part-of-speech tagging and machine translation. However, some domain-specific tasks require significant levels of interpretation, reasoning, and background knowledge in addition to being able to comprehend text. Compounded with the increasing complexity of deep learning models, it is difficult to estimate or understand if the ML model will be successful given a specific dataset. Moreover, deep learning algorithms have more substantial requirements for labeled training data compared to conventional machine learning and creating additional training data is usually costly and time-consuming. This dissertation considers these challenges in the context of diagnosing autism spectrum disorder (ASD). Identifying diagnostic criteria of ASD is a difficult task that can only be completed by trained clinicians, gold-standard training data is scarce, and the healthcare domain requires transparency. An ML solution for this task needs to overcome all these challenges and therefore this clinical task provides an ideal testbed. In this dissertation, we use two essays to address two common questions that arise in the course of designing an ML NLP system: estimating the difficulty of an ML problem and curating training data. These problems are highly relevant for researchers and practitioners who hope to leverage ML for NLP in their specific domains where costs of creating data may be prohibitive. The first essay proposes a set of measures that uses data-driven characteristics to explain and predict the performance of an ML NLP system. The second essay evaluates factors that can determine the success of an automated training data labeling framework used to address the shortage of gold-standard data by experimenting with multiple labeling systems and data sources. We finish with an integrated analysis that applies the measures proposed in the first essay to system-labelled data in the second essay, which serves as further evaluation and analysis of the artefacts proposed in this work. Designing and optimizing ML systems for text data for complex decisions involve many components, personnel, and systems. This research focuses on better understanding and leveraging training data in support of developing evidence-based guidelines toward more effective ML.
Degree ProgramGraduate College
Management Information Systems