Evaluating and Ameliorating Data Set Quality for Low-Resource Natural Language Processing
Author
Zupon, Andrew LeeIssue Date
2024Advisor
Carnie, AndrewSurdeanu, Mihai
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
This dissertation focuses on data set quality for natural language processing for low-resource languages---languages with limited online or computational data available. The topic is addressed in three papers. The first paper evaluates a set of existing corpora in terms of their suitability for natural language processing research. It further explores ways to improve the quality of the data sets using text normalization as an example task. The languages covered in this paper are a set of low-resource African languages that vary genetically, geographically, and orthographically. The second paper investigates automatic methods of detecting and correcting annotation inconsistencies in dependency syntax treebanks. The paper proposes three correction methods that compare examples in a low-resource corpus with similar examples in a resource-rich corpus of the same language. These methods are evaluated by retraining two dependency parsers on the corrected and uncorrected data. In several testing conditions, the parsers trained on the corrected data outperform the parsers trained on the uncorrected data. This paper uses a simulated low-resource English data set and a different large English data set. The third paper extends the methods in the second paper to a multilingual scenario by leveraging aligned word vectors to identify potentially incorrect annotations. As in the second paper, this multilingual method is evaluated by retraining two dependency parsers on the corrected and uncorrected data. Performance is better when trained on the corrected data in some but not all testing conditions. This paper uses a low-resource Catalan data set and two different large Spanish data sets. Taken together these three papers contribute to our understanding of how data set quality can affect natural language processing applications, particularly in low-resource contexts. The approaches described in these papers utilize and expand on common natural language processing methodologies and apply them to new low-resource use cases.Type
Electronic Dissertationtext
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeLinguistics