We are upgrading the repository! A content freeze is in effect until December 6th, 2024 - no new submissions will be accepted; however, all content already published will remain publicly available. Please reach out to repository@u.library.arizona.edu with your questions, or if you are a UA affiliate who needs to make content available soon. Note that any new user accounts created after September 22, 2024 will need to be recreated by the user in November after our migration is completed.
Publisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
There is an abundance of information being generated constantly, most of it encoded as unstructured text. The information expressed this way, although publicly available, is not directly usable by computer systems because it is not organized according to a data model that could inform us how different data nuggets relate to each other. Information extraction provides a way of scanning unstructured text and extracting structured knowledge suitable for querying and manipulation. Most information extraction research focuses on machine learning approaches that can be considered black boxes when deployed in information extraction systems. We propose a declarative language designed for the information extraction task. It allows the use of syntactic patterns alongside token-based surface patterns that incorporate shallow linguistic features. It captures complex constructs such as nested structures, and complex regular expressions over syntactic patterns for event arguments. We implement a novel information extraction runtime system designed for the compilation and execution of the proposed language. The runtime system has novel features for better declarative support, while preserving practicality. It supports features required for handling natural language, like the preservation of ambiguity and the efficient use of contextual information. It has a modular architecture that allows it to be extended with new functionality, which, together with the language design, provides a powerful framework for the development and research of new ideas for declarative information extraction. We use our language and runtime system to build a biomedical information extraction system. This system is capable of recognizing biological entities (e.g., genes, proteins, protein families, simple chemicals), events over entities (e.g., biochemical reactions), and nested events that take other events as arguments (e.g., catalysis). Additionally, it captures complex natural language phenomena like coreference and hedging. Finally, we propose a rule learning procedure to extract rules from statistical systems trained for information extraction. Rule learning combines the advantages of machine learning with the interpretability of our models. This enables us to train information extraction systems using annotated data that can then be extended and modified by human experts, and in this way accelerate the deployment of new systems that can still be extended or modified by human experts.Type
textElectronic Dissertation
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeComputer Science
Degree Grantor
University of ArizonaCollections
The following license files are associated with this item: