AdvisorMyers, Eugene W.
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractString pattern matching is an extensively studied area of computer science. Over the past few decades, many important theoretical results have been discovered, and a large number of practical algorithms has been developed for efficiently matching various classes of patterns. A variety of general pattern matching tools and specialized programming languages have been implemented for applications in areas such as lexical analysis, text editing, or database searching. Most recently, the field of molecular biology has been added to the growing list of applications that make use of pattern matching technology. The requirements of biological pattern matching differ from traditional applications in several ways. First, the amount of data to be processed is very large, and hence highly efficient pattern matching tools are required. Second, the data to be searched is obtained from biological experiments, where error rates of up to 5% are not uncommon. In addition, patterns are often averaged from several, biologically similar sequences. Therefore, to be useful, pattern matching tools must be able to accommodate some notion of approximate matching. Third, formal language notations such as regular expressions, which are commonly used in traditional applications, are insufficient for describing many of the patterns that are of interest to biologists. Hence, any conventional notation must be significantly enhanced to accommodate such patterns. Taken together, these differences combine to render most existing pattern matching tools inadequate, and have created a need for specialized pattern matching systems. This dissertation presents a pattern matching system that specifically addresses the three issues outlined above. A notation for defining patterns is developed by extending the regular expression syntax in a consistent way. Using this notation, virtually any pattern of interest to biologists can be expressed in an intuitive and concise manner. The system further incorporates a very flexible notion of approximate pattern matching that unifies most of the previously developed concepts. Last, but not least, the system employs a novel, optimized backtracking algorithm, which enables it to efficiently search even very large databases.
Degree ProgramComputer Science