SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS
Author
PORFIRIO, DAVID JONATHANIssue Date
2016Advisor
Kececioglu, John
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
Predicting protein secondary structure is the process by which, given a sequence of amino acids as input, the secondary structure class of each position in the sequence is predicted. Our approach is built on the extraction of protein words of a fixed length from protein sequences, followed by nearest-neighbor classification in order to predict the secondary structure class of the middle position in each word. We present a new formulation for learning a distance function on protein words based on position-dependent substitution scores on amino acids. These substitution scores are learned by solving a large linear programming problem on examples of words with known secondary structures. We evaluated this approach by using a database of 5519 proteins with a total amino acid length of approximately 3000000. Presently, a test scheme using words of length 23 achieved a uniform average over word position of 65.2%. The average accuracy for alpha-classified words in the test was 63.1%, for beta-classified words was 56.6%, and for coil classified words was 71.6%.Type
textElectronic Thesis
Degree Name
B.S.Degree Level
BachelorsDegree Program
Honors CollegeComputer Science