AuthorBerry, Jeffrey James
KeywordsConditional Random Fields
Deep Belief Networks
Articulatory Speech Data
Automatic Speech Recognition
AdvisorArchangeli, Diana B.
Fasel, Ian R.
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractHumans make use of more than just the audio signal to perceive speech. Behavioral and neurological research has shown that a person's knowledge of how speech is produced influences what is perceived. With methods for collecting articulatory data becoming more ubiquitous, methods for extracting useful information are needed to make this data useful to speech scientists, and for speech technology applications. This dissertation presents feature extraction methods for ultrasound images of the tongue and for data collected with an Electro-Magnetic Articulograph (EMA). The usefulness of these features is tested in several phoneme classification tasks. Feature extraction methods for ultrasound tongue images presented here consist of automatically tracing the tongue surface contour using a modified Deep Belief Network (DBN) (Hinton et al. 2006), and methods inspired by research in face recognition which use the entire image. The tongue tracing method consists of training a DBN as an autoencoder on concatenated images and traces, and then retraining the first two layers to accept only the image at runtime. This 'translational' DBN (tDBN) method is shown to produce traces comparable to those made by human experts. An iterative bootstrapping procedure is presented for using the tDBN to assist a human expert in labeling a new data set. Tongue contour traces are compared with the Eigentongues method of (Hueber et al. 2007), and a Gabor Jet representation in a 6-class phoneme classification task using Support Vector Classifiers (SVC), with Gabor Jets performing the best. These SVC methods are compared to a tDBN classifier, which extracts features from raw images and classifies them with accuracy only slightly lower than the Gabor Jet SVC method.For EMA data, supervised binary SVC feature detectors are trained for each feature in three versions of Distinctive Feature Theory (DFT): Preliminaries (Jakobson et al. 1954), The Sound Pattern of English (Chomsky and Halle 1968), and Unified Feature Theory (Clements and Hume 1995). Each of these feature sets, together with a fourth unsupervised feature set learned using Independent Components Analysis (ICA), are compared on their usefulness in a 46-class phoneme recognition task. Phoneme recognition is performed using a linear-chain Conditional Random Field (CRF) (Lafferty et al. 2001), which takes advantage of the temporal nature of speech, by looking at observations adjacent in time. Results of the phoneme recognition task show that Unified Feature Theory performs slightly better than the other versions of DFT. Surprisingly, ICA actually performs worse than running the CRF on raw EMA data.
Degree ProgramGraduate College