An Approach to Automatic and Human Speech Recognition Using Ear-Recorded Speech
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
EmbargoRelease after 28-Aug-2018
AbstractSpeech in a noisy background presents a challenge for the recognition of that speech both by human listeners and by computers tasked with understanding human speech (automatic speech recognition; ASR). Years of research have resulted in many solutions, though none so far have completely solved the problem. Current solutions generally require some form of estimation of the noise, in order to remove it from the signal. The limitation is that noise can be highly unpredictable and highly variable, both in form and loudness. The present report proposes a method of recording a speech signal in a noisy environment that largely prevents noise from reaching the recording microphone. This method utilizes the human skull as a noise-attenuation device by placing the microphone in the ear canal. For further noise dampening, a pair of noise-reduction earmuffs are used over the speakers' ears. A corpus of speech was recorded with a microphone in the ear canal, while also simultaneously recording speech at the mouth. Noise was emitted from a loudspeaker in the background. Following the data collection, the speech recorded at the ear was analyzed. A substantial noise-reduction benefit was found over mouth-recorded speech. However, this speech was missing much high-frequency information. With minor processing, mid-range frequencies were amplified, increasing the intelligibility of the speech. A human perception task was conducted using both the ear-recorded and mouth-recorded speech. Participants in this experiment were significantly more likely to understand ear-recorded speech over the noisy, mouth-recorded speech. Yet, participants found mouth-recorded speech with no noise the easiest to understand. These recordings were also used with an ASR system. Since the ear-recorded speech is missing much high-frequency information, it did not recognize the ear-recorded speech readily. However, when an acoustic model was trained low-pass filtered speech, performance improved. These experiments demonstrated that humans, and likely an ASR system, with additional training, would be able to more easily recognize ear-recorded speech than speech in noise. Further speech processing and training may be able to improve the signal's intelligibility for both human and automatic speech recognition.
Degree ProgramGraduate College
Degree GrantorUniversity of Arizona
Showing items related by title, author, creator and subject.
Clear Speech Modifications in Children Aged 6-10Taylor, Griffin Lijding (The University of Arizona., 2017)Modifications to speech production made by adult talkers in response to instructions to speak clearly have been well documented in the literature. Targeting adult populations has been motivated by efforts to improve speech production for the benefit of the communication partners, however, many adults also have communication partners who are children. Surprisingly, there is limited literature on whether children can change their speech production when cued to speak clearly. Pettinato, Tuomainen, Granlund, and Hazan (2016) showed that by age 12, children exhibited enlarged vowel space areas and reduced articulation rate when prompted to speak clearly, but did not produce any other adult-like clear speech modifications in connected speech. Moreover, Syrett and Kawahara (2013) suggested that preschoolers produced longer and more intense vowels when prompted to speak clearly at the word level. These findings contrasted with adult talkers who show significant temporal and spectral differences between speech produced in control and clear speech conditions. Therefore, it was the purpose of this study to analyze changes in temporal and spectral characteristics of speech production that children aged 6-10 made in these experimental conditions. It is important to elucidate the clear speech profile of this population to better understand which adult-like clear speech modifications they make spontaneously and which modifications are still developing. Understanding these baselines will advance future studies that measure the impact of more explicit instructions and children's abilities to better accommodate their interlocutors, which is a critical component of children’s pragmatic and speech-motor development.
Individual Differences in Degraded Speech PerceptionCarbonell, Kathy M. (The University of Arizona., 2016)One of the lasting concerns in audiology is the unexplained individual differences in speech perception performance even for individuals with similar audiograms. One proposal is that there are cognitive/perceptual individual differences underlying this vulnerability and that these differences are present in normal hearing (NH) individuals but do not reveal themselves in studies that use clear speech produced in quiet (because of a ceiling effect). However, previous studies have failed to uncover cognitive/perceptual variables that explain much of the variance in NH performance on more challenging degraded speech tasks. This lack of strong correlations may be due to either examining the wrong measures (e.g., working memory capacity) or to there being no reliable differences in degraded speech performance in NH listeners (i.e., variability in performance is due to measurement noise). The proposed project has 3 aims; the first, is to establish whether there are reliable individual differences in degraded speech performance for NH listeners that are sustained both across degradation types (speech in noise, compressed speech, noise-vocoded speech) and across multiple testing sessions. The second aim is to establish whether there are reliable differences in NH listeners' ability to adapt their phonetic categories based on short-term statistics both across tasks and across sessions; and finally, to determine whether performance on degraded speech perception tasks are correlated with performance on phonetic adaptability tasks, thus establishing a possible explanatory variable for individual differences in speech perception for NH and hearing impaired listeners.