An Approach to Automatic and Human Speech Recognition Using Ear-Recorded Speech
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
EmbargoRelease after 28-Aug-2018
AbstractSpeech in a noisy background presents a challenge for the recognition of that speech both by human listeners and by computers tasked with understanding human speech (automatic speech recognition; ASR). Years of research have resulted in many solutions, though none so far have completely solved the problem. Current solutions generally require some form of estimation of the noise, in order to remove it from the signal. The limitation is that noise can be highly unpredictable and highly variable, both in form and loudness. The present report proposes a method of recording a speech signal in a noisy environment that largely prevents noise from reaching the recording microphone. This method utilizes the human skull as a noise-attenuation device by placing the microphone in the ear canal. For further noise dampening, a pair of noise-reduction earmuffs are used over the speakers' ears. A corpus of speech was recorded with a microphone in the ear canal, while also simultaneously recording speech at the mouth. Noise was emitted from a loudspeaker in the background. Following the data collection, the speech recorded at the ear was analyzed. A substantial noise-reduction benefit was found over mouth-recorded speech. However, this speech was missing much high-frequency information. With minor processing, mid-range frequencies were amplified, increasing the intelligibility of the speech. A human perception task was conducted using both the ear-recorded and mouth-recorded speech. Participants in this experiment were significantly more likely to understand ear-recorded speech over the noisy, mouth-recorded speech. Yet, participants found mouth-recorded speech with no noise the easiest to understand. These recordings were also used with an ASR system. Since the ear-recorded speech is missing much high-frequency information, it did not recognize the ear-recorded speech readily. However, when an acoustic model was trained low-pass filtered speech, performance improved. These experiments demonstrated that humans, and likely an ASR system, with additional training, would be able to more easily recognize ear-recorded speech than speech in noise. Further speech processing and training may be able to improve the signal's intelligibility for both human and automatic speech recognition.
Degree ProgramGraduate College
Degree GrantorUniversity of Arizona
Showing items related by title, author, creator and subject.
Clear Speech Modifications in Children Aged 6-10Taylor, Griffin Lijding (The University of Arizona., 2017)Modifications to speech production made by adult talkers in response to instructions to speak clearly have been well documented in the literature. Targeting adult populations has been motivated by efforts to improve speech production for the benefit of the communication partners, however, many adults also have communication partners who are children. Surprisingly, there is limited literature on whether children can change their speech production when cued to speak clearly. Pettinato, Tuomainen, Granlund, and Hazan (2016) showed that by age 12, children exhibited enlarged vowel space areas and reduced articulation rate when prompted to speak clearly, but did not produce any other adult-like clear speech modifications in connected speech. Moreover, Syrett and Kawahara (2013) suggested that preschoolers produced longer and more intense vowels when prompted to speak clearly at the word level. These findings contrasted with adult talkers who show significant temporal and spectral differences between speech produced in control and clear speech conditions. Therefore, it was the purpose of this study to analyze changes in temporal and spectral characteristics of speech production that children aged 6-10 made in these experimental conditions. It is important to elucidate the clear speech profile of this population to better understand which adult-like clear speech modifications they make spontaneously and which modifications are still developing. Understanding these baselines will advance future studies that measure the impact of more explicit instructions and children's abilities to better accommodate their interlocutors, which is a critical component of children’s pragmatic and speech-motor development.
ACCURACY, SPEED AND EASE OF FILTERED SPEECH INTELLIGIBILITY.Downs, David Wayne (The University of Arizona., 1982)Nineteen normal-hearing university undergraduates performed an "objective" and a "subjective" test of speech intelligibility accuracy (SIA), speed (SIS) and ease (SIE) for different levels of low-pass filtered speech. During objective testing subjects listened to monosyllabic words low-pass filtered through an earphone, and repeated words as correctly and quickly as possible. They simultaneously turned off a probe light as quickly as possible whenever it appeared. Objective SIA was assessed as percentage of incorrectly-repeated phonemes, objective SIS as elapsed time between word presentation and a subject's voice response, and objective SIE as probe-reaction time to turning off the light. During subjective testing subjects listened to common sentences low-pass filtered through a loudspeaker in a background of competing speech. Subjective SIA, SIS and SIE were assessed using magnitude estimation in which subjects assigned numbers to how accurately, quickly or easily they understood the sentences. The most important finding was generally improved accuracy, speed and ease of objectively- and subjectively-measured speech intelligibility with decreased filtering. The experimenter further analyzed results by determining how well each measure of SIA, SIS and SIE met assumptions of test sensitivity, selectivity, reliability, convergence, discriminability and sufficiency. Overall, the objective SIA measure best met assumptions, followed by the three subjective measures, the objective SIS measure, and the objective SIE measure. Results have clinical and research implications for testing and understanding normal and impaired speech intelligibility and perception. First, results are encouraging for audiologists who use objective SIA and subjective measures to test speech intelligibility of their patients. Second, results suggest that persons listening to degraded speech, or persons with auditory problems, may have difficulties in SIS and SIE as well as problems already documented for SIA. Accordingly, audiologists should consider SIS and SIE during audiologic evaluations, aural rehabilitation, and auditory research. Finally, a few subjects showed exceptionally fast voice-response and probe-reaction times which has implications for understanding the nature and limits of human auditory processing.
Individual Differences in Degraded Speech PerceptionCarbonell, Kathy M. (The University of Arizona., 2016)One of the lasting concerns in audiology is the unexplained individual differences in speech perception performance even for individuals with similar audiograms. One proposal is that there are cognitive/perceptual individual differences underlying this vulnerability and that these differences are present in normal hearing (NH) individuals but do not reveal themselves in studies that use clear speech produced in quiet (because of a ceiling effect). However, previous studies have failed to uncover cognitive/perceptual variables that explain much of the variance in NH performance on more challenging degraded speech tasks. This lack of strong correlations may be due to either examining the wrong measures (e.g., working memory capacity) or to there being no reliable differences in degraded speech performance in NH listeners (i.e., variability in performance is due to measurement noise). The proposed project has 3 aims; the first, is to establish whether there are reliable individual differences in degraded speech performance for NH listeners that are sustained both across degradation types (speech in noise, compressed speech, noise-vocoded speech) and across multiple testing sessions. The second aim is to establish whether there are reliable differences in NH listeners' ability to adapt their phonetic categories based on short-term statistics both across tasks and across sessions; and finally, to determine whether performance on degraded speech perception tasks are correlated with performance on phonetic adaptability tasks, thus establishing a possible explanatory variable for individual differences in speech perception for NH and hearing impaired listeners.