An Approach to Automatic and Human Speech Recognition Using Ear-Recorded Speech
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
EmbargoRelease after 28-Aug-2018
AbstractSpeech in a noisy background presents a challenge for the recognition of that speech both by human listeners and by computers tasked with understanding human speech (automatic speech recognition; ASR). Years of research have resulted in many solutions, though none so far have completely solved the problem. Current solutions generally require some form of estimation of the noise, in order to remove it from the signal. The limitation is that noise can be highly unpredictable and highly variable, both in form and loudness. The present report proposes a method of recording a speech signal in a noisy environment that largely prevents noise from reaching the recording microphone. This method utilizes the human skull as a noise-attenuation device by placing the microphone in the ear canal. For further noise dampening, a pair of noise-reduction earmuffs are used over the speakers' ears. A corpus of speech was recorded with a microphone in the ear canal, while also simultaneously recording speech at the mouth. Noise was emitted from a loudspeaker in the background. Following the data collection, the speech recorded at the ear was analyzed. A substantial noise-reduction benefit was found over mouth-recorded speech. However, this speech was missing much high-frequency information. With minor processing, mid-range frequencies were amplified, increasing the intelligibility of the speech. A human perception task was conducted using both the ear-recorded and mouth-recorded speech. Participants in this experiment were significantly more likely to understand ear-recorded speech over the noisy, mouth-recorded speech. Yet, participants found mouth-recorded speech with no noise the easiest to understand. These recordings were also used with an ASR system. Since the ear-recorded speech is missing much high-frequency information, it did not recognize the ear-recorded speech readily. However, when an acoustic model was trained low-pass filtered speech, performance improved. These experiments demonstrated that humans, and likely an ASR system, with additional training, would be able to more easily recognize ear-recorded speech than speech in noise. Further speech processing and training may be able to improve the signal's intelligibility for both human and automatic speech recognition.
Degree ProgramGraduate College
Degree GrantorUniversity of Arizona
Showing items related by title, author, creator and subject.
Clear Speech Modifications in Children Aged 6-10Bunton, Kate; Taylor, Griffin Lijding; Bunton, Kate; Story, Brad; Plante, Elena (The University of Arizona., 2017)Modifications to speech production made by adult talkers in response to instructions to speak clearly have been well documented in the literature. Targeting adult populations has been motivated by efforts to improve speech production for the benefit of the communication partners, however, many adults also have communication partners who are children. Surprisingly, there is limited literature on whether children can change their speech production when cued to speak clearly. Pettinato, Tuomainen, Granlund, and Hazan (2016) showed that by age 12, children exhibited enlarged vowel space areas and reduced articulation rate when prompted to speak clearly, but did not produce any other adult-like clear speech modifications in connected speech. Moreover, Syrett and Kawahara (2013) suggested that preschoolers produced longer and more intense vowels when prompted to speak clearly at the word level. These findings contrasted with adult talkers who show significant temporal and spectral differences between speech produced in control and clear speech conditions. Therefore, it was the purpose of this study to analyze changes in temporal and spectral characteristics of speech production that children aged 6-10 made in these experimental conditions. It is important to elucidate the clear speech profile of this population to better understand which adult-like clear speech modifications they make spontaneously and which modifications are still developing. Understanding these baselines will advance future studies that measure the impact of more explicit instructions and children's abilities to better accommodate their interlocutors, which is a critical component of children’s pragmatic and speech-motor development.
ACCURACY, SPEED AND EASE OF FILTERED SPEECH INTELLIGIBILITY.Skinner, Paul; Downs, David Wayne; Matkin, Noel; Glattke, Theodore; Antia, Shirin D.; Putnam, Anne (The University of Arizona., 1982)Nineteen normal-hearing university undergraduates performed an "objective" and a "subjective" test of speech intelligibility accuracy (SIA), speed (SIS) and ease (SIE) for different levels of low-pass filtered speech. During objective testing subjects listened to monosyllabic words low-pass filtered through an earphone, and repeated words as correctly and quickly as possible. They simultaneously turned off a probe light as quickly as possible whenever it appeared. Objective SIA was assessed as percentage of incorrectly-repeated phonemes, objective SIS as elapsed time between word presentation and a subject's voice response, and objective SIE as probe-reaction time to turning off the light. During subjective testing subjects listened to common sentences low-pass filtered through a loudspeaker in a background of competing speech. Subjective SIA, SIS and SIE were assessed using magnitude estimation in which subjects assigned numbers to how accurately, quickly or easily they understood the sentences. The most important finding was generally improved accuracy, speed and ease of objectively- and subjectively-measured speech intelligibility with decreased filtering. The experimenter further analyzed results by determining how well each measure of SIA, SIS and SIE met assumptions of test sensitivity, selectivity, reliability, convergence, discriminability and sufficiency. Overall, the objective SIA measure best met assumptions, followed by the three subjective measures, the objective SIS measure, and the objective SIE measure. Results have clinical and research implications for testing and understanding normal and impaired speech intelligibility and perception. First, results are encouraging for audiologists who use objective SIA and subjective measures to test speech intelligibility of their patients. Second, results suggest that persons listening to degraded speech, or persons with auditory problems, may have difficulties in SIS and SIE as well as problems already documented for SIA. Accordingly, audiologists should consider SIS and SIE during audiologic evaluations, aural rehabilitation, and auditory research. Finally, a few subjects showed exceptionally fast voice-response and probe-reaction times which has implications for understanding the nature and limits of human auditory processing.