Robust Speech Recognition Based on Binaural Auditory Processing
Abstract
This paper discusses a combination of techniques for improving
speech recognition accuracy in the presence of reverberation
and spatially-separated interfering sound sources. Interaural
Time Delay (ITD), observed as a consequence of the difference
in arrival times of a sound to the two ears, is an important feature
used by the human auditory system to reliably localize and separate
sound sources. In addition, the “precedence effect” helps
the auditory system differentiate between the direct sound and
its subsequent reflections in reverberant environments. This paper
uses a cross-correlation-based measure across the two channels
of a binaural signal to isolate the target source by rejecting
portions of the signal corresponding to larger ITDs. To overcome
the effects of reverberation, the steady-state components
of speech are suppressed, effectively boosting the onsets, so as
to retain the direct sound and suppress the reflections. Experimental
results show a significant improvement in recognition
accuracy using both these techniques. Cross-correlation-based
processing and steady-state suppression are carried out separately,
and the order in which these techniques are applied produces
differences in the resulting recognition accuracy.
speech recognition accuracy in the presence of reverberation
and spatially-separated interfering sound sources. Interaural
Time Delay (ITD), observed as a consequence of the difference
in arrival times of a sound to the two ears, is an important feature
used by the human auditory system to reliably localize and separate
sound sources. In addition, the “precedence effect” helps
the auditory system differentiate between the direct sound and
its subsequent reflections in reverberant environments. This paper
uses a cross-correlation-based measure across the two channels
of a binaural signal to isolate the target source by rejecting
portions of the signal corresponding to larger ITDs. To overcome
the effects of reverberation, the steady-state components
of speech are suppressed, effectively boosting the onsets, so as
to retain the direct sound and suppress the reflections. Experimental
results show a significant improvement in recognition
accuracy using both these techniques. Cross-correlation-based
processing and steady-state suppression are carried out separately,
and the order in which these techniques are applied produces
differences in the resulting recognition accuracy.