Speaker Identification based on Automatic Crossmodal Fusion of Audio and Visual Data

Speaker Identification based on Automatic Crossmodal Fusion of Audio and Visual Data

Richard Reilly, University College Dublin

Abstract
It has long been reported that multimodal integration enhances our ability to detect, locate and discriminate external stimuli. Translating this automatic integration process to electronic systems and devices for speech or speaker recognition is of great research interest.

In this paper an audio-visual speaker identification system employing multimodal integration is reported, with the audio and visual speech modalities combined using automatic classifier fusion. The visual modality employs the speaker’s lip information. The fusion uses a feedback mechanism that automatically adapts audio or visual information based on the output of reliability estimates from both the audio and the visual feedforward recognisers.

The robustness of the system was assessed, employing additive white Gaussian noise for the audio modality and ten levels of JPEG compression for the visual modality. Experiments were carried out on a large data set of 251 subjects from an international audio-visual database (XMV2TS). The results show improved audio-visual speaker identification at all tested levels of audio and visual mismatch, compared to the individual audio or visual modality speaker identification. By combining multisensory information in this way, audio-visual speaker identification accuracies range from 99.2% for no audio and visual noise to 71.4% at the most severe mismatch levels.

The automatic fusion of information from the different modalities based on this physiological model offers enormous benefit for speech identification, recognition and other applications.

Not available

Back to Abstract