Simultaneous-speaker voice activity detection and localization using mid-fusion of SVM and HMMs

Vicente P. Minotto, Claudio R. Jung, Bowon Lee

Research output: Contribution to journalArticlepeer-review

33 Scopus citations

Abstract

Humans can extract speech signals that they need to understand from a mixture of background noise, interfering sound sources, and reverberation for effective communication. Voice Activity Detection (VAD) and Sound Source Localization (SSL) are the key signal processing components that humans perform by processing sound signals received at both ears, sometimes with the help of visual cues by locating and observing the lip movements of the speaker. Both VAD and SSL serve as the crucial design elements for building applications involving human speech. For example, systems with microphone arrays can benefit from these for robust speech capture in video conferencing applications, or for speaker identification and speech recognition in Human Computer Interfaces (HCIs). The design and implementation of robust VAD and SSL algorithms in practical acoustic environments are still challenging problems, particularly when multiple simultaneous speakers exist in the same audiovisual scene. In this work we propose a multimodal approach that uses Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) for assessing the video and audio modalities through an RGB camera and a microphone array. By analyzing the individual speakers' spatio-temporal activities and mouth movements, we propose a mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers. We tested the proposed algorithm in scenarios with up to three simultaneous speakers, showing an average VAD accuracy of 95.06% with an average error of 10.9 cm when estimating the three-dimensional locations of the speakers.

Original languageEnglish
Article number6737222
Pages (from-to)1032-1044
Number of pages13
JournalIEEE Transactions on Multimedia
Volume16
Issue number4
DOIs
StatePublished - Jun 2014

Keywords

  • Beamforming
  • SRP-PHAT
  • hidden Markov model
  • multimodal fusion
  • optical-flow
  • sound source localization
  • support vector machine
  • voice activity detection

Fingerprint

Dive into the research topics of 'Simultaneous-speaker voice activity detection and localization using mid-fusion of SVM and HMMs'. Together they form a unique fingerprint.

Cite this