SCALE 2017

Speaker Recognition in Multimedia Data

June 5 – August 4

With rapid advances in handheld technologies, internet bandwidth, and social media platforms, video is becoming a nearly ubiquitous form of online communication. As the medium becomes more and more common, it is important that technologies like speaker recognition adapt to fit the challenges of this increasingly popular domain and also to take full advantage of the benefits.

Video presents opportunities for advancement of speaker recognition in two ways. First, video is completely unconstrained in its content, and so the possible channels, codecs and environments in video are far more diverse than most traditional speech corpora allow. Developing robustness for this wider range of challenges is critical for effective analytics. Second, the presence of multiple modalities (audio and image) allows for interesting new opportunities in fusion or joint processing.

In SCALE ‘17, we explored both of these elements of the multimedia domain. Using challenging corpora, we planned to improve state-of-the-art performance on audio- only speaker recognition in conditions with diverse noise, unknown numbers of speakers, and severe channel mismatch. We also modified the system to incorporate information from the video images with facial recognition and detection. This offers not only the potential to improve both speaker recognition and facial recognition with complementary information, but also to enable the technologies to succeed in the presence of an ineffective mode (obstructed faces or excessively degraded audio).

Research areas of interest include:

  • Speaker recognition and diarization
  • Robustness in noisy conditions and across domains
  • Face detection, tracking, and recognition in videos
  • Speech activity detection
  • Speech separation and enhancement
  • Multimodal processing of video
  • Deep learning

Human Language Technology Center of Excellence