Processing videos by combining visual and audio cues

December 12, 2019

Videos include information in both the visual and audio domains, and so video processing techniques should utilize both of these means for more effective solutions.  The HLTCOE has been researching this strategy since 2017, when recognizing individuals in videos using both voice and face was a topic at the SCALE summer workshop.  The work resulting from that effort was later published at ICASSP 2018 (Sell et al, Audio-Visual Person Recognition in Multimedia Data from the IARPA Janus Program).


The recent 2019 NIST Speaker Recognition Evaluation (SRE) included a challenge for recognizing individuals in video using both vision and audio analytics.  HLTCOE researchers once again utilized techniques outlined in the 2018 paper with updated state-of-the-art face and speaker recognition systems, and the resulting system was incredibly effective.  The HLTCOE submission included almost no mistakes on the evaluation set despite the data’s difficulties, such as low video quality and multiple faces/voices.  It is clear, even in these challenging conditions, the combination of visual and audio processing can yield very powerful solutions.


Human Language Technology Center of Excellence