Within the growing sea of digital media available on the web, ranging from YouTube videos to podcasts, automatically connecting consumers with content of interest is an increasingly important goal. Content-based recommender systems leverage an individual user’s preference ratings to model their interests and information needs in order to make future recommendations. In the context of spoken documents, speech-to-text systems are traditionally used to generate lexical features on which to base these user models. However, training high accuracy speech-to-text systems requires large collections of manually transcribed speech audio, a prerequisite that limits the reach of recommender system technology.
We investigated content-based recommender systems for spoken documents specifically in low resource settings where access to recognizer training data is limited. We measured the downstream impact of degrading speech-to-text quality by varying not only the quantity of available training data, but also the language of application and the fidelity of the audio. In this constrained setting, alternative information sources less commonly considered in recommender systems can play an increasingly valuable role. Thus, in conjunction with transcript quality, we evaluated the utility of speaker/language/gender recognition, acoustic event detection, entity linking, context analysis, and zero resource linguistic discovery, as well as developed the requisite information retrieval back-end techniques to fuse these noisy and heterogeneous feature inputs. Our efforts aimed to produce an essential benchmark for the current state-of-the-art in content-based recommender systems and identify new recipes for maximizing their performance in low resource settings.