SCALE 2026

SCALE 2026: Event Understanding and Summarization from Real-time Videos

June 1st to August 7th

******************************************************************************

As audiovisual media becomes the dominant format for capturing and sharing information online, the ability to discover and understand real-time, multilingual video content has become increasingly important. From smartphone footage of natural disasters to publicly available livestreams situated near roads or high-risk infrastructure, these unedited clips offer firsthand evidence of actively unfolding events. When paired with audio, speech, and embedded text, they form a rich multimodal source of information that remains underutilized in current retrieval-augmented generation systems. 

Building on a key finding from SCALE 2024 that retrieving raw event footage is substantially more difficult than retrieving edited videos [1,2], SCALE 2026 shifts the focus to raw, real-world multimodal understanding at scale. In response to specific information needs in unedited video, we aim to develop systems that not only retrieve but also understand and synthesize information. This includes identifying and describing physical events—concrete, observable real-world occurrences—and, when possible, incorporating analyst background knowledge to support richer inferences and sensemaking. 

To support this, we will directly evaluate modality-specific technologies for detecting and extracting relevant signals from raw video data. First stage research areas of interest include: 

Audio and visual event detection from long-form unedited videos

Speech/audio summarization for capturing spoken information in real-world environments, which will involve leveraging audio event detection and speech-to-text technologies

OCR and visual frame analysis for identifying on-screen text and key visual content

These signals will then feed into and inform the second stage, our primary task of multimodal retrieval-augmented generation. Given a physical or analytical information need and a large collection of raw multilingual videos, the system must retrieve relevant segments and generate a coherent summary of the most user-relevant content. Second stage research areas of interest include: 

Multimodal retrieval of raw, real-time, multilingual videos, an open challenge identified as the most difficult and impactful setting in SCALE 2024 [3]

Multi-video summarization for synthesizing content across clips [4]

Multimodal knowledge base construction by extracting and linking factual claims from relevant video evidence

By integrating robust visual and audio processing into a retrieval augmented generation framework, SCALE 2026 offers a realistic setting for advancing real-world multimodal understanding. 

******************************************************************************

For Additional Information or to Apply

We invite interested researchers and students to apply to the SCALE 2026 program – a funded 10-week research workshop hosted at the Human Language Technology Center of Excellence (HLTCOE) at Johns Hopkins University. The workshop will be in-person in Baltimore, Maryland.

Please Contact us at [email protected]. Interested participants should send CVs along with a short message detailing their interest.

The latest we will consider applications is February 15th, 2026, but decisions will be made on a rolling basis so you are encouraged to apply as soon as possible.

******************************************************************************

References

[1] Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco, Alexander Martin, Ronald Colaianni, Nolan King, Eugene Yang, and Benjamin Van Durme. MultiVENT 2.0: A Massive Multilingual Benchmark for Event-centric Video Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.

[2] Kate Sanders, David Etter, Reno Kriz, Benjamin Van Durme. MultiVENT: Multilingual videos of events and aligned natural text. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.

[3] Saron Samuel, Dan DeGenaro, Jimena Guallar-Blasco, Kate Sanders, Tanner Spendlove, Seun Eisape, Arun Reddy Alexander Martin, Andrew Yates, Eugene Yang, Cameron Carpenter, David Etter, Efsun Kayi, Matthew Wiesner, Nolan King, Paul Boegner, Steven Triplett, Thankam Abish, Kenton Murray, Reno Kriz. MMMoRRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion. In the 48th International ACL SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025.

[4] Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme. WikiVideo: Article Generation from Multiple Videos. arXiv, 2025.

SCALE 2026

SCALE 2026: Event Understanding and Summarization from Real-time Videos

Research

Upcoming Events

Archived COE News

Human Language Technology Center of Excellence