SCALE 2024

Video-Based Event Retrieval

June 3rd to August 9th

************************************************************************************************
Information dissemination for current events has traditionally consisted of professionally collected and produced materials, leading to large collections of well-written news articles and high-quality videos. As a result, most prior work in event analysis and retrieval has focused on leveraging this traditional news content, particularly in English. However, much of the event-centric content today is generated by non-professionals, such as on-the-scene witnesses to events who hastily capture videos and upload them to the internet without further editing.

SCALE’24 will focus on the retrieval of event-based visual content found in both professional and non-professional videos. Our goals of this workshop are to understand how current state-of-the-art computer vision technologies work for the retrieval of multilingual event-based visual content and explore how different modalities can be helpful for this task. Among other research directions, we will consider the efficacy of large pre-trained multimodal models, e.g., CLIP, InternVideo, and GPT-4V. Recent research¹ has shown that a multilingual version of CLIP shows promise, but this remains a challenging and open research question.

In addition, we plan to leverage modality-specific models from optical character recognition (OCR), automated speech recognition (ASR), computer vision (CV), and natural language processing (NLP), and determine if these models can be combined. Within these modalities, other secondary tasks we will consider include:

- Video-based event classification, to search for general event types, e.g., {floods in Australia}.
- Identifying key frames that contain relevant information about an event, e.g., who has been affected by the flood.
- Extracting relevant OCR information from key frames, e.g., a banner of text describing where the flood occurred².
- Identifying the location of specific information relevant to an event within a key frame.
- Transcribing speech and extracting relevant event information from speech, e.g., a reporter discussing the flood.

************************************************************************************************

For Additional Information or to Apply

We invite interested researchers and students to apply to the SCALE 2024 program — a funded 10-week research workshop hosted at the Human Language Technology Center of Excellence (HLTCOE) at Johns Hopkins University. The workshop will be in-person in Baltimore, Maryland.

Please contact us at [email protected]. Interested participants should send CVs along with a short message detailing their interest.

The latest we will consider applications is 15 April 2024 but we will make decisions on a rolling basis as applications are received; so, we encourage you to apply as soon as possible.

SCALE 2024

Video-Based Event Retrieval

For Additional Information or to Apply

Research

Upcoming Events

Archived COE News

Human Language Technology Center of Excellence