ACL 2026 ยท La Jolla ยท In-person only

Video
Retrieval

A casual, high-signal ACL 2026 session for researchers thinking about video retrieval, multimodal retrieval, and efficient video understanding โ€” from zero-shot systems to scalable search over millions of videos without melting GPUs.

Session Chair Sourajit Saha
Location La Jolla
Mode In-person only

Why this session?

Video is everywhere, but retrieving the right video moment remains expensive, messy, and surprisingly unsolved. This session is for people who want systems that are not only accurate, but also scalable, interactive, and deployable.

๐ŸŽฏ

Retrieval that actually works

Zero-shot, few-shot, and instruction-following retrieval systems that can handle diverse video collections and complex user intent.

โšก

Efficiency without sadness

Memory-efficient search, compressed representations, smarter candidate filtering, and practical ways to avoid brute-force everything.

๐Ÿ’ฌ

Friendly academic chaos

Informal discussion, idea sharing, open problems, and probably too many opinions about embeddings, rerankers, and compute budgets.

Topics we want to discuss

Bring your papers, failed experiments, weird benchmark observations, and half-formed ideas. Especially the half-formed ideas.

Zero-shot video retrieval Few-shot retrieval CoT-guided retrieval Multimodal reranking Memory-efficient search Compressed representations Test-time adaptation Interactive retrieval Scalable video recognition Annotation-efficient learning Retrieving from millions of videos Deployable retrieval systems

Suggested format

The session is designed to be lightweight and discussion-heavy. Replace the placeholder times below once the ACL schedule is finalized.

00โ€“10 min

Opening: what are we even retrieving?

A quick framing of current video retrieval problems: queries, videos, moments, events, captions, embeddings, and where current systems break.

10โ€“30 min

Lightning idea sharing

Short informal contributions from attendees: recent work, promising directions, painful bottlenecks, and open questions.

30โ€“50 min

Open discussion

How do we retrieve from huge video databases? How do we reduce annotation burden? When should we use rerankers? What should future benchmarks measure?

50โ€“60 min

Takeaways + possible collaborations

Wrap up with promising problems, shared resources, and people who should probably talk to each other after the session.

Slides & resources

Slides, notes, and related links can be added here after the session. For now, this section is ready for your future materials.

Come for the retrieval. Stay for the embeddings. Leave with at least one dangerously good research idea.

๐Ÿ“

Where

La Jolla. This session is available in-person only.

๐Ÿ‘ฅ

Who should come?

Researchers working on video retrieval, multimodal retrieval, efficient video understanding, representation learning, reranking, or scalable search.

๐Ÿง 

What to bring

Questions, opinions, open problems, negative results, benchmark frustrations, and ideas that are not fully baked yet.