voyage-multimodal-3.5: a new multimodal retrieval frontier with video support

https://news.ycombinator.com/rss Hits: 1
Summary

TL;DR – We’re excited to introduce voyage-multimodal-3.5, our next-generation multimodal embedding model built for retrieval over text, images, and videos. Like voyage-multimodal-3, it embeds interleaved text and images (screenshots, PDFs, tables, figures, slides), but now adds explicit support for video frames. It’s also the first production-grade video embedding model to support Matryoshka embeddings for flexible dimensionality. voyage-multimodal-3.5 attains 4.56% higher retrieval accuracy than Cohere Embed v4 across 15 visual document retrieval datasets and 4.65% higher than Google Multimodal Embedding 001 across 3 video retrieval datasets, while matching state-of-the-art text models on pure-text search. We released voyage-multimodal-3, the industry’s first production-grade multimodal model capable of embedding interleaved texts and images, over a year ago. Since then, voyage-multimodal-3 has enabled numerous customers to build search and retrieval pipelines over text, PDFs, figures, tables, and other documents rich with visuals. Today, we’re excited to announce voyage-multimodal-3.5, which introduces support for embedding videos while further improving upon voyage-multimodal-3 in terms of retrieval quality. Model architecture. Similar to voyage-multimodal-3, voyage-multimodal-3.5 adopts a model architecture where both visual and text modalities are passed through a single transformer encoder. This unified architecture preserves contextual relationships between visual and textual information, enabling effective vectorization of interleaved content such as document screenshots, complex PDFs, and annotated images. This stands in contrast to CLIP-based models (such as earlier Cohere multimodal models), which route images and text through separate, independent model towers. CLIP-like models generate embeddings with a well-documented problem known as the modality gap, which we discussed in our voyage-multimodal-3 blog post. In practice, this means a text query will of...

First seen: 2026-01-24 01:49

Last seen: 2026-01-24 01:49