Skip to content

PaperCodex

Subscribe

Multimodal Representation Learning

LanguageBind: Unify Video, Audio, Depth, Thermal & Text in One Language-Aligned Multimodal Space

LanguageBind: Unify Video, Audio, Depth, Thermal & Text in One Language-Aligned Multimodal Space 833

Imagine building an AI system that understands not just images and text—but also video, audio, infrared (thermal), and depth data—all…

01/13/2026Cross-Modal Retrieval, Multimodal Representation Learning, Zero-shot Transfer Learning
ULIP-2: Scalable Multimodal 3D Understanding Without Manual Annotations

ULIP-2: Scalable Multimodal 3D Understanding Without Manual Annotations 547

Imagine building a system that can understand 3D objects as intuitively as humans do—recognizing a chair from its point cloud,…

01/13/20263D Classification, Multimodal Representation Learning, Zero-shot Learning
ONE-PEACE: A Single Model for Vision, Audio, and Language with Zero Pretraining Dependencies

ONE-PEACE: A Single Model for Vision, Audio, and Language with Zero Pretraining Dependencies 1062

In today’s AI landscape, most multimodal systems are built by stitching together specialized models—separate vision encoders, audio processors, and language…

12/26/2025Cross-Modal Retrieval, Multimodal Representation Learning, Zero-shot Transfer Learning
FlowTok: Unified Text-to-Image and Image-to-Text Generation with Compact 1D Tokens

FlowTok: Unified Text-to-Image and Image-to-Text Generation with Compact 1D Tokens 1082

FlowTok reimagines cross-modal generation by collapsing the traditionally complex boundary between text and images into a streamlined, efficient process. Unlike…

12/19/2025Image-to-text Generation, Multimodal Representation Learning, Text-to-Image Generation
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex