Awesome Multimodal Visual Question Answering Papers and Source Codes

Perception Encoder: One Vision Model to Rule Image, Video, and Language Tasks – Without Task-Specific Training 1809

Perception Encoder (PE) redefines what’s possible with a single vision encoder. Unlike legacy approaches that demand different pretraining strategies for…

12/19/2025Dense Visual Prediction, Multimodal Visual Question Answering, Zero-shot Image Classification