PaperCodex | Page 47 of 53 | Find Awesome Papers and Source Codes

Ovis: Align Vision and Language Embeddings for Superior Multimodal Reasoning Without Proprietary Lock-in

Ovis: Align Vision and Language Embeddings for Superior Multimodal Reasoning Without Proprietary Lock-in 1373

Multimodal Large Language Models (MLLMs) are increasingly vital for tasks that bridge vision and language—yet many struggle to truly fuse…

12/17/2025Multimodal Fine-tuning, Multimodal Reasoning, Vision-language Alignment

Parallax: Run LLMs on Decentralized Devices Without Costly GPU Clusters

Parallax: Run LLMs on Decentralized Devices Without Costly GPU Clusters 1004

Deploying large language models (LLMs) today often means relying on expensive, centralized infrastructure—specialized GPU clusters, high-bandwidth data centers, and recurring…

12/17/2025Decentralized Inference, Edge AI, Large Language Model Serving

Video-ChatGPT: Enable Accurate, Detailed Video Understanding with Multimodal Conversational AI

Video-ChatGPT: Enable Accurate, Detailed Video Understanding with Multimodal Conversational AI 1444

Video-ChatGPT is a state-of-the-art multimodal AI system that bridges the gap between video content and human-like conversation. Built by researchers…

12/17/2025Multimodal Dialogue, Video Question Answering, Video Understanding

UFO: Automate Multi-App Windows Workflows with Natural Language and Zero Human Intervention

UFO: Automate Multi-App Windows Workflows with Natural Language and Zero Human Intervention 7659

Imagine telling your computer what you want it to do—like “Summarize this PDF, email the summary to my manager, and…

12/17/2025Cross-application Task Execution, GUI Automation, Multimodal Reasoning

SALMONN-omni: A Standalone Full-Duplex Speech LLM That Enables Natural, Codec-Free Voice Conversations

SALMONN-omni: A Standalone Full-Duplex Speech LLM That Enables Natural, Codec-Free Voice Conversations 1366

Building truly natural voice interfaces has long been a holy grail in AI—yet most current systems fall short when it…

12/17/2025Conversational AI, Full-duplex Speech Dialogue, Spoken Language Understanding

PaSa: Autonomous Academic Paper Search Agent That Finds More Relevant Papers Than Google Scholar or ChatGPT

PaSa: Autonomous Academic Paper Search Agent That Finds More Relevant Papers Than Google Scholar or ChatGPT 1457

Searching for academic papers is a daily reality for researchers, engineers, and students—but traditional tools often fall short. Google Scholar…

12/17/2025Academic Paper Search, LLM-based Agent, Retrieval-augmented Reasoning

HealthGPT: Unified Medical Vision-Language Understanding and Generation in a Single Model

HealthGPT: Unified Medical Vision-Language Understanding and Generation in a Single Model 1567

HealthGPT is a cutting-edge Medical Large Vision-Language Model (Med-LVLM) designed to tackle a long-standing challenge in AI for healthcare: the…

12/17/2025Medical Image Generation, Medical Vision-language Modeling, Visual Question Answering

UNetFormer: Real-Time, High-Accuracy Semantic Segmentation for Urban Remote Sensing Imagery

UNetFormer: Real-Time, High-Accuracy Semantic Segmentation for Urban Remote Sensing Imagery 1007

Semantic segmentation of urban remote sensing imagery—such as aerial photos from drones or satellites—is essential for applications like land cover…

12/17/2025Remote Sensing Image Analysis, Semantic Segmentation, Vision Transformer

Hunyuan3D 2.0: Open-Source High-Resolution Textured 3D Generation from Images and Text

Hunyuan3D 2.0: Open-Source High-Resolution Textured 3D Generation from Images and Text 12640

Hunyuan3D 2.0 is a powerful, open-source system developed by Tencent for generating high-resolution, textured 3D assets from either images or…

12/17/20253D Generation, Image-to-3D, Texture Synthesis

AniSora: The First Open-Source Animation Video Generator Built Specifically for Anime-Style Motion and Consistency

AniSora: The First Open-Source Animation Video Generator Built Specifically for Anime-Style Motion and Consistency 2283

While general-purpose video generation models like Sora, Kling, and CogVideoX have revolutionized photorealistic video synthesis, they consistently underperform when it…

12/17/2025Animation Video Generation, Controllable Video Synthesis, Stylized Motion Modeling