Skip to content

PaperCodex

Subscribe
Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters

Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters 1122

For teams building real-world AI applications that combine vision and language—whether it’s parsing scanned documents, analyzing instructional videos, or creating…

12/27/2025AI Agent Automation, Multimodal Reasoning, vision-language modeling
Search-R1: Train LLMs to Reason and Search Like Human Researchers Using Open-Source Reinforcement Learning

Search-R1: Train LLMs to Reason and Search Like Human Researchers Using Open-Source Reinforcement Learning 3614

In the rapidly evolving landscape of large language models (LLMs), a critical limitation persists: despite their impressive fluency, LLMs often…

12/27/2025Reinforcement Learning For LLMs, Retrieval-Augmented Generation, Tool-augmented Reasoning
GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding

GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding 1899

If your team is building AI applications that need to see, reason, and act—like desktop assistants that interpret screenshots, UI…

12/27/2025Multimodal Agents, Multimodal Reasoning, vision-language modeling
Kimi-Audio: A Unified, Open-Source Foundation Model for Speech, Sound, and Spoken Dialogue

Kimi-Audio: A Unified, Open-Source Foundation Model for Speech, Sound, and Spoken Dialogue 4373

Building voice-enabled applications today often means stitching together separate models for speech recognition, sound classification, audio captioning, and spoken response…

12/27/2025Audio Understanding, Speech Recognition, Spoken Dialogue Generation
LightZero: One Lightweight Framework for MCTS + Deep Reinforcement Learning Across Games, Control, and Multi-Task Planning

LightZero: One Lightweight Framework for MCTS + Deep Reinforcement Learning Across Games, Control, and Multi-Task Planning 1481

If you’re evaluating tools for building intelligent agents that combine planning and learning—whether for games, robotics, scientific discovery, or general…

12/27/2025Deep Reinforcement Learning, Monte Carlo Tree Search, Multi-Task Planning
Step-Audio 2: Open-Source Multimodal LLM for Paralinguistic-Aware, Tool-Enhanced Speech Understanding and Conversation

Step-Audio 2: Open-Source Multimodal LLM for Paralinguistic-Aware, Tool-Enhanced Speech Understanding and Conversation 1252

Step-Audio 2 is an open-source, end-to-end multimodal large language model (MLM) purpose-built for real-world audio understanding and natural speech conversation.…

12/27/2025Audio Understanding, Paralinguistic Reasoning, Speech-to-speech Conversation
Sa2VA: Unified Vision-Language Model for Accurate Referring Video Object Segmentation from Natural Language

Sa2VA: Unified Vision-Language Model for Accurate Referring Video Object Segmentation from Natural Language 1455

Sa2VA represents a significant leap forward in multimodal AI by seamlessly integrating the strengths of SAM2—Meta’s state-of-the-art video object segmentation…

12/27/2025Multimodal Grounding, Referring Video Object Segmentation, vision-language modeling
Classiq: Accelerate Quantum Algorithm Development with High-Level Abstraction and Automated Circuit Synthesis

Classiq: Accelerate Quantum Algorithm Development with High-Level Abstraction and Automated Circuit Synthesis 1946

Quantum computing holds immense promise—but building, optimizing, and executing quantum circuits remains a formidable challenge for most developers, researchers, and…

12/27/2025Quantum Algorithm Design, Quantum Machine Learning, Quantum State Preparation
Loghub: Real-World System Log Datasets to Power AI-Driven Log Analytics and Research

Loghub: Real-World System Log Datasets to Power AI-Driven Log Analytics and Research 2448

In the world of software systems—whether they’re cloud-native applications, distributed infrastructures, or legacy enterprise platforms—logs are the lifeblood of observability.…

12/26/2025Anomaly Detection, Failure Prediction, Log Parsing
Mini-Omni2: Unified Vision, Speech, and Text Interaction Without External ASR/TTS Pipelines

Mini-Omni2: Unified Vision, Speech, and Text Interaction Without External ASR/TTS Pipelines 1847

In today’s open-source AI landscape, building truly multimodal applications often means stitching together separate models for vision, speech recognition (ASR),…

12/26/2025End-to-end Voice Assistant, Multimodal Understanding, Speech-to-speech Interaction

Posts pagination

Previous 1 … 13 14 15 … 43 Next
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex