PaperCodex

FramePack: Generate Long, High-Quality Videos on a Laptop—Without Cloud Costs or Drifting Artifacts 16308

Creating long, coherent, and visually rich videos with AI has long been bottlenecked by computational complexity, memory constraints, and error…

12/27/2025Diffusion Models, Next-frame Prediction, Video Generation

Second-Me: Your Private, Persistent AI Self That Eliminates Repetitive Data Entry and Reclaims Your Digital Identity 14752

In a world where AI assistants increasingly mediate our interactions with apps, services, and even other people, a critical problem…

12/27/2025AI Personalization, Decentralized AI Agents, Personal Knowledge Management

MultiTalk: Generate Realistic Multi-Person Conversational Videos from Audio with Precise Speaker Binding 2704

Creating lifelike videos of people talking has long been dominated by “talking head” technologies—tools that animate a single face from…

12/27/2025Audio-Driven Video Generation, Conversational Video Synthesis, Multi-Person Animation

DGM: Self-Improving AI Agents That Evolve Their Own Code Without Human Redesign 1762

Most AI systems today are stuck in time. Their architectures, prompts, and tooling are all hand-crafted by engineers—once deployed, they…

12/27/2025Automated Code Evolution, Open-ended Learning, Self-improving Agents

Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters 1122

For teams building real-world AI applications that combine vision and language—whether it’s parsing scanned documents, analyzing instructional videos, or creating…

12/27/2025AI Agent Automation, Multimodal Reasoning, vision-language modeling

Search-R1: Train LLMs to Reason and Search Like Human Researchers Using Open-Source Reinforcement Learning 3614

In the rapidly evolving landscape of large language models (LLMs), a critical limitation persists: despite their impressive fluency, LLMs often…

12/27/2025Reinforcement Learning For LLMs, Retrieval-Augmented Generation, Tool-augmented Reasoning

GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding 1899

If your team is building AI applications that need to see, reason, and act—like desktop assistants that interpret screenshots, UI…

12/27/2025Multimodal Agents, Multimodal Reasoning, vision-language modeling

Kimi-Audio: A Unified, Open-Source Foundation Model for Speech, Sound, and Spoken Dialogue 4373

Building voice-enabled applications today often means stitching together separate models for speech recognition, sound classification, audio captioning, and spoken response…

12/27/2025Audio Understanding, Speech Recognition, Spoken Dialogue Generation

LightZero: One Lightweight Framework for MCTS + Deep Reinforcement Learning Across Games, Control, and Multi-Task Planning 1481

If you’re evaluating tools for building intelligent agents that combine planning and learning—whether for games, robotics, scientific discovery, or general…

12/27/2025Deep Reinforcement Learning, Monte Carlo Tree Search, Multi-Task Planning

Step-Audio 2: Open-Source Multimodal LLM for Paralinguistic-Aware, Tool-Enhanced Speech Understanding and Conversation 1252

Step-Audio 2 is an open-source, end-to-end multimodal large language model (MLM) purpose-built for real-world audio understanding and natural speech conversation.…

12/27/2025Audio Understanding, Paralinguistic Reasoning, Speech-to-speech Conversation