PaperCodex

ShowUI: Open-Source Vision-Language-Action Model for Human-Like GUI Automation from Screenshots 1509

In today’s digital workflows, automating interactions with graphical user interfaces (GUIs)—whether on websites, mobile apps, or desktop software—is a high-value…

12/17/2025GUI Automation, Vision-Language-Action Modeling, Zero-Shot UI Grounding

VideoRAG: Unlock Long-Form Video Understanding with Retrieval-Augmented Generation for AI-Powered Insights 1356

Imagine being able to ask questions like “What did the professor say about quantum entanglement in Lecture 3?” or “Show…

12/17/2025Multimodal Reasoning, Retrieval-Augmented Generation, Video Understanding

InspireMusic: Generate High-Fidelity, Long-Form Music from Text or Audio with LLM and Super-Resolution 1254

InspireMusic is an open-source framework that redefines what’s possible in AI-powered music generation. By seamlessly integrating a large language model…

12/17/2025Audio Super-resolution, Music Continuation, Text-to-music Generation

StoryDiffusion: Generate Consistent Long-Form Visual Stories from Text Without Retraining Models 6351

Creating visually coherent sequences of images or videos from text prompts has long been a bottleneck in AI-powered storytelling. While…

12/17/2025Text-to-Image Generation, Video Generation, Visual Storytelling

TikZero: Generate Editable, Precise Scientific Figures from Text—No Paired Training Data Needed 1650

Creating publication-ready scientific diagrams often requires deep familiarity with vector graphics tools or typesetting systems like LaTeX and TikZ. While…

12/17/2025Multimodal Learning, Program Synthesis, Zero-shot Generation

YuE: Open-Source Foundation Model for Full-Length, Lyrics-Aligned Song Generation in Multiple Languages 5810

Creating a complete, coherent song—complete with expressive vocals, lyrics that match the melody, and stylistically consistent accompaniment—has long been a…

12/17/2025In-context Learning For Audio, Long-form Music Generation, Lyrics-to-song Generation

MMaDA: One Unified Model for Text Reasoning, Multimodal Understanding, and Image Generation 1518

Imagine running a single model that can answer complex reasoning questions, understand images and text together, and generate high-quality images…

12/17/2025Diffusion Language Models, Multimodal Reasoning, Text-to-Image Generation

WebThinker: Autonomous Web Research for Large Reasoning Models That Need Real-Time, Multi-Source Knowledge Synthesis 1366

In today’s fast-evolving information landscape, even the most advanced large reasoning models (LRMs)—such as OpenAI-o1 or DeepSeek-R1—are constrained by their…

12/17/2025Autonomous Web Research, Deep Reasoning Agent, Retrieval-Augmented Generation

SkyReels-V2: The First Open-Source Model for Infinite-Length, Cinematic-Quality Video Generation 5119

Video generation has seen remarkable progress in recent years, yet most models remain limited to short clips—typically 5 to 10…

12/17/2025Image-to-Video Synthesis, Long-form Video Generation, Text-to-Video Generation

$rStar2-Agent: A 14B Math Reasoning Model That Outsmarts 671B Models with Smarter, Tool-Aware Agentic Reasoning$

rStar2-Agent: A 14B Math Reasoning Model That Outsmarts 671B Models with Smarter, Tool-Aware Agentic Reasoning 1356

In the rapidly evolving landscape of large language models (LLMs), bigger isn’t always better—smarter is. Enter rStar2-Agent, a 14-billion-parameter reasoning…

12/17/2025Agentic Tool Use, Mathematical Reasoning, Reinforcement Learning For Reasoning