PaperCodex

ZipVoice-Dialog: Generate Realistic Spoken Dialogues Instantly—No Fine-Tuning, No Templates 662

Creating natural-sounding spoken dialogues between two people has long been a pain point in AI-driven voice applications. Traditional approaches either…

01/05/2026Non-autoregressive TTS, Spoken Dialogue Generation, Zero-shot Text-to-Speech

YOLOv13: Boost Real-Time Object Detection Accuracy Without Sacrificing Speed or Efficiency 827

For engineers, researchers, and product teams building real-time vision systems—whether for surveillance cameras, autonomous drones, or mobile apps—achieving high detection…

01/05/2026Edge AI, Object Detection, Real-time Computer Vision

UniAnimate-DiT: High-Fidelity Human Animation from a Single Image and Pose Sequence – No Full Retraining Needed 797

Animating a static human image into a realistic, temporally coherent video used to require massive datasets, complex pipelines, or retraining…

01/05/2026Diffusion Transformer, Human Image Animation, Video Generation

360-LLaMA-Factory: Plug-and-Play Sequence Parallelism for Long-Context SFT and DPO Without Rewriting Your Workflow 571

Training large language models (LLMs) on long sequences—whether for document-level instruction tuning, multi-modal reasoning, or complex alignment tasks—has long been…

01/05/2026Direct Preference Optimization, Long-Context Training, Supervised Fine-tuning

DeepResearcher: Train AI Research Agents That Think, Verify, and Adapt in the Real Web Environment 621

In today’s AI landscape, many organizations rely on large language models (LLMs) to automate complex research tasks—such as competitive analysis,…

01/05/2026Autonomous Research Agents, Reinforcement Learning For Information Retrieval, Web-grounded Reasoning

LLM×MapReduce: Generate Coherent Long-Form Articles from Extremely Long Inputs Using LLMs Efficiently 814

If you’ve ever tried using a large language model (LLM) to synthesize a detailed technical report from hundreds of research…

01/05/2026Document Synthesis, Long-context Reasoning, Long-form Generation

Waver: Generate Lifelike, High-Motion Videos in 1080p with One Unified Model 588

In the rapidly evolving world of generative AI, video generation has remained a particularly challenging frontier—especially when it comes to…

01/05/2026Image-to-Video Synthesis, Multimodal Generative Modeling, Text-to-Video Generation

VGGT-Long: Scalable Monocular 3D Reconstruction for Kilometer-Scale Real-World Sequences Without Retraining or Calibration 552

Monocular 3D reconstruction has seen rapid advances thanks to foundation models capable of inferring rich geometric structure from single images.…

01/05/2026Large-scale SLAM, Monocular 3D Reconstruction, Vision Foundation Models

SimpleVLA-RL: Boost Robotic Task Performance with Minimal Data Using Reinforcement Learning 762

Building capable robotic systems that understand vision, language, and action—commonly referred to as Vision-Language-Action (VLA) models—has become a central goal…

01/05/2026Reinforcement Learning, Robotic Manipulation, Vision-Language-Action Modeling

PUSA: Generate High-Quality Video from Text or Images for $500—Not $100,000 645

Video generation has long been bottlenecked by two stubborn realities: astronomical training costs and rigid temporal modeling. Most state-of-the-art image-to-video…

01/05/2026Image-to-Video Synthesis, Multi-condition Video Diffusion, Text-to-Video Generation