PaperCodex

InstantStyle: Effortless, Tuning-Free Style Preservation for Text-to-Image Generation 1969

InstantStyle is a breakthrough framework that enables high-fidelity, style-consistent image generation without requiring any model retraining or per-image tuning. Built…

12/19/2025Image Stylization, Style Transfer, Text-to-Image Generation

InternGPT: Solve Vision-Centric Tasks with Clicks, Scribbles, and ChatGPT-Level Reasoning 3221

In today’s AI landscape, large language models (LLMs) like ChatGPT have transformed how we interact with software—through natural language. But…

12/19/2025Interactive Image Editing, Multimodal Reasoning, vision-language modeling

Marco-o1: Open-Source Reasoning Models That Reduce Hallucination and Over-Thinking in Complex Tasks 1528

As large reasoning models (LRMs) like OpenAI’s o1 demonstrate unprecedented capabilities in math, code, and planning, a critical gap remains:…

12/19/2025Agentic Planning, Chain-of-thought Distillation, Reasoning Models

DragDiffusion: Precise, Interactive Image Editing for Real and AI-Generated Photos Using Diffusion Models 1234

DragDiffusion is an open-source framework that brings pixel-precise, point-based image manipulation to both real-world photographs and AI-generated images—without requiring users…

12/19/2025Diffusion Models, Image Editing, Interactive Manipulation

OmniGen: One Unified Model for All Image Generation Tasks—No Plugins, No Preprocessing, Just Prompts 4282

Modern image generation is powerful—but fragmented. Depending on your goal—generating from text, editing existing images, preserving a person’s identity, or…

12/19/2025Image Editing, Subject-driven Generation, Text-to-Image Generation

AniPortrait: Generate Photorealistic Talking-Head Videos from a Single Image and Audio Clip 5006

Creating lifelike, animated human faces used to require complex pipelines—motion capture rigs, professional voice actors, or hours of post-production. But…

12/19/2025Audio-driven Animation, Face Reenactment, Portrait Animation

GaussianObject: High-Quality 3D Reconstruction from Just Four Images—No COLMAP Required 1120

Creating photorealistic 3D models of real-world objects typically demands dozens—or even hundreds—of input images captured from carefully calibrated viewpoints. This…

12/19/20253D Object Reconstruction, Gaussian Splatting, Sparse-view Synthesis

AM-RADIO: Unify Vision Foundation Models into One High-Performance Backbone for Multimodal, Segmentation, and Detection Tasks 1357

In modern computer vision, practitioners often juggle multiple foundation models—CLIP for vision-language alignment, DINOv2 for dense feature extraction, and SAM…

12/19/2025Object Detection, Semantic Segmentation, Vision-language Understanding

Semantic Operators: Declarative, Fast, and Accurate AI-Powered Data Processing for Unstructured and Structured Data 1484

Processing unstructured data—like free-form text, documents, or multimodal inputs—with large language models (LLMs) has become essential across industries, from biomedical…

12/19/2025LLM-powered Analytics, Semantic Data Processing, Unstructured Data Transformation

NeedleBench: Rigorously Evaluate LLM Retrieval and Reasoning in Long-Context Scenarios 6409

Evaluating how well large language models (LLMs) retrieve critical facts and perform reasoning over long documents remains a major challenge…

12/19/2025Complex Reasoning, Long-context Retrieval, Synthetic Benchmarking