Skip to content

PaperCodex

Subscribe
RCG: High-Fidelity Unconditional Image Generation Without Labels

RCG: High-Fidelity Unconditional Image Generation Without Labels 929

For years, unconditional image generation—creating realistic images without relying on human-provided class labels—has lagged significantly behind its class-conditional counterpart in…

01/13/2026Diffusion Models, Self-Supervised Representation Learning, Unconditional Image Generation
JoyVASA: Animate Human and Animal Portraits from Audio with Diffusion-Based Lip Sync and Head Motion

JoyVASA: Animate Human and Animal Portraits from Audio with Diffusion-Based Lip Sync and Head Motion 840

Audio-driven facial animation has long been a challenging yet highly valuable capability—from building expressive virtual agents to creating personalized pet…

01/13/2026Audio-driven Facial Animation, Cross-species Avatar Synthesis, Diffusion-based Video Generation
Stereo Anything: Zero-Shot Stereo Matching That Works Across Any Domain Without Retraining

Stereo Anything: Zero-Shot Stereo Matching That Works Across Any Domain Without Retraining 803

Stereo matching—the task of finding corresponding pixels between left and right images to infer depth—is foundational to 3D vision systems…

01/13/20263D Vision, Stereo Matching, Zero-shot Generalization
SQuARE: Boost LLM Reasoning with Self-Generated Questions—No Retraining Needed

SQuARE: Boost LLM Reasoning with Self-Generated Questions—No Retraining Needed 757

Solving complex reasoning problems is a persistent challenge for Large Language Models (LLMs). While techniques like Chain-of-Thought (CoT) prompting have…

01/13/2026Chain-of-thought Prompting, Question Answering, Reasoning Enhancement
FoleyCrafter: Generate Lifelike, Synchronized Sound Effects for Silent Videos—Automatically

FoleyCrafter: Generate Lifelike, Synchronized Sound Effects for Silent Videos—Automatically 630

Silent videos—whether from AI-generated content, archival footage, gameplay recordings, or unfinished film prototypes—often lack the immersive quality that sound brings.…

01/13/2026Multimodal Synthesis, Neural Foley, Video-to-audio Generation
SAMMO: Optimize LLM Prompt Programs Like Code—Structure-Aware, Compile-Time Tuning for RAG, Instruction Refinement, and Prompt Compression

SAMMO: Optimize LLM Prompt Programs Like Code—Structure-Aware, Compile-Time Tuning for RAG, Instruction Refinement, and Prompt Compression 731

Modern LLM applications increasingly rely on complex, structured prompts—especially in scenarios like Retrieval-Augmented Generation (RAG), instruction-based tasks, and data labeling…

01/13/2026Instruction Tuning, Prompt Optimization, Retrieval-Augmented Generation (RAG)
ChangeMamba: High-Accuracy, Low-Cost Change Detection for Remote Sensing Without CNN or Transformer Trade-Offs

ChangeMamba: High-Accuracy, Low-Cost Change Detection for Remote Sensing Without CNN or Transformer Trade-Offs 515

Change detection in remote sensing—identifying what has changed between two satellite or aerial images taken at different times—is a critical…

01/13/2026Binary Change Detection, Building Damage Assessment, Semantic Change Detection
DaGAN++: Generate Realistic Talking Head Videos with Depth-Aware AI—No 3D Labels Needed

DaGAN++: Generate Realistic Talking Head Videos with Depth-Aware AI—No 3D Labels Needed 995

Creating lifelike talking head videos—where a static portrait speaks and moves in sync with an audio or driving video—has long…

01/13/2026Depth-aware Video Synthesis, Facial Reenactment, Talking Head Generation
OCRBench: The Definitive Benchmark for Evaluating Real-World OCR Capabilities in Large Multimodal Models

OCRBench: The Definitive Benchmark for Evaluating Real-World OCR Capabilities in Large Multimodal Models 726

Large Multimodal Models (LMMs) like GPT-4V and Gemini promise powerful vision-language understanding—but how well do they actually read text in…

01/13/2026Key Information Extraction, Optical Character Recognition (OCR), Visual Question Answering
ULIP-2: Scalable Multimodal 3D Understanding Without Manual Annotations

ULIP-2: Scalable Multimodal 3D Understanding Without Manual Annotations 547

Imagine building a system that can understand 3D objects as intuitively as humans do—recognizing a chair from its point cloud,…

01/13/20263D Classification, Multimodal Representation Learning, Zero-shot Learning

Posts pagination

Previous 1 … 9 10 11 … 53 Next
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex