PaperCodex

SpargeAttention: Universal, Training-Free Sparse Attention for Faster LLM, Image & Video Inference Without Retraining 814

Large AI models—from language generators to video diffusion systems—are bottlenecked by the attention mechanism, whose computational cost scales quadratically with…

01/13/2026Image Generation, Language Modeling, Video Generation

ConsisID: Generate Identity-Preserving Videos from Text and a Single Image—No Fine-Tuning Required 790

Creating videos that faithfully preserve a person’s identity from just a text prompt and a reference image has long been…

01/13/2026Diffusion Transformer Control, Identity-preserving Video Generation, Text-to-video Synthesis

Liquid: One Unified Language Model for Text and Images—No CLIP, No Compromises 633

What if a single large language model (LLM) could both understand and generate high-quality images—without relying on external vision encoders…

01/13/2026Multimodal Generation, Text-to-Image Generation, Visual Understanding

GaussianFormer-2: Efficient 3D Semantic Occupancy Prediction for Vision-Centric Autonomous Driving 574

In autonomous driving systems that rely primarily on camera inputs—so-called vision-centric setups—accurately understanding the 3D structure and semantics of the…

01/13/20263D Semantic Occupancy Prediction, Sparse Scene Representation, Vision-based 3D Perception

Ultra-Fast-Lane-Detection-V2: Real-Time, Robust Lane Detection for Autonomous Driving and ADAS Systems 759

Lane detection is a foundational capability in autonomous driving and advanced driver-assistance systems (ADAS). Traditional approaches often rely on pixel-wise…

01/13/2026Lane Detection, Ordinal Classification, Real-Time Perception

Tip-Adapter: Boost Few-Shot Image Classification Without Any Training 647

In the era of foundation models, CLIP (Contrastive Language–Image Pretraining) has revolutionized how we approach vision-language tasks—especially zero-shot image classification.…

01/13/2026Few-shot Image Classification, Vision-language Adaptation, Zero-shot Learning

DeepRetrieval: Boost Search Accuracy by 2.6× Without Any Labeled Data—Powered by Reinforcement Learning 679

Imagine you’re building a retrieval-augmented generation (RAG) system, a scientific literature assistant, or a natural-language interface to a clinical trial…

01/13/2026Information Retrieval, Query Rewriting, Retrieval-Augmented Generation

HART: Generate 1024×1024 Images Faster and More Efficiently Than Diffusion Models 635

For teams building AI-powered visual applications—whether in creative tools, digital content platforms, or rapid prototyping—the trade-off between image quality, speed,…

01/13/2026Autoregressive Modeling, High-resolution Synthesis, Image Generation

X-Pose: Detect Any Keypoint Across Humans, Animals, and Objects with One Unified Model 728

Keypoint detection—identifying specific, semantically meaningful points on objects like joints on a human body, facial landmarks on an animal, or…

01/13/2026Keypoint Detection, Multi-object Pose Estimation, Promptable Vision Models

SqueezeLLM: Deploy High-Accuracy LLMs in Half the Memory Without Sacrificing Performance 704

Deploying large language models (LLMs) like LLaMA, Mistral, or Vicuna often demands multiple high-end GPUs, complex inference pipelines, and substantial…

01/13/2026Efficient Inference, Large Language Model Deployment, Model Quantization