Skip to content

PaperCodex

Subscribe

Large Language Model Inference

SmoothQuant: Accurate 8-Bit LLM Inference Without Retraining – Slash Memory and Boost Speed

SmoothQuant: Accurate 8-Bit LLM Inference Without Retraining – Slash Memory and Boost Speed 1576

Deploying large language models (LLMs) in production is expensive—not just in dollars, but in compute and memory. While models like…

01/05/2026Efficient LLM Deployment, Large Language Model Inference, Post-training Quantization
SageAttention3: 5x Faster LLM Inference on Blackwell GPUs with Plug-and-Play FP4 Attention and First-Ever 8-Bit Training Support

SageAttention3: 5x Faster LLM Inference on Blackwell GPUs with Plug-and-Play FP4 Attention and First-Ever 8-Bit Training Support 2814

Attention mechanisms lie at the heart of modern large language models (LLMs) and multimodal architectures—but their quadratic computational complexity remains…

12/19/2025Efficient Training, Large Language Model Inference, Multimodal Generation
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex