Awesome Efficient LLM Deployment Papers and Source Codes

OmniQuant: Near-Lossless LLM Quantization for Real-World Deployment on GPUs and Mobile Devices 857

Deploying large language models (LLMs) in real-world applications remains a major engineering challenge. While models like LLaMA-2, Falcon, and Mixtral…

01/09/2026Efficient LLM Deployment, Large Language Model Quantization, Post-training Quantization

SmoothQuant: Accurate 8-Bit LLM Inference Without Retraining – Slash Memory and Boost Speed 1576

Deploying large language models (LLMs) in production is expensive—not just in dollars, but in compute and memory. While models like…

01/05/2026Efficient LLM Deployment, Large Language Model Inference, Post-training Quantization

BitNet: Run 1.58-Bit LLMs Locally on CPUs with 6x Speedup and 82% Less Energy 24452

Running large language models (LLMs) used to require powerful GPUs, expensive cloud infrastructure, or specialized hardware—until BitNet changed the game.…

12/12/2025Efficient LLM Deployment, On-device Inference, Text Generation