Awesome Model Quantization Papers and Source Codes

SqueezeLLM: Deploy High-Accuracy LLMs in Half the Memory Without Sacrificing Performance 704

Deploying large language models (LLMs) like LLaMA, Mistral, or Vicuna often demands multiple high-end GPUs, complex inference pipelines, and substantial…

01/13/2026Efficient Inference, Large Language Model Deployment, Model Quantization

TorchAO: Unified PyTorch-Native Optimization for Faster Training and Efficient LLM Inference 2559

Deploying large AI models in production often involves a fragmented toolchain: one set of libraries for training, another for quantization,…

12/19/2025Efficient Inference, Large Language Model Optimization, Model Quantization