Skip to content

PaperCodex

Subscribe

Model Quantization

SqueezeLLM: Deploy High-Accuracy LLMs in Half the Memory Without Sacrificing Performance

SqueezeLLM: Deploy High-Accuracy LLMs in Half the Memory Without Sacrificing Performance 704

Deploying large language models (LLMs) like LLaMA, Mistral, or Vicuna often demands multiple high-end GPUs, complex inference pipelines, and substantial…

01/13/2026Efficient Inference, Large Language Model Deployment, Model Quantization
TorchAO: Unified PyTorch-Native Optimization for Faster Training and Efficient LLM Inference

TorchAO: Unified PyTorch-Native Optimization for Faster Training and Efficient LLM Inference 2559

Deploying large AI models in production often involves a fragmented toolchain: one set of libraries for training, another for quantization,…

12/19/2025Efficient Inference, Large Language Model Optimization, Model Quantization
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex