Awesome Large Language Model Deployment Papers and Source Codes

SqueezeLLM: Deploy High-Accuracy LLMs in Half the Memory Without Sacrificing Performance 704

Deploying large language models (LLMs) like LLaMA, Mistral, or Vicuna often demands multiple high-end GPUs, complex inference pipelines, and substantial…

01/13/2026Efficient Inference, Large Language Model Deployment, Model Quantization

S-LoRA: Serve Thousands of Task-Specific LLMs Efficiently on a Single GPU 1879

Deploying dozens—or even thousands—of fine-tuned large language models (LLMs) has traditionally been a costly and complex endeavor. Each adapter typically…