Skip to content

PaperCodex

Subscribe

Efficient LLM Inference

LServe: Accelerate Long-Context LLM Inference with Unified Sparse Attention—No Accuracy Trade-Off

LServe: Accelerate Long-Context LLM Inference with Unified Sparse Attention—No Accuracy Trade-Off 790

Deploying large language models (LLMs) to handle long documents, extensive chat histories, or detailed technical manuals remains a major bottleneck…

01/13/2026Efficient LLM Inference, Long-context Language Modeling, Sparse Attention
TEQ: Accurate 3- and 4-Bit LLM Quantization Without Inference Overhead

TEQ: Accurate 3- and 4-Bit LLM Quantization Without Inference Overhead 2544

Deploying large language models (LLMs) in production often runs into a hard trade-off: reduce model size and latency through quantization,…

12/22/2025Efficient LLM Inference, Large Language Model Quantization, Weight-only Quantization
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex