Punica: Serve Dozens of LoRA Models on One GPU—12x Faster, 2ms Overhead

Paper & Code

2023 • punica-ai/punica

★1128

Deploying multiple fine-tuned large language models (LLMs) used to mean multiplying your GPU costs—until Punica arrived. If you’re managing dozens of LoRA (Low-Rank Adaptation) models for different customers, experiments, or product variants, you’ve likely hit a wall: each LoRA needs its own copy of the base model in memory, wasting GPU capacity and killing throughput.

Punica solves this by enabling multi-tenant LoRA serving: run many LoRA-adapted LLMs simultaneously using only the memory and compute resources of a single base model. Built on a novel CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV), Punica batches inference across different LoRA models without duplicating the underlying LLM. The result? Up to 12x higher throughput compared to state-of-the-art serving systems like vLLM or Hugging Face Transformers—with just ~2ms added latency per token.

For teams building personalized AI services, running A/B tests on adapters, or offering tenant-specific models in production, Punica isn’t just an optimization—it’s a game-changer in cost efficiency and scalability.

The Hidden Cost of Traditional LoRA Serving

LoRA is celebrated for its parameter efficiency: a fine-tuned model may add only 1% to the base LLM’s memory footprint. But in practice, serving multiple LoRAs traditionally forces you to load the full base model once per LoRA. Even if each LoRA is tiny, the base model—often tens or hundreds of gigabytes—gets duplicated across GPU memory.

This leads to:

Low GPU utilization: Most memory holds redundant base weights.
Poor batching: Requests for different LoRAs can’t be batched together, wasting compute.
High operational costs: More GPUs needed to handle the same workload.

Without a smart serving system, LoRA’s theoretical efficiency vanishes at scale.

How Punica Works: Shared Base, Segmented LoRA

Punica decouples the base model from the LoRA adapters. Mathematically, LoRA transforms the original weight W into W + A×B, so inference becomes:

y = x·W + x·A·B

When serving n LoRA models with inputs (x₁, x₂, …, xₙ), Punica computes:

Y = X·W + (x₁·A₁·B₁, x₂·A₂·B₂, …, xₙ·Aₙ·Bₙ)

The key insight: X·W (the base model pass) can be batched efficiently across all requests, while the LoRA-specific term is handled by SGMV—a custom CUDA kernel that executes xᵢ·Aᵢ·Bᵢ for each request in a single, fused operation without materializing separate model copies.

This design preserves the strong batching benefits of modern LLM serving while adding minimal overhead for LoRA adaptation.

Performance That Speaks Volumes

Benchmarks in the Punica paper show dramatic gains across realistic multi-LoRA workloads:

12x higher throughput than vLLM, FasterTransformer, DeepSpeed, and Hugging Face Transformers.
Consistent gains whether LoRA requests are identical, distinct, or follow real-world skewed popularity distributions.
Only ~2ms extra latency per token—negligible for most interactive or batch applications.

These aren’t synthetic wins. They reflect real deployment scenarios where each request might target a different fine-tuned model.

Ideal Use Cases for Punica

Punica excels when you need to serve many LoRA models concurrently:

Enterprise AI platforms: Provide each client with a custom fine-tuned model without provisioning dedicated GPUs.
Research labs: Rapidly test dozens of LoRA variants (e.g., domain-specific, safety-aligned, or prompt-tuned) on shared infrastructure.
Product teams: Run A/B tests on adapter performance or offer personalized chatbots per user segment.
Cost-sensitive deployments: Maximize GPU utilization in cloud or on-prem clusters by consolidating LoRA workloads.

If your workflow involves one LoRA or non-LoRA fine-tuning (e.g., full finetuning, adapters, or IA³), Punica won’t apply. But for LoRA-centric multi-tenant scenarios, it’s unmatched.

Getting Started: Install, Demo, Benchmark

Punica integrates smoothly into PyTorch-based workflows:

Installation (for CUDA 12.1, Python 3.10/3.11):

pip install ninja torch
pip install punica -i https://punica-ai.github.io/whl/cu121/

Or build from source for custom CUDA/PyTorch environments.

Try the multi-LoRA demo:

python examples/tui-multi-lora.py

Run benchmarks:

python -m benchmarks.bench_textgen_lora --system punica --batch-size 32

The project also includes examples for finetuning, converting models to Punica-compatible format, and integration patterns.

Current Limitations

Punica is purpose-built for LoRA and assumes:

GPU architecture compatibility (e.g., Ampere or newer: compute capability ≥8.0).
Matching CUDA, PyTorch, and Python versions (prebuilt wheels support common combinations).
Use of standard LoRA matrices (A and B) without exotic modifications.

It does not support non-LoRA adaptation methods or CPU-only inference.

Summary

Punica transforms multi-LoRA deployment from a resource drain into a scalable, efficient operation. By sharing one base model across dozens of adapters and batching their inference via a specialized CUDA kernel, it delivers 12x higher throughput with near-zero latency penalty. For any team managing multiple fine-tuned LLMs—whether for customers, experiments, or products—Punica slashes GPU costs while simplifying infrastructure. If your stack uses LoRA and you’re ready to serve more models on fewer GPUs, Punica is the missing piece.