GLM-130B: A Truly Open, Bilingual 130B-Language Model That Runs on Consumer GPUs

Paper & Code

GLM-130B: An Open Bilingual Pre-trained Model

2023 • THUDM/GLM-130B

★7680

If you’re evaluating large language models (LLMs) for real-world deployment—especially in multilingual settings—you’ve likely hit a wall: most top-performing models are either closed-source, prohibitively expensive to run, or only support English. Enter GLM-130B, a groundbreaking open-source, bilingual (English and Chinese) language model with 130 billion parameters that not only rivals GPT-3 in performance but also runs on relatively affordable hardware like four RTX 3090 GPUs. Developed by THUDM and accepted at ICLR 2023, GLM-130B breaks the myth that 100B+ models require enterprise-grade infrastructure. For technical decision-makers seeking transparency, reproducibility, and cross-lingual capability without vendor lock-in, GLM-130B offers a rare combination of scale, openness, and practicality.

Why GLM-130B Stands Out Among Open 100B+ Models

Unlike many large models that are open only in name—with inaccessible weights, incomplete training logs, or no inference code—GLM-130B delivers full transparency. Its weights, training logs, evaluation scripts, and toolkits are all publicly available on GitHub. But what truly sets it apart is how it balances three critical dimensions: performance, bilingual competence, and hardware accessibility.

Superior Benchmark Performance in Both Languages

GLM-130B consistently outperforms other open 100B+ models on standard benchmarks:

In English, it beats GPT-3 (175B), OPT-175B, and BLOOM-176B on LAMBADA by +4.0%, +5.5%, and +13.0% respectively, and slightly surpasses GPT-3 on MMLU (+0.9%).
In Chinese, it significantly outperforms ERNIE TITAN 3.0 (a 260B-parameter model) by +24.26% on zero-shot CLUE and +12.75% on FewCLUE benchmarks.

This dual-language strength isn’t an afterthought—it’s baked into the pretraining data, which includes 200 billion tokens each from English and Chinese corpora.

Hardware-Efficient Inference via INT4 Quantization

One of GLM-130B’s most practical innovations is its ability to support INT4 quantization without post-training fine-tuning—a first among 100B-scale models. This means you can run the full 130B model with almost no performance degradation on:

4 × RTX 3090 (24GB each), or
8 × RTX 2080 Ti (11GB each)

For context, most comparable models require 8× A100 (40GB) setups—hardware that costs 5–10× more. This makes GLM-130B uniquely accessible to labs, startups, and independent researchers.

Fast and Flexible Inference

GLM-130B supports two inference backends:

SAT (SwissArmyTransformer) – the native framework used for development
FasterTransformer – NVIDIA’s highly optimized library that delivers up to 2.5× faster generation

This flexibility allows teams to choose the best balance of speed and maintainability for their stack.

Practical Applications Where GLM-130B Delivers Real Value

Bilingual AI Systems

If your application serves both English and Chinese users—such as customer support chatbots, cross-lingual content summarization, or multilingual knowledge bases—GLM-130B eliminates the need to maintain two separate models. Its unified architecture handles both languages natively, with strong zero-shot transfer.

Reproducible Research and Internal Benchmarking

Because GLM-130B open-sources all 30+ evaluation results, code, and model checkpoints, it’s ideal for teams that need to validate claims or compare model behavior under controlled conditions. Unlike proprietary APIs, you can inspect every layer, reproduce every result, and audit every decision.

Hybrid Generation Tasks

GLM-130B supports two distinct generation modes out of the box:

Left-to-right generation using [gMASK] for long-form content like articles, stories, or code
Blank-filling (infilling) using [MASK] for factual QA, document completion, or structured data extraction

This dual capability stems from its General Language Model (GLM) architecture, which unifies autoregressive and infilling tasks under one training objective—a design that enhances data efficiency and downstream adaptability.

Getting Started: Inference in Practice

Deploying GLM-130B is straightforward if you meet the hardware requirements. Here’s the typical workflow:

Download the model: The checkpoint is split into 60 parts (total ~260GB). After downloading, merge and extract them:
```
cat glm-130b-sat.tar.part_* > glm-130b-sat.tar
tar xvf glm-130b-sat.tar
```
Set up the environment: Requires Python 3.9+, PyTorch 1.10+, DeepSpeed 0.6+, and SwissArmyTransformer ≥0.2.11.
Run inference: Use the provided script for interactive or file-based generation:
```
bash scripts/generate.sh --input-source interactive
```
Choose generation strategy:
- For creative or open-ended tasks: use BaseStrategy with top-k, top-p, and temperature
- For factual or constrained output: use BeamSearchStrategy with configurable beam width and length penalty

For maximum speed, integrate with FasterTransformer—documentation is included in the repo.

Important Limitations to Consider

While GLM-130B is remarkably accessible for a 130B model, it’s not without constraints:

Hardware demands remain significant: Even with INT4, you need at least four high-end consumer GPUs. It’s not suitable for single-GPU laptops or edge devices.
Language scope is limited: Only English and Chinese are supported. Other languages (even in multilingual tasks) are not covered.
Focus is on inference, not training: The repository emphasizes evaluation and generation. Fine-tuning or retraining from scratch is possible but not the primary use case.
Alternative hardware support is pending: While the project mentions compatibility with Hygon DCU, Ascend 910, and Sunway, these implementations are not yet publicly released.

Summary

GLM-130B redefines what’s possible with open, large-scale language models. It’s not just another research artifact—it’s a production-ready, bilingual LLM that delivers GPT-3-level (or better) performance while running on hardware within reach of many organizations. If your work involves English-Chinese NLP, requires model transparency, or demands high-quality generation without cloud API dependencies, GLM-130B is a compelling—and perhaps the only—open 100B+ option that checks all these boxes. With full reproducibility, quantization support, and flexible inference modes, it empowers practitioners to build, validate, and deploy with confidence.