YuLan: A Transparent, Bilingual Open-Source LLM Built from Scratch for Reproducible AI Research

Paper & Code

YuLan: An Open-source Large Language Model

2024 • RUC-GSAI/YuLan-Chat

★633

YuLan is an open-source large language model (LLM) series developed by the Gaoling School of Artificial Intelligence (GSAI) at Renmin University of China. Unlike many open-weight models that offer limited insight into their training process, YuLan—particularly its latest 12B-parameter variant—was trained entirely from scratch and accompanied by a detailed technical report. This commitment to transparency directly addresses a critical pain point in the LLM community: the black-box nature of model development that hinders reproducibility, trust, and further innovation.

Designed with strong bilingual capabilities in both English and Chinese, YuLan delivers competitive performance across standard benchmarks while offering practical features like extended context length and lightweight variants. All code, models, and training methodologies are publicly available under the MIT License (for academic use only), making YuLan a valuable asset for researchers, educators, and developers working on multilingual or Chinese-centric natural language processing tasks.

Strong Bilingual Performance Across Key Benchmarks

One of YuLan’s standout strengths is its balanced proficiency in English and Chinese—a rarity among open-source LLMs, which often prioritize English at the expense of other languages. This dual-language competence isn’t just theoretical; it’s validated through rigorous evaluations on widely recognized benchmarks:

On MMLU, a comprehensive English-language test of multitask knowledge, YuLan-Chat-3-12B achieves an average score of 55.7, outperforming several comparable LLaMA-2-based Chinese chat models.
On C-Eval, a challenging Chinese knowledge benchmark, the same model scores 50.5 overall and 37.7 on the “Hard” subset—demonstrating robust understanding of complex domain-specific content in Chinese.
In the AGI-Eval Gaokao (China’s national college entrance exam) challenge, YuLan-Chat-3-12B reaches 49.5 average, with particularly strong results in history (69.4) and geography (57.3), showing its ability to reason over culturally and linguistically nuanced material.

These results confirm that YuLan isn’t merely “Chinese-friendly”—it’s genuinely competitive in both linguistic spheres. For teams building applications targeting Chinese-speaking users or managing bilingual workflows, this eliminates the need to maintain separate models for each language.

Practical Features for Real-World Use

Beyond raw performance, YuLan incorporates design choices that directly address deployment challenges:

Expanded Chinese Vocabulary and 4K Context

The model vocabulary has been extended to 51,190 tokens, with dedicated inclusion of high-frequency Chinese characters and words. Combined with a 4,096-token context window, this enables more accurate tokenization and generation for longer Chinese inputs—such as technical documents, customer service transcripts, or academic essays—where shorter contexts or poor tokenization often degrade performance in standard LLMs.

Lightweight Option: YuLan-Mini

Recognizing that not all projects require a 12B-parameter model, the team released YuLan-Mini (2.4B parameters) in December 2024. Trained on 1T tokens, it offers a nimble alternative for resource-constrained environments, edge deployment, or rapid prototyping—without sacrificing core bilingual functionality.

Easy Integration and Developer-Friendly Usage

YuLan significantly lowers the barrier to entry for experimentation and integration:

Hugging Face Compatibility: Models like YuLan-Chat-3-12B are available on Hugging Face Hub and loadable with just a few lines of code using transformers, mirroring the standard LLaMA workflow.
Command-Line Inference: A simple inference.py script allows immediate testing without complex pipelines.
8-Bit Quantization Support: With the --load_in_8bit flag, the 13B model runs on a single RTX 3090 (24GB), and the 65B version fits on an A100 (80GB)—making powerful inference accessible even without multi-GPU setups.

This ease of use contrasts sharply with many open-source LLMs that require custom loaders, patching, or undocumented preprocessing steps.

Clear Limitations and Responsible Use

The YuLan team is transparent about the model’s constraints:

Like all probabilistic language models, YuLan may generate biased, inaccurate, or harmful content, despite extensive alignment training via curriculum learning and human preference data.
The MIT License restricts usage to academic purposes only. Commercial deployment is not permitted under current terms.

This upfront disclosure helps technical decision-makers assess risk and compliance—especially important in research or educational contexts where ethical AI use is paramount.

When (and When Not) to Choose YuLan

Ideal for:

Academic research requiring full training transparency and reproducibility
Bilingual (English–Chinese) chatbots, tutoring systems, or content generation
Chinese NLP tasks needing long-context understanding (e.g., document summarization, QA)
Lightweight LLM experimentation via YuLan-Mini

Not recommended for:

Commercial products (due to academic-use-only licensing)
Safety-critical applications (e.g., medical diagnosis, legal advice) without rigorous downstream safeguards
Purely English-only projects where smaller or more specialized models (e.g., Mistral, Llama-3) might offer better efficiency

Summary

YuLan stands out in the crowded open-source LLM landscape by combining full-from-scratch training transparency, strong bilingual English–Chinese performance, and practical deployment features—all under a permissive (though academically restricted) license. For researchers, educators, and developers working on multilingual AI—especially those focused on Chinese language processing—YuLan offers a rare blend of capability, clarity, and accessibility that empowers informed, responsible innovation.