S1: Boost Reasoning Performance with Just 1,000 Examples and Smart Test-Time Scaling

Paper & Code

2025 • simplescaling/s1

★6613

In the rapidly evolving landscape of large language models (LLMs), achieving strong reasoning capabilities often comes at the cost of massive datasets, complex training pipelines, or proprietary techniques. OpenAI’s o1 model demonstrated that allocating extra compute during inference—a strategy known as test-time scaling—can significantly enhance reasoning performance. However, its methodology remains undisclosed, leaving researchers and practitioners searching for accessible alternatives.

Enter S1: a minimalist, open-source approach that delivers o1-preview-level reasoning using only 1,000 carefully curated examples and a clever inference-time control mechanism called budget forcing. Built on top of Qwen2.5-32B-Instruct, S1 not only matches but exceeds o1-preview on competition math benchmarks like MATH and AIME24—by up to 27%. Crucially, everything is open: model weights, training data, code, and evaluation scripts. For teams seeking transparent, reproducible, and plug-and-play reasoning upgrades without hidden tricks or massive infrastructure, S1 offers a compelling solution.

How S1 Works: Simplicity Meets Strategic Inference

The s1K Dataset: Quality Over Quantity

S1’s foundation is the s1K dataset—a compact collection of 1,000 questions paired with high-quality reasoning traces. These traces aren’t randomly gathered; they’re selected based on three rigorously validated criteria: difficulty, diversity, and reasoning quality. Ablation studies in the paper confirm that each criterion contributes meaningfully to final performance. This focus on signal-rich examples enables strong generalization from minimal supervision—a stark contrast to conventional fine-tuning that demands tens or hundreds of thousands of samples.

Budget Forcing: Intelligent Control at Inference Time

The real innovation of S1 lies in budget forcing, a test-time technique that dynamically regulates how long the model “thinks” before finalizing its answer.

Here’s how it works:

During generation, the model may prematurely signal completion (e.g., by emitting an end token).
Budget forcing intercepts this signal and appends the word “Wait” to the output, effectively prompting the model to reconsider.
This can be repeated multiple times, extending the reasoning trace within a predefined token budget.
Often, this extra “thinking time” allows the model to catch and correct earlier errors—leading to substantially improved accuracy, especially on complex multi-step problems.

Importantly, budget forcing doesn’t require retraining. It’s applied purely during inference, making it a flexible, low-overhead enhancement that can be toggled on or off based on available compute or latency constraints.

Why S1 Stands Out

1. Exceptional Performance with Minimal Resources

S1-32B achieves 57% accuracy on AIME24—a notoriously difficult high school math competition—surpassing o1-preview. Even more impressively, this gain is unlocked not through larger models or more training data, but through smarter use of existing capacity at test time. When scaled with budget forcing, performance consistently improves beyond the baseline, demonstrating reliable extrapolation.

2. Full Openness and Reproducibility

Unlike closed models that hint at capabilities without sharing methods, S1 provides:

The fine-tuned model (simplescaling/s1-32B and improved s1.1-32B)
The s1K and s1K-1.1 datasets
Training, inference, and evaluation scripts
Detailed documentation for vLLM and Hugging Face Transformers

This transparency empowers researchers to validate claims, reproduce results, and build upon the work—accelerating collective progress in reasoning-focused LLM development.

3. Practical and Lightweight Integration

Integrating S1 into existing workflows is straightforward. Whether you’re using vLLM for high-throughput serving or Transformers for experimentation, the model loads like any standard LLM. Activating budget forcing requires only a few lines of code that manage token budgets and inject “Wait” tokens—no architectural changes needed.

Ideal Use Cases

S1 excels in scenarios demanding structured, multi-step reasoning, such as:

Mathematical problem solving (e.g., competition math, symbolic reasoning)
Algorithmic code generation requiring correctness guarantees
Scientific question answering where intermediate justification matters
Educational tools that benefit from verifiable reasoning traces

It’s particularly valuable for teams that:

Operate under fixed model budgets but can afford modest inference-time overhead
Need o1-like reasoning without access to proprietary systems
Prioritize auditability and want to understand how answers are derived

Getting Started

To run S1 with Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer  
model = AutoModelForCausalLM.from_pretrained("simplescaling/s1.1-32B", device_map="auto")  
tokenizer = AutoTokenizer.from_pretrained("simplescaling/s1.1-32B")  
# Use standard chat template + generate

For vLLM with budget forcing, you:

Set a max token limit for the “thinking” phase (e.g., 32,000 tokens)
Generate the initial reasoning trace
Append “Wait” and regenerate if the model ends early
Finalize the answer after the desired number of self-check iterations

The repository includes ready-to-run scripts for training, evaluation (via a modified lm-evaluation-harness), and data recreation—making experimentation accessible even to those new to test-time scaling.

Limitations and Considerations

While powerful, S1 has important constraints:

It’s built specifically on Qwen2.5-32B-Instruct; the approach isn’t yet generalized to arbitrary base models.
In high-temperature settings, vLLM may encounter out-of-vocabulary token errors due to extended generations—though a simple patch can mitigate this.
Budget forcing requires careful token budget management to avoid hitting context limits or wasting compute.
Performance gains are most pronounced on structured reasoning tasks; benefits on open-ended or creative tasks may be less consistent.

Summary

S1 proves that breakthrough reasoning performance doesn’t require secrecy or scale. By combining a tiny, high-signal dataset with intelligent test-time control, it delivers o1-preview-beating results in an open, reproducible, and easy-to-adopt package. For researchers, engineers, and product teams seeking reliable, verifiable, and high-performance reasoning—without the black box—S1 is a compelling and practical choice.