Meta-World+: A Reproducible, Standardized Benchmark for Multi-Task and Meta Reinforcement Learning in Robotic Control

Paper & Code

Meta-World+: An Improved, Standardized, RL Benchmark

2025 • Farama-Foundation/Metaworld

★1659

Evaluating reinforcement learning (RL) agents—especially those designed for multi-task or meta-learning scenarios—requires benchmarks that are consistent, well-documented, and technically accessible. Unfortunately, many existing environments suffer from undocumented changes, inconsistent APIs, or poor reproducibility, making fair algorithm comparisons nearly impossible.

Meta-World+ directly addresses these challenges. As an improved and standardized version of the original Meta-World benchmark, it offers a reliable, open-source platform for developing and evaluating RL agents in continuous-control robotic manipulation tasks. Built on the widely adopted Gymnasium API and fully backward-compatible with historical results, Meta-World+ eliminates ambiguity while giving researchers and engineers fine-grained control over task composition and execution modes.

Whether you’re testing a single policy across 50 manipulation tasks or evaluating few-shot adaptation to novel goals, Meta-World+ provides the structure, flexibility, and rigor needed for credible, scalable experimentation.

What Problem Does Meta-World+ Solve?

Since its initial release, the original Meta-World benchmark became a go-to standard for multi-task and meta-RL research. However, over time, subtle—and often undocumented—modifications to environment dynamics, observation spaces, or reward functions made it difficult to reproduce published results or compare new methods fairly.

Meta-World+ was created to restore trust in benchmarking. By freezing and documenting all environment behaviors, it ensures that results reported today will remain reproducible years from now. This standardization is essential for scientific progress and industrial adoption alike—especially in fields like robotics, where policy robustness and transferability are non-negotiable.

Key Features That Set Meta-World+ Apart

Full Reproducibility of Past Results

Every environment in Meta-World+ has been carefully version-controlled and validated against historical data. This means you can confidently compare your algorithm against prior work without worrying about hidden environmental drift.

Seamless Integration with Gymnasium

Meta-World+ adheres strictly to the Gymnasium API—the modern standard for RL environments. If you’ve used Gym or Gymnasium before, you’ll immediately recognize the interface:

import gymnasium as gym  
import metaworld  
env = gym.make("Meta-World/MT1", env_name="reach-v3")

This lowers the barrier to entry and simplifies integration into existing RL pipelines.

Flexible Benchmark Configuration

Meta-World+ ships with six standardized benchmarks:

Multi-Task: MT1 (1 task), MT10 (10 tasks), MT50 (50 tasks)
Meta-Learning: ML1 (goal variation within one task), ML10 (10 train + 5 test tasks), ML45 (45 train + 5 test tasks)

Each benchmark supports both synchronous (low memory, single process) and asynchronous (parallelized, multi-process) execution, letting you match resource constraints with performance needs.

Moreover, you can build custom benchmarks by combining any subset of the 50 available tasks—perfect for targeted evaluation or incremental scaling.

Who Should Use Meta-World+?

Meta-World+ is ideal for:

RL Researchers developing multi-task policies that must generalize across diverse manipulation skills (e.g., pick, push, pull, hammer).
Meta-Learning Practitioners testing agents that adapt quickly to new tasks or goal conditions with minimal experience.
Robotics Engineers prototyping control policies in simulation before real-world deployment.
Algorithm Benchmarkers seeking a stable, community-recognized standard for fair performance comparisons.

If your work involves continuous-action control in simulated robotic environments—and you care about reproducibility, scalability, and compatibility—Meta-World+ is built for you.

Getting Started Is Simple

Installation takes one command:

pip install metaworld

From there, you can instantiate any benchmark in seconds. For example, to run the MT10 benchmark synchronously:

envs = gym.make_vec('Meta-World/MT10', vector_strategy='sync', seed=42)  
obs, info = envs.reset()  
actions = envs.action_space.sample()  
obs, reward, terminated, truncated, info = envs.step(actions)

For meta-learning, simply create separate training and testing environments:

train_envs = gym.make_vec('Meta-World/ML10-train', vector_strategy='sync')  
test_envs = gym.make_vec('Meta-World/ML10-test', vector_strategy='sync')

This clear separation enforces rigorous evaluation protocols without adding undue complexity.

Limitations and Practical Notes

While powerful, Meta-World+ has well-defined boundaries:

Platform Support: Officially tested and supported on Linux and macOS (Python 3.8–3.11). Windows support is community-driven and not guaranteed.
Domain Focus: Exclusively targets continuous-control robotic manipulation. It is not suitable for discrete-action tasks (e.g., Atari games) or non-robotic domains (e.g., finance, NLP).
Meta-Learning Workflow: Requires explicit handling of train/test environment splits—slightly more setup than single-environment benchmarks, but necessary for valid few-shot evaluation.

These constraints ensure the benchmark remains focused, high-fidelity, and aligned with real-world robotic challenges.

Summary

Meta-World+ is more than an incremental update—it’s a commitment to rigor in reinforcement learning research. By standardizing environment behavior, embracing modern APIs, and offering granular control over task composition, it removes common roadblocks that hinder reproducibility and scalability. For anyone working on multi-task or meta-RL in robotic control, Meta-World+ provides a trustworthy, flexible, and future-proof foundation for innovation.

Start with MT1 for quick validation, scale to MT50 for generalization stress tests, or probe adaptation limits with ML45—knowing that your results will stand the test of time and scrutiny.