WebDancer: Build Autonomous Web Agents That Solve Complex, Multi-Step Research Tasks

Paper & Code

WebDancer: Towards Autonomous Information Seeking Agency

2025 • Alibaba-NLP/DeepResearch

★17544

Most large language models today give one-shot answers—but real-world problems rarely fit into a single prompt. Imagine trying to answer: “Compare the safety profiles of Drug A and Drug B based on clinical trials published in the last 18 months.” This isn’t a query you can answer with a static knowledge base. It requires autonomous browsing, iterative reasoning, tool use, and evidence synthesis—capabilities that standard LLMs lack.

WebDancer, developed by Alibaba-NLP as part of the WebAgent project, is specifically engineered to tackle these long-horizon, deep information-seeking challenges. Built on the ReAct framework and trained through a systematic four-stage pipeline, WebDancer doesn’t just respond—it researches, navigates, and reasons like a human analyst. For developers, researchers, and technical leaders seeking AI that can truly “do the work,” WebDancer offers a robust, open, and replicable path forward.

The Core Innovation: A Structured Training Paradigm for Agentic Intelligence

Unlike ad-hoc agent implementations that rely on prompt engineering or brittle tool chaining, WebDancer is trained end-to-end using a data-centric approach with four key stages:

Browsing data construction: Large-scale collection of real or synthetic web interaction traces.
Trajectory sampling: Selecting high-quality action sequences that demonstrate effective reasoning paths.
Supervised fine-tuning (SFT): Providing a reliable “cold start” so the agent doesn’t flounder at the first step.
Reinforcement learning (RL): Using a customized on-policy RL method—Group Relative Policy Optimization—with token-level gradients and negative-sample filtering to improve generalization in dynamic environments.

This pipeline ensures the agent learns not just what to say, but how to explore, when to search, when to read, and when to compute—making it far more reliable than reactive systems.

Benchmarked Performance: Proven on Real-World Challenges

WebDancer isn’t theoretical—it’s evaluated on some of the toughest agentic benchmarks available:

GAIA: A benchmark requiring multi-step reasoning across web, file, and tool interactions. WebDancer achieves a Pass@3 score of 61.1%.
WebWalkerQA: Tests autonomous web traversal and question answering. WebDancer scores 54.6%, demonstrating strong navigation and comprehension.

These results validate that WebDancer can handle the ambiguity, complexity, and tool dependency of real information-seeking tasks—far beyond what off-the-shelf LLMs can manage.

Ideal Use Cases: Who Should Adopt WebDancer?

WebDancer shines in scenarios where answers require active investigation, not passive recall:

Academic & Scientific Research: Automate literature reviews by autonomously fetching, comparing, and summarizing recent papers.
Business Intelligence: Monitor competitor product launches, patent filings, or financial reports across scattered web sources.
Technical Support Automation: Resolve user queries that require checking documentation, logs, or knowledge bases across multiple domains.
AI Assistants with “Web Legs”: Build next-generation assistants that don’t guess—but go look it up, read, analyze, and report back with sources.

If your task involves multiple tools, dynamic environments, or unstructured information sources, WebDancer provides the scaffolding to make it work.

Getting Started: From Demo to Local Deployment

While WebDancer itself is a research agent, it’s accessible through Tongyi DeepResearch—the 30.5B-parameter agentic model it helped inspire. Here’s how to try it:

Quick exploration: Use the online demos on Hugging Face or ModelScope—no setup needed. (Note: these are rate-limited and best for testing.)
Local inference: Clone the repository, set up a Python 3.10.0 environment, and run bash run_react_infer.sh after configuring your API keys.
Cloud inference via OpenRouter: The alibaba/tongyi-deepresearch-30b-a3b model is available on OpenRouter—ideal if you lack GPU resources.

The system supports both JSON and JSONL input formats. For document-based queries (e.g., “report.pdf: What are the key conclusions?”), simply place files in the eval_data/file_corpus/ directory and reference them in the question field.

Key Requirements and Setup Notes

To run WebDancer-powered inference locally, you’ll need:

Python 3.10.0 (other versions may cause dependency conflicts).
API keys for external services:
- Serper.dev for search
- Jina.ai for webpage parsing
- OpenAI-compatible endpoint for summarization
- DashScope for file parsing (PDFs, Excel, etc.)
A sandboxed Python interpreter endpoint (e.g., via SandboxFusion) for code execution.

All credentials are managed via a .env file—kept out of version control for security.

Limitations to Keep in Mind

WebDancer is powerful, but not magic:

Dependence on external APIs: Search and parsing quality hinge on third-party services, which may impose rate limits or latency.
Long execution times: Deep research tasks can take minutes—not seconds—due to multi-step interactions.
Not a chatbot: It’s optimized for research, not conversation. Don’t expect it to play 20 questions.
Stable deployment requires local GPUs: While OpenRouter offers access, production use demands controlled environments for reliability.

Part of a Larger Vision: The WebAgent Ecosystem

WebDancer isn’t a standalone experiment. It’s one node in Alibaba’s expanding Web Agent family, which includes WebWalker (benchmarking), WebWeaver (dynamic outlining), WebSailor (reasoning scaling), and more. This ecosystem reflects a long-term commitment to general agentic intelligence—not just another demo.

By open-sourcing the framework and publishing detailed training methodologies, the team enables others to replicate, extend, and industrialize these approaches—making WebDancer a strategic asset for teams building the next generation of autonomous AI.

Summary

WebDancer solves a critical gap in applied AI: autonomous, multi-step information seeking. It moves beyond static responses to deliver agents that browse, reason, and synthesize evidence from real-world sources. With a rigorous training pipeline, strong benchmark performance, and practical deployment options, it offers a credible foundation for anyone building AI systems that must do research, not just recite facts.

For technical leaders evaluating agent frameworks, WebDancer represents one of the most systematic and reproducible approaches available today—especially for tasks where correctness, traceability, and depth matter more than speed.