In the rapidly evolving landscape of large language models (LLMs), a critical limitation persists: despite their impressive fluency, LLMs often…
Tool-augmented Reasoning
BrowseComp: A Focused Benchmark for Evaluating Web-Browsing Capabilities in AI Agents 4214
Evaluating whether an AI agent can truly browse the web—navigating across pages, persisting through dead ends, and extracting entangled facts—is…
EASYTOOL: Streamline LLM Agent Tool Usage with Concise, Unified Instructions 24492
Building capable AI agents that interact with real-world tools—like APIs, software libraries, or external services—is a core challenge in deploying…