Awesome Tool-Use Benchmarking Papers and Source Codes

AgentBench: Objectively Evaluate LLMs as Real-World Agents Across 8 Practical Environments 3017

As large language models (LLMs) increasingly power autonomous agents—from customer service bots to system administration tools—a critical question arises: Can…

01/04/2026Agent Evaluation, Interactive Reasoning, Tool-Use Benchmarking