2308.03688_AgentBench: Evaluating LLMs as Agents¶

首页: https://arxiv.org/abs/2308.03688
PDF: https://arxiv.org/pdf/2308.03688
引用: 236(2025-08-09)
组织: Tsinghua University
GitHub: https://github.com/THUDM/AgentBench

Figure 2:AgentBench is the first systematic benchmark to evaluate LLM-as-Agent on a wide array of real-world challenges and 8 distinct environments. In total, 27 LLMs are examined in this edition.

总结¶

数据集
- 一个多维不断发展的基准
- 包含 8 个不同的环境，用于评估 LLM-as-Agent 在多轮开放式生成环境中的推理和决策能力

From Deepseek¶

1. 研究背景¶

大语言模型（如GPT-4、ChatGPT等）在作为自主智能体（Agent）执行复杂任务时展现出潜力，但缺乏系统化的评估框架。
现有基准多关注静态任务（如问答、推理），而忽视了动态、交互式环境中的智能体行为评估。

2. 研究目标¶

提出AgentBench，一个多维度的评估框架，用于测试LLMs在真实世界模拟环境（如Web导航、游戏、编程等）中作为智能体的表现。
涵盖8种不同环境，包括：
- 操作系统（如桌面任务自动化）
- 数据库交互
- 知识图谱推理
- 网络浏览
- 游戏（如《我的世界》）
- 编程（如调试、代码生成）
- 多轮对话
- 机器人控制（模拟环境）。

3. 关键贡献¶

多维评估：首次系统化评估LLMs在交互式、动态环境中的能力，而非仅静态文本生成。
开源工具：提供标准化测试环境和自动化评估工具（代码已开源）。
性能分析：对主流LLMs（如GPT-4、Claude、LLaMA等）进行测试，发现：
- 当前模型在长期规划和环境适应上表现较弱。
- 模型表现高度依赖领域特异性知识和上下文记忆能力。

4. 主要发现¶

GPT-4在多数任务中领先，但在需要多步操作（如游戏）或实时反馈（如机器人控制）的任务中表现下降。
开源模型（如LLaMA-2）与闭源模型差距显著，尤其在复杂环境中。
提示工程（如Chain-of-Thought）对智能体性能影响显著。

5. 意义与展望¶

为LLM智能体的开发提供标准化评估基准。
指出未来方向：提升模型的动态决策能力和环境交互鲁棒性。

这篇论文推动了LLM评估从“静态生成”转向“动态智能体行为”，对AI智能体和具身智能研究具有重要参考价值。

数据集示例¶

Operating System (OS)¶

Task: “Find the total number of non-empty directories inside the ‘/etc’ directory.”
Action Space: Any valid bash commands
Observation: System standard output

Database (DB)¶

Task: “What was the total number of medals won by United States?”, given the table ‘Olympic Medals’
Action space: Any valid SQL commands
Observation: MySQL CLI interface output

Knowledge Graph (KG)¶

Task: “Find tropical cyclones that are similar to Hurricane Marie and affected Eastern North America.”
Action space: Basic KG-querying tools
Observation: Query results

Digital Card Game (DCG)¶

Task: “Compete against another player using four ‘fish’ cards in ‘Aquawar’ game.”
Action space: Four ‘fish’ cards and Assertion
Observation: Battle process, status of ‘fish’

Lateral Thinking Puzzles (LTP)¶

Task: “A man sleeps with the lights off, and the next morning he suicides after opening windows. Why?”
Action Space: Any binary questions
Observation: ‘Yes’, ‘No’, or ‘Irrelevant’

House-holding (HH)¶

Task: “Clean some soapbar and put it in coutertop”
Action space: A list of allowed actions in the room, or other accessible rooms
Observation: Results after the action.

Web Shopping (WS)¶

Task: “Looking for a queen size bedspread set in the color redwood, and price lower than 70.”
Action space: Search (generate keywords) and Click (choose from all clickable buttons)
Observation: Products’ descriptions; the webpage

Web Browsing (WB)¶

Task: “Find a latest post with more than 10k upvotes in r/announcements community and upvote it.”
Action space: 1) Choose one out of all HTML elements in the webpage; 2) Click, Type, or Select Options
Observation: Page HTML (optional: screenshot)