# 2308.03688_AgentBench: Evaluating LLMs as Agents * 首页: * PDF: * 引用: 236(2025-08-09) * 组织: Tsinghua University * GitHub: ![](https://img.zhaoweiguo.com/uPic/2025/08/HS6mS2.png) Figure 2:AgentBench is the first systematic benchmark to evaluate LLM-as-Agent on a wide array of real-world challenges and 8 distinct environments. In total, 27 LLMs are examined in this edition. ## 总结 * 数据集 * 一个多维不断发展的基准 * 包含 8 个不同的环境,用于评估 LLM-as-Agent 在多轮开放式生成环境中的推理和决策能力 ## From Deepseek ### 1. **研究背景** - 大语言模型(如GPT-4、ChatGPT等)在作为自主智能体(Agent)执行复杂任务时展现出潜力,但缺乏系统化的评估框架。 - 现有基准多关注静态任务(如问答、推理),而忽视了动态、交互式环境中的智能体行为评估。 ### 2. **研究目标** - 提出**AgentBench**,一个多维度的评估框架,用于测试LLMs在真实世界模拟环境(如Web导航、游戏、编程等)中作为智能体的表现。 - 涵盖**8种不同环境**,包括: - **操作系统**(如桌面任务自动化) - **数据库交互** - **知识图谱推理** - **网络浏览** - **游戏**(如《我的世界》) - **编程**(如调试、代码生成) - **多轮对话** - **机器人控制**(模拟环境)。 ### 3. **关键贡献** - **多维评估**:首次系统化评估LLMs在交互式、动态环境中的能力,而非仅静态文本生成。 - **开源工具**:提供标准化测试环境和自动化评估工具(代码已开源)。 - **性能分析**:对主流LLMs(如GPT-4、Claude、LLaMA等)进行测试,发现: - 当前模型在**长期规划**和**环境适应**上表现较弱。 - 模型表现高度依赖**领域特异性知识**和**上下文记忆能力**。 ### 4. **主要发现** - **GPT-4**在多数任务中领先,但在需要多步操作(如游戏)或实时反馈(如机器人控制)的任务中表现下降。 - 开源模型(如LLaMA-2)与闭源模型差距显著,尤其在复杂环境中。 - **提示工程**(如Chain-of-Thought)对智能体性能影响显著。 ### 5. **意义与展望** - 为LLM智能体的开发提供标准化评估基准。 - 指出未来方向:提升模型的**动态决策能力**和**环境交互鲁棒性**。 这篇论文推动了LLM评估从“静态生成”转向“动态智能体行为”,对AI智能体和具身智能研究具有重要参考价值。 ## 数据集示例 ### Operating System (OS) ``` Task: “Find the total number of non-empty directories inside the ‘/etc’ directory.” Action Space: Any valid bash commands Observation: System standard output ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/wi1RLe.png) ### Database (DB) ``` Task: “What was the total number of medals won by United States?”, given the table ‘Olympic Medals’ Action space: Any valid SQL commands Observation: MySQL CLI interface output ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/B77ZLJ.png) ### Knowledge Graph (KG) ``` Task: “Find tropical cyclones that are similar to Hurricane Marie and affected Eastern North America.” Action space: Basic KG-querying tools Observation: Query results ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/WfY5s2.png) ### Digital Card Game (DCG) ``` Task: “Compete against another player using four ‘fish’ cards in ‘Aquawar’ game.” Action space: Four ‘fish’ cards and Assertion Observation: Battle process, status of ‘fish’ ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/3lNDrB.png) ### Lateral Thinking Puzzles (LTP) ``` Task: “A man sleeps with the lights off, and the next morning he suicides after opening windows. Why?” Action Space: Any binary questions Observation: ‘Yes’, ‘No’, or ‘Irrelevant’ ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/VuUvtL.png) ### House-holding (HH) ``` Task: “Clean some soapbar and put it in coutertop” Action space: A list of allowed actions in the room, or other accessible rooms Observation: Results after the action. ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/xYQMyt.png) ### Web Shopping (WS) ``` Task: “Looking for a queen size bedspread set in the color redwood, and price lower than 70.” Action space: Search (generate keywords) and Click (choose from all clickable buttons) Observation: Products’ descriptions; the webpage ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/dSLFB0.png) ### Web Browsing (WB) ``` Task: “Find a latest post with more than 10k upvotes in r/announcements community and upvote it.” Action space: 1) Choose one out of all HTML elements in the webpage; 2) Click, Type, or Select Options Observation: Page HTML (optional: screenshot) ``` ![](https://img.zhaoweiguo.com/uPic/2025/08/c2IDY0.png)