# 2308.03688_AgentBench: Evaluating LLMs as Agents

* 首页: <https://arxiv.org/abs/2308.03688>
* PDF: <https://arxiv.org/pdf/2308.03688>
* 引用: 236(2025-08-09)
* 组织: Tsinghua University
* GitHub: <https://github.com/THUDM/AgentBench>


![](https://img.zhaoweiguo.com/uPic/2025/08/HS6mS2.png)

Figure 2:AgentBench is the first systematic benchmark to evaluate LLM-as-Agent on a wide array of real-world challenges and 8 distinct environments. In total, 27 LLMs are examined in this edition.

## 总结

* 数据集
    * 一个多维不断发展的基准
    * 包含 8 个不同的环境，用于评估 LLM-as-Agent 在多轮开放式生成环境中的推理和决策能力


## From Deepseek

### 1. **研究背景**
   - 大语言模型（如GPT-4、ChatGPT等）在作为自主智能体（Agent）执行复杂任务时展现出潜力，但缺乏系统化的评估框架。
   - 现有基准多关注静态任务（如问答、推理），而忽视了动态、交互式环境中的智能体行为评估。

### 2. **研究目标**
   - 提出**AgentBench**，一个多维度的评估框架，用于测试LLMs在真实世界模拟环境（如Web导航、游戏、编程等）中作为智能体的表现。
   - 涵盖**8种不同环境**，包括：
     - **操作系统**（如桌面任务自动化）
     - **数据库交互**
     - **知识图谱推理**
     - **网络浏览**
     - **游戏**（如《我的世界》）
     - **编程**（如调试、代码生成）
     - **多轮对话**
     - **机器人控制**（模拟环境）。

### 3. **关键贡献**
   - **多维评估**：首次系统化评估LLMs在交互式、动态环境中的能力，而非仅静态文本生成。
   - **开源工具**：提供标准化测试环境和自动化评估工具（代码已开源）。
   - **性能分析**：对主流LLMs（如GPT-4、Claude、LLaMA等）进行测试，发现：
     - 当前模型在**长期规划**和**环境适应**上表现较弱。
     - 模型表现高度依赖**领域特异性知识**和**上下文记忆能力**。

### 4. **主要发现**
   - **GPT-4**在多数任务中领先，但在需要多步操作（如游戏）或实时反馈（如机器人控制）的任务中表现下降。
   - 开源模型（如LLaMA-2）与闭源模型差距显著，尤其在复杂环境中。
   - **提示工程**（如Chain-of-Thought）对智能体性能影响显著。

### 5. **意义与展望**
   - 为LLM智能体的开发提供标准化评估基准。
   - 指出未来方向：提升模型的**动态决策能力**和**环境交互鲁棒性**。

这篇论文推动了LLM评估从“静态生成”转向“动态智能体行为”，对AI智能体和具身智能研究具有重要参考价值。

## 数据集示例

### Operating System (OS)
```
Task: “Find the total number of non-empty directories inside the ‘/etc’ directory.”
Action Space: Any valid bash commands
Observation: System standard output
```

![](https://img.zhaoweiguo.com/uPic/2025/08/wi1RLe.png)


### Database (DB)

```
Task: “What was the total number of medals won by United States?”, given the table ‘Olympic Medals’
Action space: Any valid SQL commands
Observation: MySQL CLI interface output
```

![](https://img.zhaoweiguo.com/uPic/2025/08/B77ZLJ.png)


### Knowledge Graph (KG)
```
Task: “Find tropical cyclones that are similar to Hurricane Marie and affected Eastern North America.”
Action space: Basic KG-querying tools
Observation: Query results
```

![](https://img.zhaoweiguo.com/uPic/2025/08/WfY5s2.png)


### Digital Card Game (DCG)
```
Task: “Compete against another player using four ‘fish’ cards in ‘Aquawar’ game.”
Action space: Four ‘fish’ cards and Assertion
Observation: Battle process, status of ‘fish’
```

![](https://img.zhaoweiguo.com/uPic/2025/08/3lNDrB.png)


### Lateral Thinking Puzzles (LTP)
```
Task: “A man sleeps with the lights off, and the next morning he suicides after opening windows. Why?”
Action Space: Any binary questions
Observation: ‘Yes’, ‘No’, or ‘Irrelevant’
```

![](https://img.zhaoweiguo.com/uPic/2025/08/VuUvtL.png)


### House-holding (HH)

```
Task: “Clean some soapbar and put it in coutertop”
Action space: A list of allowed actions in the room, or other accessible rooms
Observation: Results after the action.
```

![](https://img.zhaoweiguo.com/uPic/2025/08/xYQMyt.png)

### Web Shopping (WS)
```
Task: “Looking for a queen size bedspread set in the color redwood, and price lower than 70.”
Action space: Search (generate keywords) and Click (choose from all clickable buttons)
Observation: Products’ descriptions; the webpage
```

![](https://img.zhaoweiguo.com/uPic/2025/08/dSLFB0.png)


### Web Browsing (WB)
```
Task: “Find a latest post with more than 10k upvotes in r/announcements community and upvote it.”
Action space: 1) Choose one out of all HTML elements in the webpage; 2) Click, Type, or Select Options
Observation: Page HTML (optional: screenshot)
```

![](https://img.zhaoweiguo.com/uPic/2025/08/c2IDY0.png)