HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face¶

https://arxiv.org/abs/2303.17580
组织：浙江大学，Microsoft亚洲研究院
https://github.com/microsoft/JARVIS
Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we present HuggingGPT, an LLM-powered agent that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.
解决具有不同领域和模态的复杂 AI 任务是迈向通用人工智能的关键一步。虽然有许多 AI 模型可用于各种领域和模式，但它们无法自主处理复杂的 AI 任务。考虑到大型语言模型（LLMs）在语言理解、生成、交互和推理方面表现出卓越的能力，我们主张可以LLMs作为控制器来管理现有的人工智能模型，以解决复杂的人工智能任务，而语言则作为通用接口来增强这种能力。基于这一理念，我们提出了 HuggingGPT，这是一种LLM驱动的代理，它利用LLMs（例如 ChatGPT）连接机器学习社区（例如 Hugging Face）中的各种 AI 模型来解决 AI 任务。具体来说，我们在收到用户请求时，使用 ChatGPT 进行任务规划，根据它们在 Hugging Face 中可用的功能描述选择模型，使用选定的 AI 模型执行每个子任务，并根据执行结果总结响应。通过利用ChatGPT强大的语言能力和Hugging Face中丰富的AI模型，HuggingGPT可以处理跨越不同模态和领域的各种复杂AI任务，并在语言、视觉、语音和其他具有挑战性的任务中取得令人印象深刻的成果，为实现通用人工智能铺平了一条新的道路。

https://img.zhaoweiguo.com/uPic/2024/08/aBwtln.png — 语言用作LLMs（例如 ChatGPT）连接众多 AI 模型（例如 Hugging Face 中的模型）以解决复杂 AI 任务的接口。在这个概念中，a LLM 充当控制器，管理和组织专家模型的合作。LLM首先根据用户请求规划任务列表，然后为每个任务分配专家模型。专家执行任务后，收集LLM结果并响应用户。¶

四个阶段
- 任务规划：使用 ChatGPT 分析用户的请求，了解他们的意图，并将它们分解成可能的可解决任务。
- 模型选择：为了解决计划的任务，ChatGPT 根据模型描述选择托管在 Hugging Face 上的专家模型。
- 任务执行：调用并执行每个选定的模型，并将结果返回给 ChatGPT。
- 响应生成：最后，利用 ChatGPT 来整合来自所有模型的预测并为用户生成响应。

https://img.zhaoweiguo.com/uPic/2024/08/OK54jh.png — LLM 以（如ChatGPT）为核心控制器，以专家模型为执行器¶

https://img.zhaoweiguo.com/uPic/2024/08/2a6Oto.png — 各种任务的案例研究¶

备注

【限制】该论文指出了一些局限性，包括需要提高LLM的规划能力，由于多种互动而导致的效率挑战以及代币长度限制的问题。 -它还指出了LLM的不稳定性，这可能导致错误的输出，强调在未来的设计中需要更好的控制机制。