2501.11733_Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Abstract

  • 背景:

    • 现在手机在生活中非常重要,但用手机完成复杂、需要多步骤的任务仍然很痛苦,很费时间。

    • 最近出现了一些基于大规模多模态模型(LMM)移动代理(mobile agents),它们能帮人“看”和“动手”操作手机上的东西。

    • 但是,这些现有方法还有很多不足

      • 不够好地解决真实世界中的人类需求;

      • 面对推理复杂、步骤长的任务容易失败;

      • 不能从过去的经验中学习和改进

  • 他们做了什么?

    • 提出了一个新系统,叫 Mobile-Agent-E

    • 它是一个分层的多智能体框架,能靠以往经验自我进化(self-evolution)。

  • 分层(Hierarchical): 把任务分成两个层次:

    • 高层Manager(管理者):负责整体规划,把一个大任务拆成小目标(subgoals)。

    • 低层(四个子代理)分别负责:

      • Perceptor(感知者):负责看手机界面,理解视觉信息;

      • Operator(执行者):负责操作手机,点、滑、打字等;

      • Action Reflector(行动反思者):检查是否出错,验证操作结果;

      • Notetaker(笔记员):负责记录总结信息。


  • 创新点:

    • 引入了一个自我进化模块(self-evolution module),通过两种方式积累和应用经验:

      • Tips(提示):总结出来的通用经验教训,比如“遇到验证码时先刷新页面”这样的操作建议。

      • Shortcuts(捷径)可以复用的一段小操作流程,比如“打开微信→搜索好友→发送文件”,以后遇到类似的子任务直接用,不用从头推理。

    • 他们还提出了一个新的评测标准,叫 Mobile-Eval-E

      • 这个评测包括了复杂、真实世界的手机任务,而且需要跨多个App长时间操作。

  • 实验结果:

    • 用三种不同基础大模型做实验,Mobile-Agent-E在性能上比之前的最先进方法提升了22个百分点

    • 他们还做了详细分析,讨论了自我进化机制的效果,并给了未来的研究方向建议。

  • 总结一句话: Mobile-Agent-E 是一个可以自我学习、越用越聪明的“手机智能助理”,通过分层设计和积累小技巧,不但能完成更复杂的任务,还比以前的方法快又准。

1. Introduction

Figure 1:We propose Mobile-Agent-E, a novel hierarchical multi-agent mobile assistant that outperforms previous state-of-the-art approaches (Zhang et al., 2023; Wang et al., 2024b, a) on complex real-world tasks. Mobile-Agent-E disentangles high-level planning and low-level action decision with dedicated agents. Equipped with a newly introduced self-evolution module that learns general Tips and reusable Shortcuts from past experiences, Mobile-Agent-E demonstrates further improvements in both performance and efficiency.

  • 期望目标:像人类一样从经验中学习。

    • 人类用一个新App,一开始可能不会,但用几次后就能更快更准。

    • 现有的移动代理每次都像第一次用一样:

      • 每次都重新分配同样的资源

      • 每次都重复犯一样的错

      • 完全不会积累经验

    • 这种缺乏经验积累导致它们在复杂、长任务中表现很差。

  • 关键创新:自我进化模块

    • Mobile-Agent-E有长期记忆,通过两个东西不断积累经验:

      • Tips(提示):总结过往任务的经验教训,比如“搜索时优先点最近的结果”。

      • Shortcuts(捷径):一段可以直接执行的小操作流程,比如“打开淘宝→搜索商品→查看价格”。

    • 每完成一个任务后,有**Experience Reflectors(经验反射器)**负责:

      • 更新Tips

      • 根据操作记录提出新的Shortcuts

    • 这样下次遇到类似任务,能做得更快、更好。

    • 灵感来源:

      • Tips像人类的情景记忆(episodic memory)

      • Shortcuts像人类的程序性记忆(procedural knowledge)

  • 新的评测标准:Mobile-Eval-E

    • 现有基准太简单了,无法真正考验代理的能力。

    • 于是他们提出了新的评测集Mobile-Eval-E,特点是:

      • 每个任务涉及的操作步骤是原来的两倍以上

      • 更多多App交互(比如淘宝、京东、微信一起用)

    • 还新设计了一个更符合人类体验的指标:

      • Satisfaction Score(满意度得分):不是单纯成功/失败,而是看“完成了多少关键步骤+探索了多少有用信息”。

      • SSS曲线(Satisfaction Score vs Steps):用来衡量代理在不同任务步骤数下的效率。

2. Mobile-Agent-E

Figure 2: An overview of the Mobile-Agent-E framework, where the Manager, Perceptor AP, Operator, Action Reflector, and Notetaker are involved in the main agent loop for each task, while two Experience Reflectors contribute to updating long-term memory across tasks. Decision-making at each step is disentangled into high-level planning by the Manager and low-level actions by the Operator. The Action Reflector verifies the outcome of each action, tracks progress, and provides error feedback. The Notetaker aggregates important information during navigation. A detailed example illustrating one step in the agent loop and the self-evolution process is presented in Figures 3 and 4.

Figure 3:A detailed breakdown of one inference step t with Mobile-Agent-E, showing the inputs and outputs of each agent. Omitted information indicates no change.

2.1. Hierachical Multi-Agent Framework

  • 这段内容描述了一个分层式的智能体体系,每个智能体有不同的角色和职责,整体协作来完成复杂任务。体系里主要有三个智能体(Manager、Perceptor、Operator),每个智能体都是基于冻结的大型多模态模型(LMM)(比如 GPT-4o)实例化的,除了感知模块(Perceptor)。

  • 总体框架

    • 整体是一个主循环,每一步都会由不同智能体分别负责高层计划、视觉感知、低层执行。

    • 除了感知器(Perceptor)是专门视觉感知模块,其他智能体(Manager、Operator)都是由大型多模态模型(比如 GPT-4o)冻结后生成的推理智能体。

  • Manager(高层计划制定者)

    • Manager 负责高层规划

      • 根据用户的输入指令 I

      • 当前的截图 st

      • 上一步的整体计划 WPt1 和子目标 WSt1

      • 当前的执行进度 WGt1

      • 之前记录的重要信息 WNt1

      • 还有长时记忆中提取出来的快捷信息 LS

    • 综合这些信息,更新

      • 当前的整体计划 WPt

      • 当前需要立刻去完成的子目标 WSt

    • 🔵 注意

      • Manager 不直接看 Perceptor 提取的细粒度视觉信息,因为高层规划不需要过多细节,避免噪声干扰。

    • 🔵 异常处理

      • 如果连续观察到 k 次动作失败(比如 k=2),Action Reflector 会触发错误升级标志(WEF)

      • 这时,Manager 会收到最近失败动作的记录 WE[-k:],重新思考整体计划或调整子目标,避免陷入死循环。

  • Perceptor(细粒度视觉感知)

    • Perceptor 负责感知手机界面的详细视觉信息

      • 输入:当前截图 st

      • 输出:识别到的所有文本、图标,以及它们的屏幕坐标,记作 WVt

    • 🔵 组成

      • OCR(识别文字)

      • 图标定位模型(找出哪些地方有按钮、图标)

      • 图标描述模型(给图标打标签)

    • 🔵 特别说明

      • Perceptor 只通过图像本身进行感知,不用手机界面的 XML 文件,保证通用性。

      • 后续智能体(比如 Operator)仍然能看到完整截图,作为整体视觉参考。

  • Operator(低层动作决策者)

    • Operator 负责执行具体操作,比如点击、输入文字等。

    • 它需要综合以下信息来决策当前具体要采取的动作(比如“点击屏幕上某个位置”)

      • 用户指令 I

      • Manager 给出的整体计划 WPt 和当前子目标 WSt

      • 上一时刻的执行进展 WGt1

      • 之前记下的重要笔记 WNt1

      • 细粒度的感知结果 WVt

      • 当前截图 st

    • 🔵 错误处理机制

      • 如果一个动作失败,Operator 会先自己尝试处理

      • 如果连续失败并且无法恢复,会向 Manager 升级错误,请求高层干预。

  • 小结

    • 这套分层系统就像一家公司:

      • Manager 是老板,负责定目标、定大方向。

      • Perceptor 是情报员,侦查现场情况。

      • Operator 是一线员工,负责根据计划具体执行动作。

    • 而且有机制保证:如果连续出错,底层员工(Operator)先自救,自救不了再上报老板(Manager)重新定方向。

2.2. Self-Evolution Module

Figure 4:Illustration of the inputs and outputs to the Experience Reflectors for a single self-evolution step, including a concrete example of the newly generated Shortcuts and Tips.

  • 🧠 核心思想

    • 灵感来自人类使用智能手机越用越熟练的过程。

    • Mobile-Agent-E 也设计了一个可以跨任务保留的长期记忆机制,并引入两个智能代理(经验反思者)来复盘任务表现,不断优化自己的行为和策略。

  • 长期记忆中包含两类关键知识:

    1. Tips(经验提示)

      • 是从过去任务中总结出的通用性建议,如“如果出错了可以尝试刷新页面”。

      • 类似人类的“情节记忆(episodic memory)”,能帮助在新任务中借鉴以往的经验。

      • 作用在高层策略(如规划、判断)上,增强 agent 的“意识”。

    2. Shortcuts(操作快捷方式)

      • 是将一串基础操作封装成的可复用子程序,比如“点击 -> 输入 -> 回车”。

      • 类似人类的“程序性知识(procedural memory)”,支持高效执行重复性任务。

      • 每个 Shortcut 都带有前置条件,确保只在合适环境下使用,比如“只有有输入框时才可使用输入快捷方式”。

  • 🧩 如何实现自我进化?

    • 通过两个 Experience Reflector(经验反思代理) 来更新上述记忆:

      • AET | 负责更新 Tips(高层经验)

      • AES | 负责生成/优化 Shortcuts(底层操作)

    • 输入数据包括:

      • 用户任务输入 I

      • 最终计划 WPτ 和进度状态 WGτ

      • 行动历史 WA 和错误历史 WE

      • 现有的 Tips / Shortcuts

      • 可选的未来任务 TF

    • 输出数据:

      • 新的 Shortcuts(结构化 JSON 格式)

      • 更新后的 Tips(自然语言形式)

  • 🧭 演化流程总结

    1. 任务执行后,系统收集任务表现数据。

    2. 使用反思代理 AETAES 分析任务表现。

    3. 自动生成新的经验 Tips 和 Shortcuts。

    4. 在后续任务中,Manager 和 Operator 会使用这些进化后的知识,提高策略水平操作效率

  • 📌 总结一句话

    • 这个 Self-Evolution 模块使 Mobile-Agent-E 像一个不断学习的“数字人类”,通过总结经验教训(Tips)和优化操作套路(Shortcuts),让它随着任务不断“进化”,变得更聪明、更高效。

3. Experiments

Figure 5:Satisfaction Score(评分量规列表, a list of rubrics)

  • 基准测试:Mobile-Eval-E

    • 共 25 个任务,分布在 5 个现实生活场景(比如餐馆推荐、信息搜索、旅游规划等)。

    • 每个任务要求智能体进行 复杂推理(reasoning-intensive)、长时序操作(long-horizon),并涉及 多个 App 的交互(76% 任务涉及多 App,以前不到 10%)。

    • 平均操作数是旧 benchmark 的两倍,任务总复杂度显著提升。

  • 更好的指标

    • Satisfaction Score (SS) – 满意度得分

      • 已完成的评分量规数除以评分量规总数,由人工评估员评判

      • 具体的评分量规数参见 Figure 5

    • Action Accuracy (AA) – 动作准确率

      • 看 agent 实际点击或操作是否合理,人工审核界面截图。

    • Reflection Accuracy (RA) – 自我反思准确率

      • 若 agent 有“反思器”,则看它是否正确识别错误、调整策略。

    • Termination Error (TE) – 异常退出率(越低越好)

      • 如果 agent 因错误退出(如操作重复、循环错误、三次连续失败等),就记为异常。

  • 关键实验结果总结

模型

满意度(SS↑)

动作准确率(AA↑)

反思准确率(RA↑)

异常退出率(TE↓)

AppAgent

25.2%

60.7%

-

96.0%

Mobile-Agent-v1

45.5%

69.8%

-

68.0%

Mobile-Agent-v2

53.0%

73.2%

96.7%

52.0%

Mobile-Agent-E

75.1%

85.9%

97.4%

32.0%

Mobile-Agent-E + Evo

86.9%

90.4%

97.8%

12.0%

  • Mobile-Agent-E 有两个版本:

    • 普通版:每个任务独立执行。

    • +Evo 版(演化版):多个任务共享记忆(long-term memory),智能体可在每次任务后进行反思更新。

  • 模型细节

    • 多模态模型(backbones

      • GPT-4o、Claude-3.5-Sonnet、Gemini-1.5-Pro

    • 视觉感知模块(Perceptor)

      • OCR(DBNet、ConvNextViT)、图标定位(GroundingDINO)、图标描述(Qwen-VL-Plus)

Table 4: 多模态模型测试结果

4. Results

Figure 7:Case study example where relevant Shortcuts and Tips are automatically retrieved from the previously evolved long-term memory and subsequently leveraged to complete an unseen, challenging task. The action trajectory also includes an example where the agent recovers from an error.

  • Progressive impact of self-evolution over time

    • 核心观点:随着任务经验积累,自进化模块会持续提升智能体的表现,特别是在序列后期的任务中效果更明显。

  • 捷径(Shortcuts)可以减少计算开销(Shortcut reduces computational overhead

    • 定义: 将多个原子动作组合为一个复合动作。例如将“点击 Tap”、“输入 Type”、“回车 Enter”三个步骤合并为一个动作 Tap_Type_and_Enter

  • Unique impact from Tips

    • 不使用新 Shortcut 的任务,只用 Tips,也能显著提升代理完成复杂任务的能力

  • A Closed-Loop Self-Evolving Agent

    • 背景:在大量任务和场景中运行后,它积累的 Tips 和 Shortcuts 可能会变得非常多,以至于 无法一次性将它们全部加载进推理决策的上下文(context)中

    • 解决方案:引入两个“经验检索器”代理(Experience Retriever Agents)

      • 𝒜_{ERT}: 用于从所有 Tips 中选出当前任务相关的 Tips

      • 𝒜_{ERS}: 用于从所有 Shortcuts 中选出当前任务相关的 Shortcut

    • 这两个组件可以根据当前任务,从“长时记忆”中选出最有帮助的经验,就像人类大脑从记忆中提取相关经验一样。

6. Conclusion and Future Work

  • 📌 1. 论文主张与亮点

    • 提出了 Mobile-Agent-E,一个新型的移动助手系统,具备以下两个关键特性

      • 分层多智能体框架(hierarchical multi-agent framework)

        • 不同智能体承担不同角色(如计划、执行、反思)

        • 更适合长期任务规划与复杂决策

      • 自我进化模块(self-evolution module)

        • 能根据执行经验优化“Shortcuts”(快捷操作)

        • 实现技能积累与智能提升

    • 这两大模块共同带来了三方面提升:

    能力

    说明

    📅 长期规划(long-term planning)

    能更好地分步执行复杂任务

    🛠️ 错误恢复(error recovery)

    执行中出错时能反思并调整策略

    ⚡ 执行效率(efficiency)

    通过 Shortcuts 等手段减少重复劳动

  • ⚠️ 2. 当前局限性(Remaining Limitations)

    1. Shortcut 条件判断失误

      • 有些快捷操作被错误地用于不满足其前提条件的场景

      • 比如:某个 App 并不支持“下滑刷新”,却依然调用了这个动作

    2. 智能体生成错误 Shortcuts

      • 比如:总结出一个无效或误导性的快捷操作

      • 可能是因为对环境感知不足或上下文理解错误

  • 🧭 3. 未来工作方向(Future Work)

    • 未来将重点优化 Shortcut 相关机制:

    阶段

    改进方向

    生成(generating)

    如何更稳健地产生有效快捷操作

    调用(invoking)

    何时、在哪个场景中安全调用

    修正(revising)

    出错后如何改进已有的 Shortcut

    • 🔧 增强个性化,让系统能更好地理解并适配每个用户的偏好与习惯

    • 🛡️ 加强安全机制,以便实现更可靠的人机协同,包括:

      • 防止误操作

      • 提供用户确认环节

      • 可解释性反馈

  • 总结:Mobile-Agent-E 是一个具备分层智能体结构和自我进化能力的移动助手,显著提升了复杂任务中的规划、执行与容错能力。尽管当前在 Shortcut 使用上仍存在前提判断错误与策略生成问题,未来将通过优化 Shortcut 管理机制、增强个性化和安全性,进一步提升用户体验与系统可靠性。

Appendix A Full Trajectory Comparison Example with Previous SOTA

Figure 8:Full trajectory comparison between the previous state-of-the-art, Mobile-Agent-v2 (Wang et al., 2024a), and Mobile-Agent-E.

Appendix B Error Recovery with Escalation to Manager

Figure 9:Error recovery with escalation. The task requires the agent to search for three different items on Walmart and note their sales information. At the step shown in the figure, the agent has already searched for ribeye steak and intends to search for fresh oranges next. However, the Operator erroneously calls the Shortcut that inputs text into the search bar and performs a search without clearing the previously entered text. Although the Action Reflector raises an error, the subgoal remains unchanged, and the Operator fails to rectify the error on the second attempt. After observing two consecutive errors, the error is escalated to the Manager, which correctly identifies the problem and revises the subgoal with detailed, decomposed steps to address the error. This helps the Operator correctly recover from the previous error by first tapping the “×” icon to clear the previous search query.

Appendix C Remaining Limitations

Figure 10:Example of misuse of Shortcuts in an invalid state. At the current step, as shown in the figure, the agent intended to switch back to Walmart to search for the final item requested by the user. While it correctly performs the “Switch_App” operation, it then calls a Shortcut for searching without realizing that it is not yet in the App where the search bar is available.

  • C.1 Misuse of Shortcuts due to Incorrect Perception of Phone State

Figure 11:Example of imperfect (above) and erroneous (below) generated Shortcuts. The “Search_Location_in_Maps” Shortcut includes an unnecessary Tap action in the operation sequence, while the “Switch_App_And_Search” Shortcut omits a Tap action needed to first enter the desired App before performing the search.

  • C.2 Errors and Imperfections in Self-Evolved Shortcuts

Appendix D All Tasks in Mobile-Eval-E Benchmark

Table 7: All task queries in Mobile-Eval-E

Scenario

Task ID

APPs

Input Query

Restaurant Recommendation

1_late_night_korean_food

Maps

Find the best-rated late-night Korean restaurant in Champaign, IL that opens beyond 9pm on Google Maps.

1_nearest_bakery

Maps

Get directions to the nearest Bakery that has a rating higher than 4.0 on Google Maps. Stop at the screen showing the route.

1_thai_duck

Maps, Notes

Find the best-rated Thai restaurant in Urbana, IL that serves duck cuisine on Google Maps. Review customer comments and compile a summary of positive and negative feedback in Notes.

1_bakery_birthday_cake

Maps, Notes

Find me a Bakery that is within 10min drive near me and does birthday cakes on Google Maps. Find the phone number and create a new note in Notes for that.

1_chinese_ohare

Maps, X, Notes

Find me a popular Chinese restaurant near Chicago O’Hare airport on Google Maps. Check X for recent posts about their signature dishes and write a summary in Notes. Then get directions to that restaurant on Google Maps. Stop at the screen showing the route.

Information Researching

2_segment_anything_cited

Chrome

Find the most-cited paper that cites the paper ‘Segment Anything’ on Google Scholar. Stop at the screen showing the paper abstract.

2_llm_agents_survey

Chrome, Notes

Find at least three representative survey papers on LLM agents on Google Scholar, and add their titles to the Notes.

2_recipes_chinese

Chrome, YouTube

I have some onions, beef, and potatoes in my refrigerator. Can you find me a Chinese-style recipe that uses all three ingredients and can be prepared in under an hour? And find me a video tutorial on YouTube for that. Stop at the screen displaying the video.

2_mcdonalds_deals

McDonald’s, Maps

Can you check the McDonald’s APP to see if there are any Rewards or Deals including Spicy McCrispy. If so, help me add that to Mobile Order (Do not pay yet, I will do it myself). And then check the pickup location and get directions on Google Maps. Stop at the screen showing the route.

2_headphones_reviews

Amazon, Notes

Find three detailed user reviews of the Bose QC45 headphones from Amazon. Summarize the general sentiment in the Notes.

Online Shopping

3_oled_tv

Best Buy

Find the best deal on a 55-inch 4K OLED TV at Best Buy. Stop at the screen displaying the best deal you find.

3_laptop_nvidia_gpu

Amazon Shopping

Find me a laptop on Amazon that is under $1000 with an Nvidia GPU and more than 8GB RAM.

3_ninja_air_fryer

Amazon Shopping, Walmart

Compare the price of a Ninja air fryer 8 qt at Walmart and Amazon. Stop at the screen displaying the best deal you find.

3_walmart_sale_items

Walmart, Notes

Check if any of the following items are on sale at Walmart: ribeye steak, fresh oranges, or toilet paper. If any are on sale, add a note in Notes with their prices.

3_nintendo_switch_joy_con

Amazon Shopping, Best Buy, Walmart

I want to buy a brand-new Nintendo Switch Joy-Con. Any color is fine. Please compare the prices on Amazon, Walmart, and Best Buy. Find the cheapest option and stop at the screen where I can add it to the cart.

What’s Trending

4_x_black_myth_wukong

X, Notes

Find the top posts about the game ‘Black Myth Wukong’ on X and summarize the key highlights in Notes.

4_x_trending_news

X, Notes

Check the top 3 trending news on X. Read a few posts to figure out what’s happening. And create a new Note to summarize your findings.

4_watercolor_painting_tutorial

Lemon8, Notes

I want to learn how to paint watercolor. Find me some content creators to follow on Lemon8 that has highly liked posts about watercolor painting tutorials. List their account names in Notes.

4_movie_trending

Fandango, Notes

Check the top 5 trending movies on Fandango that are currently in theaters. Compare their ratings and create a note in Notes for the highest-rated one, including its name and showtimes.

4_horror_movie_reviews

Fandango, Lemon8, Notes

Find me the latest horror movie currently in theaters on Fandango. Check some reviews on Lemon8 about the movie and create a note in Notes with the general sentiment.

Travel Planning

5_cheap_flights_newyork

Booking

Find the cheapest round-trip flight from Chicago to New York City in the next month on Booking. Stop at the screen showing the best deal.

5_things_to_do_la

Tripadvisor, Notes

Suggest some interesting things to do in LA. Find the top 3 attractions on Tripadvisor. Save the list in Notes.

5_palo_alto_tour

Tripadvisor, Notes

Plan a one-day itinerary for Palo Alto, CA using Tripadvisor. Choose the attractions and dining recommendations, but keep in mind that I don’t like seafood and I love museums. Write the plan in Notes.

5_local_food_chicago

Tripadvisor, Notes

Find a highly recommended local restaurant in Chicago on Tripadvisor. Check the reviews about must-try dishes and summarize in Notes.

5_hotel_champaign

Booking, Maps

Help me find a hotel in Champaign, IL on Booking that is under $200 for a queen bed. Make sure that the rating is higher than 7.0. Double-check on Google Maps to see if it is close to Green Street. Show me your final choice on Booking.

$$

Appendix E Atomic Operation Space

Table 8. Atomic operations space.

Appendix F Full list of Self-Evolved Shortcuts

Figure 12:Full list of Shortcuts generated by Mobile-Agent-E (with GPT-4o) after self-evolution.

Appendix G Full list of Self-Evolved Tips

Figure 13:Full list of Tips generated by Mobile-Agent-E (with GPT-4o) after self-evolution.