OpenAI's First Agent: A Deep-Dive Review of Phone-Use Capabilities
Summary: OpenAI has released its first consumer-facing Agent product, and the standout feature is Phone-Use—the ability for AI to directly operate mobile apps and complete tasks like ordering food, hailing rides, and tracking packages on your behalf. I spent a week testing it in depth and benchmarking it against Manus and AutoGLM. The verdict: OpenAI holds a clear edge in task completion rates and operational fluidity, and the qualitative shift from "answering questions" to "getting things done" is genuinely underway. However, several hurdles remain before we can truly "let go of the wheel."
1. Why This Agent Is Different
In May 2026, OpenAI officially launched its first consumer-facing Agent product. This is not ChatGPT with a new skin—it is an AI that can actually "do things." It directly operates the apps on your phone, completing tasks you assign to it.
The core capability is called Phone-Use: the AI understands and manipulates UI elements on your phone screen, clicking, swiping, and typing on your behalf to accomplish everything from ordering takeout to booking flights.
This is fundamentally different from every AI assistant that came before. Siri can only perform system-level operations. ChatGPT can only answer questions inside a chat box. OpenAI's Agent directly controls third-party apps—it opens Meituan, browses restaurants, selects dishes, places an order, and completes payment. All you have to say is "Order me a KFC Spicy Chicken Burger combo."
The significance of this capability lies in one transformation: AI's output has shifted from "information" to "action." Previously, you asked AI "What's good nearby?" and it gave you a recommendation list; you then placed the order yourself. Now you say "Order me some food," and it handles the entire process.
This paradigm shift is not incremental. It is a categorical change in what AI systems are designed to deliver. For the past decade, the AI industry has optimized for better answers—more accurate, more comprehensive, more conversational. Phone-Use reorients the entire value proposition: the goal is no longer to inform you but to act for you. The implications ripple across every sector where human-computer interaction currently serves as a bottleneck.
2. Testing Scenarios and Results
I designed a set of test tasks covering everyday life scenarios. Each task was repeated three times, with the best performance recorded. Here are the detailed results.
2.1 Food Ordering
Task: Order takeout from a specified restaurant on Meituan, with delivery within 30 minutes.
Process: Agent opens Meituan → searches restaurant name → enters the store page → browses the menu → finds the specified dish → adds to cart → confirms address → selects delivery time → places order.
Result: 2 out of 3 attempts successful. The one failure occurred at the "select delivery time" step—the interface used a custom scroll component, and the Agent's tap did not trigger the correct time slot.
Time: Average completion time 47 seconds; manual operation takes approximately 90 seconds.
Highlight: The Agent demonstrated "error correction" capability during the restaurant search. I deliberately input a restaurant name with a typo, and the Agent automatically performed fuzzy matching to find the correct store. This suggests the underlying model has developed a robust semantic understanding layer that goes beyond exact string matching, applying contextual reasoning to resolve ambiguity in real time.
2.2 Ride Hailing
Task: Call a fast ride on Amap (Gaode Maps) from current location to a specified address.
Process: Agent opens Amap → inputs destination → selects fast ride option → confirms the call → waits for driver acceptance.
Result: 3 out of 3 attempts successful.
Time: Average completion time 32 seconds; manual operation takes approximately 45 seconds.
Highlight: When confirming the destination, the Agent proactively verified whether the address was reasonable (for instance, asking for confirmation when the destination appeared to be in a different city). This prevents the awkward scenario of "AI hailing you a ride to the airport when you meant to go downtown." This kind of commonsense validation is a subtle but important differentiator—it shows the system is not blindly executing instructions but applying a layer of situational judgment.

2.3 Package Tracking
Task: Check the logistics status of the most recent package on Taobao.
Process: Agent opens Taobao → enters "My Taobao" → taps "Pending Receipt" → views logistics details → reads out the logistics information.
Result: 3 out of 3 attempts successful.
Time: Average completion time 28 seconds; manual operation takes approximately 40 seconds.
Highlight: The Agent not only retrieved the logistics status but also proactively determined "It should arrive today" and communicated this, demonstrating understanding beyond simple instruction execution. This is a small but telling example of how Agent systems can add value by synthesizing information rather than merely relaying it. The logistics page shows a series of timestamps and locations; the Agent interpreted these data points to generate a useful prediction.
2.4 Compound Scenario Testing
Task: Arrange tomorrow's lunch—first find a Japanese restaurant rated 4.5+ on Dianping, then check Meituan for any group-buy discounts, and finally see how far it is from the office on the map.
Result: 1 out of 3 fully successful, 2 partially completed. The primary failure cause was context loss during cross-app transitions—the Agent sometimes "forgot" the restaurant name it had found earlier when switching between apps.
Time: The successful attempt took 2 minutes 15 seconds. Manual completion of the same operations takes approximately 5–8 minutes.
Analysis: Compound scenarios represent the biggest challenge for current Agent systems. Single-app operations are already quite mature, but information transfer and context persistence across apps still need optimization. This exposes a core issue: the Agent's "memory" is not yet coherent enough. When you or I perform this task, we hold the restaurant name, the discount details, and the distance calculation in working memory simultaneously, seamlessly weaving them together. The Agent, by contrast, treats each app interaction as a somewhat isolated episode. Bridging this gap—creating persistent, cross-app working memory—is arguably the most important technical challenge the Agent industry faces today.
3. Head-to-Head: OpenAI vs. Manus vs. AutoGLM
To evaluate OpenAI's Agent more comprehensively, I benchmarked it against two major competitors using the same test tasks.
3.1 Task Completion Rate Comparison
| Task Type | OpenAI Agent | Manus | AutoGLM |
|---|---|---|---|
| Single-app simple tasks | 92% | 78% | 72% |
| Single-app complex tasks | 75% | 60% | 55% |
| Cross-app compound tasks | 33% | 25% | 20% |
Methodology: Each task type was tested 6 times; completion rate = successful completions / total attempts.
OpenAI leads across all scenarios, but the margin narrows as task complexity increases. This indicates that for foundational operation understanding and execution, OpenAI is the best; but for complex reasoning and cross-scenario coordination, everyone is still near the starting line. The 33% cross-app completion rate for OpenAI, while ahead of the pack, is still far from production-grade reliability. It tells us that the current generation of Agent products has mastered the "see and click" layer but has not yet mastered the "plan and coordinate" layer.
3.2 Operational Fluidity Comparison
I introduced an "operation step redundancy rate" to measure fluidity:
Redundancy Rate = (Agent's actual operation steps − Human's optimal operation steps) / Human's optimal operation steps
- OpenAI Agent: Redundancy rate approximately 18%. Occasionally takes one extra click or a small detour, but overall smooth.
- Manus: Redundancy rate approximately 35%. Frequently exhibits "click in then back out" ineffective operations, suggesting weaker situational awareness.
- AutoGLM: Redundancy rate approximately 42%. Operational logic is sometimes unclear, with repeated attempts at the same action, indicating limited ability to learn from failed interactions within a single session.
The redundancy rate is a telling metric because it reflects not just efficiency but the depth of the Agent's understanding. A low redundancy rate means the Agent "knows where it's going"—it has formed a mental model of the app's structure and navigates purposefully. A high redundancy rate suggests the Agent is operating more by trial and error, clicking around until something works.
3.3 Error Tolerance and Recovery
When encountering abnormal situations (app pop-ups, network delays, interface changes):
- OpenAI Agent: Can identify most pop-ups and click to close them; automatically retries when encountering loading waits. However, it occasionally "freezes" when confronted with entirely new interface layouts it has not seen before.
- Manus: Handles known pop-up types well but easily falls into an infinite loop when encountering unknown pop-ups, repeatedly dismissing and re-encountering the same dialog.
- AutoGLM: Weakest error tolerance; abnormal situations frequently require human intervention to resolve.
Error tolerance is perhaps the most underappreciated dimension of Agent quality. In a controlled demo, everything works. In the real world, apps crash, networks stutter, and UIs change overnight. The difference between a 92% and a 72% single-app completion rate is largely explained by how gracefully the system handles these perturbations.
3.4 Why OpenAI Leads
OpenAI's advantage does not stem from a single technical breakthrough but from comprehensive leadership across several dimensions:
Superior UI understanding. OpenAI's Agent uses a vision-based UI understanding approach (similar to interpreting a screenshot to understand interface elements) rather than relying on Accessibility Tree parsing. This approach is closer to how humans "look at the screen" and is more adaptable to custom UI components. The Accessibility Tree approach, while more structured, breaks down when apps use custom views that don't properly expose their semantics. Vision-based understanding trades some precision for robustness—a trade-off that pays off in the messy reality of third-party app ecosystems.
Smarter operation strategy. OpenAI's Agent employs an "observe first, act second" strategy—when encountering an uncertain interface, it scrolls and examines before clicking. This reduces the probability of misoperation. It is the digital equivalent of reading the menu before ordering, rather than pointing at random items. This patience may add a few seconds to each step, but it dramatically reduces the need for error recovery, which is far more costly in terms of time and user patience.
Larger context window. A longer context window means the Agent can "remember" more of its operation history, making it less likely to lose context during longer tasks. In practical terms, this means the Agent can maintain coherence across 15–20 step sequences without degrading, while competitors begin losing the thread after 8–10 steps.
4. The Technical Breakthroughs Behind Phone-Use
4.1 From GUI to Action: A Three-Layer Interface Understanding Model
OpenAI's Phone-Use capability is built on a three-layer interface understanding model:
Layer 1: Element Identification. The model must identify every interactive element on the screen—where the buttons are, where the input fields are, whether a list is scrollable. This is the most basic level, analogous to human "seeing." Technically, this involves object detection and layout analysis on the visual input, identifying bounding boxes and element types with high precision.
Layer 2: Semantic Understanding. After identifying elements, the model must understand the meaning of each element—is this button "Submit" or "Cancel," does this list display search results or browsing history? This is the "comprehension" level. It requires the model to map visual patterns to functional semantics, drawing on its vast pre-training knowledge of UI conventions across thousands of apps.
Layer 3: Operation Planning. After understanding the interface, the model must plan an operation sequence—click the search box, input keywords, click the search button, wait for results to load, select the target item. This is the "action" level. It requires not just understanding the current state but projecting forward through a sequence of state transitions, anticipating what each action will produce.
The difficulty is not in each individual layer but in the real-time coordination of all three layers. Mobile interfaces are dynamically changing—every operation changes the interface state, and the model must rapidly complete three-layer understanding on each new screen before deciding the next action. This creates a tight perception-action loop that must execute reliably at 1–3 Hz (one action every 0.3–1 seconds) to match human interaction speed. Any latency in the loop creates a compounding delay that degrades the entire task.
4.2 Safety Mechanisms: Knowing When to Let Go
The biggest safety risk of Phone-Use is "the AI doing something it shouldn't." OpenAI has designed multi-layer safety mechanisms to address this:
Sensitive operation confirmation. For irreversible operations involving payment, deletion, or sending messages, the Agent pauses and requests user confirmation. During testing, a confirmation prompt appeared before every payment action. This is a deliberate friction point—sacrificing some automation for safety—a trade-off that is clearly correct at this stage of Agent maturity.
Permission boundaries. The Agent can only operate apps explicitly authorized by the user and cannot modify system settings or access private files. This containment model ensures that even if the Agent's decision-making goes awry, the blast radius is limited to the approved app sandbox.
Operation rollback. For reversible operations, the Agent supports "undo previous action." While not all apps support undo, this mechanism provides a safety net where possible. The rollback capability is implemented at the action-sequence level, meaning the Agent can unwind multiple steps if a later step reveals an earlier error.
Anomaly detection. When the Agent detects that an operation's result deviates from expectations (e.g., an error page appears after placing an order), it automatically stops and notifies the user. This is effectively a runtime assertion system—checking post-conditions after each significant action and halting execution when conditions are violated.
5. From "Works" to "Works Well": The Remaining Hurdles
The test results are overall impressive, but I also identified several clear problems.
5.1 Context Forgetting
This is the most prominent issue. During cross-app operations, the Agent frequently "forgets" information it previously obtained. For example, after finding a restaurant on Dianping and switching to Meituan, it sometimes cannot recall which restaurant to search for.
This is not simply a "memory capacity" problem—it is that the Agent's attention mechanism when switching between apps is not mature enough. The current workaround is to repeat key information in the instruction, but this increases the user's burden. Fundamentally, the Agent needs an external working memory module—a persistent scratchpad that survives app switches and maintains structured state across the entire task lifecycle. This is an active area of research in the Agent community, and solutions like retrieval-augmented memory and structured state graphs are being explored.
5.2 Custom UI Adaptation
App UIs vary enormously, especially Chinese super-apps (WeChat, Alipay, Meituan) which feature deep UI hierarchies, numerous custom components, and frequent redesigns. The Agent's adaptation to mainstream apps is acceptable, but success rates drop noticeably with niche apps or new app versions. The vision-based approach helps here—it is more resilient to layout changes than accessibility-tree-based approaches—but it is not a panacea. When an app introduces an entirely new interaction paradigm (e.g., replacing a scroll list with a card carousel), even vision-based models can struggle to generalize from prior experience.
5.3 Network and Performance Dependency
Phone-Use relies on real-time screen capture and remote inference, making it highly sensitive to network latency. On 4G networks, the average response time per step is approximately 1.5 seconds; on Wi-Fi, approximately 0.8 seconds. While this may seem small, a typical task requires 10–20 steps, accumulating 10–30 seconds of additional waiting. This is one reason why on-device Agent execution is such an attractive long-term direction—eliminating the network round-trip would bring per-step latency down to the 100–200ms range, making the Agent feel truly instantaneous.
5.4 User Trust
This is the hardest to quantify but potentially the most critical issue. Allowing AI to operate your phone and spend your money requires a considerable degree of trust. Even with OpenAI's multi-layer safety mechanisms, the unease you feel the first time you let AI place an order for you is real and visceral.
Trust takes time to build. My experience suggests a progressive approach: start with low-risk tasks (checking packages, checking weather), gradually transition to medium-risk tasks (ordering food, hailing rides), and only then move to high-risk tasks (booking flights, making transfers). Incremental trust-building is more realistic than a leap of faith. This is not unlike how people gradually increased their trust in online payments—from buying a book on Amazon to paying rent via bank transfer, the progression was organic and earned through repeated positive experiences.
6. The Agent Era: A Paradigm Shift from Q&A to Execution
OpenAI's first Agent product marks a fundamental transition for AI from "Q&A mode" to "execution mode." The impact of this shift extends far beyond the product itself.
6.1 A Revolution in Interaction
Over the past 20 years, human-computer interaction has evolved through "keyboard → touchscreen → voice," but the fundamental dynamic has remained the same: humans actively operate, machines passively respond. Phone-Use breaks this relationship—humans only need to express intent, and machines handle execution.
This means an exponential increase in interaction efficiency. Not "operating faster" but "not needing to operate at all." Consider the implications for accessibility: individuals with motor impairments who struggle with touch interfaces can now delegate the physical interaction to the Agent entirely. The Agent becomes a universal accessibility layer that works across every app, without requiring each developer to implement their own accessibility features.
6.2 New Possibilities for Work Automation
When AI can directly operate apps, the barrier to work automation drops dramatically. No API integration needed, no scripting required—just tell the AI what you want done.
Imagine these scenarios: every morning, AI automatically opens DingTalk, checks unread messages, and sends you a digest; every Friday, AI automatically submits the week's expenses in the reimbursement system; every month, AI automatically downloads bank statements from the banking app and categorizes them. None of this requires any development—just "teach" the AI the operation flow once, and it can execute automatically thereafter. This is RPA (Robotic Process Automation) democratized—no longer the province of enterprise IT departments but available to any individual with a smartphone.
6.3 The Agent Computer: The Ultimate Form of AI Agents
The current mobile Agent is just the beginning. A more complete Agent form would be the Agent Computer—a system that runs continuously, makes autonomous decisions, and operates multiple tools. Not "it moves when you tell it to," but "it proactively monitors and executes on your behalf."
KaiheAiBox is building exactly this kind of Agent Computer: available 24/7, capable not only of operating mobile apps but also of processing emails, managing schedules, and executing data analysis—functioning as a true "digital employee." When Agents evolve from "following instructions" to "autonomous planning," from "single tasks" to "continuous workflows," AI's value will be truly unleashed.
The Agent Computer represents a convergence of several trends: the maturation of Agent frameworks (like LangChain and AutoGen), the proliferation of tool APIs, and the increasing reliability of LLM-based planning and reasoning. The phone-use capability we see today is the visible tip of this iceberg—the same orchestration intelligence, applied across a broader toolset and running persistently rather than on-demand, is what will define the next generation of AI products.
7. Conclusion: The Qualitative Shift Has Arrived, but the Road Is Long
After a week of testing, my core conclusions are:
- OpenAI's Agent can already complete most daily mobile operation tasks, with a single-app simple task completion rate exceeding 90%. This is not a demo—it is a usable product for everyday tasks.
- Cross-app compound tasks are the biggest current shortcoming, with only a 33% completion rate. Context memory is the key bottleneck, and solving it will likely require architectural innovations rather than incremental improvements.
- OpenAI has a clear advantage over Manus and AutoGLM, but the gap at the basic operation level is not enormous. The real differentiator is in error tolerance and recovery capability—the ability to handle the messy, unpredictable real world.
- Safety mechanisms are well-designed, with sensitive operation confirmation, permission boundaries, and anomaly detection all properly implemented. But trust building takes time and repeated positive experiences.
The qualitative shift from AI "answering questions" to "getting things done" has genuinely occurred. This is not a proof of concept—it is a product you can actually use. But there is still a clear gap before "complete hands-free"—letting AI autonomously handle all daily operations.
If you are interested in AI Agents, my advice is: start using them now, beginning with simple tasks. Technology is advancing quickly, but building usage habits takes time. The earlier you start, the sooner you adapt to this new paradigm of "AI doing things for you."
KaiheAiBox | The Agent Computer for Everyone · AI Agent Tracker