Baidu Wenxin Agent 2.0 Hands-On: Can AI Really Book Restaurants by Phone or Is It Just Hype?

Published on: 2026-05-26

Testing Baidu Wenxin Agent 2.0's AI Phone Call Feature: Genuinely Useful or Just a Gimmick?

Abstract: Baidu's Wenxin Agent Platform has seen a nearly 4x increase in users for its AI phone call feature, with 150,000 enterprise developers on the platform and a 16x increase in distribution volume. For the first time, an AI Agent is delivering a genuinely usable product in the high-frequency, real-world scenario of making phone calls to book restaurants and check on orders. But real-world testing reveals a more complicated picture: AI phone calls work well, but they are far from "safe to fully delegate."


I asked Baidu Wenxin Assistant to book a table for six at a Sichuan restaurant near my office for 7 PM. Here is what happened:

The Agent searched for highly rated Sichuan restaurants nearby, shortlisted three that met the criteria, and called the first one. After the call connected, the AI explained the booking request in natural speech. The restaurant confirmed availability. The Agent completed the booking. It then pushed the confirmation to my phone.

Total elapsed time: 2 minutes and 17 seconds. I did nothing.

The experience left me both excited and uneasy. Excited because this is genuinely the first time an AI Agent has delivered a usable product in the "phone call" scenario—one of the most common, most human, and most technically challenging real-world interactions. Uneasy because—what if the person had said "we don't have a table for six, but we have a table for five, would that work?" What if no one answered? What if the person had a heavy accent?

This is the state of Baidu Wenxin Agent 2.0: impressive in standard scenarios, but still requiring a human safety net in non-standard ones.

AI Phone Calls: The Gap Between Demo and Product

Baidu's published metrics are eye-catching: nearly 4x growth in users of the AI phone call feature on Wenxin Assistant. That kind of growth indicates actual demand—not "curiosity-driven" demand, but "I want to use this" demand.

The technical stack behind AI phone calls includes:

  • Automatic Speech Recognition (ASR): Transcribing the other party's speech to text in real time
  • Dialogue Management (DM): Deciding what to say next based on conversation context
  • Text-to-Speech (TTS): Synthesizing the Agent's response into natural-sounding speech
  • Intent Recognition: Determining whether the other party agreed, refused, or proposed a condition
  • Exception Handling: No answer, busy signal, request to transfer to a human agent

In standard workflows—place call → state request → confirm → complete—this technical stack performs quite smoothly. Baidu's TTS quality in Chinese scenarios is now virtually indistinguishable from a human speaker, and the dialogue rhythm is well-calibrated.

But "standard workflows" do not cover all scenarios. Typical problems encountered in testing include:

Poor handling of ambiguous responses. When a restaurant says "we probably have a table, call back to confirm before you come," the Agent becomes confused—does this count as booked or not booked?

Weak multi-conditional negotiation. "No table for six, but we have a table for five, or an eight-person private room." This requires the Agent to understand three options and make a choice. Current performance on this is inconsistent.

Dialects and accents. In non-Mandarin scenarios, ASR accuracy drops noticeably, causing subsequent dialogue to diverge from expectations.

The technical bottleneck in AI phone calls is not "speaking"—it is "listening." Generating a natural sentence of speech is largely solved. Understanding an ambiguous response—that is still harder for AI than for a high school student.

The Wenxin Agent Platform: An Ecosystem of 150,000 Enterprises

The AI phone call feature is only the tip of the iceberg for the Wenxin Agent Platform. As of May 2026, the platform has aggregated 150,000 enterprise developers, with distribution volume growing 16x year-over-year.

The strategic logic here is important: Baidu is not just building a "help you make phone calls" feature. It is building an Agent ecosystem where anyone can create their own Agent, plug into Baidu's capabilities (search, maps, voice, etc.), and deliver services to users.

The platform recently integrated DeepSeek models, which means developers can choose between Baidu's own Wenxin models and DeepSeek as the underlying inference engine. This multi-model openness is a genuine plus for developers who want to optimize for cost, latency, or capability depending on their use case.

Comparison: Wenxin Agent vs. KAIHE AI Box A1 + OpenClaw

Wenxin Agent 2.0 and the KAIHE AI Box A1 running OpenClaw represent two distinct philosophies of Agent product design.

Wenxin Agent: Cloud + Vertical Scenarios. All inference happens on Baidu's cloud. The advantages are strong model capabilities and deep integration with Baidu's ecosystem (search, maps, voice). The disadvantages are cloud dependency, token costs, and data privacy considerations.

KAIHE AI Box A1 + OpenClaw: Local + General-Purpose Automation. Inference happens locally on the device. The advantages are zero token costs, data never leaving the device, and 24/7 stable operation. The disadvantage is that local model capabilities have a ceiling, and cloud-native services like Baidu Maps cannot be directly accessed.

The two approaches are not in conflict. They are complementary. A sensible Agent architecture might be:

  • KAIHE AI Box A1 handles daily local automation (file organization, data processing, scheduled tasks)
  • When phone calls, search, or maps capabilities are needed, call the Wenxin Agent API
  • Local OpenClaw acts as the orchestration layer, deciding when to invoke which Agent

This "local orchestration + cloud capabilities" hybrid architecture balances privacy, cost, and capability. It is the pragmatic direction for Agent deployment.

The Next Phase of AI Phone Calls

Baidu's investment in AI phone call capabilities will not stop here. Following the pattern of product evolution, the next phase of focus is likely to be:

Improved exception handling. Standard workflows are already good enough. The real differentiation will come from handling non-standard scenarios. This requires more sophisticated dialogue policy models and more granular intent recognition.

Agent-to-Agent communication. As more businesses deploy AI phone agents to handle incoming calls, the "human on both ends" assumption breaks down. When an AI calls a restaurant and an AI answers, the efficiency and accuracy of AI-to-AI dialogue could be substantially higher than AI-to-human. This requires standardized communication protocols.

Multimodal understanding. Phone calls are only the beginning. Video calls, face-to-face interactions—Agents need to understand not just speech but visual information as well.

Baidu Wenxin Agent 2.0 delivers a clear conclusion: AI phone calls are not a gimmick. But they are also not the destination. They are a critical interface for Agents to move from the digital world to the physical world. That interface is still rough today, but the direction is correct.

When AI starts making phone calls for you—booking restaurants, checking orders, handling customer service—it ceases to be a "chat tool" and becomes an "action agent." The significance of this transition is far greater than any incremental increase in model parameter count.


KaiheAiBox · AI Agents

© KAIHE AI - Agent Computer Specialist