I Tested OpenAI's First Agent: It Actually Gets Work Done — And Done Well
Summary: OpenAI Codex has evolved from a "code completion tool" into a true AI agent that understands multi-step tasks, operates applications, and collaborates across devices. I tested it for a week. The verdict: it genuinely gets work done — if you know how to use it.
From "Writing Code" to "Getting Things Done": What Changed in Codex
If your impression of Codex still dates back to 2021's "Copilot underlying model," you're out of date.
2026's Codex is a Software Engineering Agent. Its mode of operation isn't "you type a line, it suggests the next" — it's you give it a task, and it autonomously breaks it down, executes, verifies, and iterates until completion.
What can it actually do? OpenAI positions it as an "AI software engineer":
- Understands the entire codebase context, not just the current file
- Edits multiple files in one pass — handles changes spanning a dozen files
- Runs tests, reads error output, and fixes bugs autonomously in a closed loop
- Generates Pull Requests with commit messages written for you
- Handles multiple tasks in parallel, each in its own sandbox environment
In essence, it evolved from a "completion tool" into a "colleague who can independently finish development tasks."
What I Actually Tested

I put Codex through three real-world scenarios:
Scenario 1: Batch Image Renaming Script (Entry Level)
I said one sentence: "Write a Python script to batch rename images in a folder, sorted by creation date, supporting JPG/PNG/WebP formats."
Codex output: a complete Python script using PIL for EXIF date extraction, argparse for CLI handling, plus exception handling. I ran it — two small bugs (import path issues), which it self-corrected in one iteration. Total time: ~15 minutes.
Scenario 2: React Task Management Page (Advanced)
Requirement: a task management frontend with CRUD, drag-and-drop sorting, and data persistence.
Codex's solution: react-beautiful-dnd for drag-and-drop, localStorage for persistence, useReducer for state management. Code quality — solid mid-level frontend engineer. Architecture was reasonable; boundary condition handling was adequate. I added some style tweaks and put it into production.
Scenario 3: Letting Codex Operate My Mac
This was the most impressive part. After the May 2026 update, Codex can operate Mac applications without hogging the mouse cursor. I had it batch-process cover images in Photoshop in the background while I continued browsing — both operations ran without interference.
Previously, getting AI to operate your computer meant either using tools like OpenClaw (which required IM integration) or surrendering mouse control entirely. Codex now achieves "background execution" — a qualitative leap.
Four Usage Modes, Covering Different Scenarios
Codex now ships in four forms, each suited to different users:
| Mode | Target User | Use Case |
|---|---|---|
| CLI | Command-line users | Local development, automation scripts |
| App (macOS) | Mac users | Multi-agent orchestration, visual monitoring |
| Web | Casual / remote | Travel, device switching, quick validation |
| IDE Plugin | VS Code users | In-editor direct invocation |
Mobile control is another highlight. The Codex entry in iOS/Android ChatGPT lets you send instructions on your phone while your home desktop Codex responds in real time. Thought of a bug fix while commuting? Pull out your phone and assign the task — no remote desktop, no file transfers. Instructions and results close the loop entirely within Codex.
What Are Its Limitations?
Several hard constraints became apparent during testing:
High Token consumption. Especially with the Chronicle feature (screen memory), which runs background agents to capture and process screen content. The Token burn rate is considerable. For individual users, this may be a larger hidden cost than subscription fees.
Free tier is thin. Codex free配额 is relatively tight for its intended use cases. Heavy users hit limits quickly. This is why many people use Cursor or Windsurf as alternatives.
Context understanding still has boundaries. When tasks involve highly specific business logic or internal company frameworks, Codex needs substantial guidance. It's not a "say one sentence and it reads your entire three-year codebase" magic tool.
Security considerations. Chronicle requires screen captures for OCR and memory extraction. Although OpenAI claims raw screenshots auto-delete after 6 hours and aren't used for cloud training, this remains a risk factor requiring evaluation for sensitive data scenarios.
How Does It Compare to Competitors?
Placing Codex in the 2026 AI coding tool landscape:
- vs. Cursor: Cursor is more like a "super editor"; Codex is more like a "remote colleague." The former assists beside you; the latter takes the task and works on it independently on the side.
- vs. OpenClaw: OpenClaw is a general-purpose agent (can operate the entire system); Codex focuses on software engineering scenarios. The former is broader; the latter is more specialized.
- vs. Claude Code: Similar positioning, but Codex has OpenAI's official backing and GPT-5.5's model capabilities, with a slight edge in code understanding and multi-file coordination.
The Verdict: Can It Actually Get Work Done?
Yes. But with a caveat — you need to know how to use it.
Codex isn't a tool where "you say one thing and everything is solved." It needs a human partner who can break down tasks, review outputs, and make judgments at key nodes. The optimal usage pattern isn't "throwing all the work at it," but rather "handing off the repetitive, logically clear parts to it while you focus on architectural decisions and critical logic."
For developers, not using an AI coding assistant in 2026 is like not using Git in 2016 — you can get work done, but your efficiency is an order of magnitude behind.
KaiheAiBox · The Agent Computer for Everyone · AI Agent Zone