GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer
OpenAI released GPT-5.4 with 1M token context, native Computer Use achieving 75% on OSWorld (surpassing humans), and a full model family. Complete specs, benchmarks, and competitive analysis.

From "AI That Answers" to "AI That Acts"
On March 5, OpenAI released GPT-5.4 — the first general-purpose AI model with native Computer Use capabilities. Not a separate agent product. Not an experimental beta. Computer Use is built into the model itself, available through both the API and ChatGPT.
GPT-5.4 scores 75.0% on OSWorld-Verified (surpassing human experts at 72.4%), packs a 1-million-token context window (8x the previous generation), and reduces factual errors by 33% while using 33% fewer reasoning tokens. Here's the deal: AI just crossed the line from "tool that helps you think" to "agent that does things for you."
Background: The Road to Computer Use
| Timeline | Model/Product | What Happened |
|---|---|---|
| October 2024 | Anthropic Claude Computer Use | First Computer Use concept demo (beta) |
| January 2025 | OpenAI Operator | Separate agent product for Computer Use |
| March 2025 | Google Project Mariner | Chrome browser automation agent |
| March 2026 | OpenAI GPT-5.4 | First general-purpose model with native Computer Use |
The key difference: all previous Computer Use implementations were either separate agent products or experimental betas. GPT-5.4 bakes it into the core model, available as a standard feature across API and ChatGPT.
Core Specs
| Metric | GPT-5.4 | GPT-5.2 | Improvement |
|---|---|---|---|
| Context Window | 1M tokens (922K in + 128K out) | 128K | ~8x |
| OSWorld-Verified | 75.0% | 47.3% | +27.7pp (human: 72.4%) |
| WebArena-Verified | 67.3% | — | Browser automation |
| GDPVal | 83.0% | — | Human expert level |
| Per-claim error reduction | -33% | baseline | — |
| Reasoning token usage | -33% | baseline | — |
OSWorld-Verified tests whether AI can navigate real desktop environments — opening browsers, filling forms, managing files, switching between applications. Human experts score 72.4%. GPT-5.4 scores 75.0%, crossing the human threshold for the first time.
GDPVal measures performance on economically valuable tasks — emails, spreadsheets, report drafting, data cleaning. At 83.0%, GPT-5.4 has reached "can this AI actually generate economic value" territory.
The Model Family
GPT-5.4 ships as a family, not a single model:
| Model | Key Feature | Target | API Pricing (per 1M tokens) |
|---|---|---|---|
| GPT-5.4 Thinking | Reasoning-first, shows plan before solving | ChatGPT Plus/Team/Pro | Included |
| GPT-5.4 Pro | High-performance + Computer Use | Pro/Enterprise | Included |
| GPT-5.4 (API) | Full capability | Developers | $3 input / $15 output |
| GPT-5.4 mini | Fast coding/reasoning, 2x+ speed | High-volume API | $0.40 / $1.60 |
| GPT-5.4 nano | Ultra-lightweight, edge devices | Mobile/embedded | $0.10 / $0.40 |
The nano pricing is remarkable: $0.10 per million input tokens — cheaper than GPT-3.5 was, with performance exceeding GPT-4 on many benchmarks. Computer Use is no longer a premium-only feature.
Tool Search and Financial Plugins
GPT-5.4 introduces Tool Search — the model autonomously discovers and selects the right tool from a large set of APIs, plugins, and functions for the current task. Previously, developers had to pre-specify which tools were available.
According to VentureBeat, GPT-5.4 also ships with native financial plugins for Microsoft Excel and Google Sheets — analyzing financial data, generating charts, and building pivot tables from natural language instructions.
Competitive Landscape
| Benchmark | GPT-5.4 | Claude 4.6 Opus | Gemini 3.1 Pro |
|---|---|---|---|
| OSWorld-Verified | 75.0% | — | — |
| BrowseComp | 82.7 | 84.0 | 85.9 |
| GDPVal | 83.0% | — | — |
Anthropic pioneered Computer Use with Claude in October 2024. Claude 4.6 Opus scores 84.0 on BrowseComp (vs. GPT-5.4's 82.7), excelling at pixel-level screen manipulation. But GPT-5.4 leads on OSWorld's broader desktop automation tasks.
Google focuses on browser automation through Project Mariner. Gemini 3.1 Pro tops BrowseComp at 85.9 but lags behind in general-purpose desktop Computer Use. Google's advantage is deep Workspace integration.
The Computer Use market is still early. The winning model will be the one that can reliably handle real-world business tasks — not just benchmark scenarios.
OpenAI's Current Position
GPT-5.4 arrives at a pivotal moment for OpenAI. The company's annualized revenue has surpassed $25 billion — the fastest revenue scaling in software history. For comparison, Google took 5 years and Facebook took 7 years to reach similar milestones. OpenAI achieved it in roughly 3.5 years from ChatGPT's November 2022 launch. Reports suggest OpenAI is considering an IPO by late 2026. The company remains unprofitable, but the revenue trajectory is unprecedented.
What This Means for Developers
Agent architectures are changing. The combination of 1M token context and Tool Search shifts the paradigm from "pre-define your tools" to "let the model find what it needs."
RPA is being disrupted. As Computer Use matures, significant portions of the traditional RPA (Robotic Process Automation) market — valued at $13B+ — will shift to AI agents that understand context rather than following rigid scripts.
Security becomes critical. AI controlling actual computer interfaces means prompt injection and misaligned instructions can have real-world consequences. Sandboxing and permission systems will be essential.
Costs keep falling. The mini and nano variants signal that Computer Use is democratizing rapidly. What cost hundreds of dollars per task in 2025 may cost pennies in 2027.
References
관련 기사

GPT-5.4 Thinking Ships — 33% Fewer Tokens, 33% Fewer Errors, and the Reasoning AI Tipping Point
OpenAI released GPT-5.4 Thinking with 33% fewer reasoning tokens, 33% fewer factual errors, and GDPVal 83.0%. Full model family, pricing, benchmarks, and what it means for developers.

OpenAI GPT-5.4 Unleashed: 1 Million Tokens + Autonomous Multi-Step Workflows
GPT-5.4 hits 1M token context and 75% on OSWorld-V benchmarks, proving AI agents can now handle real-world software tasks autonomously

The Agent Platform War Just Started -- OpenAI, Alibaba, and Cisco Moved in the Same Week
OpenAI expanded its Responses API, Alibaba shipped an agent-optimized Qwen 3.6, and Cisco launched an AI security agent. Three companies, one week, one signal: the competition has shifted from models to platforms.
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
