GPT-5.4 hits OSWorld-V 75% — autonomy goes mainstream
OpenAI unveiled GPT-5.4 with a 1M-token context and multi-step autonomous workflows. New SOTA on OSWorld-V at 75%.

75%
OSWorld-V at 75%. That's the headline number for GPT-5.4. OSWorld-V scores models on real desktop multi-step tasks: open files, edit, save, switch apps. The prior generation (GPT-5 baseline) sat near 51%, and the previous SOTA, Claude Sonnet 4.5, was 65%.
This release isn't about "longer context." It's about execution.
OpenAI's autonomy bet
OpenAI consolidated its model lineup at 5.0 in Q1 2025, then iterated through 5.x. 5.1 and 5.2 hardened multimodal alignment; 5.3 improved tool-call accuracy; 5.4 targets autonomous workflows.
Sam Altman has repeated a line for months: "the next leap is from answers to actions." OSWorld-V 75% is the first hard measurement on that thesis.
Jakub Pachocki, now central to architecture decisions after Mira Murati's departure, has emphasized that 5.4's training recipe elevates tool-use traces to a primary signal — a point Greg Brockman reinforced in a recent interview.
[IMG#1]
The spec sheet
| Spec | GPT-5.4 | GPT-5 | Gemini 3.1 Ultra | Claude 4.5 Opus |
|---|---|---|---|---|
| Context | 1,000,000 | 256,000 | 2,000,000 | 500,000 |
| OSWorld-V | 75% | 51% | not disclosed | 65% |
| SWE-bench Verified | 71% | 64% | 68% | 70% |
| Multi-step autonomy | ✅ | partial | ✅ | ✅ |
| Input price ($/1M) | $5.00 | $5.00 | $1.25 | $15.00 |
| Output price ($/1M) | $15.00 | $15.00 | $5.00 | $75.00 |
OpenAI re-takes the agent-bench top slot. Pricing held flat — but Gemini's $1.25 input is 4x cheaper. The new shape: OpenAI sells capability, Google sells price.
What "multi-step autonomy" actually means
Imagine: open five files, compare three, log results to Notion, then ping Slack — all from one prompt. GPT-5.4 demos completed that flow in 4-7 tool calls and 2-4 app switches.
The crucial advance is error recovery. When a tool call fails or an app stalls, the model now backs off and retries cleanly. The previous generation either froze on first failure or fell into retry loops.
Who wins
OpenAI — agent-bench leadership recovered, but pricing pressure from Google forces the "premium for capability" framing.
Enterprise automation — UiPath, Workato, Zapier and similar players have a viable backend; "agent RPA" cements as a category within twelve months.
[IMG#2]
Past benchmark curves
OSWorld was introduced by Tianbao Xie et al. (2024). At launch, GPT-4 scored 12% and Claude 3 scored 14%. One year later: 65%. Eight months after that: 75%.
A familiar arc — same shape on SWE-bench, where Devin debuted at 13.86% in early 2024 and frontier models now sit in the 70%s. The "1.5-2 years from launch to 60-75%" curve is now the norm.
Counter-moves
Google — Gemini 3.1 Ultra leans on 2M context plus code execution. Notably, Google has not yet published an OSWorld-V score.
Anthropic — Claude Sonnet 4.6 emphasizes coding/tool-use accuracy. SWE-bench gap to GPT-5.4 has narrowed to ~1pp, but OSWorld-V trails by 10pp.
Meta — Llama 5 is rumored to push "open-weight autonomous agents," with self-hosting as the differentiator.
Stakes
- Wins: OpenAI — agent-bench top spot, restored leverage on Enterprise renewals.
- Wins: Automation SaaS — viable model backbone for RPA-style use cases.
- Loses: Thin LLM wrappers — autonomous execution as default erodes wrapper differentiation.
- Watching: Regulators — autonomy editing files/emails creates ambiguous GDPR/SOC2 boundaries.
- Watching: Internal IT — RBAC controls must catch up to autonomous execution.
Skeptical view
Andrej Karpathy: "OSWorld-V is curated; production task distributions differ. 75% on the bench isn't 75% in your stack."
Yann LeCun (Meta): "Track whether hallucination and tool-misuse rates rise alongside benchmark scores — autonomy turns hallucination from 'wrong text' into 'wrong files deleted.'"
What changes for you
For builders — GPT-5.4 Tools API elevates multi-step autonomy to a first-class feature. Default to session-based multi-step rather than single LLM call.
For founders — RPA and automation SaaS lose entry-barrier moats; differentiation now lives in domain data, policy, and integrations.
For investors — Microsoft Q2 will surface ChatGPT Enterprise renewals and Azure AI workload revenue, the cleanest shareable signal.
For end users — ChatGPT's Tasks feature gets a deeper autonomy upgrade. Try it on recurring weekly workflows.
3-Line Summary
- GPT-5.4 sets a new OSWorld-V SOTA at 75%, anchored on multi-step autonomous execution.
- Pricing held; Gemini's $1.25 input is 4x cheaper, framing capability vs. price.
- Automation/RPA category cements; enterprise RBAC and regulation become the next constraint.
References
- OpenAI — GPT-5.4 announcement
- OSWorld benchmark — official site
- TechCrunch — GPT-5.4 hands-on
- Bloomberg — OpenAI revenue update
- Andrej Karpathy — benchmark commentary
출처
관련 기사

OpenAI Codex Just Got 'Everything Mode' — It Uses Your Computer, Remembers, and Runs for Days
OpenAI rolled Codex into an Everything Mode that unifies computer-use, long-horizon memory, and a multi-tool agentic loop. This is not code generation anymore — it is project-level operations running for days at a time.

OpenAI's Lilli Replaces Internal Knowledge Search with AI Agents
OpenAI's internal search system Lilli launches for enterprise. Can it replace Notion and Confluence?

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer
OpenAI released GPT-5.4 with 1M token context, native Computer Use achieving 75% on OSWorld (surpassing humans), and a full model family. Complete specs, benchmarks, and competitive analysis.
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
