GPT-5.4 hits OSWorld-V 75% — autonomy goes mainstream

75%

OSWorld-V at 75%. That's the headline number for GPT-5.4. OSWorld-V scores models on real desktop multi-step tasks: open files, edit, save, switch apps. The prior generation (GPT-5 baseline) sat near 51%, and the previous SOTA, Claude Sonnet 4.5, was 65%.

This release isn't about "longer context." It's about execution.

OpenAI's autonomy bet

OpenAI consolidated its model lineup at 5.0 in Q1 2025, then iterated through 5.x. 5.1 and 5.2 hardened multimodal alignment; 5.3 improved tool-call accuracy; 5.4 targets autonomous workflows.

Sam Altman has repeated a line for months: "the next leap is from answers to actions." OSWorld-V 75% is the first hard measurement on that thesis.

Jakub Pachocki, now central to architecture decisions after Mira Murati's departure, has emphasized that 5.4's training recipe elevates tool-use traces to a primary signal — a point Greg Brockman reinforced in a recent interview.

[IMG#1]

The spec sheet

Spec	GPT-5.4	GPT-5	Gemini 3.1 Ultra	Claude 4.5 Opus
Context	1,000,000	256,000	2,000,000	500,000
OSWorld-V	75%	51%	not disclosed	65%
SWE-bench Verified	71%	64%	68%	70%
Multi-step autonomy	✅	partial	✅	✅
Input price ($/1M)	$5.00	$5.00	$1.25	$15.00
Output price ($/1M)	$15.00	$15.00	$5.00	$75.00

OpenAI re-takes the agent-bench top slot. Pricing held flat — but Gemini's $1.25 input is 4x cheaper. The new shape: OpenAI sells capability, Google sells price.

What "multi-step autonomy" actually means

Imagine: open five files, compare three, log results to Notion, then ping Slack — all from one prompt. GPT-5.4 demos completed that flow in 4-7 tool calls and 2-4 app switches.

The crucial advance is error recovery. When a tool call fails or an app stalls, the model now backs off and retries cleanly. The previous generation either froze on first failure or fell into retry loops.

Who wins

OpenAI — agent-bench leadership recovered, but pricing pressure from Google forces the "premium for capability" framing.

Enterprise automation — UiPath, Workato, Zapier and similar players have a viable backend; "agent RPA" cements as a category within twelve months.

[IMG#2]

Past benchmark curves

OSWorld was introduced by Tianbao Xie et al. (2024). At launch, GPT-4 scored 12% and Claude 3 scored 14%. One year later: 65%. Eight months after that: 75%.

A familiar arc — same shape on SWE-bench, where Devin debuted at 13.86% in early 2024 and frontier models now sit in the 70%s. The "1.5-2 years from launch to 60-75%" curve is now the norm.

Counter-moves

Google — Gemini 3.1 Ultra leans on 2M context plus code execution. Notably, Google has not yet published an OSWorld-V score.

Anthropic — Claude Sonnet 4.6 emphasizes coding/tool-use accuracy. SWE-bench gap to GPT-5.4 has narrowed to ~1pp, but OSWorld-V trails by 10pp.

Meta — Llama 5 is rumored to push "open-weight autonomous agents," with self-hosting as the differentiator.

Stakes

Wins: OpenAI — agent-bench top spot, restored leverage on Enterprise renewals.
Wins: Automation SaaS — viable model backbone for RPA-style use cases.
Loses: Thin LLM wrappers — autonomous execution as default erodes wrapper differentiation.
Watching: Regulators — autonomy editing files/emails creates ambiguous GDPR/SOC2 boundaries.
Watching: Internal IT — RBAC controls must catch up to autonomous execution.

Skeptical view

Andrej Karpathy: "OSWorld-V is curated; production task distributions differ. 75% on the bench isn't 75% in your stack."

Yann LeCun (Meta): "Track whether hallucination and tool-misuse rates rise alongside benchmark scores — autonomy turns hallucination from 'wrong text' into 'wrong files deleted.'"

What changes for you

For builders — GPT-5.4 Tools API elevates multi-step autonomy to a first-class feature. Default to session-based multi-step rather than single LLM call.

For founders — RPA and automation SaaS lose entry-barrier moats; differentiation now lives in domain data, policy, and integrations.

For investors — Microsoft Q2 will surface ChatGPT Enterprise renewals and Azure AI workload revenue, the cleanest shareable signal.

For end users — ChatGPT's Tasks feature gets a deeper autonomy upgrade. Try it on recurring weekly workflows.

3-Line Summary

GPT-5.4 sets a new OSWorld-V SOTA at 75%, anchored on multi-step autonomous execution.
Pricing held; Gemini's $1.25 input is 4x cheaper, framing capability vs. price.
Automation/RPA category cements; enterprise RBAC and regulation become the next constraint.

References

OpenAI — GPT-5.4 announcement
OSWorld benchmark — official site
TechCrunch — GPT-5.4 hands-on
Bloomberg — OpenAI revenue update
Andrej Karpathy — benchmark commentary

GPT-5.4 hits OSWorld-V 75% — autonomy goes mainstream

75%

OpenAI's autonomy bet

The spec sheet

What "multi-step autonomy" actually means

Who wins

Past benchmark curves

Counter-moves

Stakes

Skeptical view

What changes for you

3-Line Summary

References

출처

관련 기사

OpenAI Codex Just Got 'Everything Mode' — It Uses Your Computer, Remembers, and Runs for Days

OpenAI's Lilli Replaces Internal Knowledge Search with AI Agents

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

75%

OpenAI's autonomy bet

The spec sheet

What "multi-step autonomy" actually means

Who wins

Past benchmark curves

Counter-moves

Stakes

Skeptical view

What changes for you

3-Line Summary

References

출처

관련 기사

OpenAI Codex Just Got 'Everything Mode' — It Uses Your Computer, Remembers, and Runs for Days

OpenAI's Lilli Replaces Internal Knowledge Search with AI Agents

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

AI 트렌드를 앞서가세요