spoonai
TOPOpenAIGPT-5Agents

GPT-5.4 hits OSWorld-V 75% — autonomy goes mainstream

OpenAI unveiled GPT-5.4 with a 1M-token context and multi-step autonomous workflows. New SOTA on OSWorld-V at 75%.

·4분 소요·blog.mean.ceoblog.mean.ceo
공유
GPT-5.4 OSWorld-V benchmark chart — 75% vs prior generation
Source: OpenAI

75%

OSWorld-V at 75%. That's the headline number for GPT-5.4. OSWorld-V scores models on real desktop multi-step tasks: open files, edit, save, switch apps. The prior generation (GPT-5 baseline) sat near 51%, and the previous SOTA, Claude Sonnet 4.5, was 65%.

This release isn't about "longer context." It's about execution.

OpenAI's autonomy bet

OpenAI consolidated its model lineup at 5.0 in Q1 2025, then iterated through 5.x. 5.1 and 5.2 hardened multimodal alignment; 5.3 improved tool-call accuracy; 5.4 targets autonomous workflows.

Sam Altman has repeated a line for months: "the next leap is from answers to actions." OSWorld-V 75% is the first hard measurement on that thesis.

Jakub Pachocki, now central to architecture decisions after Mira Murati's departure, has emphasized that 5.4's training recipe elevates tool-use traces to a primary signal — a point Greg Brockman reinforced in a recent interview.

[IMG#1]

The spec sheet

Spec GPT-5.4 GPT-5 Gemini 3.1 Ultra Claude 4.5 Opus
Context 1,000,000 256,000 2,000,000 500,000
OSWorld-V 75% 51% not disclosed 65%
SWE-bench Verified 71% 64% 68% 70%
Multi-step autonomy partial
Input price ($/1M) $5.00 $5.00 $1.25 $15.00
Output price ($/1M) $15.00 $15.00 $5.00 $75.00

OpenAI re-takes the agent-bench top slot. Pricing held flat — but Gemini's $1.25 input is 4x cheaper. The new shape: OpenAI sells capability, Google sells price.

What "multi-step autonomy" actually means

Imagine: open five files, compare three, log results to Notion, then ping Slack — all from one prompt. GPT-5.4 demos completed that flow in 4-7 tool calls and 2-4 app switches.

The crucial advance is error recovery. When a tool call fails or an app stalls, the model now backs off and retries cleanly. The previous generation either froze on first failure or fell into retry loops.

Who wins

OpenAI — agent-bench leadership recovered, but pricing pressure from Google forces the "premium for capability" framing.

Enterprise automation — UiPath, Workato, Zapier and similar players have a viable backend; "agent RPA" cements as a category within twelve months.

[IMG#2]

Past benchmark curves

OSWorld was introduced by Tianbao Xie et al. (2024). At launch, GPT-4 scored 12% and Claude 3 scored 14%. One year later: 65%. Eight months after that: 75%.

A familiar arc — same shape on SWE-bench, where Devin debuted at 13.86% in early 2024 and frontier models now sit in the 70%s. The "1.5-2 years from launch to 60-75%" curve is now the norm.

Counter-moves

Google — Gemini 3.1 Ultra leans on 2M context plus code execution. Notably, Google has not yet published an OSWorld-V score.

Anthropic — Claude Sonnet 4.6 emphasizes coding/tool-use accuracy. SWE-bench gap to GPT-5.4 has narrowed to ~1pp, but OSWorld-V trails by 10pp.

Meta — Llama 5 is rumored to push "open-weight autonomous agents," with self-hosting as the differentiator.

Stakes

  • Wins: OpenAI — agent-bench top spot, restored leverage on Enterprise renewals.
  • Wins: Automation SaaS — viable model backbone for RPA-style use cases.
  • Loses: Thin LLM wrappers — autonomous execution as default erodes wrapper differentiation.
  • Watching: Regulators — autonomy editing files/emails creates ambiguous GDPR/SOC2 boundaries.
  • Watching: Internal IT — RBAC controls must catch up to autonomous execution.

Skeptical view

Andrej Karpathy: "OSWorld-V is curated; production task distributions differ. 75% on the bench isn't 75% in your stack."

Yann LeCun (Meta): "Track whether hallucination and tool-misuse rates rise alongside benchmark scores — autonomy turns hallucination from 'wrong text' into 'wrong files deleted.'"

What changes for you

For builders — GPT-5.4 Tools API elevates multi-step autonomy to a first-class feature. Default to session-based multi-step rather than single LLM call.

For founders — RPA and automation SaaS lose entry-barrier moats; differentiation now lives in domain data, policy, and integrations.

For investors — Microsoft Q2 will surface ChatGPT Enterprise renewals and Azure AI workload revenue, the cleanest shareable signal.

For end users — ChatGPT's Tasks feature gets a deeper autonomy upgrade. Try it on recurring weekly workflows.

3-Line Summary

  • GPT-5.4 sets a new OSWorld-V SOTA at 75%, anchored on multi-step autonomous execution.
  • Pricing held; Gemini's $1.25 input is 4x cheaper, framing capability vs. price.
  • Automation/RPA category cements; enterprise RBAC and regulation become the next constraint.

References

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지