spoonai
TOPLLMOpenAIGPT-5.4

GPT-5.4 Deep Dive — The First General-Purpose Model That Operates Your Computer

OpenAI released GPT-5.4 with native Computer Use, 1M token context, 75% OSWorld score beating humans. Full specs, benchmarks, and competitive analysis.

·3분 소요·Introducing GPT-5.4 | OpenAI
GPT-5.4 Computer Use demonstration
Image: OpenAI

From "AI that answers" to "AI that acts"

On March 5, OpenAI released GPT-5.4 — a frontier model that unifies reasoning, coding, and agentic workflows into a single system. Most importantly, it's the first general-purpose model with native Computer Use: it can see your screen and control your mouse and keyboard to complete complex tasks across applications.

Previous Computer Use attempts — Anthropic's Claude (October 2024), OpenAI's own Operator (January 2025), Google's Project Mariner — were either experimental betas or separate agent products. GPT-5.4 makes Computer Use a built-in capability available in both the API and ChatGPT.

Key Specs

Metric GPT-5.4 GPT-5.2 Improvement
Context window 1M tokens 128K tokens ~8x
OSWorld-Verified 75.0% 47.3% +27.7pp (beats humans at 72.4%)
Per-claim error reduction -33% baseline
Full-response error reduction -18% baseline
Reasoning token usage -33% baseline
Image recognition 10.24M pixels
GDPVal 83.0% Human expert level

The 1 million token context is a game-changer for agent workflows. It means long-running agents can maintain context across complex, multi-step tasks without losing track of earlier work — analyzing entire codebases, processing hundreds of pages of legal documents, or orchestrating multi-application workflows in a single session.

OSWorld — The Real Computer Use Test

OSWorld-Verified measures whether AI can perform complex tasks in actual computer environments: opening browsers, filling forms, managing files, switching between applications. Human experts score 72.4%. GPT-5.4 scored 75.0%, surpassing humans. GPT-5.2 managed only 47.3% — a 27.7 percentage point leap in one generation signals that Computer Use has crossed from experimental to practical.

GDPVal — "Can AI Do Economically Valuable Work?"

GDPVal benchmarks AI on tasks with real economic value: drafting emails, analyzing spreadsheets, writing reports, cleaning data. GPT-5.4 scored 83.0%, reaching human expert level on the most realistic test of whether AI can generate actual business value.

Model Family

Model Target Key Feature Availability
GPT-5.4 Thinking ChatGPT Plus, Team, Pro Reasoning with visible plans March 5
GPT-5.4 Pro Pro, Enterprise Computer Use, high performance March 5
GPT-5.4 mini API bulk processing 2x+ faster than GPT-5 mini March 17
GPT-5.4 nano Mobile, edge Ultra-lightweight March 17

Tool Search and Financial Plugins

GPT-5.4 introduces Tool Search — the model autonomously discovers and selects from available tools (APIs, plugins, functions) based on the task at hand, rather than requiring developers to pre-specify which tools to use. According to VentureBeat, it also ships with native financial plugins for Microsoft Excel and Google Sheets, enabling natural-language financial analysis, chart generation, and pivot table creation.

Competitive Landscape

Benchmark GPT-5.4 Claude 4.6 Opus Gemini 3.1 Pro
OSWorld-Verified 75.0%
BrowseComp 82.7 84.0 85.9
GDPVal 83.0%

The Computer Use market is still early. The winner will be determined not by benchmarks alone, but by which model can reliably handle real-world tasks at scale.

Sources

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.