GPT-5.4 Deep Dive — The First General-Purpose Model That Operates Your Computer
OpenAI released GPT-5.4 with native Computer Use, 1M token context, 75% OSWorld score beating humans. Full specs, benchmarks, and competitive analysis.

From "AI that answers" to "AI that acts"
On March 5, OpenAI released GPT-5.4 — a frontier model that unifies reasoning, coding, and agentic workflows into a single system. Most importantly, it's the first general-purpose model with native Computer Use: it can see your screen and control your mouse and keyboard to complete complex tasks across applications.
Previous Computer Use attempts — Anthropic's Claude (October 2024), OpenAI's own Operator (January 2025), Google's Project Mariner — were either experimental betas or separate agent products. GPT-5.4 makes Computer Use a built-in capability available in both the API and ChatGPT.
Key Specs
| Metric | GPT-5.4 | GPT-5.2 | Improvement |
|---|---|---|---|
| Context window | 1M tokens | 128K tokens | ~8x |
| OSWorld-Verified | 75.0% | 47.3% | +27.7pp (beats humans at 72.4%) |
| Per-claim error reduction | -33% | baseline | — |
| Full-response error reduction | -18% | baseline | — |
| Reasoning token usage | -33% | baseline | — |
| Image recognition | 10.24M pixels | — | — |
| GDPVal | 83.0% | — | Human expert level |
The 1 million token context is a game-changer for agent workflows. It means long-running agents can maintain context across complex, multi-step tasks without losing track of earlier work — analyzing entire codebases, processing hundreds of pages of legal documents, or orchestrating multi-application workflows in a single session.
OSWorld — The Real Computer Use Test
OSWorld-Verified measures whether AI can perform complex tasks in actual computer environments: opening browsers, filling forms, managing files, switching between applications. Human experts score 72.4%. GPT-5.4 scored 75.0%, surpassing humans. GPT-5.2 managed only 47.3% — a 27.7 percentage point leap in one generation signals that Computer Use has crossed from experimental to practical.
GDPVal — "Can AI Do Economically Valuable Work?"
GDPVal benchmarks AI on tasks with real economic value: drafting emails, analyzing spreadsheets, writing reports, cleaning data. GPT-5.4 scored 83.0%, reaching human expert level on the most realistic test of whether AI can generate actual business value.
Model Family
| Model | Target | Key Feature | Availability |
|---|---|---|---|
| GPT-5.4 Thinking | ChatGPT Plus, Team, Pro | Reasoning with visible plans | March 5 |
| GPT-5.4 Pro | Pro, Enterprise | Computer Use, high performance | March 5 |
| GPT-5.4 mini | API bulk processing | 2x+ faster than GPT-5 mini | March 17 |
| GPT-5.4 nano | Mobile, edge | Ultra-lightweight | March 17 |
Tool Search and Financial Plugins
GPT-5.4 introduces Tool Search — the model autonomously discovers and selects from available tools (APIs, plugins, functions) based on the task at hand, rather than requiring developers to pre-specify which tools to use. According to VentureBeat, it also ships with native financial plugins for Microsoft Excel and Google Sheets, enabling natural-language financial analysis, chart generation, and pivot table creation.
Competitive Landscape
| Benchmark | GPT-5.4 | Claude 4.6 Opus | Gemini 3.1 Pro |
|---|---|---|---|
| OSWorld-Verified | 75.0% | — | — |
| BrowseComp | 82.7 | 84.0 | 85.9 |
| GDPVal | 83.0% | — | — |
The Computer Use market is still early. The winner will be determined not by benchmarks alone, but by which model can reliably handle real-world tasks at scale.
Sources
관련 기사
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.



