GPT-5.5 Ships: Agentic Coding and Computer Use Just Stepped Up a Level

75%

When GPT-5.0 shipped last summer, the loudest critique was that it didn't earn its name. SWE-Bench Verified came in around 65%, slightly below Claude Sonnet 4.5. Nine months later, OpenAI launched GPT-5.5 — and the same benchmark broke 75%. Not just a score bump; a credible step change in agentic coding capability.

Two main upgrades. First, multi-step agentic coding. The model takes a PR-level task: write code, run tests, debug failures, retry, ship. Second, computer use. The model controls a browser and OS directly — Anthropic's Computer Use idea (October 2024), refined another notch in OpenAI's stack.

Sam Altman wrote in the launch post: "5.5 is the first model that finishes the task instead of describing it." Marketing voice — but the benchmarks and demo videos do back it up to a meaningful degree.

Why each side cares

For OpenAI, 5.5 is the redemption release after 5.0's lukewarm reception. While OpenAI worked on it, Anthropic took the coding lead with Sonnet 4.5 and Computer Use, and Google caught up on multimodal with Gemini 2.5 and 3.0. 5.5 fills the gap.

For Anthropic, the SWE-Bench lead is gone for now. Sonnet 4.5 sits around 73%; 5.5 reaches 75.2%. First time OpenAI is ahead of Anthropic on a flagship coding benchmark. Single-benchmark wins matter less than developer satisfaction over a quarter — but the flag has moved.

For Google, Gemini 3.1 Ultra (announced the same day, with a 2M-token context window) competes on a different axis: reasoning over very large codebases. Different battlefield from agentic per-PR coding.

For users, the bigger shift is agent-shaped IDE workflows finally feeling production-ready. Cursor, Codex, Claude Code have been moving this direction for a year; 5.5 is the model-side reinforcement.

Benchmark snapshot

Benchmark	GPT-5.5	GPT-5.0 (prev.)	Claude Sonnet 4.5 (rival)	Gemini 2.5 Pro (rival)
SWE-Bench Verified	75.2%	64.5%	72.8%	65.0%
MMLU-Pro	87.5%	84.0%	86.2%	85.5%
GPQA Diamond	81.0%	76.5%	79.0%	78.0%
OSWorld (computer use)	56.0%	n/a	42.5%	38.0%
WebArena (browser)	68.2%	58.0%	64.5%	60.5%
AIME 2025 (math)	92.5%	88.0%	90.5%	89.0%

The biggest single jump is OSWorld: 5.0 couldn't really run this benchmark; 5.5 lands at 56%, a 13.5 point lead over Claude Sonnet 4.5. WebArena moves to 68.2% — first model above 65%. These two benchmarks measure whether an agent can actually replace a human inside a GUI environment. Six months ago, no model cleared 50% on either.

Pricing holds at GPT-5.0 levels: $2.50/M input tokens, $10/M output. Context window grows from 200K → 256K. Computer-use mode is metered separately by action.

Who wins what

OpenAI. Reclaims footing in coding. Cursor and similar IDEs make backend-model decisions partly off SWE-Bench scores; this matters. Computer-use leadership opens the door for OpenAI to become the backend default for agent-shaped SaaS.

Developers. Same task, shorter debug cycle. Early users report ~30% faster average iteration loops on test-fail → debug → retry chains.

SaaS companies. Computer use lets a single agent stitch across SaaS surfaces. RPA market acceleration into LLM-agent space — pressure on UiPath, Automation Anywhere.

OpenAI employees. Morale recovery after the 5.0 cycle. Late-2025 IPO rumors gain credibility if 5.5 lands well.

What history says about generational jumps

GPT-3 → GPT-3.5 (2022). A 0.5 step that enabled ChatGPT via RLHF. Not just bigger — a methodology shift. Claude 3 → 3.5 (2024). Sonnet 3.5 outperformed the larger Claude 3 Opus on coding — methodology won over size. Llama 2 → 3 (2024). Major data scale-up (2T → 15T tokens).

5.5 looks like a methodology jump (synthetic agent trajectories, modified RLHF / process rewards) rather than a parameter scale-up. That pattern matters: parameter increases are predictable; methodology jumps are not, and they're harder to reproduce.

How rivals counter

Anthropic. Sonnet 5.0 expected June. Coding lead recovery is goal #1. Computer Use v3 will need to close the OSWorld gap. Google. Gemini 3.1 Ultra leans into 2M-token context — a different bet (whole-codebase reasoning), not agent loops. xAI / DeepSeek / Qwen. Compete on price. OpenAI not cutting prices yet signals they don't feel the pressure — but a cycle is likely 6-12 months out. Cursor / Codex / Claude Code. Differentiation moves up the stack — context management, MCP, multi-agent orchestration.

What this changes for you

Engineers. Switch backend models in your IDE and measure your own debug-cycle deltas. Same price, possibly real time savings. Founders. Map any user workflow where the model could plausibly finish (not describe). Computer-use enables flows you couldn't ship six months ago. Investors. UiPath et al. forward guidance is the immediate read. OpenAI's next-round price is the medium-term read. Users. ChatGPT will more often "just do" the task instead of teaching you how. Tasks like "extract data from this PDF and put it in a spreadsheet" finish in one round more often.

Stakes

Wins: OpenAI (coding lead recovered), agent SaaS (better backend), developers (cycle time)
Loses: Anthropic (lead disturbed), traditional RPA (UiPath etc. — accelerated displacement)
Watching: Cursor / Claude Code default-model decisions, Gemini 3.1 Ultra's whole-codebase use cases

Skeptics, named

Simon Willison wrote on X right after launch: "The benchmark jump is real but SWE-Bench Verified is curated. Real PR environments — codebase scale, CI flakiness, dep conflicts — won't reproduce 75% cleanly." Andrej Karpathy has noted that agent jumps are uneven across task families: average scores can overstate the typical-user benefit. First two weeks of real-user data will tell.

Computer-use safety also remains an open question. Sandbox + confirmation gates are mandated, but jailbreak research is already underway. Expect first public incidents within 1-2 months.

Tomorrow morning

Engineers: Switch your IDE's backend to 5.5 on a small task batch. Measure debug cycle time vs. 5.0 or Sonnet 4.5. Founders / PMs: Audit user flows for "computer use can finish this" candidates. Mark the top 3 as automation experiments. Investors: Track UiPath / Automation Anywhere quarterly guidance language. Watch OpenAI's next round price as a 5.5 reception barometer. Users: Try the same task on 5.0 vs. 5.5 (Plus/Pro users) for a week. Track "finished without help" rate.

Sources

LLM Stats — GPT-5.5 update: https://llm-stats.com/llm-updates
OpenAI blog (model card): https://openai.com/blog
Simon Willison — first impressions: https://simonwillison.net/
TechCrunch — release coverage: https://techcrunch.com/
OSWorld benchmark: https://os-world.github.io/

GPT-5.5 Ships: Agentic Coding and Computer Use Just Stepped Up a Level

75%

Why each side cares

Benchmark snapshot

Who wins what

What history says about generational jumps

How rivals counter

What this changes for you

Stakes

Skeptics, named

Tomorrow morning

Sources

출처

관련 기사

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

OpenAI Put a Terminal in Its API – From Model Company to Agent Platform

OpenAI's GPT-5.5 'Spud' Pretraining Complete – Launch Expected Within Weeks

75%

Why each side cares

Benchmark snapshot

Who wins what

What history says about generational jumps

How rivals counter

What this changes for you

Stakes

Skeptics, named

Tomorrow morning

Sources

출처

관련 기사

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

OpenAI Put a Terminal in Its API – From Model Company to Agent Platform

OpenAI's GPT-5.5 'Spud' Pretraining Complete – Launch Expected Within Weeks

AI 트렌드를 앞서가세요