spoonai
TOPOpenAIGPT-5.5Agent

GPT-5.5 Ships: Agentic Coding and Computer Use Just Stepped Up a Level

OpenAI released GPT-5.5 with major upgrades to multi-step agentic coding and computer use. SWE-Bench Verified passes 75% and OSWorld leaps to 56% — the largest single-generation jump for OpenAI in agent benchmarks.

·6분 소요·LLM StatsLLM Stats
공유
GPT-5.5 — agentic coding and computer use benchmark chart
Source: llm-stats.com

75%

When GPT-5.0 shipped last summer, the loudest critique was that it didn't earn its name. SWE-Bench Verified came in around 65%, slightly below Claude Sonnet 4.5. Nine months later, OpenAI launched GPT-5.5 — and the same benchmark broke 75%. Not just a score bump; a credible step change in agentic coding capability.

Two main upgrades. First, multi-step agentic coding. The model takes a PR-level task: write code, run tests, debug failures, retry, ship. Second, computer use. The model controls a browser and OS directly — Anthropic's Computer Use idea (October 2024), refined another notch in OpenAI's stack.

Sam Altman wrote in the launch post: "5.5 is the first model that finishes the task instead of describing it." Marketing voice — but the benchmarks and demo videos do back it up to a meaningful degree.

Why each side cares

For OpenAI, 5.5 is the redemption release after 5.0's lukewarm reception. While OpenAI worked on it, Anthropic took the coding lead with Sonnet 4.5 and Computer Use, and Google caught up on multimodal with Gemini 2.5 and 3.0. 5.5 fills the gap.

For Anthropic, the SWE-Bench lead is gone for now. Sonnet 4.5 sits around 73%; 5.5 reaches 75.2%. First time OpenAI is ahead of Anthropic on a flagship coding benchmark. Single-benchmark wins matter less than developer satisfaction over a quarter — but the flag has moved.

For Google, Gemini 3.1 Ultra (announced the same day, with a 2M-token context window) competes on a different axis: reasoning over very large codebases. Different battlefield from agentic per-PR coding.

For users, the bigger shift is agent-shaped IDE workflows finally feeling production-ready. Cursor, Codex, Claude Code have been moving this direction for a year; 5.5 is the model-side reinforcement.

Benchmark snapshot

Benchmark GPT-5.5 GPT-5.0 (prev.) Claude Sonnet 4.5 (rival) Gemini 2.5 Pro (rival)
SWE-Bench Verified 75.2% 64.5% 72.8% 65.0%
MMLU-Pro 87.5% 84.0% 86.2% 85.5%
GPQA Diamond 81.0% 76.5% 79.0% 78.0%
OSWorld (computer use) 56.0% n/a 42.5% 38.0%
WebArena (browser) 68.2% 58.0% 64.5% 60.5%
AIME 2025 (math) 92.5% 88.0% 90.5% 89.0%

The biggest single jump is OSWorld: 5.0 couldn't really run this benchmark; 5.5 lands at 56%, a 13.5 point lead over Claude Sonnet 4.5. WebArena moves to 68.2% — first model above 65%. These two benchmarks measure whether an agent can actually replace a human inside a GUI environment. Six months ago, no model cleared 50% on either.

Pricing holds at GPT-5.0 levels: $2.50/M input tokens, $10/M output. Context window grows from 200K → 256K. Computer-use mode is metered separately by action.

Who wins what

OpenAI. Reclaims footing in coding. Cursor and similar IDEs make backend-model decisions partly off SWE-Bench scores; this matters. Computer-use leadership opens the door for OpenAI to become the backend default for agent-shaped SaaS.

Developers. Same task, shorter debug cycle. Early users report ~30% faster average iteration loops on test-fail → debug → retry chains.

SaaS companies. Computer use lets a single agent stitch across SaaS surfaces. RPA market acceleration into LLM-agent space — pressure on UiPath, Automation Anywhere.

OpenAI employees. Morale recovery after the 5.0 cycle. Late-2025 IPO rumors gain credibility if 5.5 lands well.

What history says about generational jumps

GPT-3 → GPT-3.5 (2022). A 0.5 step that enabled ChatGPT via RLHF. Not just bigger — a methodology shift. Claude 3 → 3.5 (2024). Sonnet 3.5 outperformed the larger Claude 3 Opus on coding — methodology won over size. Llama 2 → 3 (2024). Major data scale-up (2T → 15T tokens).

5.5 looks like a methodology jump (synthetic agent trajectories, modified RLHF / process rewards) rather than a parameter scale-up. That pattern matters: parameter increases are predictable; methodology jumps are not, and they're harder to reproduce.

How rivals counter

Anthropic. Sonnet 5.0 expected June. Coding lead recovery is goal #1. Computer Use v3 will need to close the OSWorld gap. Google. Gemini 3.1 Ultra leans into 2M-token context — a different bet (whole-codebase reasoning), not agent loops. xAI / DeepSeek / Qwen. Compete on price. OpenAI not cutting prices yet signals they don't feel the pressure — but a cycle is likely 6-12 months out. Cursor / Codex / Claude Code. Differentiation moves up the stack — context management, MCP, multi-agent orchestration.

What this changes for you

Engineers. Switch backend models in your IDE and measure your own debug-cycle deltas. Same price, possibly real time savings. Founders. Map any user workflow where the model could plausibly finish (not describe). Computer-use enables flows you couldn't ship six months ago. Investors. UiPath et al. forward guidance is the immediate read. OpenAI's next-round price is the medium-term read. Users. ChatGPT will more often "just do" the task instead of teaching you how. Tasks like "extract data from this PDF and put it in a spreadsheet" finish in one round more often.

Stakes

  • Wins: OpenAI (coding lead recovered), agent SaaS (better backend), developers (cycle time)
  • Loses: Anthropic (lead disturbed), traditional RPA (UiPath etc. — accelerated displacement)
  • Watching: Cursor / Claude Code default-model decisions, Gemini 3.1 Ultra's whole-codebase use cases

Skeptics, named

Simon Willison wrote on X right after launch: "The benchmark jump is real but SWE-Bench Verified is curated. Real PR environments — codebase scale, CI flakiness, dep conflicts — won't reproduce 75% cleanly." Andrej Karpathy has noted that agent jumps are uneven across task families: average scores can overstate the typical-user benefit. First two weeks of real-user data will tell.

Computer-use safety also remains an open question. Sandbox + confirmation gates are mandated, but jailbreak research is already underway. Expect first public incidents within 1-2 months.

Tomorrow morning

Engineers: Switch your IDE's backend to 5.5 on a small task batch. Measure debug cycle time vs. 5.0 or Sonnet 4.5. Founders / PMs: Audit user flows for "computer use can finish this" candidates. Mark the top 3 as automation experiments. Investors: Track UiPath / Automation Anywhere quarterly guidance language. Watch OpenAI's next round price as a 5.5 reception barometer. Users: Try the same task on 5.0 vs. 5.5 (Plus/Pro users) for a week. Track "finished without help" rate.

Sources

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지