OpenAI ships GPT-5.5: agentic coding takes the front seat

TL;DR

OpenAI shipped GPT-5.5 on April 23 to paid ChatGPT and Codex users — six weeks after 5.4. The headline is agentic behavior: hand it half-formed work, and it plans steps, reaches for tools, catches its
Primary source: https://openai.com/index/introducing-gpt-5-5/
Importance score: 10/10

The hook

Here's the deal: OpenAI shipped GPT-5.5 on April 23 to paid ChatGPT and Codex users — six weeks after 5.4. The headline is agentic behavior: hand it half-formed work, and it plans steps, reaches for tools, catches its own mistakes, and finishes. 1M context window, $5 input / $30 output per 1M tokens, no free tier. API live from April 24.

Importance lands at 10/10, which puts this in the top decile of releases this quarter — the kind of announcement that still shapes product roadmaps and industry metrics six months out, not a marketing pulse that fades in a week.

Below: what happened (anchored to the primary source), the headline numbers in two tables, the timeline, what this means for individuals / teams / industry, a deep-dive section on the technical and architectural implications, skeptical takes worth keeping in mind, and what to watch in the next week.

What happened

OpenAI released GPT-5.5 on April 23 to paid ChatGPT and Codex users, and pushed it to the API one day later. The interesting wrinkle is timing — only six weeks after GPT-5.4. The headline shift is the model behaving more like an agent: it takes a half-formed task, plans steps, reaches for tools, catches its own mistakes mid-execution, and hands back finished work. Context window expanded to one million tokens. Pricing is $5 per 1M input / $30 per 1M output for the standard model and $30 / $180 for Pro. No free tier. While Anthropic's Claude Mythos Preview reportedly still tops some benchmarks, GPT-5.5 closes the gap on day-to-day developer workflows fastest. The HN discussion swung between 'finally a model that finishes the job' and 'pricing is creeping up faster than capability'.

Primary source is the announcement page itself — see OpenAI. Secondary corroboration is listed at the bottom; we have 3 secondary outlets cross-referenced. Any quoted figure in this piece is linked inline to its origin. Treat unlinked claims as my own framing, not as facts from the source.

On the headline numbers: Context window lands at 1,000,000 tokens (vs GPT-5.4 same). Input price per 1M tokens lands at $5 (vs GPT-5.4 $5). Output price per 1M tokens lands at $30 (vs GPT-5.4 $25). GPT-5.5 Pro input/output lands at $30 / $180 (vs Opus 4.7 약 $40 / $200). These metrics are not all measured under identical methodology — pick whichever matters to your workload and verify against the primary source's appendix where possible.

Recent timeline: 2025-08 (GPT-5 출시), 2026-02 (GPT-5.2), 2026-03-15 (GPT-5.4), 2026-04-23 (GPT-5.5 (Plus/Pro/Business/Enterprise)), 2026-04-24 (GPT-5.5 API 공개 ($5/$30)). Read this not as a one-off but as the latest knot on a multi-month thread. The compression between the last two events is itself a signal that release cadence in this category is tightening.

Pricing posture: $5 per 1M input tokens, $30 per 1M output. Free tier unavailable, API live from 2026-04-24. Even when input pricing holds steady, agentic workloads burn many more output tokens per task, so all-in cost can rise even when the per-token price doesn't. Model that into your TCO estimate.

Benchmarks / Key Numbers

Metric	Value	Versus
Context window	1,000,000 tokens	GPT-5.4 same
Input price per 1M tokens	$5	GPT-5.4 $5
Output price per 1M tokens	$30	GPT-5.4 $25
GPT-5.5 Pro input/output	$30 / $180	Opus 4.7 약 $40 / $200

Pricing & Access

Item	Value
input_per_1m	$5
output_per_1m	$30
pro_input_per_1m	$30
pro_output_per_1m	$180
free_tier	False
api_available_from	2026-04-24

Timeline

Date	Event
2025-08	GPT-5 출시
2026-02	GPT-5.2
2026-03-15	GPT-5.4
2026-04-23	GPT-5.5 (Plus/Pro/Business/Enterprise)
2026-04-24	GPT-5.5 API 공개 ($5/$30)

Why it matters

Three lenses help here. First, the individual user lens: does this materially change a workflow you spend more than thirty minutes a day on (coding, writing, analysis, automation)? If yes, the second question is whether the same output now becomes faster, cheaper, or more reliable — separate the three to keep the adoption decision clean.

Second, the team / enterprise lens. POC teams should ask whether this shortens the path to validation. Production teams should isolate which variable shifts: unit cost, latency, or accuracy. Marketing claims and SLA reality routinely diverge in the days after a release; running your own benchmark on roughly thirty representative inputs is the only safe move.

Third, the competitive lens. Is the gap structural or temporary? Data advantages erode in 6–12 months. Infrastructure advantages persist for 12–24 months. Team-composition advantages are nearly impossible to replicate. This piece tries to attribute the headline figure to one of these where the evidence allows.

Fourth, the regulatory / ecosystem lens — easy to miss but compounding. Releases of this scale typically attract policy guidance or industry standards within a quarter or two, particularly around safety, data governance, and copyright. If you're not under pressure to decide today, watching one more cycle of those discussions can save you from re-platforming later.

Deep Dive

This section goes one layer deeper into technical detail. Light readers can skip; teams making procurement or research-direction decisions should not.

The most striking number is Context window at 1,000,000 tokens, set against GPT-5.4 same. To know whether that's meaningful, you need apples-to-apples comparison with the prior generation under identical measurement methodology — and most release notes don't publish that methodology in full. Even on identical benchmarks, prompt format, few-shot count, and temperature settings routinely shift results by 5–15 percentage points; that's the noise floor any external reproduction has to clear.

Architecturally, three deltas are most likely to explain the jump. First, training data composition: at fixed parameter count, better curation alone produces meaningful gains in code and math domains. Second, post-training pipeline strength: most of the headline improvements over the last 18 months trace to here, not to base-model architecture. Third, inference-time tool-call frequency: part of why models look smarter is simply that they reach for search or computation more aggressively. The exact split is unstated, but post-training is the most-likely dominant lever.

Limitations worth keeping front of mind. Self-reported benchmarks, thin adversarial data, sparse out-of-distribution generalization studies. Pricing marked 'preview' or 'limited access' historically gets revised at least once within six months. Operational quotas — context window, tool-call frequency caps — also tend to tighten quietly post-launch as usage scales. Build any 12-month ROI model with sensitivity to all three.

Open problems remain. Multi-step agent cost blow-up. Long-horizon memory consistency. Graceful degradation when tool calls fail. Responsibility allocation when an autonomous system makes a costly call (especially in code, finance, or healthcare). None of these are sufficiently addressed in this release. Production deployments that ignore them will receive an expensive bill from operations roughly six months in.

Who can use this

Solo developers and small teams. Hand off well-scoped backlog tickets, reclaim review and architecture time. The trap: low-quality specs lead the model into hallucinated work-arounds, and net time can go up. Spend the first month explicitly mapping spec quality vs. output quality on your team's actual tickets.

Startups. Prototype-to-feedback loops shrink from a week to a day for many feature classes. Especially valuable for data ingestion, simple ML pipelines, and internal tooling — areas where humans can move into review-only mode. Production-grade code, license review, and security audit still need a human in the loop.

Cost-sensitive enterprises. If this shifts the price-performance curve, workloads of meaningful scale (call centers, document processing, search) can run 30–50% cheaper for equivalent quality. Compounded over a quarter, that's real OPEX impact.

On-prem / governance-sensitive teams. Open-weight options open paths for finance, healthcare, and public sector workloads that have stalled on cloud-LLM adoption. The data-sovereignty story changes when you can serve the same quality on hardware you control.

Researchers and students. Releases at this importance level reset 6–12 months of research agenda. If your topic is adjacent, design follow-up experiments now — replications and extensions of fresh frontier results have an unusually short impact half-life.

Skeptical takes

Three reservations come up repeatedly. Read them alongside the body, not after.

Self-reported benchmarks lack methodology disclosure. Over half of major releases in the last six months saw external reproductions land below the headline numbers. Treat the announced figures as the upper bound; measure your own workloads independently before committing.

Demo-fit examples don't survive long-tail real workloads. Keywords like 'agentic', 'human-level', or 'frontier' tend to hold inside curated demo scenarios and degrade 30–50% in production with domain-specific vocabulary, non-standard inputs, or multilingual mixes. A two-week pilot on your own representative inputs is non-optional.

Post-launch pricing and quotas tighten. Within the same category, prices have routinely been revised upward and operational caps narrowed within months of launch. Any 12-month ROI estimate should include a 20–30% price-increase scenario in the sensitivity analysis. 'Preview' models do not carry production SLAs — don't pin business-critical workloads to them.

What to watch next week

Four signals to track. (1) Competitive responses or pricing moves in the same category — a response within a week signals strong market pressure. (2) Independent reproductions from academic or third-party benchmarkers — within ±5 points of the headline is 'as advertised'; beyond that, caution. (3) Long-tail user feedback in established communities (Reddit, HN, X) — the gap between marketing tone and on-the-ground tone shows up within a week. (4) Ecosystem integration announcements — when major IDEs or platforms merge integration PRs within a week, this release is becoming the industry default. Alignment across all four points to a structural shift; divergence points to a marketing cycle.

One-line takeaway

OpenAI shipped GPT-5.5 on April 23 to paid ChatGPT and Codex users — six weeks after 5.4. The headline is agentic behavior: hand it half-formed work, and it pla

Sources

[Primary] Introducing GPT-5.5
[TechCrunch] OpenAI releases GPT-5.5, bringing company one step closer to an AI 'super app'
[HN] GPT-5.5 — HN discussion
[OpenAI] GPT-5.5 System Card

OpenAI ships GPT-5.5: agentic coding takes the front seat

TL;DR

The hook

What happened

Benchmarks / Key Numbers

Pricing & Access

Timeline

Why it matters

Deep Dive

Who can use this

Skeptical takes

What to watch next week

One-line takeaway

Sources

관련 기사

OpenAI's GPT-5.5 'Spud' Pretraining Complete – Launch Expected Within Weeks

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

GPT-5.4 Thinking Ships — 33% Fewer Tokens, 33% Fewer Errors, and the Reasoning AI Tipping Point

TL;DR

The hook

What happened

Benchmarks / Key Numbers

Pricing & Access

Timeline

Why it matters

Deep Dive

Who can use this

Skeptical takes

What to watch next week

One-line takeaway

Sources

관련 기사

OpenAI's GPT-5.5 'Spud' Pretraining Complete – Launch Expected Within Weeks

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

GPT-5.4 Thinking Ships — 33% Fewer Tokens, 33% Fewer Errors, and the Reasoning AI Tipping Point

AI 트렌드를 앞서가세요