Gemini 3.1 Ultra ships — 2M context, native text·image·audio·video multimodal
Google released Gemini 3.1 Ultra with a 2M-token window and reasoning trained jointly across text, image, audio, and video. Lands the same week as GPT-5.4.

2M
When Google shipped Gemini 3.0 in December, the loudest line was "still in OpenAI's shadow." Users didn't leave ChatGPT. Revenue gap didn't close.
This week Google played a card.
Gemini 3.1 Ultra is out. The headline number is 2M tokens of context — twice GPT-5.4's 1M — with a model architecture trained jointly on text, image, audio, and video from the start. Native multimodal, not bolted on.
A built-in code-execution sandbox now runs snippets and feeds results back into reasoning. Sundar Pichai (CEO, Google and Alphabet) opened the keynote with "multimodal was always our path."
The collision matters. Gemini 3.1 Ultra and GPT-5.4 dropped the same week. The last time two frontier models clashed this directly on the same headline was GPT-4o vs Gemini 1.5 in spring 2024.
Who's involved — Google, OpenAI, the multimodal market
For Google, 3.1 Ultra is a multimodal-identity recovery play.
The Gemini line has pitched multimodal since 1.0, but adoption stayed in ChatGPT's lane. 3.0 beat GPT-5.0 on multimodal benchmarks in December — users still didn't move.
3.1 Ultra's bet: own the categories text doesn't fit — long video analysis, audio, complex visuals — and create new market space rather than fight ChatGPT for the text seat.
OpenAI's risk this week is the launch getting eclipsed. 5.4's OSWorld 75% is a strong headline, but 2M context and native video plays in a different lane. The two models may end up dividing the market rather than competing for the same buyer.
Buyers in the multimodal segment — video, audio, content creation — get a credible third option. Last year you picked OpenAI or Anthropic. Now Google is on the shortlist.
Demis Hassabis (CEO, Google DeepMind) framed it: "Real AGI doesn't feel modality boundaries." Marketing, but the architecture and training notes back the direction.
The numbers
3.1 Ultra is built around multimodal and long context. Pure-text reasoning is slightly behind GPT-5.4. Video, audio, and long-document understanding pull clearly ahead.
| Benchmark | Gemini 3.1 Ultra | Gemini 3.0 (prior self) | GPT-5.4 (rival 1) | Claude Sonnet 4.5 (rival 2) |
|---|---|---|---|---|
| MMU (multimodal understanding) | 78.5% | 71.0% | 70.5% | 68.0% |
| Video-MME (video QA) | 84.0% | 76.5% | 72.0% | 68.5% |
| AudioBench | 81.5% | 73.0% | 70.0% | 65.5% |
| LongBench-2M | 75.0% | 64.0% | 58.5% | 56.0% |
| MMLU-Pro | 87.5% | 85.5% | 89.0% | 86.5% |
| OSWorld-V | 52.0% | 45.0% | 75.0% | 56.5% |
| Context | 2M | 1M | 1M | 1M |
| Input ($/1M) | 1.25 | 1.25 | 2.50 | 3.00 |
8–12 percentage-point lead on video and audio. 16+ point lead on long-document understanding. Input pricing at $1.25/M is half of GPT-5.4.
Desktop automation lags. The two flagships are diverging on positioning.
Wins and losses
Google could own the standard-model seat for video and audio content. YouTube and Drive supply the training corpus, YouTube Studio and Docs supply the integration distribution. Creators using Gemini 3.1 can pull captions, chapter markers, and Shorts cuts in one pass.
Creators — YouTubers, podcasters, course makers — get a meaningful workflow upgrade. Hour-long video to a 5-minute summary plus chapters and captions is now one model call. Outsourcing budget shrinks.
Media, education, and entertainment buyers get to turn long-tail video assets into searchable, summarizable, repurposable data.
OpenAI's text-workflow moat — Discord, Slack, enterprise messengers — won't move overnight. Gemini 3.1 adoption starts in multimodal-first use cases.
Past cycles — multimodal frontier swings
Four prior swings.
OpenAI GPT-4o, May 2024. First single-model text + image + voice. Splashy launch, video pushed to a follow-up.
Google Gemini 1.5 Pro, 2024. 1M context broke ground on long doc handling. UX and pricing kept adoption modest.
Meta Llama 3 Vision, 2024. Open-source multimodal viability, limited audio/video.
Anthropic Claude Vision, 2024. Strong on image, basically silent on video and audio. Claude's strengths sit in text and code.
Pattern: launches splashy, real adoption stays text. 3.1 Ultra is positioned to break the pattern because of YouTube data and workflow distribution Google uniquely owns.
Counter-moves
OpenAI bets the coding/agent lane. Sora 2 covers content creation, ChatGPT enterprise pull-through covers revenue.
Anthropic stays in the text and coding lane and answers with Sonnet 5.0. They go deeper on their strength rather than pivot into multimodal.
Meta uses Llama's open-source pricing to attack the low end of multimodal. Llama 4 Multimodal is the candidate.
xAI Grok bets on real-time X data integration. Real-time signal, not multimodal depth. Resource gap makes a direct comparison unfair.
Skeptics, by name
Yann LeCun (Meta AI Chief Scientist) on X: a single model spanning all modalities is inefficient — modality-specific models do better. Same line he's argued for two years.
Aravind Srinivas (Perplexity CEO) gives credit on the 2M number, then notes most users can't even fill 1M. Capability outruns demand.
The consensus read: 3.1 Ultra doesn't dent GPT-5.4 in coding, but it can claim the multimodal standard seat.
Stakes
- Wins: Google — multimodal identity recovered, video/audio standard seat in reach. YouTube and Drive — data assets re-rate. Creators — video post-production workflow flips automated.
- Loses: OpenAI — multimodal shootout intensifies with Sora 2. Anthropic — hard to plant a flag in this category. Adobe and Final Cut Pro — partial creator workflow erosion.
- Watching: Meta — when Llama Multimodal v2 ships. Apple — depth of Apple Intelligence × Gemini integration. EU regulators — automated video/audio analysis guidance.
What changes
Devs: a credible multimodal API alternative exists. Video and audio SaaS now considers Google alongside OpenAI and Anthropic. Half the input price helps.
Founders: video content analysis becomes a viable wedge. Meeting notes automation, lecture summarization, marketing-video analysis all get cheaper.
Investors: Google revenue visibility improves. Cloud + Workspace + YouTube cross-sell on multimodal lifts ARPU. Video editing and captioning outsourcing markets face short-term pressure.
Consumers: long video becomes 1-minute summaries. Free auto-captioning becomes default.
3-Line Summary
- Gemini 3.1 Ultra ships with 2M context and native multimodal.
- Video and audio benchmarks lead GPT-5.4 by 8–12 points.
- Multimodal standard-model race is officially open.
Sources
출처
관련 기사

Google Gemini 3.1 Ultra Ships With 2M Token Context and Native Multimodal Reasoning
Google launches Gemini 3.1 Ultra with a 2-million token context window and native multimodal reasoning across text, image, audio, and video. Benchmarks match GPT-5.4 at one-third the API cost.

Gemini 3.1 Flash-Lite Arrives at $0.25/M Tokens — Inside the LLM Price War That Cut Costs 80% in One Year
Google's Gemini 3.1 Flash-Lite sets a new floor for LLM pricing. Here's how API costs dropped 80% year-over-year, who's winning the price war, and what it means for developers.

DeepSeek V4 — 1 Trillion Parameters, Open-Weight, and Everything You Need to Know
Complete technical breakdown of DeepSeek V4: MoE architecture (1T total, 32B active), Engram Memory, Dynamic Sparse Attention, benchmarks, pricing (50x cheaper than Claude), API usage, license terms, and geopolitical implications.
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
