spoonai
PaperAgentic-CodingGamesFoundation-Model

One Prompt, One Game — OpenGame Paper Sets Domain-Specialized Coding SOTA

CUHK MMLab's full-stack release: GameCoder-27B + a multi-phase Game Skill workflow + OpenGame-Bench 150-prompt eval suite. State of the art on agentic web-game generation, plus a template for domain-specialized coding agents.

·3분 소요·arXivarXiv
공유
OpenGame paper card — Phaser HTML5 game agent
Source: arXiv 2604.18394

SOTA Across 150 Prompts

CUHK MMLab dropped the OpenGame paper on April 21 with a clean message: a domain-specialized base model plus a multi-phase workflow plus an automated eval pipeline produces SOTA on 150 game-generation prompts. The unusual move is shipping code (leigest519/OpenGame), model (GameCoder-27B) and bench (OpenGame-Bench) at the same time — a full-stack open template for domain-specialized coding agents.

In Plain Terms

Type "a 2D side-scroller where a fox collects mushrooms," walk away, and a playable HTML5 game falls out. Five-stage agent pipeline — plan → scaffold → implement → debug → polish — with a learned Skill library (Templates and Debug protocols) that grows across runs and reduces redundant work. The 100th run on a similar prompt is faster and more reliable than the first.

Authors and Source

CUHK MMLab, with leigest519 first author. MMLab is the multimodal vision-language group behind Visual ChatGPT and Vid2Seq. arXiv 2604.18394, CC-BY, posted April 21, 2026. Conference venue not announced; ICML 2026 / NeurIPS 2026 likely.

Limits Before This — Games Aren't Function Patches

Generic coding-agent benches like SWE-Bench score function-level patches against a known correct answer. Games don't have a single correct answer for "fun side-scroller" — prior evaluation depended on slow human judgment or imprecise pattern match. OpenGame automates the evaluation while introducing a more precise three-axis score (build health, visual usability, intent alignment).

Domain corpus was the second gap. General code models like Llama 3.1 70B don't deeply know Phaser, Pygame, or Godot scripting. GameCoder-27B fills that.

Method — Game Skill + GameCoder-27B + OpenGame-Bench

Three coupled components. Game Skill memory: Template Skill grows a project-skeleton library; Debug Skill maintains a verified-fix protocol. Reuse compounds. GameCoder-27B: a 27B model fine-tuned on game-code corpora that fits on a single A100. OpenGame-Bench: 150 prompts scored on Build Health (builds and runs cleanly), Visual Usability (headless-browser screenshots judged by VLM), and Intent Alignment (semantic match between prompt and result).

Results

Model Build Health Visual Usability Intent Alignment Avg
GPT-5.5 (general) 0.74 0.62 0.71 0.69
Claude Opus 4.7 (general) 0.72 0.65 0.69 0.69
Llama 3.1 70B + workflow 0.61 0.54 0.62 0.59
GameCoder-27B + Game Skill 0.83 0.78 0.81 0.81

The 27B domain model with the Skill workflow beats general 70B+ frontier models on the bench. Notable: GameCoder-27B alone (no workflow) scores ~0.62 — the workflow contributes as much as the specialized base.

Why It Matters

Three reads. One: the first full-stack open release of a domain-specialized coding agent, and a template for analytics notebooks, medical imaging workflows, industrial dashboards. Two: VLM-as-judge as production-ready evaluation. Three: domain data beating model size on a scoped task — a signal that "specialize over scale" can win in vertical coding.

Limits and Skeptics

Three: 2D-only genre coverage, no multiplayer/networking, GameCoder-27B training corpus is closed (partial reproducibility). VLM judge (GPT-4V or Qwen-VL) injects judge bias into the score.

A skeptic case: as frontier models hit SWE-Bench 75% (GPT-5.5), they may absorb game-coding by default and the GameCoder-27B differential will compress.

One-Liner

Domain-specialized model + multi-phase workflow + automated evaluation, shipped as full-stack open. The reference design as coding agents move from function-patches to whole-project building.

References

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지