One Prompt, One Game — OpenGame Paper Sets Domain-Specialized Coding SOTA

SOTA Across 150 Prompts

CUHK MMLab dropped the OpenGame paper on April 21 with a clean message: a domain-specialized base model plus a multi-phase workflow plus an automated eval pipeline produces SOTA on 150 game-generation prompts. The unusual move is shipping code (leigest519/OpenGame), model (GameCoder-27B) and bench (OpenGame-Bench) at the same time — a full-stack open template for domain-specialized coding agents.

In Plain Terms

Type "a 2D side-scroller where a fox collects mushrooms," walk away, and a playable HTML5 game falls out. Five-stage agent pipeline — plan → scaffold → implement → debug → polish — with a learned Skill library (Templates and Debug protocols) that grows across runs and reduces redundant work. The 100th run on a similar prompt is faster and more reliable than the first.

Authors and Source

CUHK MMLab, with leigest519 first author. MMLab is the multimodal vision-language group behind Visual ChatGPT and Vid2Seq. arXiv 2604.18394, CC-BY, posted April 21, 2026. Conference venue not announced; ICML 2026 / NeurIPS 2026 likely.

Limits Before This — Games Aren't Function Patches

Generic coding-agent benches like SWE-Bench score function-level patches against a known correct answer. Games don't have a single correct answer for "fun side-scroller" — prior evaluation depended on slow human judgment or imprecise pattern match. OpenGame automates the evaluation while introducing a more precise three-axis score (build health, visual usability, intent alignment).

Domain corpus was the second gap. General code models like Llama 3.1 70B don't deeply know Phaser, Pygame, or Godot scripting. GameCoder-27B fills that.

Method — Game Skill + GameCoder-27B + OpenGame-Bench

Three coupled components. Game Skill memory: Template Skill grows a project-skeleton library; Debug Skill maintains a verified-fix protocol. Reuse compounds. GameCoder-27B: a 27B model fine-tuned on game-code corpora that fits on a single A100. OpenGame-Bench: 150 prompts scored on Build Health (builds and runs cleanly), Visual Usability (headless-browser screenshots judged by VLM), and Intent Alignment (semantic match between prompt and result).

Results

Model	Build Health	Visual Usability	Intent Alignment	Avg
GPT-5.5 (general)	0.74	0.62	0.71	0.69
Claude Opus 4.7 (general)	0.72	0.65	0.69	0.69
Llama 3.1 70B + workflow	0.61	0.54	0.62	0.59
GameCoder-27B + Game Skill	0.83	0.78	0.81	0.81

The 27B domain model with the Skill workflow beats general 70B+ frontier models on the bench. Notable: GameCoder-27B alone (no workflow) scores ~0.62 — the workflow contributes as much as the specialized base.

Why It Matters

Three reads. One: the first full-stack open release of a domain-specialized coding agent, and a template for analytics notebooks, medical imaging workflows, industrial dashboards. Two: VLM-as-judge as production-ready evaluation. Three: domain data beating model size on a scoped task — a signal that "specialize over scale" can win in vertical coding.

Limits and Skeptics

Three: 2D-only genre coverage, no multiplayer/networking, GameCoder-27B training corpus is closed (partial reproducibility). VLM judge (GPT-4V or Qwen-VL) injects judge bias into the score.

A skeptic case: as frontier models hit SWE-Bench 75% (GPT-5.5), they may absorb game-coding by default and the GameCoder-27B differential will compress.

One-Liner

Domain-specialized model + multi-phase workflow + automated evaluation, shipped as full-stack open. The reference design as coding agents move from function-patches to whole-project building.

References

Paper: https://arxiv.org/abs/2604.18394
Code: https://github.com/leigest519/OpenGame
Phaser: https://phaser.io/
CUHK MMLab: https://mmlab.ie.cuhk.edu.hk/
SWE-Bench: https://www.swebench.com/

One Prompt, One Game — OpenGame Paper Sets Domain-Specialized Coding SOTA

SOTA Across 150 Prompts

In Plain Terms

Authors and Source

Limits Before This — Games Aren't Function Patches

Method — Game Skill + GameCoder-27B + OpenGame-Bench

Results

Why It Matters

Limits and Skeptics

One-Liner

References

출처

관련 기사

leigest519/OpenGame — Build a Whole Web Game From One Prompt

SOTA Across 150 Prompts

In Plain Terms

Authors and Source

Limits Before This — Games Aren't Function Patches

Method — Game Skill + GameCoder-27B + OpenGame-Bench

Results

Why It Matters

Limits and Skeptics

One-Liner

References

출처

관련 기사

leigest519/OpenGame — Build a Whole Web Game From One Prompt

AI 트렌드를 앞서가세요