leigest519/OpenGame — Build a Whole Web Game From One Prompt
Open-source agentic framework that turns a single prompt into a playable HTML5 game. CUHK MMLab ships GameCoder-27B, a Game Skill memory and the OpenGame-Bench eval suite together — full-stack open.

A Whole Game From One Sentence
Type "a 2D side-scroller where a fox collects mushrooms" and walk away — what falls out is a real, playable HTML5 game you can drop on itch.io. That's the pitch of leigest519/OpenGame, and the differentiator is that the team also shipped a 150-prompt evaluation suite (OpenGame-Bench) that scores the output across build health, visual usability, and intent alignment. It's not a demo. It's a benchmarked, reproducible, full-stack open release.
The first commit landed April 21, the project crossed onto GitHub Trending Python the week of April 28, and it's running ~280 stars/day to ~4,200 total at our snapshot time (estimates from search snippets — direct GitHub counts vary by environment). The companion arXiv paper 2604.18394 makes this an unusually complete release: code, model, data, and evaluation in one drop.
Project Background — Who and Why
OpenGame is led by leigest519 in CUHK MMLab. The lab is best known for multimodal vision-language work (Visual ChatGPT lineage), and OpenGame is its move into "agentic coding" with a deliberate domain choice. General coding agents (the SWE-Bench cohort) eat function-level patches with crisp ground truth; "fun games" don't have ground truth. OpenGame solves the scoring problem at the same time it solves the building problem.
Two macro trends meet here. First, the coding-agent market is moving from function-patch to whole-project building — Cursor 3, Claude Code, Codex all chase this. Second, evaluation is shifting from exact-match to LLM/VLM-as-judge. OpenGame stitches both into one OSS package and shows them working end to end.
The signature design choice is "Game Skill" memory. A scaffold that worked once gets stored in a Template Skill library; a verified bug fix gets logged into a Debug Skill protocol. The next prompt reuses both — the agent gets faster and more reliable across runs instead of starting over each time.
Core Capabilities — Game Skill + GameCoder-27B + OpenGame-Bench
Three coupled components: a multi-phase workflow (plan → scaffold → implement → debug → polish), the GameCoder-27B domain model, and the OpenGame-Bench evaluation pipeline. The workflow tree resembles SWE-Bench-style coding agents but adds a separate "polish" stage for fonts, sprites, and audio touch-ups that web games actually need.
GameCoder-27B is fine-tuned on game code (Phaser, Pygame, Godot scripting) corpora rather than general code. The 27B size is deliberate — it's the upper bound that runs on a single A100 with the included quantization recipe. The benchmarked gap to a Llama-3.1-70B base on the same workflow tells you the domain specialization is doing real work.
OpenGame-Bench scores 150 diverse prompts on three axes: build health (does it run, no console errors), visual usability (headless-browser screenshot judged by VLM), and intent alignment (semantic match between prompt and result). This goes beyond exact-match benches like the 2024 GameBench cohort.
# Four lines to first game
git clone https://github.com/leigest519/OpenGame.git
cd OpenGame && make install
make run PROMPT='a 2D side-scroller where a fox collects mushrooms'
# Output: ./out/ contains an HTML/JS/asset bundle ready to play
Stack + Architecture
Python is the host, Phaser 3 the runtime — chosen so the eval step can run the artifact straight in headless Chrome with no separate build/deploy. Docker locks the environment and removes the "works on my machine" failure mode. The VLM judge is pluggable — you can swap GPT-4V for Qwen-VL and stay fully open-source if you want.
Compete Set
| Project | Role | Model | Bench in repo? | License |
|---|---|---|---|---|
| leigest519/OpenGame | Whole-game agent | GameCoder-27B (domain) | OpenGame-Bench 150 prompts | Apache-2.0 |
| microsoft/autogen | General multi-agent fw | BYO LLM | None | CC-BY-4.0 |
| phaserjs/phaser-ai-templates | Phaser snippets | External LLM | None | MIT |
| cline/cline (prior incumbent) | VS Code IDE agent | BYO | None | Apache-2.0 |
OpenGame's wedge is "domain-specialized model + bench in the box." Autogen is generic and needs a lot of glue for game work; Phaser AI templates are LLM-friendly snippets, not agents.
Why Now — Ecosystem Context
Three currents converge. One, the same week brought reports that Cursor is in talks at a $50B valuation — a clear signal that agentic coding is now an industrial category, not a feature. Two, GPT-5.5's SWE-Bench 75% number means function-level accuracy is treated as solved, and attention is moving to "whole project" — exactly OpenGame's lane. Three, eval automation is the hottest infra topic of 2026; LLM-as-judge papers exploded this spring, and OpenGame goes one harder by shipping a working VLM-as-judge pipeline.
The most upvoted HN comment on the launch read: "An indie dev can prototype 30 games a month with this." Twelve OpenGame-built prototypes hit itch.io in the last week of April, which is the cleanest possible proof.
Getting Started + Pitfalls
Two pitfalls to know. The 27B weights are ~54GB; the 4-bit quantization recipe brings it to ~14GB and people are reporting clean runs on M2 Pro 32GB Macs in GitHub Issues. The eval pipeline kicks off headless Chrome plus VLM calls and spends about five minutes on first-run container build — budget for it.
# 4-bit quantized weights
make install QUANT=4bit
# Run only the eval pipeline
make bench PROMPT_SET=opengame_bench_v1
Limits and Outlook
Genre coverage is 2D-only today (side-scrollers, puzzles, shooters), and multiplayer/networking is essentially out of scope. The roadmap (RFC PR landed April 30) covers Three.js 3D and a 2-agent collaboration mode for H2 2026. Long-term, OpenGame's contribution is the template — domain-specialized coding agents will follow this pattern (specialized model + domain bench + multi-phase workflow) into notebooks, medical imaging, industrial dashboards.
Tomorrow Morning
Developers: git clone and run the quickstart. Drop the result on itch.io and you'll have a felt sense for the current quality bar. Game designers: read all 150 OpenGame-Bench prompts and build your own 50-prompt vertical bench in your domain — that's the highest-leverage OSS contribution category of the year. Investors/founders: watch for GameCoder-70B (mentioned in the paper appendix) and the H2 Three.js extension as the next inflection points; track leigest519's commit cohort and release notes.
References
- Repo: https://github.com/leigest519/OpenGame
- Paper: https://arxiv.org/abs/2604.18394
- Phaser engine: https://phaser.io/
- GitHub Trending Python (weekly): https://github.com/trending/python?since=weekly
- HuggingFace LightEval (related eval infra): https://github.com/huggingface/lighteval
출처
관련 기사

This AI Rewrites Its Own Code — MiniMax M2.7's Self-Evolution Experiment
MiniMax M2.7 autonomously improved itself over 100+ iterations, scoring 56.22% on SWE-Pro — near Claude Opus 4.6 levels — at 1/50th the price.

DeepSeek V4 — 1 Trillion Parameters, Open-Weight, and Everything You Need to Know
Complete technical breakdown of DeepSeek V4: MoE architecture (1T total, 32B active), Engram Memory, Dynamic Sparse Attention, benchmarks, pricing (50x cheaper than Claude), API usage, license terms, and geopolitical implications.

OpenClaw — Why a Local AI Assistant Hit 250K Stars on GitHub
No cloud, no data leaving your device. Connects 50+ platforms including WhatsApp, Telegram, Slack, and iMessage. A weekend project became one of the fastest-growing open-source repos in GitHub history.
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
