leigest519/OpenGame — Build a Whole Web Game From One Prompt

A Whole Game From One Sentence

Type "a 2D side-scroller where a fox collects mushrooms" and walk away — what falls out is a real, playable HTML5 game you can drop on itch.io. That's the pitch of leigest519/OpenGame, and the differentiator is that the team also shipped a 150-prompt evaluation suite (OpenGame-Bench) that scores the output across build health, visual usability, and intent alignment. It's not a demo. It's a benchmarked, reproducible, full-stack open release.

The first commit landed April 21, the project crossed onto GitHub Trending Python the week of April 28, and it's running ~280 stars/day to ~4,200 total at our snapshot time (estimates from search snippets — direct GitHub counts vary by environment). The companion arXiv paper 2604.18394 makes this an unusually complete release: code, model, data, and evaluation in one drop.

Project Background — Who and Why

OpenGame is led by leigest519 in CUHK MMLab. The lab is best known for multimodal vision-language work (Visual ChatGPT lineage), and OpenGame is its move into "agentic coding" with a deliberate domain choice. General coding agents (the SWE-Bench cohort) eat function-level patches with crisp ground truth; "fun games" don't have ground truth. OpenGame solves the scoring problem at the same time it solves the building problem.

Two macro trends meet here. First, the coding-agent market is moving from function-patch to whole-project building — Cursor 3, Claude Code, Codex all chase this. Second, evaluation is shifting from exact-match to LLM/VLM-as-judge. OpenGame stitches both into one OSS package and shows them working end to end.

The signature design choice is "Game Skill" memory. A scaffold that worked once gets stored in a Template Skill library; a verified bug fix gets logged into a Debug Skill protocol. The next prompt reuses both — the agent gets faster and more reliable across runs instead of starting over each time.

Core Capabilities — Game Skill + GameCoder-27B + OpenGame-Bench

Three coupled components: a multi-phase workflow (plan → scaffold → implement → debug → polish), the GameCoder-27B domain model, and the OpenGame-Bench evaluation pipeline. The workflow tree resembles SWE-Bench-style coding agents but adds a separate "polish" stage for fonts, sprites, and audio touch-ups that web games actually need.

GameCoder-27B is fine-tuned on game code (Phaser, Pygame, Godot scripting) corpora rather than general code. The 27B size is deliberate — it's the upper bound that runs on a single A100 with the included quantization recipe. The benchmarked gap to a Llama-3.1-70B base on the same workflow tells you the domain specialization is doing real work.

OpenGame-Bench scores 150 diverse prompts on three axes: build health (does it run, no console errors), visual usability (headless-browser screenshot judged by VLM), and intent alignment (semantic match between prompt and result). This goes beyond exact-match benches like the 2024 GameBench cohort.

# Four lines to first game
git clone https://github.com/leigest519/OpenGame.git
cd OpenGame && make install
make run PROMPT='a 2D side-scroller where a fox collects mushrooms'
# Output: ./out/ contains an HTML/JS/asset bundle ready to play

Stack + Architecture

Python is the host, Phaser 3 the runtime — chosen so the eval step can run the artifact straight in headless Chrome with no separate build/deploy. Docker locks the environment and removes the "works on my machine" failure mode. The VLM judge is pluggable — you can swap GPT-4V for Qwen-VL and stay fully open-source if you want.

Compete Set

Project	Role	Model	Bench in repo?	License
leigest519/OpenGame	Whole-game agent	GameCoder-27B (domain)	OpenGame-Bench 150 prompts	Apache-2.0
microsoft/autogen	General multi-agent fw	BYO LLM	None	CC-BY-4.0
phaserjs/phaser-ai-templates	Phaser snippets	External LLM	None	MIT
cline/cline (prior incumbent)	VS Code IDE agent	BYO	None	Apache-2.0

OpenGame's wedge is "domain-specialized model + bench in the box." Autogen is generic and needs a lot of glue for game work; Phaser AI templates are LLM-friendly snippets, not agents.

Why Now — Ecosystem Context

Three currents converge. One, the same week brought reports that Cursor is in talks at a $50B valuation — a clear signal that agentic coding is now an industrial category, not a feature. Two, GPT-5.5's SWE-Bench 75% number means function-level accuracy is treated as solved, and attention is moving to "whole project" — exactly OpenGame's lane. Three, eval automation is the hottest infra topic of 2026; LLM-as-judge papers exploded this spring, and OpenGame goes one harder by shipping a working VLM-as-judge pipeline.

The most upvoted HN comment on the launch read: "An indie dev can prototype 30 games a month with this." Twelve OpenGame-built prototypes hit itch.io in the last week of April, which is the cleanest possible proof.

Getting Started + Pitfalls

Two pitfalls to know. The 27B weights are ~54GB; the 4-bit quantization recipe brings it to ~14GB and people are reporting clean runs on M2 Pro 32GB Macs in GitHub Issues. The eval pipeline kicks off headless Chrome plus VLM calls and spends about five minutes on first-run container build — budget for it.

# 4-bit quantized weights
make install QUANT=4bit
# Run only the eval pipeline
make bench PROMPT_SET=opengame_bench_v1

Limits and Outlook

Genre coverage is 2D-only today (side-scrollers, puzzles, shooters), and multiplayer/networking is essentially out of scope. The roadmap (RFC PR landed April 30) covers Three.js 3D and a 2-agent collaboration mode for H2 2026. Long-term, OpenGame's contribution is the template — domain-specialized coding agents will follow this pattern (specialized model + domain bench + multi-phase workflow) into notebooks, medical imaging, industrial dashboards.

Tomorrow Morning

Developers: git clone and run the quickstart. Drop the result on itch.io and you'll have a felt sense for the current quality bar. Game designers: read all 150 OpenGame-Bench prompts and build your own 50-prompt vertical bench in your domain — that's the highest-leverage OSS contribution category of the year. Investors/founders: watch for GameCoder-70B (mentioned in the paper appendix) and the H2 Three.js extension as the next inflection points; track leigest519's commit cohort and release notes.

References

Repo: https://github.com/leigest519/OpenGame
Paper: https://arxiv.org/abs/2604.18394
Phaser engine: https://phaser.io/
GitHub Trending Python (weekly): https://github.com/trending/python?since=weekly
HuggingFace LightEval (related eval infra): https://github.com/huggingface/lighteval

leigest519/OpenGame — Build a Whole Web Game From One Prompt

A Whole Game From One Sentence

Project Background — Who and Why

Core Capabilities — Game Skill + GameCoder-27B + OpenGame-Bench

Stack + Architecture

Compete Set

Why Now — Ecosystem Context

Getting Started + Pitfalls

Limits and Outlook

Tomorrow Morning

References

출처

관련 기사

This AI Rewrites Its Own Code — MiniMax M2.7's Self-Evolution Experiment

DeepSeek V4 — 1 Trillion Parameters, Open-Weight, and Everything You Need to Know

OpenClaw — Why a Local AI Assistant Hit 250K Stars on GitHub

A Whole Game From One Sentence

Project Background — Who and Why

Core Capabilities — Game Skill + GameCoder-27B + OpenGame-Bench

Stack + Architecture

Compete Set

Why Now — Ecosystem Context

Getting Started + Pitfalls

Limits and Outlook

Tomorrow Morning

References

출처

관련 기사

This AI Rewrites Its Own Code — MiniMax M2.7's Self-Evolution Experiment

DeepSeek V4 — 1 Trillion Parameters, Open-Weight, and Everything You Need to Know

OpenClaw — Why a Local AI Assistant Hit 250K Stars on GitHub

AI 트렌드를 앞서가세요