GenericAgent — 100% Completion at 222k Tokens, Just 27.7% of Claude Code

27.7%

Same job, 27.7% of the tokens. GenericAgent reaches 100% completion on Lifelong AgentBench with 27.7% of Claude Code's input tokens and 15.5% of OpenClaw's. With token costs back in the spotlight in May, this is the strongest counterpunch.

In Plain English

Most agents went bigger context = better. GenericAgent inverts that: maximize the information density inside the context, and you can do more with less. A 30k context can be enough for a self-evolving agent — that's the headline.

Authors / Citation

Authors are the lsdefine GitHub maintainer group. arXiv ID 2604.17091, published April 21. Featured on Hugging Face Papers the same week and amplified by Mervin Praison's intro video.

Prior Limits

Two camps in self-evolving agent research: ① large context (100k+) with full history → better completion, ballooning costs; ② small context with external memory calls → cheaper but latency/consistency issues. Both treated context size as the primary variable; density was a side note.

Method

GenericAgent unifies four mechanisms under one principle, Context Information Density Maximization (CIDM):

Atomic tools (9): A small toolkit that gives the LLM local-system control with near-zero token overhead per call.
Hierarchical on-demand memory: Don't keep all history in context — retrieve only what the current step needs.
Self-evolution: Crystallize each successful execution path into a reusable SOP/code in a personal skill tree.
Context truncation: After a step, push unneeded history into delegated sub-agents and refresh the main context.

Results

Model	Completion on Lifelong AgentBench	Input tokens	Relative cost
GenericAgent	100%	222k	1.0× (base)
Claude Code	100%	802k	3.61×
OpenClaw	100%	1,432k	6.45×
GPT-5.4 base agent	87%	540k	2.43×

Two takeaways. First, parity completion at much lower token spend. Second, the heavier implication: a 30k self-evolving agent works, undermining the "bigger model + bigger context" default that's been industry-standard for three years.

Why It Matters

Industrial implications: token costs are back in the discourse — Anthropic Opus 4.7 reportedly uses +27% tokens vs Opus 4.6 for the same prompt (per HN/Reddit measurements). GenericAgent moves the opposite direction, -73% on the same task. Theoretical implication: capability isn't ruled by context size, it's ruled by information density. That reframes RAG, agent, and tool-use design.

Critiques

Yann LeCun (AMI Labs CEO): "Lifelong AgentBench is one benchmark. Long tail will say more." — Real-world long-tail tasks need additional validation. The 9 atomic tools also need redesign per domain; the paper offers only partial guidance for generalization.

Self-evolution security is another concern — crystallizing dangerous code into the skill tree is plausible without strong sandboxing. v1 is light on it.

TL;DR

In a token-cost-sensitive era, the answer may not be "bigger context" but "denser context." GenericAgent is the first quantitative case for it.

GenericAgent — 100% Completion at 222k Tokens, Just 27.7% of Claude Code

27.7%

In Plain English

Authors / Citation

Prior Limits

Method

Results

Why It Matters

Critiques

TL;DR

Sources

출처

관련 기사

LLM Reasoning Happens Before the Words -- Not Because of Them

Memory as Metabolism: A Design for Companion Knowledge Systems

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

27.7%

In Plain English

Authors / Citation

Prior Limits

Method

Results

Why It Matters

Critiques

TL;DR

Sources

출처

관련 기사

LLM Reasoning Happens Before the Words -- Not Because of Them

Memory as Metabolism: A Design for Companion Knowledge Systems

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

AI 트렌드를 앞서가세요