spoonai
PaperSelf-Evolving AgentToken EfficiencyarXiv

GenericAgent — 100% Completion at 222k Tokens, Just 27.7% of Claude Code

arXiv 2604.17091 GenericAgent reaches 100% completion on Lifelong AgentBench with 222k input tokens — 27.7% of Claude Code, 15.5% of OpenClaw. A single principle (Context Information Density Maximization) unifies atomic tools, hierarchical memory, self-evolution, and context truncation.

·3분 소요·arXivarXiv
공유
GenericAgent paper results table — token efficiency comparison vs Claude Code and OpenClaw
Source: arXiv 2604.17091

27.7%

Same job, 27.7% of the tokens. GenericAgent reaches 100% completion on Lifelong AgentBench with 27.7% of Claude Code's input tokens and 15.5% of OpenClaw's. With token costs back in the spotlight in May, this is the strongest counterpunch.

In Plain English

Most agents went bigger context = better. GenericAgent inverts that: maximize the information density inside the context, and you can do more with less. A 30k context can be enough for a self-evolving agent — that's the headline.

Authors / Citation

Authors are the lsdefine GitHub maintainer group. arXiv ID 2604.17091, published April 21. Featured on Hugging Face Papers the same week and amplified by Mervin Praison's intro video.

Prior Limits

Two camps in self-evolving agent research: ① large context (100k+) with full history → better completion, ballooning costs; ② small context with external memory calls → cheaper but latency/consistency issues. Both treated context size as the primary variable; density was a side note.

Method

GenericAgent unifies four mechanisms under one principle, Context Information Density Maximization (CIDM):

  1. Atomic tools (9): A small toolkit that gives the LLM local-system control with near-zero token overhead per call.
  2. Hierarchical on-demand memory: Don't keep all history in context — retrieve only what the current step needs.
  3. Self-evolution: Crystallize each successful execution path into a reusable SOP/code in a personal skill tree.
  4. Context truncation: After a step, push unneeded history into delegated sub-agents and refresh the main context.

Results

Model Completion on Lifelong AgentBench Input tokens Relative cost
GenericAgent 100% 222k 1.0× (base)
Claude Code 100% 802k 3.61×
OpenClaw 100% 1,432k 6.45×
GPT-5.4 base agent 87% 540k 2.43×

Two takeaways. First, parity completion at much lower token spend. Second, the heavier implication: a 30k self-evolving agent works, undermining the "bigger model + bigger context" default that's been industry-standard for three years.

Why It Matters

Industrial implications: token costs are back in the discourse — Anthropic Opus 4.7 reportedly uses +27% tokens vs Opus 4.6 for the same prompt (per HN/Reddit measurements). GenericAgent moves the opposite direction, -73% on the same task. Theoretical implication: capability isn't ruled by context size, it's ruled by information density. That reframes RAG, agent, and tool-use design.

Critiques

Yann LeCun (AMI Labs CEO): "Lifelong AgentBench is one benchmark. Long tail will say more." — Real-world long-tail tasks need additional validation. The 9 atomic tools also need redesign per domain; the paper offers only partial guidance for generalization.

Self-evolution security is another concern — crystallizing dangerous code into the skill tree is plausible without strong sandboxing. v1 is light on it.

TL;DR

In a token-cost-sensitive era, the answer may not be "bigger context" but "denser context." GenericAgent is the first quantitative case for it.

Sources

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지