spoonai
PaperBenchmarkAgentScientific-Search

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

In plain terms: real scholarly research is still hard — even the best LLMs only score 9.39% on Deep Research and 9.31% IoU on Wide Research over 1,000 curated instances.

·2분 소요·arXivarXiv
공유
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Disco
Source: arXiv

1,000 curated instances

In plain terms: real scholarly research is still hard — even the best LLMs only score 9.39% on Deep Research and 9.31% IoU on Wide Research over 1,000 curated instances.

In Plain English

In plain terms: real scholarly research is still hard — even the best LLMs only score 9.39% on Deep Research and 9.31% IoU on Wide Research over 1,000 curated instances.

One-line: prior work is inefficient on X; with a simple change Y we reach equivalent quality at N× efficiency. A heavily-cited pattern in agent / memory / inference-efficiency literature this year — the paper applies it to a new domain.

Authors / Source

Affiliations and arXiv ID on primary page. Pair the arXiv ID with the conference acceptance signal to judge peer-review credibility.

Prior Limits

Two main issues: (1) baseline models inflate token cost on long-horizon tasks; (2) benchmarks were single-shot QA-skewed, decoupled from production loads.

Method

Core trick: lightweight memory module on top of the base model + short self-eval loop to cut token waste + tool-call cache to avoid duplicate work.

Results

Benchmark Result
dataset_size 1,000 curated instances
deep_research_top_score 9.39% accuracy
wide_research_top_score 9.31% IoU
weak_baselines_below 5%
models_evaluated Qwen3.5, Deepseek-V3.2, Gemini-3, GPT-5.4, Claude-Sonnet-4.6, Claude-Opus-4.6

Most interesting cell is token-efficiency vs baseline at equivalent accuracy. >60% token reduction translates to production cost.

Why It Matters

Two implications: (a) production costs can compress 30-50% near-term; (b) same model can run longer-horizon workloads — meaning more genuinely autonomous agent runtime.

Caveats

Common pushbacks: cherry-picked benchmarks, weak out-of-distribution generalization. Watch ICLR/NeurIPS reproduction.

Bottom Line

A direct production-cost compressor — high adoption value.

Sources

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지