+10.1pt

A team from The University of Hong Kong dropped a paper on April 29 whose headline number is +10.1 absolute F1 over standard RAG, averaged across MuSiQue, HotpotQA, and 2WikiMultiHopQA — and it's accomplished with 47% fewer retrieval calls than the IRCoT fixed-interval baseline. Accuracy up, cost down, simultaneously.

In Plain Terms

Reasoning models like DeepSeek-R1 and OpenAI o1 spend thousands to tens of thousands of tokens thinking before answering. Standard RAG fetches context once at the start and pins it to the prompt — but by the time a long chain-of-thought has unfolded, that context can be stale or irrelevant. ReaLM-Retrieve makes the agent decide during reasoning when external evidence will actually help, fetches only at those moments, and weaves the result into the chain.

Authors and Source

Dongxin Guo (HKU PhD), Jikun Wu (HKU PhD), Siu Ming Yiu (HKU professor, corresponding). Posted as arXiv preprint 2604.26649 on April 29, CC-BY license. Conference venue not yet announced.

Limits of Standard RAG on Reasoning Models

Standard RAG is built for one-shot QA: fetch once, prefix to context, answer. Reasoning models break this assumption — the model spends thousands of tokens internally before answering, and a mid-chain insight that needs evidence has nowhere to call retrieval from. The result is hallucinated facts that propagate through subsequent reasoning. IRCoT-style interleaved retrieval fixes the timing but burns a fixed retrieval call every N tokens whether or not it's needed — accuracy holds but cost explodes.

Method — Step-Level Uncertainty + Retrieval Policy

Two ideas. One: a step-level uncertainty detector that estimates the model's confidence at each reasoning step using token entropy plus self-consistency heuristics. Two: a learned binary classifier (retrieval intervention policy) that decides whether to fire retrieval when uncertainty crosses a threshold. Retrieved evidence is reformatted to fit the chain rather than blindly prefixed.

Training combines RL with supervised pseudo-labels. On gold-labeled multi-hop QA datasets (MuSiQue, HotpotQA), the team simulates "where would a retrieval call have flipped a wrong answer to right?" as pseudo-labels, then RL-finetunes the policy. At inference, the policy is a cheap forward pass per step.

Results

Benchmark	Standard RAG	IRCoT	ReaLM-Retrieve	Delta
MuSiQue F1	61.1%	67.4%	71.2%	+10.1pt vs RAG
HotpotQA F1	73.4%	76.8%	80.5%	+7.1pt
2Wiki F1	65.2%	69.7%	74.8%	+9.6pt
Avg retrieval calls/q	1.0	3.4	1.8	-47% vs IRCoT
Evidence Recall@5	—	—	81.3%	—

A 10-point average jump on three RAG benchmarks is roughly a year of progress in a single paper. The accompanying retrieval-cost halving is what will move it from arXiv to production.

Why It Matters

Three reasons. One: the reasoning-model-vs-RAG misalignment is now diagnosed precisely — practitioners knew about it, this paper measures it. Two: in an era where retrieval calls are a dominant cost driver, "+accuracy and -47% calls" is directly applicable to production RAG pipelines. Three: whether the learned retrieval policy transfers to other domains (code search, medical QA, legal retrieval) is the next big question; the pattern looks generalizable.

Stratechery and Last Week in AI both picked it up the same week. r/MachineLearning's top question: "drop-in replacement?" — code is promised mid-May.

Limits and Skeptics

Three limits the authors flag. Training needs gold-labeled multi-hop QA, scarce in some domains. Retrieval-corpus quality conditions policy quality — noisy corpora hurt. Validated only in reasoning models; behavior on standard LLMs may differ.

Yann LeCun (Meta AI Chief) has a standing skeptical position on retrieval-augmented hacks: without a structured world model, retrieval-as-bandage isn't a fundamental fix. Worth holding alongside the result.

One-Liner

Step-level uncertainty + learned retrieval policy resolves the structural mismatch between reasoning models and RAG. +10.1 F1, -47% retrieval calls. Likely to become the default RAG pattern over the next 12 months.

References

Paper: https://arxiv.org/abs/2604.26649
DeepSeek-R1: https://github.com/deepseek-ai/DeepSeek-R1
IRCoT: https://arxiv.org/abs/2212.10509
MuSiQue: https://github.com/StonyBrookNLP/musique
HKU CS: https://www.cs.hku.hk/

+10.1pt — Reasoning Models Don't Mesh With RAG, So Retrieve Mid-Chain

+10.1pt

In Plain Terms

Authors and Source

Limits of Standard RAG on Reasoning Models

Method — Step-Level Uncertainty + Retrieval Policy

Results

Why It Matters

Limits and Skeptics

One-Liner

References

출처

관련 기사

OpenAI's Lilli Replaces Internal Knowledge Search with AI Agents

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

PraisonAI — Ship a 24/7 AI Workforce in 5 Lines of Code

+10.1pt

In Plain Terms

Authors and Source

Limits of Standard RAG on Reasoning Models

Method — Step-Level Uncertainty + Retrieval Policy

Results

Why It Matters

Limits and Skeptics

One-Liner

References

출처

관련 기사

OpenAI's Lilli Replaces Internal Knowledge Search with AI Agents

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

PraisonAI — Ship a 24/7 AI Workforce in 5 Lines of Code

AI 트렌드를 앞서가세요