FlowBot — Teaching LLMs to Design Their Own Workflows with 'Textual Gradients'

NAS for LLMs
If you've been building LLM pipelines, you know the drill. You chain a few LLM calls together, sprinkle in some tool use, tweak prompts for a week, and call it a day. But here's the thing nobody likes admitting: nobody knows if the pipeline structure itself is any good. Maybe you should have added a verification step. Maybe that summarization node is actually losing critical info. The design space is enormous, and we're all just guessing.
This is eerily similar to where deep learning was before Neural Architecture Search (NAS). Researchers used to hand-design CNN architectures -- choosing layer counts, kernel sizes, skip connections -- until NAS automated the whole thing and found architectures humans never would have tried.
FlowBot, a new paper from Naver Search US and MIT, applies that same idea to LLM systems. Instead of hand-crafting pipelines, let the system learn its own workflow structure and optimize the internals simultaneously.
The Problem -- Why Hand-Crafted Pipelines Hit a Wall
Building an LLM-based system today looks something like this: you decide that incoming queries should first be classified, then routed to a retrieval step or a direct generation step, then passed through a final response generator. You write the prompts, pick the tools, and wire everything together manually.
The problem is combinatorial. Every node you add or remove, every edge you rewire, every prompt tweak -- they all interact. Changing the structure might require completely different prompts. Optimizing prompts for one structure might make them useless for another. Humans can explore maybe a dozen configurations. The actual search space has thousands.
Previous work tackled pieces of this. DSPy automated prompt optimization but kept the workflow structure fixed -- you still had to define the graph yourself. AFlow explored workflow structures through code generation but didn't tightly couple structure search with prompt optimization. Nobody had a unified framework for optimizing both at once.
The Team
The paper comes from Hongyeon Yu and Young-Bum Kim at Naver Search US, working with Yoon Kim at MIT. It was posted to arXiv on April 29, 2026. Yoon Kim has been a consistent contributor to NLP research at MIT, with previous work on structured text generation. Naver Search US brings the applied search pipeline perspective -- these are people who deal with multi-stage retrieval and generation systems every day.
Bilevel Optimization -- Two Loops, One Framework
The core idea is bilevel optimization: two nested loops, each handling a different level of the problem.
The outer loop manages the workflow "sketch" -- the high-level graph structure. Which nodes exist, what type they are (LLM call, tool call, branch, merge), and how they connect. The outer loop proposes structural changes based on performance: maybe remove a redundant summarization step, or add a fact-checking node after retrieval. This is the architecture search level, directly analogous to NAS.
The inner loop takes whatever sketch the outer loop provides and optimizes everything inside it. That means tuning each node's prompt, selecting which tools to use, adjusting parameters. The mechanism for this optimization is what makes FlowBot interesting: textual gradients.
The relationship between the loops matters. The inner loop gives the outer loop a fair evaluation of each structure -- if you don't optimize the internals, you can't tell whether a structure is bad or just badly configured. And the outer loop gives the inner loop a good structure to work with, so prompt optimization isn't wasted on a fundamentally broken pipeline.
Think of it this way: the outer loop asks "what should we build?" and the inner loop asks "how should we tune what we built?" They alternate, and both improve.
Textual Gradients -- Backpropagation in Natural Language
This is the clever part. In a normal neural network, backpropagation computes mathematical gradients from the loss function back through every layer. But LLM workflows aren't differentiable -- they take text in, produce text out, and might call external tools in between. You can't compute gradients through that.
FlowBot's workaround: replace mathematical gradients with natural language feedback, propagated layer by layer through the workflow, from the output back to the input.
Here's how it works. First, run the workflow on a task and record each node's inputs and outputs. Second, compare the final output to the ground truth and generate a natural language "loss" -- something like "the response ignored the second constraint in the question." Third, propagate this feedback backward through the nodes. At each node, an LLM looks at (a) the node's input and output, (b) the feedback from the next node, and decides how to modify the current node's prompt to address the issue.
The key insight is that this happens layer by layer, just like chain-rule backpropagation. The last node gets direct feedback from the output comparison. The second-to-last node gets feedback that's been filtered through the last node's perspective. And so on, all the way back to the first node. Each node only needs to make a local adjustment based on local information, but the chain of adjustments addresses global issues.
Why is this better than just saying "the output was wrong, fix all the prompts"? Because global feedback is vague. If your five-node pipeline produces a wrong answer, which node caused it? Textual gradients localize the problem the same way backpropagation localizes which weights to adjust.
Results
Here's how FlowBot stacks up against baselines across six benchmarks:
| Benchmark | FlowBot | AFlow | ReAct (single call) | CoT (single call) |
|---|---|---|---|---|
| HotpotQA | Best | 2nd | 3rd | 4th |
| DROP | Best | 2nd | 4th | 3rd |
| MATH | Best | 2nd | 3rd | 4th |
| GSM8K | 2nd | Best | 3rd | 4th |
| FEVER | Best | 2nd | 3rd | 4th |
| TriviaQA | 2nd | Best | 4th | 3rd |
FlowBot beats AFlow on 4 out of 6 benchmarks. The wins are particularly strong on multi-step reasoning tasks like HotpotQA and DROP, which makes sense -- those are exactly the tasks where workflow structure matters most. On GSM8K and TriviaQA, where AFlow wins, the margins are small.
Compared to single-call baselines (ReAct, CoT), the gap is larger across the board. This confirms that learning the workflow itself opens a performance tier that no amount of prompt engineering on a single call can reach.
Why This Matters -- The Step After DSPy and AFlow
The progression here is clear. DSPy said: "don't hand-write prompts, optimize them automatically." AFlow said: "don't just optimize prompts, search over workflow structures too." FlowBot says: "do both at once, in a mathematically grounded framework where they reinforce each other."
The bilevel framing is what makes this more than incremental. Because the inner loop fairly evaluates each structure, the outer loop makes better structural decisions. Because the outer loop finds better structures, the inner loop's optimizations are more effective. It's a virtuous cycle that neither DSPy nor AFlow could achieve on their own.
Zooming out, this represents a shift in how LLM systems get built. We're moving from "engineer designs the pipeline" toward "system learns the pipeline." NAS didn't eliminate human architecture design entirely -- ResNets and Transformers are still human inventions. But it opened a new design methodology. FlowBot might do the same for LLM pipelines: start with automatic structure search, then have humans refine.
Limitations and Outlook
A few caveats. First, optimization cost -- running two nested loops with many LLM calls per iteration adds up. The paper doesn't deeply analyze token costs for real-world scenarios, and that matters for production use. Second, all six benchmarks are academic. How FlowBot performs on messy, real-world pipelines (customer support bots, code review systems, RAG for enterprise search) remains an open question. Third, textual gradient quality depends on the LLM generating the feedback. If the feedback model makes a wrong diagnosis, optimization goes sideways.
Still, the direction is compelling. As LLM systems grow more complex, manual pipeline design becomes an increasingly obvious bottleneck. FlowBot offers a principled starting point for automating that design process. If follow-up work addresses cost efficiency and validates on production workloads, this bilevel optimization approach could become a standard tool in the LLM engineering toolkit.
References
- Original paper: FlowBot: Bilevel Optimization for LLM Workflow Induction
- arXiv HTML full text: https://arxiv.org/html/2604.26258
- DSPy framework: https://github.com/stanfordnlp/dspy
- AFlow paper: AFlow: Automating Agentic Workflow Generation (2024)
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
