Back to Thoughts

RLM: Why Your Agent Should Program Against Documents, Not Read Them

Feb 14, 2026

10 min read

View raw

In my last post, I talked about the Bitter Lesson — how structure becomes obsolete as models improve, and why we should build for removal.

This post is about a new architecture that takes that principle seriously: Recursive Language Models (RLMs).

The idea is deceptively simple. Instead of stuffing documents into the context window and hoping the model doesn't lose track, RLMs treat documents as external environments that the model programs against.

The model never reads the document. It queries it.

And the results are striking: an 8B parameter model using RLM comes close to GPT-5 on long-context tasks — while being cheaper per query.


The Problem: Context Rot

We keep building bigger context windows. 128K. 1M. 10M tokens. But a larger window doesn't solve the core issue.

Anthropic defines context rot as when "the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases." But anyone who's used Claude Code for a long session or chatted with ChatGPT for hours knows the phenomenon is weirder than that. The model doesn't just forget things — it gets dumber. Reasoning quality degrades in ways that are hard to benchmark but easy to feel.

Current mitigations — RAG, summarization, chunking — all share the same assumption: retrieved or compressed text must eventually become tokens in the prompt. The model has to "see" everything it reasons about.

RLMs challenge that assumption entirely.

[Source: Alex Zhang's RLM blog post — alexzhang13.github.io/blog/2025/rlm]


The Model as Programmer, Not Reader

What if the model didn't have to read the document at all?

RLMs place the document inside a coding environment — a Python REPL — and let the model write programs to interact with it. The model never ingests the raw text. It writes code (grep for a keyword, slice out a section, iterate over chapters) and only the results of that code enter the context window.

Think of it as the difference between reading a database and querying a database.

Traditional LLMs read. RLMs query.

The recursive part: the model can spawn sub-agents (copies of itself) to process specific slices. Each sub-agent gets a manageable chunk, reasons over it within its own context window, and returns a result to the parent. The parent's context is never polluted with irrelevant information.

[Source: "Recursive Language Models" — arxiv.org/abs/2512.24601]


How It Actually Works

The system has three components:

  1. A context variable holding the document (potentially 10M+ tokens)
  2. An rlm_agent(query, context) function that delegates to sub-agents
  3. Standard Python libraries (json, re, numpy) pre-loaded in the REPL

The root LLM starts with just the query and an indication that the context exists. It doesn't see the document — only knows its size.

From there, the model writes code, executes it, observes results, and iterates. When confident it has an answer, it outputs FINAL(answer).

Here's what the context window actually contains:

┌─────────────────────────────────────────────────┐
 System Prompt                                   
 User Query                                      
 "context variable exists, size: 10M tokens"     
                                                 
 [Model writes]: context[:2000]  # peek          
 [Result]: first 2000 chars                      
                                                 
 [Model writes]: grep for "user_id: 67144"       
 [Result]: 47 matching lines                     
                                                 
 [Model writes]: rlm_agent(sub_query, filtered)  
 [Sub-agent result]: "Count is 23"               
                                                 
 FINAL(23)                                       
└─────────────────────────────────────────────────┘

The root LLM's context stays clean. It only sees query + code + execution results. The 10M token document never enters the context window.

[Source: Alex Zhang's RLM blog — Figure 3]


But Isn't This Just RAG?

Fair question. The distinction is architectural.

RAG: Retrieved chunks get injected into the prompt. The model reads them directly. The document (or parts of it) becomes tokens.

RLM: The document stays inside the REPL. The model writes code to extract only what it needs. Only those extracted results enter the context. The document is never read wholesale.

As co-author Alex Zhang puts it: "It's not the sub-agent having access to a grepper that matters, it's that the sub-agent is called from and communicates inside of the REPL."

RLM vs Standard Coding Agents: Both combine LLMs with code execution. But in typical agent frameworks, the model calls sub-agents as independent tools. The REPL and the sub-agent are separate. In an RLM, the sub-agent is a function inside the REPL. The parent writes an algorithm, calls rlm_agent() as part of that algorithm, and results flow back into program execution.

RLM vs Simple grep: Grep is one operation an RLM might write. The power is in composition — the model can write arbitrary programs that combine search, filtering, aggregation, and recursive delegation.

[Source: Elvis Saravia's thread on RLM — @omarsar0]


Emergent Strategies

Here's what makes RLM interesting: the authors didn't hand-design decomposition strategies. They gave the model a REPL with a recursive call function and observed what emerged.

The model independently discovered:

Peeking — examining the first few thousand characters to understand document structure before doing anything else.

Grepping — writing regex to narrow down relevant lines from massive contexts.

Partition + Map — chunking the context into pieces and recursively processing each one with sub-agents.

Programmatic processing — for structured tasks like tracking git diffs, the model writes a complete program to solve the task in one shot rather than reasoning line by line.

This matters because the decomposition strategy is not prescribed. The model figures out how to interact with its context at inference time.

Most agent frameworks do task decomposition (breaking a complex problem into simpler sub-problems). RLMs additionally do context decomposition (breaking a large input into manageable pieces).

Standard agents decide what to do. RLMs also decide what to look at.

[Source: RLM paper — Section on emergent strategies]


The Results

Two benchmarks stand out.

OOLONG-Pairs — requires models to identify relationships across scattered statements in long documents. This is quadratic reasoning: connecting information from many different locations, not just finding a single needle.

MethodScore (132K context)
GPT-530.0
GPT-5-mini22.2
RLM(GPT-5-mini)64.0

RLM(GPT-5-mini) more than doubled GPT-5's performance — while being roughly the same cost per query.

BrowseComp-Plus — multi-hop reasoning across up to 1,000 documents, synthesizing information scattered across sources.

MethodAccuracy (1000 docs)
GPT-50.0% (context limit)
GPT-5 (truncated)15.0%
GPT-5 + BM2570.5%
ReAct + GPT-5 + BM2551.0%
RLM(GPT-5)91.3%

At the 1,000-document scale, vanilla frontier models completely failed. RLM led by a wide margin.

The broader finding: RLM-Qwen3-8B (a post-trained 8B model) outperforms the base Qwen3-8B by 28.3% on average. That same 8B model approaches GPT-5 quality on three long-context benchmarks.

You can take a small open model, teach it to manage its own context recursively with minimal post-training, and get competitive with a frontier model on tasks where raw context window size usually dominates.

[Source: RLM paper — Figures 4a, 4b, 5]


How This Connects to Context Engineering

In my previous post, I discussed Tavily's approach to context engineering: don't propagate raw tool outputs forever. Distill them into reflections. Carry forward the insight, not the raw data.

RLM takes this further. The raw data never enters the context at all.

DeepAgents (LangChain's Claude Code-inspired framework) uses filesystem abstraction for context management — offloading large tool results to files, letting the agent read them back when needed.

RLM is the logical extreme: the document is the filesystem. The model interacts with it programmatically, never ingesting it directly.

The principle is the same: the bottleneck isn't the context window size. It's what you put in it.

[Source: Tavily blog — tavily.com/blog/research-en; LangChain DeepAgents — blog.langchain.com/context-management-for-deepagents]


The Bitter Lesson Connection

Rich Sutton's Bitter Lesson argues that general methods leveraging computation beat hand-crafted knowledge. RLMs embody this.

Traditional approaches to long-context require human decisions: how to chunk, what to retrieve, when to summarize. These decisions get baked into the architecture.

RLMs defer these decisions to the model. The decomposition strategy emerges at inference time based on the query and context. No prescribed workflow.

As Zhang notes: "The trajectory in which a language model chooses to interact with and recurse over its context is entirely learnable, and can be RL-ified in the same way that reasoning is currently trained for frontier models."

This is the Bitter Lesson applied to context management: don't hard-code the strategy. Let the model figure it out.

[Source: Rich Sutton's "The Bitter Lesson" — incompleteideas.net]


Current Limitations

The authors are transparent about what doesn't work yet.

The model must be a good coder. RLMs offload reasoning into code, which means the underlying model needs strong programming ability. Weaker models struggle to write effective REPL programs. Models with long internal reasoning traces sometimes burn through their output budget on "thinking" before producing any executable code.

Generalization is fragile. The recursive strategy doesn't transfer cleanly across model families. Prompts tuned for one model can behave unpredictably on another. The paper reports cases where a model attempted to spawn thousands of simultaneous sub-agents.

Sequential execution. Sub-agents run blocking, not parallel. Deep recursion gets slow. Parallelizing sub-agent calls is an obvious improvement but isn't implemented yet.

Only depth=1 tested. The root model can call sub-agents, but those sub-agents don't call further sub-agents. The architecture supports deeper recursion, but it hasn't been validated.

[Source: RLM blog — Limitations section]


When to Use What

Use CaseBest Approach
Quick Q&A over known document structureRAG
Semantic search across many documentsRAG with good chunking
Complex reasoning requiring connecting scattered infoRLM
Tasks needing adaptive decompositionRLM
10M+ token contextsRLM (only viable option)
Structured/programmatic tasks (diffs, counting)RLM

RAG isn't dead. It's still the right choice for many use cases — especially when you have predictable query patterns and can optimize chunking offline.

But for tasks that require the model to figure out what to look at — not just what to do — RLM offers something RAG can't.


What This Means for Agent Builders

RLMs point toward a broader shift in how we think about context.

For years, the scaling story for long-context has been: make the window bigger, hope attention holds up, throw more compute at the problem.

RLMs suggest something more elegant. The right unit of scaling isn't the context window itself — it's the model's ability to decide what belongs in it.

Context management isn't a hardware constraint to engineer around. It's a capability the model can learn.

The question isn't "how do we fit more tokens in?" It's "can we train models to be selective about what they attend to?"

The early results suggest yes.

[Source: Alex Zhang's concluding remarks — "What We're Thinking Now & for the Future"]


Resources

If you want to go deeper:


The Takeaway

RLMs aren't just a clever inference trick. They're a bet on a different relationship between models and context.

Instead of asking models to read everything, we let them program against it. Instead of prescribing decomposition strategies, we let them emerge. Instead of fighting context rot with bigger windows, we avoid it entirely by keeping the window clean.

The model becomes the programmer. The document becomes the database.

And an 8B model starts competing with GPT-5 — not because it's smarter, but because it's selective about what it looks at.

That's a lesson worth learning.


Gopal Khadka