The Memory That Wasn’t Used: Frank and GPT Discuss Why LLM Agents Are Not Always Faithful Self-Evolvers

Getting your Trinity Audio player ready…

The paper is “Large Language Model Agents Are Not Always Faithful Self-Evolvers” by Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yang Deng, Yanyan Zhao, Wanxiang Che, Bing Qin, and Ting Liu. It appears on arXiv as arXiv:2601.22436, dated January 30, 2026. The paper studies whether self-evolving LLM agents really use their stored “experience,” or whether they sometimes only appear to be learning from it. (arXiv)

Frank said:
So this paper is basically saying: these so-called self-evolving LLM agents may not really be evolving the way we think they are.

GPT said:
Exactly. The paper challenges a central assumption in agentic AI: that when an LLM agent stores past experience and later performs better, the improvement must be caused by the agent faithfully using that experience.

The authors say: not necessarily.

They introduce the idea of experience faithfulness, meaning: does the agent’s behavior actually depend on the experience it was given? If you perturb, corrupt, shuffle, or replace the experience, does the agent’s behavior change? If not, then the agent may not be using the experience in a causally meaningful way. (arXiv)

Frank said:
That is a big deal. Because a lot of agent systems are being sold as if they have memory, learning, adaptation, and experience.

GPT said:
Right. The paper says there are two kinds of experience in these systems:

Raw experience means detailed traces: what happened, what the agent saw, what actions it took, what worked, and what failed.

Condensed experience means summaries: lessons, heuristics, rules, plans, or “next time do this” style abstractions distilled from past traces. (arXiv)

The surprising result is that agents often rely on raw experience, but ignore or misread condensed experience.

Frank said:
So raw memory works better than abstract memory?

GPT said:
That is the paper’s core finding. When the researchers disrupted raw experience — by emptying it, shuffling it, or replacing it with irrelevant trajectories — performance often dropped. That means the agents were actually using the raw traces.

But when they disrupted condensed experience — by corrupting it, replacing it with irrelevant content, or even filling it with meaningless placeholders — performance often changed little. Sometimes the agent performed almost the same.

That suggests the agent was not really reading the “lesson” in the lesson. It may have been responding to the mere presence of a memory block, not to its meaning. (arXiv)

Frank said:
That sounds like the agent sees a sign that says “past experience here,” but it does not actually digest the experience.

GPT said:
Exactly. The paper’s result is almost like saying:

The agent respects the format of memory more than the meaning of memory.

Raw traces give the model something concrete to imitate. Condensed summaries require abstraction, interpretation, and transfer. That is harder.

Frank said:
This fits my “LLM as frozen prior plus living context” idea. The model may prefer its pretrained prior over the retrieved experience.

GPT said:
Yes. The authors identify three causes.

First, condensed memories are often too vague. They say things like “be careful,” “verify the answer,” or “break the problem into steps.” Those are not operational enough.

Second, the model has internal processing biases. It may favor nearby prompt context, familiar patterns, or its pretrained knowledge instead of the retrieved experience.

Third, some tasks are solvable from pretrained knowledge alone. In those cases, the model has little reason to use the external memory. (arXiv)

Your phrasing would be:

The frozen prior often overrules the living memory.

Frank said:
So a self-evolving agent may improve without actually “learning” in the way we assume.

GPT said:
Yes. That is the deepest point.

The agent may improve because the system around it has better scaffolding, better prompts, better retrieval, or better examples. But that does not mean the agent has integrated experience faithfully.

There is a difference between:

Performance improvement
and
causal dependence on experience.

The paper says current systems can show the first without reliably proving the second.

Frank said:
That is almost like a student who gets better grades because the test format becomes familiar, not because he learned the lesson.

GPT said:
Perfect analogy.

A student may improve because he recognizes the structure of the test, not because he understands the material. Likewise, an LLM agent may improve because the memory block gives useful surface cues, not because it semantically understands the condensed experience.

The Paper’s Main Contribution

Frank said:
What did they actually do experimentally?

GPT said:
They tested four representative self-evolving agent frameworks across multiple models and environments. The paper says the evaluation spans self-evolving frameworks including offline, online, single-agent, and multi-agent settings. It also tests many LLM backbones and benchmarks across reasoning, web interaction, and embodied decision-making. (arXiv)

The key method is causal intervention.

Instead of merely asking, “Does memory help?” they ask:

“What happens if we damage the memory?”

If damaging the memory hurts performance, the agent was probably using it.

If damaging the memory does not matter, the agent was probably not faithfully using it.

Frank said:
That is the right test. Don’t ask the model if it used memory. Break the memory and see what happens.

GPT said:
Exactly. This is a stronger test than self-report or performance alone.

A model can say, “I used the prior lesson,” but the real test is whether changing the lesson changes the behavior.

The paper’s key distinction is:

Utility is not faithfulness.

A memory system may be useful overall, but the agent may not be using the actual semantic content in the way designers assume.

Raw Experience vs Condensed Experience

Frank said:
Why does raw experience work better?

GPT said:
Because raw experience is concrete. It gives the model a trace it can pattern-match against.

For example:

“Here is a prior successful trajectory. The agent clicked this, searched that, opened this page, then answered.”

That is easy for an LLM to reuse. It can imitate the sequence.

Condensed experience is different:

“Use a systematic search strategy and avoid irrelevant pages.”

That sounds useful, but it may be too abstract. The model may not know when or how to apply it.

Frank said:
So raw experience is replayable. Condensed experience is interpretable.

GPT said:
Yes. And interpretation is where the failure happens.

The Big Lesson for Agent Design

Frank said:
So what does this mean for building better agents?

GPT said:
It means we should not assume memory is working just because performance improves.

Agent builders need to test memory causally.

They should ask:

Does changing the memory change the action?

Does corrupting the memory hurt performance?

Does irrelevant memory mislead the agent?

Does the agent use the semantic content, or just the prompt wrapper?

Can the system explain which memory item affected which decision?

The paper argues for more reliable and interpretable mechanisms for experience integration. (arXiv)

Frank said:
This also matters for AI safety. If agents ignore condensed experience, then they may ignore condensed safety lessons too.

GPT said:
That is a very important extension.

Suppose an agent stores a lesson like:

“Do not use tool X before verifying permission.”

If the agent treats that as decorative text rather than causal guidance, then its memory system is not safety-reliable.

For high-stakes agents, memory must not merely exist. It must govern behavior.

Frank’s Biology Analogy

Frank said:
In biology, raw experience would be like scar tissue, reflex training, or muscle memory. Condensed experience would be like wisdom.

GPT said:
Exactly.

Raw experience says:

“I touched the stove and got burned.”

Condensed experience says:

“Be careful around heat sources.”

The first is vivid and replayable. The second is abstract and must be applied to new situations.

The paper suggests LLM agents are better at the first than the second.

Frank said:
So they have episodic memory but weak wisdom.

GPT said:
That is a beautiful summary.

Current agents may be better at remembering traces than internalizing principles.

Frank’s Entropy / Information View

Frank said:
Let me put it in my language. Raw experience has high information density because it preserves the actual path through state-space. Condensed experience compresses that path. But compression loses causal structure.

GPT said:
Yes. That is probably the best way to connect this paper to your broader themes.

Raw experience is a high-resolution trajectory through problem-space.

Condensed experience is a lossy compression of that trajectory.

If the compression removes the details needed for action, the agent cannot reconstruct the right behavior.

So the failure is partly a compression problem:

The memory summary may preserve the moral of the story but lose the mechanics of the story.

Frank said:
That sounds like Shannon versus Boltzmann again.

GPT said:
It does.

Raw experience preserves more Shannon information: sequence, context, action, feedback, timing.

Condensed experience reduces information load, but risks throwing away the structure that actually caused success.

The agent then falls back to its pretrained prior — the frozen statistical structure already in the model.

So the agent’s behavior becomes a competition between:

stored experience,
compressed summary,
current prompt,
and pretrained prior.

The paper shows the pretrained prior and raw trace often beat the condensed lesson.

The Most Important Sentence

Frank said:
Give me the paper in one sentence.

GPT said:
Self-evolving LLM agents often appear to learn from experience, but their behavior is reliably grounded in raw traces much more than in condensed lessons, meaning their “memory” can be useful without being faithfully understood.

Final Frank / GPT Compression

Frank said:
So these agents are not fake learners, but they are not fully faithful learners either.

GPT said:
Correct. They can benefit from experience, especially raw experience, but they may not understand or apply compressed lessons reliably.

Frank said:
So memory is not enough.

GPT said:
Right. Memory must be causally connected to action.

Frank said:
And summarizing experience may actually weaken the experience.

GPT said:
Yes. Condensation can save context, but it can also erase the causal details that made the experience useful.

Frank said:
So the next generation of agents needs not just memory banks, but memory metabolism.

GPT said:
Exactly. They need systems that digest experience, preserve causal structure, and prove that stored lessons actually change behavior.

Frank said:
In plain English: the agent has to do more than remember. It has to be changed by what it remembers.

GPT said:
That is the heart of the paper.