AutoMem: Automated Learning of Memory
as a Cognitive Skill

Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang, Serena Yeung-Levy

Stanford University

AutoMem progression across scaffold optimization and memory training
Memory skill optimization with Qwen2.5-32B-Instruct. Starting from a base agent equipped with file-system memory (v0), AutoMem progressively improves performance through memory scaffold optimization (v0–v5 for Crafter, v0–v4 for MiniHack, v0–v2 for NetHack), followed by memory proficiency training (+train) that yields further gains on top of the optimized scaffold. Colored y-ticks mark frontier references (Gemini-3.1-Pro-Thinking, Claude-Opus-4.5).
~2×–4×
progression gain over base agent
32B
open model → frontier level
memory only
no task-action weights trained

Abstract

AutoMem turns memory management into a trainable skill for LLM agents. Memory operations (read, write, search, append) live in the same action space as the agent's task actions, so the model itself decides what to store, when to retrieve, and how to organize what it knows. AutoMem automates the learning of this skill end-to-end with two meta-LLM loops: loop #1 optimizes the agent scaffold (memory structure), and loop #2 trains a dedicated memory specialist from the agent's own traces (memory proficiency). Optimizing memory alone — without ever touching the model's task-action behavior — yields ~2×–4× progression gains, lifting the open-weight Qwen2.5-32B to frontier-level performance on three long-horizon tasks: Crafter, MiniHack, and NetHack.

§The AutoMem Framework

AutoMem overview: two automated outer loops
Two automated outer loops optimize a shared inner-loop agent that uses the file system as its memory. At each step the inner agent runs two routines in a unified action space: a LOG routine (“what is worth recording about what just happened?”) that appends, creates, or rewrites memory files, and a PLAN routine (“what do I need to recall to act now?”) that searches and reads files before committing the next world action. Memory operations (<|SEARCH|>, <|APPEND|>, <|UPSERT_MAP|>…) sit in the same action space as task actions, so every memory decision is a traceable action the outer loops can observe and optimize.

Outer-loop #1 (top) optimizes the memory structure: a meta-LLM reviews full episode traces and iteratively revises the agent scaffold — its code, prompts, file schema, and action vocabulary. Outer-loop #2 (bottom) optimizes the model's proficiency: a meta-LLM training engine jointly curates SFT data from the agent's own traces and matches a LoRA configuration to train a dedicated memory specialist. At inference, the trained memory specialist handles LOG and the consultation part of PLAN, while the unmodified base gameplay model commits the world action, so memory proficiency is sharpened without ever touching the task competence. Loop #1 sets the structural ceiling; loop #2 pushes the model capacity towards it.

§Qualitative Demos: Watch the Behavior Improve

We study three stochastic, procedurally-generated worlds: Crafter — open-world survival with crafting, combat, and resource management (22 achievements, ~103 steps); MiniHack — 8 focused puzzle/navigation/combat tasks on the NetHack engine (~102 steps); and NetHack — among the hardest games, with 104–105-turn episodes regenerated every seed.

Each demo replays the same seed at each stage — v0 base scaffold, the optimized scaffold, and the trained specialist — so the world is identical and only the agent differs. The chip on each panel reports that run's progression rate (defined per environment in the note below). Press Play to advance all panels in lockstep; toggle between the Visual game view and the raw Text the model reads; memory operations the agent issued appear under every frame.

Crafter
MiniHack
NetHack

§Quantitative Results

AgentCrafter (%)MiniHack (%)NetHack (%)
Frontier proprietary — BALROG leaderboard
Gemini-3-Pro57.3 ±4.440.0 ±7.76.8 ±3.2
Gemini-3.1-Pro-Thinking55.0 ±6.427.5 ±7.12.6 ±0.3
Claude-Opus-4.549.5 ±3.127.5 ±7.12.0 ±0.5
Gemini-2.5-Pro55.0 ±6.017.5 ±6.01.7 ±0.2
Open-weight — BALROG leaderboard
DeepSeek-R1 (671B)36.4 ±3.825.0 ±6.81.4 ±0.5
Qwen2.5-72B-Instruct27.3 ±3.65.0 ±3.40.3 ±0.3
Qwen2.5-7B-Instruct16.4 ±3.00.0 ±0.00.0 ±0.0
Qwen2.5-32B-Instruct with basic context management
  sliding window19.55 ±3.462.50 ±2.470.00 ±0.00
  + chain-of-thought17.27 ±2.7110.00 ±4.740.00 ±0.00
Qwen2.5-32B-Instruct with AutoMem (ours)
  memory-as-file-system, v025.00 ±5.507.50 ±4.160.42 ±0.37
  + scaffold opt. (loop #1)47.27 ±2.0527.50 ±7.061.57 ±0.35
  + memory training (loop #2)51.36 ±3.8130.00 ±7.251.85 ±0.44

Progression (%), mean ± standard error over the fixed seed set (10 episodes Crafter, 5 episodes × 8 tasks MiniHack, 5 episodes NetHack). Optimizing memory alone — without modifying task-action weights — roughly doubles or more than triples performance in every environment, and the full framework yields ~2×–4× gains, approaching the level of frontier proprietary systems. Scaffold opt. is loop #1 at convergence (v5 Crafter, v4 MiniHack, v2 NetHack); + memory training adds loop #2's memory specialist on top, lifting Crafter to 51.36%, MiniHack to 30.0%, and NetHack to 1.85%.

§Why Memory Optimization Works

Behavioral effect of scaffold optimization
Loop #1 — scaffold optimization improves gameplay and memory behavior simultaneously (initial agent v0 vs. the converged scaffold). The unproductive action rate (steps that are stuck or oscillating) drops 32–65%; redundant memory writes drop sharply (−68 to −83%); the empty-search rate (<|SEARCH|>es that return nothing) falls (−13 to −50%); and the per-step input context shrinks (−3 to −30%) as leaner memory compresses what the model must attend to. Lower is better in every panel.
LOG-phase writes / SEARCHBase+ Trained
Crafter0.840.39 (−54%)
MiniHack2.890.82 (−72%)
NetHack4.661.31 (−72%)
Loop #2 — training internalizes a consult-before-write discipline. In the LOG phase, the trained memory specialist's ratio of memory writes per <|SEARCH|> falls in every environment (lower = more retrieval before writing): it searches existing memory before appending new content rather than logging blindly. This is the retrieval-first pattern the optimized scaffold encourages, now internalized into the model's weights.

§Citation

@article{wu2026automem,
  title={AutoMem: Automated Learning of Memory as a Cognitive Skill},
  author={Wu, Shengguang and Zhu, Hao and Zhang, Yuhui and Wang, Xiaohan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2607.01224},
  year= {2026}
}