ViLoMem: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Weihao Bo^1,2, Shan Zhang³, Yanpeng Sun⁴, Jingjing Wu², Qunyi Xie², Xiao Tan²,
Kunbin Chen², Wei He², Xiaofan Li², Na Zhao⁴, Jingdong Wang^2‡, Zechao Li^1†

¹Nanjing University of Science and Technology ²Baidu Inc ³Adelaide AIML ⁴Singapore University of Technology and Design

^‡Project Leader ^†Corresponding author

arXiv Code

Motivation: Multimodal Semantic Memory Enables Progressive Learning

Multimodal Semantic Memory Enables Progressive Learning. When solving multimodal problems, early attempts may contain both logical and visual errors. Through feedback, the model refines its logical memory for theorem application and its visual memory to avoid perceptual traps—improving by integrating where to look with how to reason.

Abstract · Method · Results · Analysis · Case Study · Ablation · BibTeX

Abstract

MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo—solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution.

This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences.

Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge—preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors.

Method

ViLoMem is a plug-in dual-stream memory framework for multimodal reasoning, featuring a closed-loop Memory Cycle that enables continuous learning from reasoning and perception errors.

Key Components

(a) Memory Cycle: A closed-loop learning mechanism where both logical and visual memories are retrieved and utilized by the solver. The verifier evaluates actions to filter redundant trajectories and update both memory streams.
(b) Memory Generation: An error-attribution framework using LLM for logical analysis and MLLM for visual analysis, producing structured memory schemas through similarity-based merge and create operations.
(c) Memory Retrieval: Specialized dual-stream retrieval—visual memories undergo image-embedding retrieval followed by question-specific filtering; logical memories are retrieved through problem analysis and text-embedding similarity.

Main Results

We evaluate ViLoMem across six multimodal reasoning benchmarks covering mathematical reasoning, hallucination robustness, and visual knowledge understanding.

Method	MMMU	MathVista	MathVision	HallusionBench	MMStar	RealWorldQA
GPT-4.1 (baseline)	74.00	70.40	46.12	58.50	69.80	73.72
GPT-4.1 (step)	74.16	74.27	47.47	74.44	70.43	72.03
GPT-4.1 (+ ViLoMem)	77.26	76.88	53.95	75.29	72.43	74.38

Qwen3-VL-235B (baseline)	78.70	84.90	61.28	63.20	78.40	79.30
Qwen3-VL-235B (step)	75.97	83.66	62.17	74.58	76.16	78.66
Qwen3-VL-235B (+ ViLoMem)	79.40	84.98	62.83	75.21	78.31	77.22

Qwen3-VL-8B (baseline)	66.38	77.20	48.13	61.10	70.91	71.50
Qwen3-VL-8B (step)	65.52	77.80	48.35	73.08	70.22	70.85
Qwen3-VL-8B (+ ViLoMem)	69.90	77.87	49.34	73.19	72.13	73.59

Key Findings

Consistent Improvements: ViLoMem achieves gains across all models and benchmarks, with particularly notable improvements on mathematical reasoning tasks.
GPT-4.1 Benefits Most: +6.48 on MathVision and +2.61 on MathVista, owing to stronger contextual learning ability.
Smaller Models Gain Significantly: Qwen3-VL-8B achieves +4.38 on MMMU and +2.74 on RealWorldQA, indicating structured memory provides complementary knowledge beyond limited parametric capacity.

Memory Analysis

(a) Memory generation and retrieval statistics show that visual errors dominate generation (59% to 93%), demonstrating that visual perception remains the primary bottleneck in multimodal reasoning. Despite this generation asymmetry, both streams contribute comparably during retrieval.

(b) Cross-task dependency analysis reveals balanced utilization of both memory streams during retrieval across diverse tasks and models, confirming effective dual-stream coordination.

Cross-Model Memory Transfer

We evaluate the reusability of dual-stream memory by conducting cross-model transfer experiments, where each solver retrieves memories generated by other models rather than its own.

Method	MMMU	MathVista
GPT-4.1 (step)	74.16	74.27
GPT-4.1 (+ ViLoMem)	77.26	76.88
GPT-4.1 (+ ViLoMem Cross)	78.21	76.58

Qwen3-VL-235B (step)	75.97	83.66
Qwen3-VL-235B (+ ViLoMem)	79.40	84.98
Qwen3-VL-235B (+ ViLoMem Cross)	79.26	84.21

Qwen3-VL-8B (step)	65.52	77.80
Qwen3-VL-8B (+ ViLoMem)	69.90	77.87
Qwen3-VL-8B (+ ViLoMem Cross)	71.26	79.20

Smaller models benefit most from cross-model memories: Qwen3-VL-8B achieves +1.36 on MMMU and +1.33 on MathVista compared to its self-generated memory, suggesting that memories from stronger models encode higher-quality error patterns. This demonstrates that dual-stream memory enables effective knowledge distillation from stronger to weaker models without fine-tuning.

Case Study

We present representative cases demonstrating ViLoMem's memory generation and retrieval process across different multimodal reasoning tasks. These examples illustrate how the dual-stream memory mechanism captures and transfers both visual and logical knowledge.

For vision-intensive questions (e.g., traffic-light color, object localization, optical-illusion setups), visual memory provides concrete viewing strategies such as checking the actual illuminated region or isolating targets from distracting backgrounds. Attention maps concentrate on task-relevant regions, steering the solver toward correct visual evidence.

For geometry and chart-reading tasks, visual and logical memories work in complementary roles: logical memory provides reusable rules for measurement and graph interpretation, while visual memory focuses on concrete inspection behaviors such as aligning with gridlines or checking true line orientation. This demonstrates a clear division of labor: visual memory governs "where to look" while logical memory refines "how to reason".

Ablation Study

We validate the necessity of dual-stream memory by selectively disabling each component on GPT-4.1.

Method	MMMU	MathVista
GPT-4.1 (baseline)	74.00	70.40
GPT-4.1 (step)	74.16	74.27
GPT-4.1 (w/o logic memory)	76.64	75.59
GPT-4.1 (w/o visual memory)	76.88	75.66
GPT-4.1 (+ ViLoMem)	77.26	76.88
GPT-4.1 (+ ViLoMem & attention)	78.21	76.87

Removing either stream consistently degrades performance, confirming that both memory types are essential. The gap between single-stream variants and the full ViLoMem model demonstrates their complementarity: the visual and logical streams capture distinct error patterns rather than redundant information. Augmenting visual memory with question-aware attention maps yields additional gains on MMMU.

BibTeX

@misc{bo2025agenticlearnergrowandrefinemultimodal,
      title={Agentic Learner with Grow-and-Refine Multimodal Semantic Memory}, 
      author={Weihao Bo and Shan Zhang and Yanpeng Sun and Jingjing Wu and Qunyi Xie and Xiao Tan and Kunbin Chen and Wei He and Xiaofan Li and Na Zhao and Jingdong Wang and Zechao Li},
      year={2025},
      eprint={2511.21678},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.21678}, 
}