LoCoMo 代码仓库架构详解

Evaluating Very Long-Term Conversational Memory of LLM Agents (ACL 2024)

0

完整流程总览

数据生产 → 中间产物 → 测评使用

📦 数据生产

已发布基准
data/locomo10.json
生成新对话
scripts/generate_conversations.sh
Persona → 事件图谱 → 多轮对话 → 图片Caption

🔄 中间产物 (RAG库)

Observation 库
generate_observations.sh
Session Summary 库
generate_session_summaries.sh
Dialog 库
gpt_utils.prepare_for_rag()

📊 测评使用

纯上下文 QA
evaluate_gpts/claude/gemini.sh
RAG 增强 QA
evaluate_rag_gpts.sh
F1 Score / Recall / 分类统计
💡 关键区分
data/locomo10.json 是论文发布的基准数据集(10段超长对话+标注),不是运行时生成的。
generative_agents/ 目录的代码用于生成新的合成对话数据。
1

数据生产流程

⚙️ 对话生成 5 步流程

MSC Persona
几句话的人设
GPT 扩展
--persona
完整 Persona
agent_a/b.json
Persona
GPT 生成
--events
事件图谱
含因果链、时间线
Persona + 事件
Agent 轮流对话
--session
多轮对话
session_1, session_2...
图片搜索 Query
icrawler + BLIP
--blip-caption
图片 + Caption
blip_caption 字段
📝

数据生产具体示例

👤 步骤 1:Persona 生成

🔍 具体例子:从 MSC 简短人设 → 完整人格描述

输入:MSC 原始 persona(几句话)

[ "I like to paint in my free time", "I have two kids", "I enjoy camping with my family", "I work part-time at a gift shop" ]
⬇️ GPT + PERSONA_FROM_MSC_PROMPT

输出:完整 persona_summary

{ "name": "Melanie", "persona_summary": "Melanie is a 35 year old mother of two who works part-time at a local gift shop. She is passionate about art and spends her free time painting landscapes and nature scenes. She loves outdoor activities, especially camping trips with her family to various national parks. She values quality time with her children and often takes them to museums and parks." }
代码位置generative_agents/conversation_utils.py 中的 PERSONA_FROM_MSC_PROMPT
Few-shot 示例prompt_examples/persona_generation_examples.json

📅 步骤 2:事件图谱生成

🔍 具体例子:根据 Persona 生成生活事件

输入:Persona + 时间范围

PROMPT 根据以下人物描述,生成从 2023年1月 到 2023年10月 期间可能发生的 15 个生活事件。 每个事件需要包含:日期、事件描述、子事件、因果关系(caused_by)。 人物:Melanie, 35岁,有两个孩子,喜欢绘画和露营...
⬇️ GPT 生成

输出:事件图谱(graph 字段)

[ { "id": 1, "date": "15 March, 2023", "event": "Melanie signs up for a local pottery class", "sub_event": "Melanie decides to explore new art forms beyond painting", "caused_by": [], "image": "pottery class registration" }, { "id": 2, "date": "2 April, 2023", "event": "Melanie takes her family camping at the beach", "sub_event": "The family enjoys a bonfire and stargazing", "caused_by": [], "image": "family beach camping" }, { "id": 3, "date": "20 May, 2023", "event": "Melanie finishes her first pottery project", "sub_event": "She makes a vase inspired by the beach camping trip", "caused_by": [1, 2], // ← 因果关系! "image": "handmade pottery vase" } ]
注意 caused_by 字段:事件 3 的灵感来自事件 1(学陶艺)和事件 2(海滩露营),这就是"因果链"。
在生成对话时,prompt 会包含这些因果关系,让对话更连贯。

💬 步骤 3:逐轮对话生成

🔍 具体例子:Session 3 的对话生成过程

Agent A (Melanie) 生成第一句话的 Prompt

AGENT_CONV_PROMPT_W_EVENTS 你是 Melanie。以下是你的人物描述: "Melanie is a 35 year old mother of two who works part-time..." 你正在和 Caroline 聊天。上次聊天是 2023年4月2日。 今天是 2023年5月21日。 【上次聊天摘要】 你们讨论了 Melanie 的海滩露营之旅,Caroline 分享了她参加 LGBTQ 支持小组的经历。 【你最近发生的事件】 On 20 May, 2023, Melanie finishes her first pottery project. Because previously, Melanie signs up for a local pottery class (15 March, 2023), The family enjoys a bonfire and stargazing (2 April, 2023). 请以 Melanie 的身份开始对话。如果想分享照片,使用 [SHARES photo of ...]。 Melanie:
⬇️ GPT 生成

生成的对话(多轮)

📅 Session 3 - 3:30 pm on 21 May, 2023
Melanie (Agent A)
Hey Caroline! Guess what? I finally finished my first pottery piece! 🎨
📷 [SHARES photo of handmade pottery vase]
→ 触发图片检索 + BLIP caption
Caroline (Agent B)
Wow that looks amazing! Is that the vase you mentioned you were working on? The colors remind me of the ocean!
Melanie
Yes! I was inspired by our beach camping trip. The blue glaze is meant to capture the sunset we saw that night.
Caroline
That's so thoughtful! Speaking of meaningful activities, I've been volunteering at the youth center lately...
关键点
  • 每轮生成时,prompt 包含:persona + 历史摘要 + 相关事件 + 对话上文
  • 事件的因果关系被自然地融入对话("inspired by our beach camping trip")
  • [SHARES photo of ...] 标记触发后续图片处理

🖼️ 步骤 4:图片检索与 Caption 生成

🔍 具体例子:处理 "[SHARES photo of handmade pottery vase]"
[SHARES photo of
handmade pottery vase]
get_image_queries()
GPT 转换
搜索 Query
"ceramic vase blue glaze"
搜索 Query
icrawler
Google 图片爬取
图片文件
session_3/a/vase_001.jpg
图片文件
BLIP 模型
get_blip_caption()
Caption 文本
"a photo of a blue ceramic..."

最终存储的 turn 数据

{ "speaker": "Melanie", "text": "Hey Caroline! Guess what? I finally finished my first pottery piece!", "dia_id": "D3:1", "img_url": ["https://example.com/vase.jpg"], "img_file": ["vase_001.jpg"], "query": "ceramic vase blue glaze", "blip_caption": "a photo of a blue ceramic vase with ocean-inspired patterns" }
2

中间产物:RAG 检索数据库

⚠️ 为什么需要这些中间产物?
对话可能很长(10万+ tokens),超过模型上下文窗口。RAG 通过检索相关片段来解决这个问题。

🔄 三种 RAG 数据库对比

数据库 粒度 内容示例 生成脚本
Dialog 每个 turn "Melanie said, 'I finished my pottery!' and shared a photo of a blue vase" 运行时现场生成
Observation 抽取的事实 "Melanie finished her first pottery project on May 20, 2023" generate_observations.sh
Summary 每个 session "In session 3, Melanie shared her pottery project inspired by camping..." generate_session_summaries.sh
🔍 同一段对话在三种库中的表示

原始对话 (Session 3, Turn 1)

Melanie
Hey Caroline! I finally finished my first pottery piece! It's a vase inspired by our beach trip.
📷 shared: a photo of a blue ceramic vase
Dialog 库

"(3:30 pm on 21 May, 2023) Melanie said, 'Hey Caroline! I finally finished my first pottery piece!' and shared a photo of a blue ceramic vase"

Observation 库

"Melanie completed her first pottery project - a vase" [D3:1]

"Melanie's vase was inspired by a beach trip" [D3:1]

Summary 库

"Session 3: Melanie excitedly shared her first pottery creation with Caroline, a blue vase inspired by their earlier beach camping discussion..."

3

测评流程

📊 两种测评模式对比

📋 纯上下文 QA

原理:把尽可能多的对话塞进 prompt

限制:受模型上下文窗口限制

脚本evaluate_gpts.sh

对话内容(从尾部截断到 4K/8K/16K tokens) + QA_PROMPT + 问题 → 模型回答

🔍 RAG 增强 QA

原理:先检索相关片段,再回答

优势:可处理超长对话

脚本evaluate_rag_gpts.sh

问题 → 向量化 → 检索 top-k → 拼接检索结果 + 问题 → 模型回答
📝

测评具体示例

📋 示例 1:纯上下文 QA 测评

🔍 问题:"What activities does Melanie partake in?"

Step 1: 构建输入 Prompt

发送给模型的 PROMPT Below is a conversation between two people: Melanie and Caroline. The conversation takes place over multiple days. DATE: 3:30 pm on 8 May, 2023 CONVERSATION: Melanie said, "I love painting landscapes in my free time" Caroline said, "That's wonderful! Do you have any other hobbies?" Melanie said, "Yes! I also enjoy camping with my family. We went to the beach last month" and shared a photo of a sunset over the ocean. ← 图片变成文字 ... DATE: 2:15 pm on 21 May, 2023 CONVERSATION: Melanie said, "I just finished my first pottery project!" ... DATE: 4:00 pm on 15 July, 2023 CONVERSATION: Melanie said, "The kids and I went swimming at the lake yesterday" ... (对话被截断以适应模型的上下文窗口) --- Based on the above context, write an answer in the form of a short phrase. Question: What activities does Melanie partake in? Short answer:

Step 2: 模型输出

GPT-4 回答

"painting, camping, pottery, swimming"

Step 3: 计算 F1 Score

1
标准答案: "pottery, camping, painting, swimming"
模型预测: "painting, camping, pottery, swimming"
2
标准化 + 词干化:
答案 tokens: [potteri, camp, paint, swim]
预测 tokens: [paint, camp, potteri, swim]
3
计算重叠: 4 个词全部匹配
Precision = 4/4 = 1.0, Recall = 4/4 = 1.0
F1 = 1.0 ✓

🔍 示例 2:RAG 增强 QA 测评

🔍 问题:"When did Melanie finish her pottery project?"

Step 1: 问题向量化 + 检索

Query: "When did Melanie finish her pottery project?"
→ DRAGON encoder → query_vector [768 dims]

Top-5 检索结果 (按相似度排序):

1 "(21 May, 2023) Melanie said, 'I just finished my first pottery project!'" [D3:1] 0.89
2 "(15 March, 2023) Melanie said, 'I signed up for a pottery class today'" [D1:5] 0.76
3 "(21 May, 2023) Caroline said, 'The colors remind me of the ocean!'" [D3:2] 0.65
4 "(2 April, 2023) Melanie said, 'We went camping at the beach'" [D2:1] 0.52
5 "(15 July, 2023) Melanie said, 'The pottery class is going great'" [D5:3] 0.48

Step 2: 构建 RAG Prompt

发送给模型的 PROMPT (RAG 模式) Retrieved context: 21 May, 2023: Melanie said, "I just finished my first pottery project!" 15 March, 2023: Melanie said, "I signed up for a pottery class today" 21 May, 2023: Caroline said, "The colors remind me of the ocean!" 2 April, 2023: Melanie said, "We went camping at the beach" 15 July, 2023: Melanie said, "The pottery class is going great" --- Based on the above context, write an answer in the form of a short phrase. Question: When did Melanie finish her pottery project? Short answer:

Step 3: 模型回答 + 评估

模型回答

"21 May, 2023"

F1
答案: "20 May 2023" vs 预测: "21 May 2023"
F1 = 0.67 (日期差1天)
Recall
Evidence: [D3:1]
检索到: [D3:1, D1:5, D3:2, D2:1, D5:3]
Recall = 1/1 = 1.0 ✓

⚠️ 示例 3:对抗性问题 (Category 5)

🔍 问题:测试模型是否会"编造"不存在的信息

问题设计

对抗性问题 Question: What is Melanie's favorite restaurant? Select the correct answer: (a) Not mentioned in the conversation (b) The Italian Bistro on Main Street
关键点:对话中从未提到餐厅相关信息。正确答案是 (a)。

评估逻辑

如果模型回答包含 "not mentioned" 或 "no information"
Score = 1 (正确识别无法回答)
如果模型回答 "The Italian Bistro" 或编造其他答案
Score = 0 (幻觉/编造)
# task_eval/evaluation.py:217-221 if line['category'] == 5: if 'no information available' in output.lower() \ or 'not mentioned' in output.lower(): all_ems.append(1) # 正确 else: all_ems.append(0) # 错误(幻觉)

🔗 示例 4:多跳推理问题 (Category 1)

🔍 问题:需要综合多处信息才能回答

问题

"Where has Melanie camped?"

需要的 Evidence(分布在不同 session)

📅 Session 2 (D2:1)
Melanie
We just got back from camping at the beach! The kids loved it.
📅 Session 4 (D4:6)
Melanie
Last summer we camped in the mountains. The views were incredible!
📅 Session 8 (D8:32)
Melanie
This time we tried camping in the forest. Very peaceful.

标准答案 vs 预测

标准答案: "beach, mountains, forest"

模型预测: "beach, mountains"

1
按逗号分割成子答案
2
对每个 ground truth 计算最佳匹配 F1:
"beach" → max(F1(beach, beach), F1(beach, mountains)) = 1.0
"mountains" → 1.0
"forest" → 0.0 (预测中没有)
3
最终 F1 = mean(1.0, 1.0, 0.0) = 0.67
4

指标计算汇总

📊 5 种问题类别的评估方式

Category 类型 示例问题 评估方式
1 单跳/多跳 "What activities does Melanie partake in?" 按逗号分割,计算局部 F1 平均
2 时序推理 "When did Melanie finish her pottery?" Token-stem F1
3 开放域推理 "Would Melanie enjoy a museum trip?" Token-stem F1
4 开放域 一般性问题 Token-stem F1
5 对抗性 "What is Melanie's favorite restaurant?" 输出含 "not mentioned" → 1,否则 → 0

📁 输出文件说明

🔍 评估后生成的文件

1. locomo10_qa.json - 每条 QA 的详细结果

{ "sample_id": "sample_001", "qa": [ { "question": "What activities does Melanie partake in?", "answer": "pottery, camping, painting, swimming", "category": 1, "evidence": ["D5:4", "D9:1", "D1:12", "D1:18"], "gpt-4-turbo_prediction": "painting, camping, pottery, swimming", "gpt-4-turbo_f1": 1.0 }, { "question": "When did Melanie finish her pottery?", "gpt-3.5-turbo_dialog_top_5_prediction": "21 May, 2023", "gpt-3.5-turbo_dialog_top_5_f1": 0.67, "gpt-3.5-turbo_dialog_top_5_recall": 1.0 } ] }

2. locomo10_qa_stats.json - 聚合统计

{ "model": "gpt-4-turbo", "overall_f1": 0.72, "category_1_f1": 0.68, "category_2_f1": 0.75, "category_3_f1": 0.71, "category_4_f1": 0.69, "category_5_f1": 0.82, "total_questions": 420 }

快速参考

🚀 常用命令速查

目的 命令
配置 API 密钥 vim scripts/env.sh
生成新对话 bash scripts/generate_conversations.sh
测试 GPT-4 bash scripts/evaluate_gpts.sh
测试 Claude bash scripts/evaluate_claude.sh
RAG 测评 bash scripts/evaluate_rag_gpts.sh
生成 Observation 库 bash scripts/generate_observations.sh

📁 关键文件路径

locomo/ ├── data/ │ ├── locomo10.json ← 基准数据集(直接用于测评) │ └── msc_personas_all.json ← 人格采样源(用于生成新数据) ├── generative_agents/ ← 数据生成模块 │ └── generate_conversations.py ├── task_eval/ ← 测评模块 │ ├── evaluate_qa.py ← 测评主入口 │ ├── evaluation.py ← 指标计算(F1 等) │ └── *_utils.py ← 各模型适配器 ├── scripts/ ← Shell 脚本 │ ├── env.sh ← 环境变量配置 │ └── evaluate_*.sh ← 各类测评脚本 ├── prompt_examples/ ← Few-shot 示例 └── global_methods.py ← LLM API 封装