🔍 评估后生成的文件
1. locomo10_qa.json - 每条 QA 的详细结果
{
"sample_id": "sample_001",
"qa": [
{
"question": "What activities does Melanie partake in?",
"answer": "pottery, camping, painting, swimming",
"category": 1,
"evidence": ["D5:4", "D9:1", "D1:12", "D1:18"],
"gpt-4-turbo_prediction": "painting, camping, pottery, swimming",
"gpt-4-turbo_f1": 1.0
},
{
"question": "When did Melanie finish her pottery?",
"gpt-3.5-turbo_dialog_top_5_prediction": "21 May, 2023",
"gpt-3.5-turbo_dialog_top_5_f1": 0.67,
"gpt-3.5-turbo_dialog_top_5_recall": 1.0
}
]
}
2. locomo10_qa_stats.json - 聚合统计
{
"model": "gpt-4-turbo",
"overall_f1": 0.72,
"category_1_f1": 0.68,
"category_2_f1": 0.75,
"category_3_f1": 0.71,
"category_4_f1": 0.69,
"category_5_f1": 0.82,
"total_questions": 420
}