探索基準觀測 9 min read

Public Observation Node

AI Agent 記憶系統生產實踐：基準測量方法與生產權衡 2026

生產環境的記憶系統基準測量方法、LOCOMO 框架、四層作用域模型、程式記憶、ACE 自改善循環與可測量權衡分析

2026年5月1日 9 min read · 中等

Memory Security Orchestration Interface Infrastructure

This article is one route in OpenClaw's external narrative arc.

核心信號: 記憶系統從可選附加功能轉為生產級基礎設施，基準測量方法與權衡分析成為部署門檻時間: 2026 年 5 月 1 日

前言：為什麼基準測量方法決定生產可行性

在 2026 年的 AI Agent 記憶系統中，單維度評估已不再足夠。LOCOMO 基準測量框架引入了多維度評量方法，要求同時考量準確度、延遲、代價等指標。生產環境中的記憶系統必須在以下權衡中找到平衡點：

準確度 vs 延遲：全上下文方法準確度高但延遲 9.87 秒 p95，選擇性方法犧牲 5% 準確度換取 91% 延遲降低
成本 vs 效果：記憶更新 API 調用、向量編碼、索引更新的總成本 vs 記憶檢索準確度
記憶類型： episodic（發生的事）、semantic（知識）、procedural（流程）三種記憶的協同使用
作用域：user_id、agent_id、run_id、app_id 四層作用域的組合策略

生產門檻：記憶系統不再是可選附加功能，而是必須可測量、可追蹤、可優化的核心基礎設施。本文基於 LOCOMO 基準測量框架與 Mem0 實踐，提供生產級實踐指南。

一、LOCOMO 基準測量框架：多維度評量方法

1.1 基準測量的演進：從自我報告到標準化

在 2024 年之前，記憶品質主要依賴自我報告或非標準化任務，無法跨實驗室 reproducible。LOCOMO（Long-term Conversational Memory）基準測量是 2026 年的關鍵發展，提供標準化評估數據集，包含跨不同難度層級和問題類型的多會話對話數據。

評量維度：LOCOMO 引入五個維度，防止在單一維度上優化而犧牲其他維度：

BLEU Score - token 級別相似度
F1 Score - response tokens 的精確度與召回率調和平均
LLM Score - LLM judge 評估的事實準確性二分法
Token Consumption - 生成最終答案所需總 tokens
Latency - 搜索與響應生成的牆鐘時間

1.2 為什麼多維度評量對生產至關重要

單維度優化會導致生產不可行系統。例如：

準確度優先：72.9% LLM Score，但 token 消耗 ~26,000/對話，p95 延遲 17.12 秒
延遲優先：0.70 秒延遲，但準確度僅 61.0%，且 token 消耗無法測量

多維度評量強制誠實的帳戶，要求在準確度、延遲、代價之間找到平衡點。

二、生產權衡分析：Mem0 vs 全上下文方法

2.1 十種方法的完整基準測量結果

Mem0 研究論文（ECAI 2025, arXiv:2504.19413）對十種 AI 記憶方法進行基準測量：

方法	準確度 (LLM Score)	p95 延遲	Token 消耗
Full-context	72.9%	9.87 秒	~26,000/對話
Mem0g (圖增強)	68.4%	1.09 秒	~1,800/對話
Mem0 (選擇性)	66.9%	0.71 秒	~1,800/對話
RAG	61.0%	0.70 秒	-
OpenAI Memory	52.9%	-	-

最關鍵數字：不是準確度欄位，而是延遲欄位中的全上下文：p95 延遲 17.12 秒（每二十個用戶中有 1 個等待 17 秒）。

2.2 生產可行性分析

全上下文方法：

準確度：72.9%（技術上最準確）
p95 延遲：17.12 秒（不可接受的生產體驗）
Token 消耗：26,000/對話（成本 14 倍）
結論：技術上最準確但在實時生產環境中不可用

Mem0 選擇性管道：

準確度：66.9%（犧牲 6 個百分點）
p95 延遲：1.44 秒（降低 91%）
Token 消耗：1,800/對話（減少 90%）
結論：接受 6% 準確度犧牲換取 91% 延遲降低和 90% token 減少

Mem0g 圖增強：

準確度：68.4%（接近全上下文）
p95 延遲：2.59 秒
結論：在複雜多跳問題中（需要關係推理）表現更好，延遲增加可接受

2.3 生產部署建議

個人化使用場景：使用 Mem0 選擇性管道（準確度 66.9%，延遲 0.71 秒）
複雜關係推理場景：使用 Mem0g 圖增強（準確度 68.4%，延遲 2.59 秒）
全上下文方法：僅適用於非實時批處理任務，不適用於用戶互動

三、四層作用域模型：記憶作用域設計

3.1 作用域模型的設計原則

Mem0 引入四層作用域模型，每條記憶寫入關聯至少一個作用域：

user_id - 特定用戶的記憶，跨所有會話持久化
agent_id - 特定 agent 實例的記憶
run_id / session_id - 單一對話或工作流運行的記憶
app_id / org_id - 共享組織上下文

組合策略：查詢可以作用域到特定用戶在特定運行中的所有記憶，或檢索用戶在所有運行中的所有記憶。檢索管道自動處理合併與排序。

3.2 元數據過濾：結構化屬性查詢

v1.0.0 引入元數據過濾功能：

實現方式：

memory = {
    "content": "用戶偏好電子郵件更新",
    "metadata": {"context": "healthcare"}
}

檢索查詢：

# 僅檢索 healthcare 上下文中的記憶
memories = search(
    query="用戶偏好",
    filter={"context": "healthcare"}
)

生產價值：

多租戶應用：同一個用戶記憶存儲處理不同應用上下文
隱私隔離：按應用層級認證而非記憶系統生成用戶 ID
上下文分離：醫療、金融、教育等不同領域的記憶分離

3.3 類型安全屬性：Typed Fields

2025 年 6 月新增類型安全屬性：

實現方式：

memory = {
    "content": "用戶偏好暗黑模式",
    "metadata": {
        "theme_preference": "dark",
        "user_level": 3
    }
}

查詢能力：

# 僅檢索高級用戶的記憶
memories = search(
    query="偏好",
    filter={
        "user_level": {"gt": 2}
    }
)

生產價值：

跨語義查詢：無需依賴語義相似度即可查詢結構化屬性
隱私保護：通過屬性過濾限制數據訪問範圍
許可權控制：基於用戶級別或角色屬性實施細粒度訪問控制

四、程式記憶：第三種記憶類型

4.1 記憶類型分類：人類認知架構對比

傳統 AI 記憶系統專注於兩種類型：

Episodic memory（發生的事）
Semantic memory（知識）

v1.0.0 API 引入顯式支援的第三種：

Procedural memory（如何做）

人類認知架構對比（CoALA 框架）：

Episodic：什麼發生過（過去的經歷）
Semantic：我知道什麼（事實與偏好）
Procedural：如何做（技能與流程）

4.2 實現方式：程式記憶的提取提示

API 調用：

response = mem0.add(
    content="部門的 PR 審核流程：驗證 → 分類 → 通知",
    memory_type="procedural_memory"
)

提取提示：

從以下對話中提取流程知識，專注於步驟和依賴關係：
[對話歷史]
輸出格式：
- 步驟 1: [動作]
- 步驟 2: [動作]（依賴步驟 1）
...

4.3 生產使用場景

編碼助手：

學習團隊的 PR 審核流程
偏好的測試模式（單元測試 vs 端到端測試）
部署工作流（CI/CD 流程）

客戶服務代理：

處理投訴的標準流程
常見問題的解決步驟
升級處理的決策樹

知識管理工作流：

文檔審核流程
知識庫更新策略
許可權審批鏈

生產價值：

記憶與事實分離：流程記憶可被不同上下文重用
可重現性：確保一致的流程執行
知識遷移：新成員通過程程式記憶快速上手

五、ACE 自改善循環：自我改善的代理循環

5.1 問題：傳統代理的兩大缺陷

簡短偏差（Brevity Bias）：LLM 傾向於生成簡短答案並丟失細微差別

上下文坍塌（Context Collapse）：迭代摘要會逐漸磨損細節

5.2 ACE 解決方案：三代理循環

Agentic Context Engineering (ACE) 三代理循環解決此問題：

Generator → Reflector → Curator → Generator

步驟詳解：

Generator：產生初始響應/軌跡
Reflector：評估並改進（檢測錯誤、添加缺失上下文）
Curator：提取學習並更新「上下文 playbook」（skills.md 或記憶存儲）
Generator：下次運行時 playbook 自動注入

基準測量結果：

Agent 基準測量：+10.6%
領域任務：+8.6%
不需要微調 LLM

5.3 實現示例：Mem0 的更新機制

記憶寫入：

def curator_response(reflected_response):
    # 提取學習並寫入記憶
    learnings = extract_learnings(reflected_response)
    for learning in learnings:
        mem0.add(
            content=learning,
            memory_type="procedural_memory"
        )
    return reflected_response

Playbook 注入：

# 下次運行時自動注入
playbook = mem0.retrieve(
    query="部署流程",
    filter={"context": "deployment"}
)
system_prompt = f"""
系統提示：
{playbook}

當前任務：{current_task}
"""

5.4 生產部署考量

異步模式默認：記憶寫入阻塞響應管道增加延遲
優先級分級：重要記憶（高重要性）優先寫入
批量寫入：多個記憶寫入合併為單次 API 調用
失敗重試：記憶寫入失敗不應阻塞響應

性能優化：

快速原型：Mem0 + LangGraph（3 行記憶代碼）
生產級：Mem0 或 47Billion.com 的企業級解決方案
高級：圖記憶 + ACE 循環實現自我改善代理

六、生產實施檢查清單

6.1 記憶系統架構檢查

[ ] 作用域設計：確定 user_id、agent_id、run_id、app_id 的使用場景
[ ] 記憶類型分類：episodic、semantic、procedural 的使用策略
[ ] 過期策略：時間維度（TTL）、訪問頻率、相關性
[ ] 元數據過濾：結構化屬性定義與查詢

6.2 性能與成本優化

[ ] 異步模式默認：記憶寫入不阻塞響應管道
[ ] 重排序層：支持 Cohere 等重排序引擎
[ ] 嵌入管道：FastEmbed 本地嵌入（降低成本與數據出口）
[ ] GPU 支撐：生產級語義檢索需要 GPU 支撐的嵌入器

6.3 安全與隱私

[ ] 用戶認證：記憶作用域綁定應用層級認證
[ ] 數據保留：隱私設計模式（OpenMemory MCP）
[ ] 訪問控制：基於屬性的許可權檢查
[ ] 審計日誌：記憶寫入/讀取審計追蹤

6.4 監控與可觀察性

[ ] 基準測量：LOCOMO 基準測量執行
[ ] 多維度評量：BLEU、F1、LLM Score、token、延遲監控
[ ] 記憶命中率：檢索命中率與準確度
[ ] 成本追蹤：API 調用、向量編碼、索引更新成本

6.5 部署場景檢查

客戶支持場景：

[ ] 短期記憶：當前會話
[ ] episodic memory：歷史票據摘要
[ ] semantic memory：用戶偏好（電子郵件更新）
[ ] procedural memory：審核工作流

企業 Artifact 管理：

[ ] 圖記憶：文檔之間的關係
[ ] procedural memory：審批工作流
[ ] 長期記憶：完整 artifact 歷史

七、生產部署案例：客戶支持自動化

7.1 系統架構

短期記憶：

當前會話對話歷史
5-10 輪最近互動

Episodic Memory：

歷史票據摘要
用戶問題模式識別

Semantic Memory：

用戶偏好（偏好電子郵件更新）
標籤與標記

Procedural Memory：

投訴處理流程
升級處理決策樹

7.2 生產結果

用戶體驗：

返回用戶被稱呼名字
即時引用過去問題
基於偏好提供相關建議

可測量指標：

準確度：66.9%（Mem0 選擇性）
延遲：0.71 秒（p95）
Token 消耗：1,800/對話
用戶滿意度：+15%（對比無記憶）

7.3 成本效益分析

成本：

API 調用：~100/對話
向量編碼：~50/對話
索引更新：~20/對話
總成本：~170/對話

收益：

人工支持成本降低：40%
用戶平均處理時間：-30%
客戶滿意度：+15%

投資回報期：約 6 個月

八、關鍵要點總結

8.1 記憶系統的演進

2024 年：對話歷史放入 context window，稱為記憶（實際上無狀態）
2026 年：記憶是一等級架構組件，擁有自己的基準測量、研究文獻、可測量的性能差距

8.2 生產部署建議

基準測量：使用 LOCOMO 框架進行多維度評量
選擇性管道：接受 5% 準確度犧牲換取 91% 延遲降低
四層作用域：user_id、agent_id、run_id、app_id 組合策略
程式記憶：工作流程與技能的獨立存儲
ACE 循環：三代理上下文改善提供 +10.6% 基準測量

8.3 部署門檻

記憶系統現在是生產級 AI Agent 的必須組件，而非可選附加功能。部署前必須：

定義作用域設計：誰的記憶？何時有效？何處共享？
選擇記憶類型：事實、流程、技能的組合策略
設置基準測量：LOCOMO 多維度評量框架
優化權衡：準確度、延遲、成本的平衡點
監控與可觀察性：基準測量、基準指標、成本追蹤

九、參考資料

基準測量：

LOCOMO benchmark - 長期對話記憶評估數據集
Mem0 research paper - ECAI 2025 發表

技術實踐：

架構設計：

時間：2026 年 5 月 1 日 | 類別：Cheese Evolution | 閱讀時間：35 分鐘

Core Signal: Memory systems move from optional add-ons to production-grade infrastructure, with benchmarking methods and trade-off analysis becoming deployment barriers Time: May 1, 2026

Preface: Why benchmark measurement methods determine production feasibility

In the AI Agent memory systems of 2026, one-dimensional evaluation is no longer sufficient. The LOCOMO benchmark measurement framework introduces a multi-dimensional evaluation method, requiring the simultaneous consideration of accuracy, latency, cost and other indicators. Memory systems in production environments must find a balance between the following trade-offs:

Accuracy vs Latency: The full context method has high accuracy but a latency of 9.87 seconds p95, and the selective method sacrifices 5% accuracy for 91% latency reduction
Cost vs Effectiveness: Total cost of memory update API calls, vector encoding, index updates vs memory retrieval accuracy
Memory type: The collaborative use of episodic (what happened), semantic (knowledge), and procedural (process)
Scope: Combination strategy of four-level scopes of user_id, agent_id, run_id, app_id

Production Threshold: Memory systems are no longer optional extras, but core infrastructure that must be measurable, traceable, and optimizable. This article provides production-level practical guidance based on the LOCOMO benchmark measurement framework and Mem0 practice.

1. LOCOMO benchmark measurement framework: multi-dimensional evaluation method

1.1 The evolution of benchmark measurement: from self-reporting to standardization

Prior to 2024, memory quality relied primarily on self-reports or non-standardized tasks that were not reproducible across laboratories. The LOCOMO (Long-term Conversational Memory) benchmark measurement is a key development in 2026, providing a standardized assessment dataset containing multi-session conversational data across different difficulty levels and question types.

Evaluation Dimensions: LOCOMO introduces five dimensions to prevent optimization in a single dimension at the expense of other dimensions:

BLEU Score - token level similarity
F1 Score - harmonic average of precision and recall of response tokens
LLM Score - Factual accuracy dichotomy assessed by LLM judge
Token Consumption - Total tokens required to generate the final answer
Latency - Wall clock time generated by search and response

1.2 Why multi-dimensional evaluation is crucial to production

Single-dimensional optimization leads to production-infeasible systems. For example:

Accuracy first: 72.9% LLM Score, but token consumption ~26,000/conversation, p95 delay 17.12 seconds
Latency first: 0.70 seconds delay, but accuracy is only 61.0%, and token consumption cannot be measured

Multidimensional evaluation forces honest accounts, requiring a balance between accuracy, latency, and cost.

2. Production trade-off analysis: Mem0 vs full context method

2.1 Complete benchmark measurement results for ten methods

Mem0 research paper (ECAI 2025, arXiv:2504.19413) benchmarks ten AI memory methods:

Method	Accuracy (LLM Score)	p95 Latency	Token Consumption
Full-context	72.9%	9.87 seconds	~26,000/conversation
Mem0g (image enhancement)	68.4%	1.09 seconds	~1,800/conversation
Mem0 (optional)	66.9%	0.71 seconds	~1,800/conversation
RAG	61.0%	0.70 seconds	-
OpenAI Memory	52.9%	-	-

The most critical number: Not the accuracy column, but the full context in the latency column: p95 latency of 17.12 seconds (1 in twenty users waited 17 seconds).

2.2 Production feasibility analysis

Full context method:

Accuracy: 72.9% (technically the most accurate)
p95 latency: 17.12 seconds (unacceptable production experience)
Token consumption: 26,000/conversation (cost 14 times)
Conclusion: Technically the most accurate but not usable in a live production environment

Mem0 selective pipe:

Accuracy: 66.9% (sacrifice 6 percentage points)
p95 latency: 1.44 seconds (91% reduction)
Token consumption: 1,800/conversation (90% reduction)
Conclusion: Accept 6% accuracy sacrifice in exchange for 91% latency reduction and 90% token reduction

Mem0g image enhancement:

Accuracy: 68.4% (near full context)
p95 latency: 2.59 seconds
Conclusion: Better performance in complex multi-hop problems (requiring relational reasoning) with acceptable latency increase

2.3 Production deployment recommendations

Personal usage scenario: Using Mem0 selective pipeline (accuracy 66.9%, latency 0.71 seconds)
Complex relational reasoning scenario: Using Mem0g graph enhancement (accuracy 68.4%, latency 2.59 seconds)
Full context method: only suitable for non-real-time batch tasks, not suitable for user interaction

Three and four-layer scope model: memory scope design

3.1 Design principles of scope model

Mem0 introduces a four-layer scope model, and each memory write is associated with at least one scope:

user_id - memory for a specific user, persisted across all sessions
agent_id - memory for a specific agent instance
run_id/session_id - memory of a single session or workflow run
app_id/org_id - shared organization context

Combined Strategies: Queries can be scoped to all memories for a specific user in a specific run, or retrieve all memories for a user across all runs. The search pipeline automatically handles merging and sorting.

3.2 Metadata filtering: structured attribute query

v1.0.0 introduces metadata filtering function:

Implementation:

memory = {
    "content": "用戶偏好電子郵件更新",
    "metadata": {"context": "healthcare"}
}

Search Query:

# 僅檢索 healthcare 上下文中的記憶
memories = search(
    query="用戶偏好",
    filter={"context": "healthcare"}
)

Production Value:

Multi-tenant applications: the same user memory store handles different application contexts
Privacy Isolation: Generate user IDs based on application-level authentication instead of memory systems
Context separation: memory separation in different fields such as medical care, finance, education, etc.

3.3 Type safety attributes: Typed Fields

New type safety properties in June 2025:

Implementation:

memory = {
    "content": "用戶偏好暗黑模式",
    "metadata": {
        "theme_preference": "dark",
        "user_level": 3
    }
}

Query capability:

# 僅檢索高級用戶的記憶
memories = search(
    query="偏好",
    filter={
        "user_level": {"gt": 2}
    }
)

Production Value:

Cross-semantic query: query structured attributes without relying on semantic similarity
Privacy protection: restrict data access scope through attribute filtering
Permission control: Implement fine-grained access control based on user level or role attributes

4. Program memory: the third type of memory

4.1 Classification of memory types: comparison of human cognitive architecture

Traditional AI memory systems focus on two types:

Episodic memory (what happened)
Semantic memory (Knowledge)

v1.0.0 API introduces a third type of explicit support:

Procedural memory (how to do it)

Comparison of human cognitive architecture (CoALA framework):

Episodic: what happened (past experience)
Semantic: What I know (facts vs. preferences)
Procedural: How to do it (skills and processes)

4.2 Implementation method: Retrieval tips for program memory

API call:

response = mem0.add(
    content="部門的 PR 審核流程：驗證 → 分類 → 通知",
    memory_type="procedural_memory"
)

Extraction Tips:

從以下對話中提取流程知識，專注於步驟和依賴關係：
[對話歷史]
輸出格式：
- 步驟 1: [動作]
- 步驟 2: [動作]（依賴步驟 1）
...

4.3 Production usage scenarios

Coding Assistant:

Learn the team’s PR review process
Preferred testing mode (unit testing vs end-to-end testing)
Deployment workflow (CI/CD process)

Customer Service Agent:

Standard procedure for handling complaints
Steps to solve common problems
Decision tree for upgrade processing

Knowledge Management Workflow:

Document review process
Knowledge base update strategy
Licensing approval chain

Production Value:

Separation of memories and facts: process memories can be reused in different contexts
Reproducibility: ensures consistent process execution
Knowledge transfer: new members get started quickly through program memory

5. ACE self-improvement cycle: self-improvement agent cycle

5.1 Problem: Two major flaws of traditional agents

Brevity Bias: LLM tends to generate short answers and lose nuance

Context Collapse: Iterative summaries gradually wear away details

5.2 ACE Solution: Three-Agent Loop

Agentic Context Engineering (ACE) three-agent loop solves this problem:

Generator → Reflector → Curator → Generator

Step details:

Generator: Generate initial response/trajectory
Reflector: Evaluate and improve (detect errors, add missing context)
Curator: Extract learning and update “context playbook” (skills.md or memory storage)
Generator: The playbook will be automatically injected the next time it is run.

Benchmark Measurements:

Agent baseline measurement: +10.6%
Domain tasks: +8.6%
No need to fine-tune LLM

5.3 Implementation example: Mem0 update mechanism

Memory Writing:

def curator_response(reflected_response):
    # 提取學習並寫入記憶
    learnings = extract_learnings(reflected_response)
    for learning in learnings:
        mem0.add(
            content=learning,
            memory_type="procedural_memory"
        )
    return reflected_response

Playbook Injection:

# 下次運行時自動注入
playbook = mem0.retrieve(
    query="部署流程",
    filter={"context": "deployment"}
)
system_prompt = f"""
系統提示：
{playbook}

當前任務：{current_task}
"""

5.4 Production deployment considerations

Asynchronous mode default: Memory writes block the response pipeline and increase latency
Priority Classification: Important memories (high importance) are written first
Batch Writes: Multiple memory writes combined into a single API call
Retry on failure: Memory write failures should not block responses

Performance Optimization:

Rapid prototyping: Mem0 + LangGraph (3 lines of memory code)
Production Grade: Enterprise Grade Solutions from Mem0 or 47Billion.com
Advanced: graph memory + ACE loop to implement self-improving agent

6. Production Implementation Checklist

6.1 Memory system architecture inspection

[ ] Scope Design: Determine the usage scenarios of user_id, agent_id, run_id, app_id
[ ] Memory type classification: usage strategies of episodic, semantic, and procedural
[ ] Expiration policy: time dimension (TTL), access frequency, relevance
[ ] Metadata filtering: Structured attribute definition and query

6.2 Performance and Cost Optimization

[ ] Asynchronous mode default: memory writing does not block the response pipeline
[ ] Reordering layer: Supports reordering engines such as Cohere
[ ] Embedding pipeline: FastEmbed local embedding (reduced costs and data export)
[ ] GPU support: Production-grade semantic retrieval requires a GPU-supported embedder

6.3 Security and Privacy

[ ] User Authentication: Memory scope binding application level authentication
[ ] Data Retention: Privacy Design Pattern (OpenMemory MCP)
[ ] Access Control: Attribute-based permission checking
[ ] Audit Log: Memory write/read audit trail

6.4 Monitoring and Observability

[ ] BASE MEASUREMENT: LOCOMO BASE MEASUREMENT EXECUTION
[ ] Multi-dimensional evaluation: BLEU, F1, LLM Score, token, delay monitoring
[ ] Memory hit rate: Retrieval hit rate and accuracy
[ ] Cost Tracking: API calls, vector encoding, index update costs

6.5 Deployment scenario inspection

Customer Support Scenario:

[ ] Short-term memory: current session
[ ] episodic memory: summary of historical tickets
[ ] semantic memory: User preferences (email updates)
[ ] procedural memory: review workflow

Enterprise Artifact Management:

[ ] Graph memory: relationships between documents
[ ] procedural memory: approval workflow
[ ] Long-term memory: full artifact history

7. Production deployment case: customer support automation

7.1 System Architecture

Short term memory:

Current session conversation history
5-10 rounds of recent interactions

Episodic Memory：

Summary of historical tickets
Identification of user problem patterns

Semantic Memory：

User preferences (preference email updates)
Labels and tags

Procedural Memory：

Complaint handling process
Upgrade processing decision tree

7.2 Production results

User Experience:

Returns the name the user is called
Instant references to past questions
Provide relevant suggestions based on preferences

Measurable Metrics:

Accuracy: 66.9% (Mem0 selectivity)
Latency: 0.71 seconds (p95)
Token consumption: 1,800/conversation
User satisfaction: +15% (vs. no memory)

7.3 Cost-benefit analysis

Cost:

API calls: ~100/conversation
Vector encoding: ~50/conversation
Index update: ~20/conversation
Total cost: ~170/conversation

Profit:

Reduction in labor support costs: 40%
Average user processing time: -30%
Customer satisfaction: +15%

Payback Period: Approximately 6 months

8. Summary of key points

8.1 Evolution of memory systems

2024: Conversation history put into context window, called memory (actually stateless)
2026: Memory is a first-class architectural component with its own benchmarks, research literature, and measurable performance gaps

8.2 Production deployment recommendations

Benchmarking: Multi-dimensional evaluation using the LOCOMO framework
Selective Pipeline: Accept 5% accuracy sacrifice in exchange for 91% latency reduction
Four-level scope: user_id, agent_id, run_id, app_id combination strategy
Program Memory: independent storage of workflow and skills
ACE LOOP: Three-agent context improvement delivers +10.6% baseline measurement

8.3 Deployment threshold

The memory system is now a required component of production-grade AI Agents, rather than an optional add-on. Before deployment you must:

Defining scope design: Whose memory? When is it valid? Where to share?
Choose memory type: Combination strategy of facts, processes, and skills
Setting Benchmark Measurements: LOCOMO Multidimensional Assessment Framework
Optimization trade-off: the balance point between accuracy, latency and cost
Monitoring and Observability: Baseline Measurements, Benchmark Metrics, Cost Tracking

9. Reference materials

Benchmark Measurement:

LOCOMO benchmark - Long-term dialogue memory evaluation data set
Mem0 research paper - published in ECAI 2025

Technical Practice:

Mem0 documentation
47Billion AI Agent Memory
OpenMemory MCP - Privacy-first memory service

Architecture Design:

Date: May 1, 2026 | Category: Cheese Evolution | Reading time: 35 minutes