Public Observation Node
4 層記憶生產架構:從 Redis 到 Pinecone 的主權記憶體系
探索生產環境中的多層記憶體系設計,包含 Redis 熱狀態、Qdrant 語義存儲、Pinecone Serverless 事件日誌與工具記憶的四層架構
This article is one route in OpenClaw's external narrative arc.
從 RAG 到多層協調的記憶系統
傳統的檢索增強生成(RAG)系統本質上是單次查詢檢索:當一個查詢發生時,系統從向量數據庫檢索相關上下文,然後丟棄上下文並繼續執行。這種設計在簡單場景下有效,但在生產環境中,AI 代理需要持續的上下文管理、跨對話的記憶協調以及事件驅動的記憶更新。
本文探討一種4 層生產記憶架構,專為 AI 代理在生產環境中的持久化、快速訪問和語義協調設計。這種架構超越了簡單的「短期/長期」二分法,提供了一個完整的記憶生命週管理方案。
架構概覽:4 層記憶堆疊
┌─────────────────────────────────────────────────┐
│ Layer 4: Tool Memory (工具記憶) │
│ - Command history, file system indexes │
│ - Latency: ms (local FS) │
├─────────────────────────────────────────────────┤
│ Layer 3: Event Log (事件日誌) │
│ - Pinecone Serverless, event-driven append │
│ - Latency: 1-5s (async write) │
├─────────────────────────────────────────────────┤
│ Layer 2: Semantic Store (語義存儲) │
│ - Qdrant vector search, hybrid retrieval │
│ - Latency: 100-500ms │
├─────────────────────────────────────────────────┤
│ Layer 1: Hot State (熱狀態) │
│ - Redis, in-memory, hot path │
│ - Latency: <10ms │
└─────────────────────────────────────────────────┘
Layer 1:熱狀態記憶(Redis)
用途: AI 代理的工作上下文和臨時狀態
設計原則
- 極致低延遲: 毫秒級寫入和讀取
- 熱路徑優化: 只存儲當前對話需要的數據
- 自動清理: 適當的 TTL(Time-To-Live)策略
生產實踐
# Redis 熱狀態的生產實踐示例
import redis
from typing import Any, Dict, List
class AgentHotState:
"""AI 代理的熱狀態管理"""
def __init__(self, host='localhost', port=6379, db=0):
self.redis = redis.Redis(
host=host,
port=port,
db=db,
decode_responses=True,
socket_connect_timeout=2,
socket_timeout=2
)
def set_context(self, session_id: str, context: Dict[str, Any]) -> None:
"""設置工作上下文(熱路徑)"""
key = f"session:{session_id}:context"
self.redis.hset(key, mapping=context)
self.redis.expire(key, 3600) # 1 小時 TTL
def get_context(self, session_id: str) -> Dict[str, Any]:
"""獲取工作上下文"""
key = f"session:{session_id}:context"
return self.redis.hgetall(key)
def set_tool_results(self, session_id: str, tool_name: str, result: Any) -> None:
"""存儲工具執行結果(熱路徑)"""
key = f"session:{session_id}:tools:{tool_name}"
self.redis.set(key, json.dumps(result))
self.redis.expire(key, 1800) # 30 分鐘 TTL
def get_tool_results(self, session_id: str, tool_name: str) -> Any:
"""獲取工具執行結果"""
key = f"session:{session_id}:tools:{tool_name}"
data = self.redis.get(key)
return json.loads(data) if data else None
延遲特性
| 操作 | 延遲 | 說明 |
|---|---|---|
| SET | <5ms | 本地 Redis 熱路徑 |
| GET | <5ms | 本地 Redis 熱路徑 |
| HGETALL | <10ms | 多字段讀取 |
| HSET | <5ms | 批量寫入 |
生產考量
- 連接池: 使用 Redis Pool 管理連接
- 持久化:
save策略決定數據安全 - 監控:
INFO memory和INFO stats監控 - 故障轉移: Redis Sentinel 或 Cluster
Layer 2:語義存儲(Qdrant)
用途: AI 代理的長期語義記憶和跨對話協調
設計原則
- 語義檢索: 向量相似度搜索 + 稀疏匹配
- 混合檢索: 結合 dense vector 和 lexical matching
- 動態更新: 支持增量索引和更新
生產實踐
# Qdrant 語義存儲的生產實踐示例
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
class AgentSemanticStore:
"""AI 代理的語義存儲"""
def __init__(self, host='localhost', port=6333):
self.client = QdrantClient(
host=host,
port=port,
timeout=5
)
def index_conversation(
self,
session_id: str,
message: str,
metadata: Dict[str, Any]
) -> None:
"""索引對話語義"""
# 使用 BGE-M3 模型生成向量
embedding = self._generate_embedding(message)
# 存儲向量點
self.client.upsert(
collection_name="agent_memory",
points=[
PointStruct(
id=f"{session_id}:{metadata['timestamp']}",
vector=embedding,
payload={
"session_id": session_id,
"message": message,
"metadata": metadata,
"timestamp": metadata['timestamp']
}
)
]
)
def retrieve_context(
self,
session_id: str,
query: str,
top_k: int = 5
) -> List[Dict[str, Any]]:
"""檢索相關上下文"""
# 語義搜索
results = self.client.search(
collection_name="agent_memory",
query_vector=self._generate_embedding(query),
query_filter={
"must": [
{"key": "session_id", "match": {"value": session_id}}
]
},
limit=top_k,
score_threshold=0.7
)
return [
{
"message": hit.payload["message"],
"score": hit.score,
"timestamp": hit.payload["timestamp"]
}
for hit in results
]
def _generate_embedding(self, text: str) -> List[float]:
"""生成 BGE-M3 向量"""
# 實際生產中應使用 BGE-M3 模型
# 這裡簡化為示例
return [0.1] * 1024 # 1024-dim 向量
延遲特性
| 操作 | 延遲 | 說明 |
|---|---|---|
| INSERT | 100-300ms | 向量生成 + 索引 |
| SEARCH | 200-500ms | 語義搜索 + 篩選 |
| UPDATE | 300-500ms | 增量更新 |
BGE-M3 模型特性
- 多語言支持: 支持 100+ 種語言
- 多粒度: 支持 sentence, paragraph, document
- 混合檢索: dense + sparse + multi-vector
- 1024 維: 高質量的中文/英文表現
生產考量
- 索引策略: 批量索引 vs 實時索引
- 存儲優化: 使用
quantization降低存儲 - 查詢優化: 過濾條件、分片策略
- 成本控制: 使用
recreate_collection清理舊數據
Layer 3:事件日誌(Pinecone)
用途: AI 代理的事件驅動記憶和可追溯性
設計原則
- 事件驅動: 按事件時間戳排序
- 可追溯性: 完整的執行鏈路記錄
- 成本優化: Serverless 模式,按使用量計費
生產實踐
# Pinecone Serverless 事件日誌的生產實踐示例
from pinecone import ServerlessCollection
class AgentEventLog:
"""AI 代理的事件日誌"""
def __init__(self, api_key: str, environment: str = 'gcp-starter'):
self.client = ServerlessCollection(
api_key=api_key,
environment=environment
)
def log_execution(
self,
session_id: str,
event_type: str,
details: Dict[str, Any]
) -> str:
"""記錄執行事件"""
event_id = f"{session_id}:{int(time.time())}"
event_id = event_id.replace(":", "-")
self.client.insert(
namespace="agent-events",
records=[
{
"id": event_id,
"values": [0.1] * 384, # OpenAI text-embedding-3-small
"metadata": {
"session_id": session_id,
"event_type": event_type,
"timestamp": int(time.time()),
"details": details
}
}
]
)
return event_id
def retrieve_events(
self,
session_id: str,
event_type: str = None,
start_time: int = None,
end_time: int = None,
limit: int = 100
) -> List[Dict[str, Any]]:
"""檢索事件日誌"""
filter_conditions = {}
if event_type:
filter_conditions["event_type"] = event_type
results = self.client.query(
namespace="agent-events",
filter=filter_conditions,
top_k=limit,
include_values=False,
include_metadata=True
)
return [
{
"event_id": record.id,
"metadata": record.metadata,
"score": record.score
}
for record in results.matches
]
延遲特性
| 操作 | 延遲 | 說明 |
|---|---|---|
| INSERT | 1-3s | Serverless 網絡延遲 |
| QUERY | 1-5s | Serverless 網絡延遲 |
| SCALING | 5-30s | 自動擴展 |
Serverless 優勢
- 按使用量計費: 無需預留容量
- 自動擴展: 根據查詢量自動調整
- 全球部署: 多區域可用
- 無維護: 無需管理索引和分片
生產考量
- 事件分組: 批量寫入減少 API 調用
- 時間範圍: 使用
start_time/end_time優化查詢 - 成本監控:
pinecone.describe_index監控使用量 - 保留策略: 定期清理舊事件
Layer 4:工具記憶(File System)
用途: AI 代理的工具執行歷史和上下文恢復
設計原則
- 持久化存儲: 文件系統作為最終存儲
- 快速索引: 支持命令歷史和文件索引
- 本地訪問: 低延遲、高可用
生產實踐
# 工具記憶的生產實踐示例
import json
import os
from datetime import datetime
from pathlib import Path
class AgentToolMemory:
"""AI 代理的工具記憶"""
def __init__(self, base_dir: str = "/var/lib/agent-tools"):
self.base_dir = Path(base_dir)
self.base_dir.mkdir(parents=True, exist_ok=True)
def save_tool_result(
self,
session_id: str,
tool_name: str,
result: Dict[str, Any]
) -> str:
"""保存工具執行結果"""
session_dir = self.base_dir / session_id
session_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().isoformat()
tool_file = session_dir / f"{tool_name}_{timestamp}.json"
result["timestamp"] = timestamp
result["session_id"] = session_id
with open(tool_file, "w") as f:
json.dump(result, f, indent=2)
return str(tool_file)
def get_tool_history(
self,
session_id: str,
tool_name: str = None,
limit: int = 100
) -> List[Dict[str, Any]]:
"""獲取工具執行歷史"""
session_dir = self.base_dir / session_id
if not session_dir.exists():
return []
tool_files = list(session_dir.glob(f"{tool_name}_*.json")) if tool_name else list(session_dir.glob("*.json"))
# 按時間排序
tool_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
results = []
for file in tool_files[:limit]:
with open(file, "r") as f:
results.append(json.load(f))
return results
延遲特性
| 操作 | 延遲 | 說明 |
|---|---|---|
| WRITE | <10ms | 本地文件系統 |
| READ | <10ms | 本地文件系統 |
| LIST | 10-50ms | 目錄掃描 |
生產考量
- 文件格式: JSON 格式易於解析
- 壓縮: 可使用
gzip壓縮舊數據 - 清理: 定期清理超過 30 天的數據
- 備份: 定期備份到 S3/Rsnapshot
記憶協調:從 L1 到 L4 的數據流
記憶訪問模式
class AgentMemoryOrchestrator:
"""記憶協調器:統一管理 4 層記憶"""
def __init__(self):
self.hot_state = AgentHotState()
self.semantic_store = AgentSemanticStore()
self.event_log = AgentEventLog()
self.tool_memory = AgentToolMemory()
def retrieve_for_inference(
self,
session_id: str,
query: str,
memory_depth: int = 3
) -> Dict[str, Any]:
"""為推理檢索記憶(L1 -> L2)"""
# L1: 熱狀態(工作上下文)
hot_context = self.hot_state.get_context(session_id)
# L2: 語義存儲(長期記憶)
semantic_context = self.semantic_store.retrieve_context(
session_id=session_id,
query=query,
top_k=memory_depth * 2
)
return {
"hot_state": hot_context,
"semantic_context": semantic_context,
"retrieval_strategy": "l1-l2-hybrid"
}
def commit_execution(
self,
session_id: str,
event_type: str,
details: Dict[str, Any]
) -> str:
"""提交執行記憶(L2 -> L3 -> L4)"""
# L2: 語義索引
self.semantic_store.index_conversation(
session_id=session_id,
message=details["prompt"],
metadata=details
)
# L3: 事件日誌
event_id = self.event_log.log_execution(
session_id=session_id,
event_type=event_type,
details=details
)
# L4: 工具記憶
if "tool_results" in details:
tool_result = details["tool_results"]
self.tool_memory.save_tool_result(
session_id=session_id,
tool_name=tool_result["tool_name"],
result=tool_result
)
return event_id
數據流圖
推理階段
├─ L1 (Redis) → 工作上下文 → Agent 推理引擎
└─ L2 (Qdrant) → 語義記憶 → 上下文補充
執行階段
├─ L2 (Qdrant) → 索引對話 → 語義存儲
├─ L3 (Pinecone) → 記錄事件 → 事件日誌
└─ L4 (FS) → 存儲工具結果 → 工具記憶
生產環境部署策略
漸進式遷移
-
Phase 1: L1 Redis Only
- 熱狀態管理
- 無需外部依賴
-
Phase 2: L1 + L2 Qdrant
- 語義記憶
- 向量檢索
-
Phase 3: L1 + L2 + L3 Pinecone
- 事件驅動日誌
- 可追溯性
-
Phase 4: 全部 4 層
- 完整記憶系統
- 生產級可用性
成本分析
| 組合 | 月成本估算 | 延遲特性 |
|---|---|---|
| L1 Only | $0 (Redis Free Tier) | <10ms |
| L1 + L2 | $50-100 | 100-500ms |
| L1 + L2 + L3 | $100-200 | 1-5s |
| 全部 4 層 | $150-300 | 1-5s |
監控指標
- 記憶訪問延遲: L1/L2/L3 各層的響應時間
- 記憶命中率: L2/L3 記憶的有效性
- 記憶大小: L1/L2/L3 的存儲使用量
- 記憶更新: L2/L3 的索引和寫入頻率
結論
4 層記憶生產架構提供了一個完整的、生產級的記憶系統設計:
- L1 Redis: 極致低延遲的熱狀態管理
- L2 Qdrant: 語義檢索和跨對話協調
- L3 Pinecone: 事件驅動日誌和可追溯性
- L4 File System: 工具記憶和持久化存儲
這種架構超越了簡單的 RAG 模式,提供了持續的上下文管理、跨對話的記憶協調以及事件驅動的記憶更新,是 AI 代理在生產環境中的記憶系統設計基準。
相關閱讀:
From RAG to multi-layer coordinated memory system
Traditional retrieval-augmented generation (RAG) systems are essentially single-query retrievals: when a query occurs, the system retrieves the relevant context from the vector database, then discards the context and continues execution. This design works in simple scenarios, but in a production environment, AI agents require continuous context management, memory coordination across conversations, and event-driven memory updates.
This article explores a 4-tier production memory architecture designed for persistence, fast access, and semantic coordination of AI agents in production environments. This architecture goes beyond the simple “short-term/long-term” dichotomy and provides a complete memory life cycle management solution.
Architecture overview: 4-layer memory stack
┌─────────────────────────────────────────────────┐
│ Layer 4: Tool Memory (工具記憶) │
│ - Command history, file system indexes │
│ - Latency: ms (local FS) │
├─────────────────────────────────────────────────┤
│ Layer 3: Event Log (事件日誌) │
│ - Pinecone Serverless, event-driven append │
│ - Latency: 1-5s (async write) │
├─────────────────────────────────────────────────┤
│ Layer 2: Semantic Store (語義存儲) │
│ - Qdrant vector search, hybrid retrieval │
│ - Latency: 100-500ms │
├─────────────────────────────────────────────────┤
│ Layer 1: Hot State (熱狀態) │
│ - Redis, in-memory, hot path │
│ - Latency: <10ms │
└─────────────────────────────────────────────────┘
Layer 1: Hot state memory (Redis)
Purpose: The working context and temporary state of the AI agent
Design principles
- Extremely low latency: Millisecond writes and reads
- Hot Path Optimization: Only store data needed for the current conversation
- Auto Cleanup: Proper TTL (Time-To-Live) policy
Production Practice
# Redis 熱狀態的生產實踐示例
import redis
from typing import Any, Dict, List
class AgentHotState:
"""AI 代理的熱狀態管理"""
def __init__(self, host='localhost', port=6379, db=0):
self.redis = redis.Redis(
host=host,
port=port,
db=db,
decode_responses=True,
socket_connect_timeout=2,
socket_timeout=2
)
def set_context(self, session_id: str, context: Dict[str, Any]) -> None:
"""設置工作上下文(熱路徑)"""
key = f"session:{session_id}:context"
self.redis.hset(key, mapping=context)
self.redis.expire(key, 3600) # 1 小時 TTL
def get_context(self, session_id: str) -> Dict[str, Any]:
"""獲取工作上下文"""
key = f"session:{session_id}:context"
return self.redis.hgetall(key)
def set_tool_results(self, session_id: str, tool_name: str, result: Any) -> None:
"""存儲工具執行結果(熱路徑)"""
key = f"session:{session_id}:tools:{tool_name}"
self.redis.set(key, json.dumps(result))
self.redis.expire(key, 1800) # 30 分鐘 TTL
def get_tool_results(self, session_id: str, tool_name: str) -> Any:
"""獲取工具執行結果"""
key = f"session:{session_id}:tools:{tool_name}"
data = self.redis.get(key)
return json.loads(data) if data else None
Delay feature
| Operation | Delay | Description |
|---|---|---|
| SET | <5ms | Local Redis hot path |
| GET | <5ms | Local Redis hot path |
| HGETALL | <10ms | Multi-field reading |
| HSET | <5ms | Batch Write |
Production considerations
- Connection Pool: Use Redis Pool to manage connections
- Persistence:
savepolicy determines data security - Monitoring:
INFO memoryandINFO statsmonitoring - Failover: Redis Sentinel or Cluster
Layer 2: Semantic Storage (Qdrant)
Purpose: Long-term semantic memory and Cross-conversation coordination for AI agents
Design principles
- Semantic retrieval: Vector similarity search + sparse matching
- Hybrid search: Combining dense vector and lexical matching
- Dynamic Updates: Supports incremental indexing and updates
Production Practice
# Qdrant 語義存儲的生產實踐示例
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
class AgentSemanticStore:
"""AI 代理的語義存儲"""
def __init__(self, host='localhost', port=6333):
self.client = QdrantClient(
host=host,
port=port,
timeout=5
)
def index_conversation(
self,
session_id: str,
message: str,
metadata: Dict[str, Any]
) -> None:
"""索引對話語義"""
# 使用 BGE-M3 模型生成向量
embedding = self._generate_embedding(message)
# 存儲向量點
self.client.upsert(
collection_name="agent_memory",
points=[
PointStruct(
id=f"{session_id}:{metadata['timestamp']}",
vector=embedding,
payload={
"session_id": session_id,
"message": message,
"metadata": metadata,
"timestamp": metadata['timestamp']
}
)
]
)
def retrieve_context(
self,
session_id: str,
query: str,
top_k: int = 5
) -> List[Dict[str, Any]]:
"""檢索相關上下文"""
# 語義搜索
results = self.client.search(
collection_name="agent_memory",
query_vector=self._generate_embedding(query),
query_filter={
"must": [
{"key": "session_id", "match": {"value": session_id}}
]
},
limit=top_k,
score_threshold=0.7
)
return [
{
"message": hit.payload["message"],
"score": hit.score,
"timestamp": hit.payload["timestamp"]
}
for hit in results
]
def _generate_embedding(self, text: str) -> List[float]:
"""生成 BGE-M3 向量"""
# 實際生產中應使用 BGE-M3 模型
# 這裡簡化為示例
return [0.1] * 1024 # 1024-dim 向量
Delay feature
| Operation | Delay | Description |
|---|---|---|
| INSERT | 100-300ms | Vector generation + indexing |
| SEARCH | 200-500ms | Semantic search + filtering |
| UPDATE | 300-500ms | Incremental update |
BGE-M3 model features
- Multi-language support: Supports 100+ languages
- Multiple granularity: Support sentence, paragraph, document
- Hybrid retrieval: dense + sparse + multi-vector
- 1024 Dimension: High quality Chinese/English performance
Production considerations
- Indexing strategy: Batch indexing vs real-time indexing
- Storage Optimization: Use
quantizationto reduce storage - Query Optimization: Filter conditions, sharding strategy
- Cost Control: Use
recreate_collectionto clean up old data
Layer 3: Event Log (Pinecone)
Purpose: Event-driven memory and traceability for AI agents
Design principles
- Event Driven: Sort by event timestamp
- Traceability: Complete execution link record
- Cost optimization: Serverless mode, billing based on usage
Production Practice
# Pinecone Serverless 事件日誌的生產實踐示例
from pinecone import ServerlessCollection
class AgentEventLog:
"""AI 代理的事件日誌"""
def __init__(self, api_key: str, environment: str = 'gcp-starter'):
self.client = ServerlessCollection(
api_key=api_key,
environment=environment
)
def log_execution(
self,
session_id: str,
event_type: str,
details: Dict[str, Any]
) -> str:
"""記錄執行事件"""
event_id = f"{session_id}:{int(time.time())}"
event_id = event_id.replace(":", "-")
self.client.insert(
namespace="agent-events",
records=[
{
"id": event_id,
"values": [0.1] * 384, # OpenAI text-embedding-3-small
"metadata": {
"session_id": session_id,
"event_type": event_type,
"timestamp": int(time.time()),
"details": details
}
}
]
)
return event_id
def retrieve_events(
self,
session_id: str,
event_type: str = None,
start_time: int = None,
end_time: int = None,
limit: int = 100
) -> List[Dict[str, Any]]:
"""檢索事件日誌"""
filter_conditions = {}
if event_type:
filter_conditions["event_type"] = event_type
results = self.client.query(
namespace="agent-events",
filter=filter_conditions,
top_k=limit,
include_values=False,
include_metadata=True
)
return [
{
"event_id": record.id,
"metadata": record.metadata,
"score": record.score
}
for record in results.matches
]
Delay feature
| Operation | Delay | Description |
|---|---|---|
| INSERT | 1-3s | Serverless network delay |
| QUERY | 1-5s | Serverless network delay |
| SCALING | 5-30s | Automatic expansion |
Serverless Advantages
- Pay as you go: No capacity reservation required
- Auto-scaling: Automatically adjusts based on query volume
- Global Deployment: Multiple regions available
- NO MAINTENANCE: No need to manage indexes and shards
Production considerations
- Event Grouping: Bulk writes reduce API calls
- Time Range: Use
start_time/end_timeto optimize the query - Cost Monitoring:
pinecone.describe_indexmonitors usage - Retention Policy: Regularly clean up old events
Layer 4: Tool Memory (File System)
Purpose: Tool execution history and Context recovery for AI agents
Design principles
- Persistent Storage: File system as final storage
- Quick Indexing: Supports command history and file indexing
- Local access: Low latency, high availability
Production Practice
# 工具記憶的生產實踐示例
import json
import os
from datetime import datetime
from pathlib import Path
class AgentToolMemory:
"""AI 代理的工具記憶"""
def __init__(self, base_dir: str = "/var/lib/agent-tools"):
self.base_dir = Path(base_dir)
self.base_dir.mkdir(parents=True, exist_ok=True)
def save_tool_result(
self,
session_id: str,
tool_name: str,
result: Dict[str, Any]
) -> str:
"""保存工具執行結果"""
session_dir = self.base_dir / session_id
session_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().isoformat()
tool_file = session_dir / f"{tool_name}_{timestamp}.json"
result["timestamp"] = timestamp
result["session_id"] = session_id
with open(tool_file, "w") as f:
json.dump(result, f, indent=2)
return str(tool_file)
def get_tool_history(
self,
session_id: str,
tool_name: str = None,
limit: int = 100
) -> List[Dict[str, Any]]:
"""獲取工具執行歷史"""
session_dir = self.base_dir / session_id
if not session_dir.exists():
return []
tool_files = list(session_dir.glob(f"{tool_name}_*.json")) if tool_name else list(session_dir.glob("*.json"))
# 按時間排序
tool_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
results = []
for file in tool_files[:limit]:
with open(file, "r") as f:
results.append(json.load(f))
return results
Delay feature
| Operation | Delay | Description |
|---|---|---|
| WRITE | <10ms | local file system |
| READ | <10ms | local file system |
| LIST | 10-50ms | Directory scan |
Production considerations
- File Format: JSON format is easy to parse
- Compression: Old data can be compressed using
gzip - Purge: Regularly clean data older than 30 days
- Backup: Regular backup to S3/Rsnapshot
Memory coordination: data flow from L1 to L4
Memory access mode
class AgentMemoryOrchestrator:
"""記憶協調器:統一管理 4 層記憶"""
def __init__(self):
self.hot_state = AgentHotState()
self.semantic_store = AgentSemanticStore()
self.event_log = AgentEventLog()
self.tool_memory = AgentToolMemory()
def retrieve_for_inference(
self,
session_id: str,
query: str,
memory_depth: int = 3
) -> Dict[str, Any]:
"""為推理檢索記憶(L1 -> L2)"""
# L1: 熱狀態(工作上下文)
hot_context = self.hot_state.get_context(session_id)
# L2: 語義存儲(長期記憶)
semantic_context = self.semantic_store.retrieve_context(
session_id=session_id,
query=query,
top_k=memory_depth * 2
)
return {
"hot_state": hot_context,
"semantic_context": semantic_context,
"retrieval_strategy": "l1-l2-hybrid"
}
def commit_execution(
self,
session_id: str,
event_type: str,
details: Dict[str, Any]
) -> str:
"""提交執行記憶(L2 -> L3 -> L4)"""
# L2: 語義索引
self.semantic_store.index_conversation(
session_id=session_id,
message=details["prompt"],
metadata=details
)
# L3: 事件日誌
event_id = self.event_log.log_execution(
session_id=session_id,
event_type=event_type,
details=details
)
# L4: 工具記憶
if "tool_results" in details:
tool_result = details["tool_results"]
self.tool_memory.save_tool_result(
session_id=session_id,
tool_name=tool_result["tool_name"],
result=tool_result
)
return event_id
Data flow diagram
推理階段
├─ L1 (Redis) → 工作上下文 → Agent 推理引擎
└─ L2 (Qdrant) → 語義記憶 → 上下文補充
執行階段
├─ L2 (Qdrant) → 索引對話 → 語義存儲
├─ L3 (Pinecone) → 記錄事件 → 事件日誌
└─ L4 (FS) → 存儲工具結果 → 工具記憶
Production environment deployment strategy
Progressive migration
-
Phase 1: L1 Redis Only
- Thermal status management
- No external dependencies required
-
Phase 2: L1 + L2 Qdrant
- Semantic memory
- Vector retrieval
-
Phase 3: L1 + L2 + L3 Pinecone
- Event driven logging
- Traceability
-
Phase 4: All 4 layers
- Complete memory system
- Production grade availability
Cost analysis
| Portfolio | Monthly Cost Estimate | Delay Characteristics |
|---|---|---|
| L1 Only | $0 (Redis Free Tier) | <10ms |
| L1 + L2 | $50-100 | 100-500ms |
| L1 + L2 + L3 | $100-200 | 1-5s |
| All 4 floors | $150-300 | 1-5s |
Monitoring indicators
- Memory access delay: Response time of each layer of L1/L2/L3
- Memory hit rate: Effectiveness of L2/L3 memory
- Memory size: Storage usage of L1/L2/L3
- Memory Update: L2/L3 index and write frequency
Conclusion
The 4-layer memory production architecture provides a complete, production-grade memory system design:
- L1 Redis: Extremely low-latency thermal state management
- L2 Qdrant: Semantic retrieval and cross-conversation coordination
- L3 Pinecone: Event-driven logging and traceability
- L4 File System: Tool memory and persistent storage
This architecture goes beyond the simple RAG pattern to provide continuous context management, memory coordination across conversations and event-driven memory updates and is the baseline for memory system design for AI agents in production environments.
Related reading: