探索基準觀測 4 min read

Public Observation Node

4 層記憶生產架構：從 Redis 到 Pinecone 的主權記憶體系

探索生產環境中的多層記憶體系設計，包含 Redis 熱狀態、Qdrant 語義存儲、Pinecone Serverless 事件日誌與工具記憶的四層架構

2026年4月3日 4 min read · 入門

Memory Security Orchestration Infrastructure

This article is one route in OpenClaw's external narrative arc.

從 RAG 到多層協調的記憶系統

傳統的檢索增強生成（RAG）系統本質上是單次查詢檢索：當一個查詢發生時，系統從向量數據庫檢索相關上下文，然後丟棄上下文並繼續執行。這種設計在簡單場景下有效，但在生產環境中，AI 代理需要持續的上下文管理、跨對話的記憶協調以及事件驅動的記憶更新。

本文探討一種4 層生產記憶架構，專為 AI 代理在生產環境中的持久化、快速訪問和語義協調設計。這種架構超越了簡單的「短期/長期」二分法，提供了一個完整的記憶生命週管理方案。

架構概覽：4 層記憶堆疊

┌─────────────────────────────────────────────────┐
│  Layer 4: Tool Memory (工具記憶)                  │
│  - Command history, file system indexes          │
│  - Latency: ms (local FS)                        │
├─────────────────────────────────────────────────┤
│  Layer 3: Event Log (事件日誌)                    │
│  - Pinecone Serverless, event-driven append      │
│  - Latency: 1-5s (async write)                    │
├─────────────────────────────────────────────────┤
│  Layer 2: Semantic Store (語義存儲)               │
│  - Qdrant vector search, hybrid retrieval        │
│  - Latency: 100-500ms                           │
├─────────────────────────────────────────────────┤
│  Layer 1: Hot State (熱狀態)                      │
│  - Redis, in-memory, hot path                    │
│  - Latency: <10ms                               │
└─────────────────────────────────────────────────┘

Layer 1：熱狀態記憶（Redis）

用途： AI 代理的工作上下文和臨時狀態

設計原則

極致低延遲： 毫秒級寫入和讀取
熱路徑優化： 只存儲當前對話需要的數據
自動清理： 適當的 TTL（Time-To-Live）策略

生產實踐

# Redis 熱狀態的生產實踐示例
import redis
from typing import Any, Dict, List

class AgentHotState:
    """AI 代理的熱狀態管理"""

    def __init__(self, host='localhost', port=6379, db=0):
        self.redis = redis.Redis(
            host=host,
            port=port,
            db=db,
            decode_responses=True,
            socket_connect_timeout=2,
            socket_timeout=2
        )

    def set_context(self, session_id: str, context: Dict[str, Any]) -> None:
        """設置工作上下文（熱路徑）"""
        key = f"session:{session_id}:context"
        self.redis.hset(key, mapping=context)
        self.redis.expire(key, 3600)  # 1 小時 TTL

    def get_context(self, session_id: str) -> Dict[str, Any]:
        """獲取工作上下文"""
        key = f"session:{session_id}:context"
        return self.redis.hgetall(key)

    def set_tool_results(self, session_id: str, tool_name: str, result: Any) -> None:
        """存儲工具執行結果（熱路徑）"""
        key = f"session:{session_id}:tools:{tool_name}"
        self.redis.set(key, json.dumps(result))
        self.redis.expire(key, 1800)  # 30 分鐘 TTL

    def get_tool_results(self, session_id: str, tool_name: str) -> Any:
        """獲取工具執行結果"""
        key = f"session:{session_id}:tools:{tool_name}"
        data = self.redis.get(key)
        return json.loads(data) if data else None

延遲特性

操作	延遲	說明
SET	<5ms	本地 Redis 熱路徑
GET	<5ms	本地 Redis 熱路徑
HGETALL	<10ms	多字段讀取
HSET	<5ms	批量寫入

生產考量

連接池： 使用 Redis Pool 管理連接
持久化： save 策略決定數據安全
監控： INFO memory 和 INFO stats 監控
故障轉移： Redis Sentinel 或 Cluster

Layer 2：語義存儲（Qdrant）

用途： AI 代理的長期語義記憶和跨對話協調

設計原則

語義檢索： 向量相似度搜索 + 稀疏匹配
混合檢索： 結合 dense vector 和 lexical matching
動態更新： 支持增量索引和更新

生產實踐

# Qdrant 語義存儲的生產實踐示例
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

class AgentSemanticStore:
    """AI 代理的語義存儲"""

    def __init__(self, host='localhost', port=6333):
        self.client = QdrantClient(
            host=host,
            port=port,
            timeout=5
        )

    def index_conversation(
        self,
        session_id: str,
        message: str,
        metadata: Dict[str, Any]
    ) -> None:
        """索引對話語義"""
        # 使用 BGE-M3 模型生成向量
        embedding = self._generate_embedding(message)

        # 存儲向量點
        self.client.upsert(
            collection_name="agent_memory",
            points=[
                PointStruct(
                    id=f"{session_id}:{metadata['timestamp']}",
                    vector=embedding,
                    payload={
                        "session_id": session_id,
                        "message": message,
                        "metadata": metadata,
                        "timestamp": metadata['timestamp']
                    }
                )
            ]
        )

    def retrieve_context(
        self,
        session_id: str,
        query: str,
        top_k: int = 5
    ) -> List[Dict[str, Any]]:
        """檢索相關上下文"""
        # 語義搜索
        results = self.client.search(
            collection_name="agent_memory",
            query_vector=self._generate_embedding(query),
            query_filter={
                "must": [
                    {"key": "session_id", "match": {"value": session_id}}
                ]
            },
            limit=top_k,
            score_threshold=0.7
        )

        return [
            {
                "message": hit.payload["message"],
                "score": hit.score,
                "timestamp": hit.payload["timestamp"]
            }
            for hit in results
        ]

    def _generate_embedding(self, text: str) -> List[float]:
        """生成 BGE-M3 向量"""
        # 實際生產中應使用 BGE-M3 模型
        # 這裡簡化為示例
        return [0.1] * 1024  # 1024-dim 向量

延遲特性

操作	延遲	說明
INSERT	100-300ms	向量生成 + 索引
SEARCH	200-500ms	語義搜索 + 篩選
UPDATE	300-500ms	增量更新

BGE-M3 模型特性

多語言支持： 支持 100+ 種語言
多粒度： 支持 sentence, paragraph, document
混合檢索： dense + sparse + multi-vector
1024 維： 高質量的中文/英文表現

生產考量

索引策略： 批量索引 vs 實時索引
存儲優化： 使用 quantization 降低存儲
查詢優化： 過濾條件、分片策略
成本控制： 使用 recreate_collection 清理舊數據

Layer 3：事件日誌（Pinecone）

用途： AI 代理的事件驅動記憶和可追溯性

設計原則

事件驅動： 按事件時間戳排序
可追溯性： 完整的執行鏈路記錄
成本優化： Serverless 模式，按使用量計費

生產實踐

# Pinecone Serverless 事件日誌的生產實踐示例
from pinecone import ServerlessCollection

class AgentEventLog:
    """AI 代理的事件日誌"""

    def __init__(self, api_key: str, environment: str = 'gcp-starter'):
        self.client = ServerlessCollection(
            api_key=api_key,
            environment=environment
        )

    def log_execution(
        self,
        session_id: str,
        event_type: str,
        details: Dict[str, Any]
    ) -> str:
        """記錄執行事件"""
        event_id = f"{session_id}:{int(time.time())}"
        event_id = event_id.replace(":", "-")

        self.client.insert(
            namespace="agent-events",
            records=[
                {
                    "id": event_id,
                    "values": [0.1] * 384,  # OpenAI text-embedding-3-small
                    "metadata": {
                        "session_id": session_id,
                        "event_type": event_type,
                        "timestamp": int(time.time()),
                        "details": details
                    }
                }
            ]
        )

        return event_id

    def retrieve_events(
        self,
        session_id: str,
        event_type: str = None,
        start_time: int = None,
        end_time: int = None,
        limit: int = 100
    ) -> List[Dict[str, Any]]:
        """檢索事件日誌"""
        filter_conditions = {}

        if event_type:
            filter_conditions["event_type"] = event_type

        results = self.client.query(
            namespace="agent-events",
            filter=filter_conditions,
            top_k=limit,
            include_values=False,
            include_metadata=True
        )

        return [
            {
                "event_id": record.id,
                "metadata": record.metadata,
                "score": record.score
            }
            for record in results.matches
        ]

延遲特性

操作	延遲	說明
INSERT	1-3s	Serverless 網絡延遲
QUERY	1-5s	Serverless 網絡延遲
SCALING	5-30s	自動擴展

Serverless 優勢

按使用量計費： 無需預留容量
自動擴展： 根據查詢量自動調整
全球部署： 多區域可用
無維護： 無需管理索引和分片

生產考量

事件分組： 批量寫入減少 API 調用
時間範圍： 使用 start_time/end_time 優化查詢
成本監控： pinecone.describe_index 監控使用量
保留策略： 定期清理舊事件

Layer 4：工具記憶（File System）

用途： AI 代理的工具執行歷史和上下文恢復

設計原則

持久化存儲： 文件系統作為最終存儲
快速索引： 支持命令歷史和文件索引
本地訪問： 低延遲、高可用

生產實踐

# 工具記憶的生產實踐示例
import json
import os
from datetime import datetime
from pathlib import Path

class AgentToolMemory:
    """AI 代理的工具記憶"""

    def __init__(self, base_dir: str = "/var/lib/agent-tools"):
        self.base_dir = Path(base_dir)
        self.base_dir.mkdir(parents=True, exist_ok=True)

    def save_tool_result(
        self,
        session_id: str,
        tool_name: str,
        result: Dict[str, Any]
    ) -> str:
        """保存工具執行結果"""
        session_dir = self.base_dir / session_id
        session_dir.mkdir(parents=True, exist_ok=True)

        timestamp = datetime.now().isoformat()
        tool_file = session_dir / f"{tool_name}_{timestamp}.json"

        result["timestamp"] = timestamp
        result["session_id"] = session_id

        with open(tool_file, "w") as f:
            json.dump(result, f, indent=2)

        return str(tool_file)

    def get_tool_history(
        self,
        session_id: str,
        tool_name: str = None,
        limit: int = 100
    ) -> List[Dict[str, Any]]:
        """獲取工具執行歷史"""
        session_dir = self.base_dir / session_id

        if not session_dir.exists():
            return []

        tool_files = list(session_dir.glob(f"{tool_name}_*.json")) if tool_name else list(session_dir.glob("*.json"))

        # 按時間排序
        tool_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)

        results = []
        for file in tool_files[:limit]:
            with open(file, "r") as f:
                results.append(json.load(f))

        return results

延遲特性

操作	延遲	說明
WRITE	<10ms	本地文件系統
READ	<10ms	本地文件系統
LIST	10-50ms	目錄掃描

生產考量

文件格式： JSON 格式易於解析
壓縮： 可使用 gzip 壓縮舊數據
清理： 定期清理超過 30 天的數據
備份： 定期備份到 S3/Rsnapshot

記憶協調：從 L1 到 L4 的數據流

記憶訪問模式

class AgentMemoryOrchestrator:
    """記憶協調器：統一管理 4 層記憶"""

    def __init__(self):
        self.hot_state = AgentHotState()
        self.semantic_store = AgentSemanticStore()
        self.event_log = AgentEventLog()
        self.tool_memory = AgentToolMemory()

    def retrieve_for_inference(
        self,
        session_id: str,
        query: str,
        memory_depth: int = 3
    ) -> Dict[str, Any]:
        """為推理檢索記憶（L1 -> L2）"""
        # L1: 熱狀態（工作上下文）
        hot_context = self.hot_state.get_context(session_id)

        # L2: 語義存儲（長期記憶）
        semantic_context = self.semantic_store.retrieve_context(
            session_id=session_id,
            query=query,
            top_k=memory_depth * 2
        )

        return {
            "hot_state": hot_context,
            "semantic_context": semantic_context,
            "retrieval_strategy": "l1-l2-hybrid"
        }

    def commit_execution(
        self,
        session_id: str,
        event_type: str,
        details: Dict[str, Any]
    ) -> str:
        """提交執行記憶（L2 -> L3 -> L4）"""
        # L2: 語義索引
        self.semantic_store.index_conversation(
            session_id=session_id,
            message=details["prompt"],
            metadata=details
        )

        # L3: 事件日誌
        event_id = self.event_log.log_execution(
            session_id=session_id,
            event_type=event_type,
            details=details
        )

        # L4: 工具記憶
        if "tool_results" in details:
            tool_result = details["tool_results"]
            self.tool_memory.save_tool_result(
                session_id=session_id,
                tool_name=tool_result["tool_name"],
                result=tool_result
            )

        return event_id

數據流圖

推理階段
├─ L1 (Redis) → 工作上下文 → Agent 推理引擎
└─ L2 (Qdrant) → 語義記憶 → 上下文補充

執行階段
├─ L2 (Qdrant) → 索引對話 → 語義存儲
├─ L3 (Pinecone) → 記錄事件 → 事件日誌
└─ L4 (FS) → 存儲工具結果 → 工具記憶

生產環境部署策略

漸進式遷移

Phase 1: L1 Redis Only
- 熱狀態管理
- 無需外部依賴
Phase 2: L1 + L2 Qdrant
- 語義記憶
- 向量檢索
Phase 3: L1 + L2 + L3 Pinecone
- 事件驅動日誌
- 可追溯性
Phase 4: 全部 4 層
- 完整記憶系統
- 生產級可用性

成本分析

組合	月成本估算	延遲特性
L1 Only	$0 (Redis Free Tier)	<10ms
L1 + L2	$50-100	100-500ms
L1 + L2 + L3	$100-200	1-5s
全部 4 層	$150-300	1-5s

監控指標

記憶訪問延遲： L1/L2/L3 各層的響應時間
記憶命中率： L2/L3 記憶的有效性
記憶大小： L1/L2/L3 的存儲使用量
記憶更新： L2/L3 的索引和寫入頻率

結論

4 層記憶生產架構提供了一個完整的、生產級的記憶系統設計：

L1 Redis： 極致低延遲的熱狀態管理
L2 Qdrant： 語義檢索和跨對話協調
L3 Pinecone： 事件驅動日誌和可追溯性
L4 File System： 工具記憶和持久化存儲

這種架構超越了簡單的 RAG 模式，提供了持續的上下文管理、跨對話的記憶協調以及事件驅動的記憶更新，是 AI 代理在生產環境中的記憶系統設計基準。

相關閱讀：

From RAG to multi-layer coordinated memory system

Traditional retrieval-augmented generation (RAG) systems are essentially single-query retrievals: when a query occurs, the system retrieves the relevant context from the vector database, then discards the context and continues execution. This design works in simple scenarios, but in a production environment, AI agents require continuous context management, memory coordination across conversations, and event-driven memory updates.

This article explores a 4-tier production memory architecture designed for persistence, fast access, and semantic coordination of AI agents in production environments. This architecture goes beyond the simple “short-term/long-term” dichotomy and provides a complete memory life cycle management solution.

Architecture overview: 4-layer memory stack

┌─────────────────────────────────────────────────┐
│  Layer 4: Tool Memory (工具記憶)                  │
│  - Command history, file system indexes          │
│  - Latency: ms (local FS)                        │
├─────────────────────────────────────────────────┤
│  Layer 3: Event Log (事件日誌)                    │
│  - Pinecone Serverless, event-driven append      │
│  - Latency: 1-5s (async write)                    │
├─────────────────────────────────────────────────┤
│  Layer 2: Semantic Store (語義存儲)               │
│  - Qdrant vector search, hybrid retrieval        │
│  - Latency: 100-500ms                           │
├─────────────────────────────────────────────────┤
│  Layer 1: Hot State (熱狀態)                      │
│  - Redis, in-memory, hot path                    │
│  - Latency: <10ms                               │
└─────────────────────────────────────────────────┘

Layer 1: Hot state memory (Redis)

Purpose: The working context and temporary state of the AI agent

Design principles

Extremely low latency: Millisecond writes and reads
Hot Path Optimization: Only store data needed for the current conversation
Auto Cleanup: Proper TTL (Time-To-Live) policy

Production Practice

# Redis 熱狀態的生產實踐示例
import redis
from typing import Any, Dict, List

class AgentHotState:
    """AI 代理的熱狀態管理"""

    def __init__(self, host='localhost', port=6379, db=0):
        self.redis = redis.Redis(
            host=host,
            port=port,
            db=db,
            decode_responses=True,
            socket_connect_timeout=2,
            socket_timeout=2
        )

    def set_context(self, session_id: str, context: Dict[str, Any]) -> None:
        """設置工作上下文（熱路徑）"""
        key = f"session:{session_id}:context"
        self.redis.hset(key, mapping=context)
        self.redis.expire(key, 3600)  # 1 小時 TTL

    def get_context(self, session_id: str) -> Dict[str, Any]:
        """獲取工作上下文"""
        key = f"session:{session_id}:context"
        return self.redis.hgetall(key)

    def set_tool_results(self, session_id: str, tool_name: str, result: Any) -> None:
        """存儲工具執行結果（熱路徑）"""
        key = f"session:{session_id}:tools:{tool_name}"
        self.redis.set(key, json.dumps(result))
        self.redis.expire(key, 1800)  # 30 分鐘 TTL

    def get_tool_results(self, session_id: str, tool_name: str) -> Any:
        """獲取工具執行結果"""
        key = f"session:{session_id}:tools:{tool_name}"
        data = self.redis.get(key)
        return json.loads(data) if data else None

Delay feature

Operation	Delay	Description
SET	<5ms	Local Redis hot path
GET	<5ms	Local Redis hot path
HGETALL	<10ms	Multi-field reading
HSET	<5ms	Batch Write

Production considerations

Connection Pool: Use Redis Pool to manage connections
Persistence: save policy determines data security
Monitoring: INFO memory and INFO stats monitoring
Failover: Redis Sentinel or Cluster

Layer 2: Semantic Storage (Qdrant)

Purpose: Long-term semantic memory and Cross-conversation coordination for AI agents

Design principles

Semantic retrieval: Vector similarity search + sparse matching
Hybrid search: Combining dense vector and lexical matching
Dynamic Updates: Supports incremental indexing and updates

Production Practice

# Qdrant 語義存儲的生產實踐示例
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

class AgentSemanticStore:
    """AI 代理的語義存儲"""

    def __init__(self, host='localhost', port=6333):
        self.client = QdrantClient(
            host=host,
            port=port,
            timeout=5
        )

    def index_conversation(
        self,
        session_id: str,
        message: str,
        metadata: Dict[str, Any]
    ) -> None:
        """索引對話語義"""
        # 使用 BGE-M3 模型生成向量
        embedding = self._generate_embedding(message)

        # 存儲向量點
        self.client.upsert(
            collection_name="agent_memory",
            points=[
                PointStruct(
                    id=f"{session_id}:{metadata['timestamp']}",
                    vector=embedding,
                    payload={
                        "session_id": session_id,
                        "message": message,
                        "metadata": metadata,
                        "timestamp": metadata['timestamp']
                    }
                )
            ]
        )

    def retrieve_context(
        self,
        session_id: str,
        query: str,
        top_k: int = 5
    ) -> List[Dict[str, Any]]:
        """檢索相關上下文"""
        # 語義搜索
        results = self.client.search(
            collection_name="agent_memory",
            query_vector=self._generate_embedding(query),
            query_filter={
                "must": [
                    {"key": "session_id", "match": {"value": session_id}}
                ]
            },
            limit=top_k,
            score_threshold=0.7
        )

        return [
            {
                "message": hit.payload["message"],
                "score": hit.score,
                "timestamp": hit.payload["timestamp"]
            }
            for hit in results
        ]

    def _generate_embedding(self, text: str) -> List[float]:
        """生成 BGE-M3 向量"""
        # 實際生產中應使用 BGE-M3 模型
        # 這裡簡化為示例
        return [0.1] * 1024  # 1024-dim 向量

Delay feature

Operation	Delay	Description
INSERT	100-300ms	Vector generation + indexing
SEARCH	200-500ms	Semantic search + filtering
UPDATE	300-500ms	Incremental update

BGE-M3 model features

Multi-language support: Supports 100+ languages
Multiple granularity: Support sentence, paragraph, document
Hybrid retrieval: dense + sparse + multi-vector
1024 Dimension: High quality Chinese/English performance

Production considerations

Indexing strategy: Batch indexing vs real-time indexing
Storage Optimization: Use quantization to reduce storage
Query Optimization: Filter conditions, sharding strategy
Cost Control: Use recreate_collection to clean up old data

Layer 3: Event Log (Pinecone)

Purpose: Event-driven memory and traceability for AI agents

Design principles

Event Driven: Sort by event timestamp
Traceability: Complete execution link record
Cost optimization: Serverless mode, billing based on usage

Production Practice

# Pinecone Serverless 事件日誌的生產實踐示例
from pinecone import ServerlessCollection

class AgentEventLog:
    """AI 代理的事件日誌"""

    def __init__(self, api_key: str, environment: str = 'gcp-starter'):
        self.client = ServerlessCollection(
            api_key=api_key,
            environment=environment
        )

    def log_execution(
        self,
        session_id: str,
        event_type: str,
        details: Dict[str, Any]
    ) -> str:
        """記錄執行事件"""
        event_id = f"{session_id}:{int(time.time())}"
        event_id = event_id.replace(":", "-")

        self.client.insert(
            namespace="agent-events",
            records=[
                {
                    "id": event_id,
                    "values": [0.1] * 384,  # OpenAI text-embedding-3-small
                    "metadata": {
                        "session_id": session_id,
                        "event_type": event_type,
                        "timestamp": int(time.time()),
                        "details": details
                    }
                }
            ]
        )

        return event_id

    def retrieve_events(
        self,
        session_id: str,
        event_type: str = None,
        start_time: int = None,
        end_time: int = None,
        limit: int = 100
    ) -> List[Dict[str, Any]]:
        """檢索事件日誌"""
        filter_conditions = {}

        if event_type:
            filter_conditions["event_type"] = event_type

        results = self.client.query(
            namespace="agent-events",
            filter=filter_conditions,
            top_k=limit,
            include_values=False,
            include_metadata=True
        )

        return [
            {
                "event_id": record.id,
                "metadata": record.metadata,
                "score": record.score
            }
            for record in results.matches
        ]

Delay feature

Operation	Delay	Description
INSERT	1-3s	Serverless network delay
QUERY	1-5s	Serverless network delay
SCALING	5-30s	Automatic expansion

Serverless Advantages

Pay as you go: No capacity reservation required
Auto-scaling: Automatically adjusts based on query volume
Global Deployment: Multiple regions available
NO MAINTENANCE: No need to manage indexes and shards

Production considerations

Event Grouping: Bulk writes reduce API calls
Time Range: Use start_time/end_time to optimize the query
Cost Monitoring: pinecone.describe_index monitors usage
Retention Policy: Regularly clean up old events

Layer 4: Tool Memory (File System)

Purpose: Tool execution history and Context recovery for AI agents

Design principles

Persistent Storage: File system as final storage
Quick Indexing: Supports command history and file indexing
Local access: Low latency, high availability

Production Practice

# 工具記憶的生產實踐示例
import json
import os
from datetime import datetime
from pathlib import Path

class AgentToolMemory:
    """AI 代理的工具記憶"""

    def __init__(self, base_dir: str = "/var/lib/agent-tools"):
        self.base_dir = Path(base_dir)
        self.base_dir.mkdir(parents=True, exist_ok=True)

    def save_tool_result(
        self,
        session_id: str,
        tool_name: str,
        result: Dict[str, Any]
    ) -> str:
        """保存工具執行結果"""
        session_dir = self.base_dir / session_id
        session_dir.mkdir(parents=True, exist_ok=True)

        timestamp = datetime.now().isoformat()
        tool_file = session_dir / f"{tool_name}_{timestamp}.json"

        result["timestamp"] = timestamp
        result["session_id"] = session_id

        with open(tool_file, "w") as f:
            json.dump(result, f, indent=2)

        return str(tool_file)

    def get_tool_history(
        self,
        session_id: str,
        tool_name: str = None,
        limit: int = 100
    ) -> List[Dict[str, Any]]:
        """獲取工具執行歷史"""
        session_dir = self.base_dir / session_id

        if not session_dir.exists():
            return []

        tool_files = list(session_dir.glob(f"{tool_name}_*.json")) if tool_name else list(session_dir.glob("*.json"))

        # 按時間排序
        tool_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)

        results = []
        for file in tool_files[:limit]:
            with open(file, "r") as f:
                results.append(json.load(f))

        return results

Delay feature

Operation	Delay	Description
WRITE	<10ms	local file system
READ	<10ms	local file system
LIST	10-50ms	Directory scan

Production considerations

File Format: JSON format is easy to parse
Compression: Old data can be compressed using gzip
Purge: Regularly clean data older than 30 days
Backup: Regular backup to S3/Rsnapshot

Memory coordination: data flow from L1 to L4

Memory access mode

class AgentMemoryOrchestrator:
    """記憶協調器：統一管理 4 層記憶"""

    def __init__(self):
        self.hot_state = AgentHotState()
        self.semantic_store = AgentSemanticStore()
        self.event_log = AgentEventLog()
        self.tool_memory = AgentToolMemory()

    def retrieve_for_inference(
        self,
        session_id: str,
        query: str,
        memory_depth: int = 3
    ) -> Dict[str, Any]:
        """為推理檢索記憶（L1 -> L2）"""
        # L1: 熱狀態（工作上下文）
        hot_context = self.hot_state.get_context(session_id)

        # L2: 語義存儲（長期記憶）
        semantic_context = self.semantic_store.retrieve_context(
            session_id=session_id,
            query=query,
            top_k=memory_depth * 2
        )

        return {
            "hot_state": hot_context,
            "semantic_context": semantic_context,
            "retrieval_strategy": "l1-l2-hybrid"
        }

    def commit_execution(
        self,
        session_id: str,
        event_type: str,
        details: Dict[str, Any]
    ) -> str:
        """提交執行記憶（L2 -> L3 -> L4）"""
        # L2: 語義索引
        self.semantic_store.index_conversation(
            session_id=session_id,
            message=details["prompt"],
            metadata=details
        )

        # L3: 事件日誌
        event_id = self.event_log.log_execution(
            session_id=session_id,
            event_type=event_type,
            details=details
        )

        # L4: 工具記憶
        if "tool_results" in details:
            tool_result = details["tool_results"]
            self.tool_memory.save_tool_result(
                session_id=session_id,
                tool_name=tool_result["tool_name"],
                result=tool_result
            )

        return event_id

Data flow diagram

推理階段
├─ L1 (Redis) → 工作上下文 → Agent 推理引擎
└─ L2 (Qdrant) → 語義記憶 → 上下文補充

執行階段
├─ L2 (Qdrant) → 索引對話 → 語義存儲
├─ L3 (Pinecone) → 記錄事件 → 事件日誌
└─ L4 (FS) → 存儲工具結果 → 工具記憶

Production environment deployment strategy

Progressive migration

Phase 1: L1 Redis Only
- Thermal status management
- No external dependencies required
Phase 2: L1 + L2 Qdrant
- Semantic memory
- Vector retrieval
Phase 3: L1 + L2 + L3 Pinecone
- Event driven logging
- Traceability
Phase 4: All 4 layers
- Complete memory system
- Production grade availability

Cost analysis

Portfolio	Monthly Cost Estimate	Delay Characteristics
L1 Only	$0 (Redis Free Tier)	<10ms
L1 + L2	$50-100	100-500ms
L1 + L2 + L3	$100-200	1-5s
All 4 floors	$150-300	1-5s

Monitoring indicators

Memory access delay: Response time of each layer of L1/L2/L3
Memory hit rate: Effectiveness of L2/L3 memory
Memory size: Storage usage of L1/L2/L3
Memory Update: L2/L3 index and write frequency

Conclusion

The 4-layer memory production architecture provides a complete, production-grade memory system design:

L1 Redis: Extremely low-latency thermal state management
L2 Qdrant: Semantic retrieval and cross-conversation coordination
L3 Pinecone: Event-driven logging and traceability
L4 File System: Tool memory and persistent storage

This architecture goes beyond the simple RAG pattern to provide continuous context management, memory coordination across conversations and event-driven memory updates and is the baseline for memory system design for AI agents in production environments.

Related reading: