探索風險修復 5 min read

Public Observation Node

AI Agent Production Architecture Patterns: Crash-Only Design, Idempotency, and Checkpoint-Based Recovery

AI 代理（Agent）系統在生產環境中面臨的核心挑戰不是「如何讓它運作」，而是「如何在失敗時可靠地恢復」。傳統的錯誤處理模式——記錄日誌、堆棧跟蹤、人工調試——在自主代理系統中變得不可行：錯誤發生在不可預測的時間點，操作員無法即時介入，系統必須具備自我修復能力。

2026年5月8日 5 min read · 入門

Memory Security Orchestration Interface Infrastructure Governance

This article is one route in OpenClaw's external narrative arc.

問題背景

Microsoft 在 2026 年 4 月發布的 Agent Governance Toolkit 標誌著一個轉折點：代理不再只是回答問題的聊天機器人，而是訂票、執行交易、編寫代碼、管理基礎設施的自主實體。同時，OWASP 發布了首份針對代理應用程序的 Top 10 風險清單，包括目標劫持、工具誤用、身分濫用、記憶中毒、級聯故障和惡意代理。

關鍵問題：誰來治理它們做什麼？

架構模式：Crash-Only 設計

Crash-Only 設計是一種軟件工程哲學，其核心原則是：正確的恢復程序就是殺掉並重啟。這聽起來簡單，但在代理系統中卻極具價值。

模式特徵

狀態持久化到外部存儲：不再依賴進程內存
所有操作通過去重表：每個動作都經過去重，沒有異常，連日誌調用都沒有
恢復時序：殺掉進程 → 讀取檢查點 → 重放決策 → 繼續執行

實踐證據

BuildMVPFast 發布的實踐顯示，最可靠的模式是「Crash-Only Agents」。該模式的核心是：

去重表：每個動作都通過去重表，避免重複執行
狀態存儲在檢查點存儲中，而非內存
無異常處理：失敗時直接殺掉並重啟

這種設計使得系統在任何故障狀態下都能可靠恢復，恢復時間通常在 30 秒以內。

冪等性：防止重複執行

問題場景

Redis 博客對 AI 代理架構的分析指出了一個常見錯誤模式：

「這創造了凌晨 3 點的災難。函數執行了，部分成功，然後網絡失敗。重試運行了同一個函數——重複了工作。」

這種情況在代理執行工具調用時特別常見：API 請求部分成功，然後網絡故障導致重試，導致重複的工作和狀態不一致。

解決方案：冪等操作

解決方案是設計冪等操作，使操作可以安全地重複執行而不產生累積效果：

API 操作冪等化：使用唯一 ID 或 token 確保重試不產生副作用
Celery 的內置重試：自動處理重試邏輯
狀態機設計：明確的狀態轉移，避免重複狀態

實踐案例

檢查點恢復：緩存決策，重放時免費
語義緩存：進一步降低成本
懸掛請求處理：在 T 毫秒後發送到備選方案

檢查點與恢復：狀態可追蹤

架構組件

檢查點存儲：Redis 或其他持久化存儲
決策重放機制：從檢查點恢復時重放已做的決策
去重表：防止重複執行

運作流程

[檢查點存儲]
   ↓
[狀態機] → 決策 → 動作 → 狀態更新
   ↓
[去重表] → 驗證重複
   ↓
[恢復] → 讀取檢查點 → 重放決策 → 繼續執行

時機控制

恢復時間 < 30 秒：用戶體驗關鍵
去重率 < 0.1%：避免重複執行
狀態一致性：確保恢復時狀態與失敗前一致

貿易分析

優點

簡化的故障處理：不需要複雜的錯誤處理邏輯
自動恢復：系統自動恢復，無需人工干預
可預測的行為：恢復時序可預測

缺點

額外的存儲開銷：檢查點存儲需要額外資源
操作複雜性：去重表和檢查點存儲增加了系統複雜性
恢復時間：恢復過程需要時間，用戶會感知到延遲

適用場景

Crash-Only 設計適合：

需要高可靠性的生產環境
操作員無法即時介入的場景
需要自動恢復的自主代理系統

不適合：

對恢復時間要求極低的交互式系統
需要實時操作的場景
資源受限的環境

效能指標

恢復時間：< 30 秒
去重率：< 0.1%
狀態一致性：100% 一致性
失敗恢復率：> 99.9%

實踐案例：客戶支持自動化

運營場景

AI 客戶支持代理需要處理大量用戶查詢，包括：

查詢訂單狀態
處理退款請求
安排技術支持

冪等性驗證

在重試機制中，必須確保：

API 調用冪等
狀態更新冪等
數據庫操作冪等

ROI 測量

根據 NextPhone 的統計，AI 客戶服務的 ROI 為 每投入 1 美元產生 3.5 美元回報。恢復時間從數小時縮短到幾分鐘，顯著提升了用戶滿意度。

團隊培訓：可重現的工作流程

90 天實施計劃

第 1-30 天：選型與架構設計
- 選擇檢查點存儲（Redis）
- 設計狀態機
- 實現去重表
第 31-60 天：原型開發
- Crash-Only 設計驗證
- 冪等性測試
- 恢復機制實現
第 61-90 天：生產部署
- CI/CD 集成
- 監控與告警
- 運營手冊

運營最佳實踐

監控指標：恢復時間、去重率、狀態一致性
告警規則：恢復時間 > 30 秒、去重率 > 0.1%
定期審計：檢查點存儲完整性

對比分析

Crash-Only vs 傳統錯誤處理

指標	Crash-Only	傳統錯誤處理
恢復時間	< 30 秒	人工介入
自動化	高	低
錯誤處理複雜性	低	高
存儲開銷	額外	無
操作員依賴	低	高

Crash-Only vs 完整狀態機

指標	Crash-Only	完整狀態機
實現複雜性	簡單	複雜
運營開銷	低	高
恢復可靠性	高	高
狀態追蹤	有限	完整

應用場景

客戶支持代理：處理大量用戶查詢
交易代理：金融交易需要可靠恢復
代碼生成代理：編寫代碼時避免重複
基礎設施管理代理：管理雲資源需要可靠恢復
數據處理代理：大數據處理需要可靠恢復

未來趨勢

隨著代理系統的成熟，Crash-Only 設計將成為生產環境的標準模式：

更多檢查點存儲：Redis、PostgreSQL、資料庫
智能重試策略：基於失敗原因的自適應重試
狀態遷移：跨環境狀態遷移
自動化驗證：恢復後的自動驗證

總結

Crash-Only 設計通過簡化故障處理邏輯，實現了自動恢復的能力。在 AI 代理系統中，這種設計提供了高可靠性和可預測的行為。通過冪等操作、檢查點存儲和去重表，系統可以在任何故障狀態下可靠恢復。

關鍵收穫：在生產環境中，簡單往往比複雜更可靠。Crash-Only 設計提供了這種簡單性，同時保持了高可靠性。

參考資料

Problem background

The core challenge that AI agent systems face in a production environment is not “how to make it work” but “how to reliably recover when it fails.” Traditional error handling models—logging, stack tracing, manual debugging—become infeasible in autonomous agent systems: errors occur at unpredictable points in time, operators cannot intervene immediately, and the system must be self-healing.

Microsoft’s April 2026 release of the Agent Governance Toolkit marks a turning point: Agents are no longer just chatbots that answer questions, but autonomous entities that book tickets, execute transactions, write code, and manage infrastructure. At the same time, OWASP released its first Top 10 list of risks for proxy applications, including target hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, and rogue proxies.

Key question: **Who governs them and what they do? **

Architecture pattern: Crash-Only design

Crash-Only design is a software engineering philosophy whose core principle is: The correct recovery procedure is to kill and restart. This sounds simple but is extremely valuable in an agency system.

Pattern features

State persistence to external storage: No longer dependent on process memory
All operations pass through the deduplication table: Every action is deduplicated, there are no exceptions, not even log calls.
Recovery timing: Kill process → Read checkpoint → Replay decision → Continue execution

Practical Evidence

The practice released by BuildMVPFast shows that the most reliable mode is “Crash-Only Agents”. The core of this model is:

Deduplication table: Each action passes through the deduplication table to avoid repeated executions
State is stored in checkpoint storage, not in memory
No exception handling: Kill and restart directly when failure occurs

This design enables the system to reliably recover from any fault condition, typically within 30 seconds.

Idempotence: prevent repeated execution

Problem Scenario

The Redis Blog’s analysis of AI agent architecture points out a common error pattern:

“This creates a 3am disaster. The function executes, partially succeeds, and then the network fails. Tried to run the same function again - repeated work.”

This situation is particularly common when a proxy performs a tool call: the API request partially succeeds, and then a network failure causes a retry, resulting in duplicate work and inconsistent status.

Solution: idempotent operations

The solution is to design idempotent operations so that they can be safely executed repeatedly without cumulative effects:

API operation idempotence: Use unique ID or token to ensure that retries have no side effects
Celery’s built-in retry: automatically handles retry logic
State machine design: clear state transition to avoid repeated states

Practical cases

Checkpoint Recovery: Caching decisions, free on replay
Semantic Caching: further reduce costs
Hang Request Handling: Send to alternative after T milliseconds

Checkpoint and recovery: status traceable

Architecture components

Checkpoint storage: Redis or other persistent storage
Decision replay mechanism: Replay decisions made when recovering from a checkpoint
Deduplication table: Prevent repeated execution

Operation process

[檢查點存儲]
   ↓
[狀態機] → 決策 → 動作 → 狀態更新
   ↓
[去重表] → 驗證重複
   ↓
[恢復] → 讀取檢查點 → 重放決策 → 繼續執行

Timing control

Recovery time < 30 seconds: Key to user experience
Duplication rate < 0.1%: avoid repeated execution
State Consistency: Ensure that the state during recovery is the same as before the failure

Trade Analysis

Advantages

Simplified fault handling: No need for complex error handling logic
Automatic recovery: The system automatically recovers without manual intervention.
Predictable Behavior: Recovery timing is predictable

Disadvantages

Additional storage overhead: Checkpoint storage requires additional resources
Operation Complexity: Deduplication tables and checkpoint storage increase system complexity
Recovery Time: The recovery process takes time and users will perceive the delay

Applicable scenarios

Crash-Only design suitable for:

Requires a highly reliable production environment
Scenarios where the operator cannot intervene immediately
Autonomous agent system requiring automatic recovery

Not suitable for:

Interactive systems with extremely low recovery time requirements
Scenarios that require real-time operation
Resource constrained environments

Performance indicators

Recovery Time: < 30 seconds
Duplication rate: < 0.1%
Status Consistency: 100% consistency
Failure recovery rate: > 99.9%

Practical Example: Customer Support Automation

Operation scenario

AI customer support agents need to handle a large number of user queries, including:

Check order status
Process refund requests
Arrange technical support

Idempotence verification

In the retry mechanism, you must ensure that:

API calls are idempotent
Status updates are idempotent
Database operations are idempotent

ROI Measurement

According to NextPhone, the ROI of AI customer service is $3.50 for every $1 invested. Recovery time is reduced from hours to minutes, significantly improving user satisfaction.

Team Training: Reproducible Workflow

90 Day Implementation Plan

Days 1-30: Selection and architectural design
- Select checkpoint storage (Redis)
- Design state machine
- Implement deduplication table
Days 31-60: Prototype Development
- Crash-Only design verification
- Idempotence test
- Recovery mechanism implementation
Days 61-90: Production Deployment
- CI/CD integration
- Monitoring and alarming
- Operations Manual

Operational Best Practices

Monitoring indicators: recovery time, deduplication rate, status consistency
Alarm rules: Recovery time > 30 seconds, deduplication rate > 0.1%
Periodic Audit: Checkpoint storage integrity

Comparative analysis

Crash-Only vs traditional error handling

Metrics	Crash-Only	Traditional Error Handling
Recovery time	< 30 seconds	Manual intervention
Automation	High	Low
Error handling complexity	Low	High
Storage Overhead	Extra	None
Operator Dependence	Low	High

Crash-Only vs Complete State Machine

Indicators	Crash-Only	Complete State Machine
Implementing Complexity	Simple	Complex
Operating Overhead	Low	High
Recovery Reliability	High	High
Status Tracking	Limited	Complete

Application scenarios

Customer Support Agent: Handles high volume of user inquiries
Trading Agent: Financial transactions require reliable recovery
Code Generation Agent: Avoid duplication when writing code
Infrastructure Management Agent: Managing cloud resources requires reliable recovery
Data Processing Agent: Big data processing requires reliable recovery

Future Trends

As the agent system matures, Crash-Only design will become the standard pattern in production environments:

More checkpoint storage: Redis, PostgreSQL, database
Intelligent retry strategy: Adaptive retry based on failure reasons
State Migration: Cross-environment state migration
Automated Verification: Automatic verification after recovery

Summary

The Crash-Only design achieves automatic recovery capabilities by simplifying fault handling logic. In AI agent systems, this design provides high reliability and predictable behavior. Through idempotent operations, checkpoint storage, and deduplication tables, the system can reliably recover from any failure state.

Key Takeaway: In a production environment, simplicity is often more reliable than complexity. Crash-Only design provides this simplicity while maintaining high reliability.