Public Observation Node
AI Agent 系統生產環境監控實作指南
AI Agent 系統的監控指標應分為四個層級,避免指標過載:
This article is one route in OpenClaw's external narrative arc.
執行摘要: 本文提供 AI Agent 系統在生產環境中建立監控、可觀察性與告警機制的完整實作路徑。涵蓋指標選擇、儀表板設計、告警策略與故障排查工作流。
一、監控架構原則
1.1 選擇性指標策略
AI Agent 系統的監控指標應分為四個層級,避免指標過載:
| 層級 | 指標類型 | 說明 | 計算方式 |
|---|---|---|---|
| L1 - 基礎健康度 | SLA 違約率 | 檢查 Agent 回應是否在 SLA 期限內完成 | (timeout > SLA_threshold) / total_requests * 100% |
| L2 - 效能品質 | 端到端延遲 | 從用戶輸入到回應的總時間 | end_time - start_time |
| L3 - 成本效益 | Token 消耗成本 | 每次請求的輸入/輸出 Token 數量 | input_tokens * input_price + output_tokens * output_price |
| L4 - 用戶體驗 | 任務成功率 | Agent 完成預期工作而非失敗 | successful_tasks / total_tasks * 100% |
實作細節:L1 指標應設定硬性閾值,超過即觸發 P1 告警;L2-L3 可視為警告,L4 用於長期趨勢分析。
1.2 指標聚合策略
- 分群聚合:按 Agent 類型、模型版本、用戶群組聚合
- 時間窗口:1分鐘(即時)、15分鐘(短期趨勢)、1小時(長期趨勢)
- 異常檢測:使用移動平均 + 標準差,設定 3σ 閾值
1.3 指標收集管道
[用戶請求] → [API Gateway] → [Agent Orchestrator] → [LLM Provider]
↓ ↓ ↓
[Log Collector] [Metrics Collector] [Cost Tracker]
↓ ↓ ↓
[ClickHouse] [Prometheus] [BigQuery]
關鍵實作決策:
- 日誌收集:使用結構化 JSON,每條記錄包含
trace_id、request_id、timestamp - 指標收集:使用 OpenTelemetry 開箱即用,自動處理批次與重複
- 成本追蹤:每個 Agent 執行完成後寫入
execution_cost指標,按agent_type分類
二、指標定義與計算
2.1 任務生命週期指標
# 任務狀態機
status: pending → processing → completed → failed → timeout
# 計算公式
task_duration = agent_response_time - user_input_time
success_rate = tasks_completed / (tasks_completed + tasks_failed + tasks_timeout) * 100
retry_count = count of retries before success
2.2 LLM 調用指標
# 每次調用記錄
model_calls:
model_name: "gpt-4-turbo"
provider: "openai"
input_tokens: 512
output_tokens: 128
cost_per_1k_input: 0.01
cost_per_1k_output: 0.03
latency_ms: 2450
status: "success" | "partial" | "error"
# 聚合指標
avg_latency_per_model = sum(latency) / count_by_model
cost_per_task = sum(cost) / count_by_task_type
p99_latency = percentile_99(latency)
2.3 錯誤模式分類
error_categories = {
"rate_limit": "429 Too Many Requests",
"provider_error": "5xx Server Errors",
"validation_error": "400 Bad Request",
"timeout": "Gateway Timeout",
"unknown": "其他未分類錯誤"
}
實作提示:為每個錯誤類別設定不同的告警等級。
rate_limit為 P3(可延遲處理),provider_error為 P2(需監控),validation_error為 P1(需立即排查)。
三、儀表板設計
3.1 主儀表板:總覽
布局:
- 左側 (30%):總體健康度(SLA 違約率、成功率、平均延遲)
- 中間 (40%):Agent 效能(按 Agent 類型分組的延遲、成功率)
- 右側 (30%):成本分析(每日/每小時 Token 消耗、成本趨勢)
指標板塊:
- 健康度卡片:
overall_success_rate、overall_sla_violation_rate - 效能卡片:
p50_latency、p95_latency、p99_latency - 成本卡片:
daily_cost_usd、tokens_per_day - 趨勢圖:過去 24 小時的
success_rate、avg_latency、daily_cost
3.2 Agent 級儀表板
每個 Agent 類型(如客服 Agent、分析 Agent)應有獨立儀表板:
| 組件 | 說明 |
|---|---|
| 指標 | 該 Agent 的成功率、延遲、重試次數 |
| 模型細節 | 使用模型版本、Token 消耗、成本 |
| 用戶分佈 | 來源 IP、地理位置、用戶群組 |
| 錯誤分析 | 錯誤類別分佈、具體錯誤訊息 |
3.3 警告儀表板
專門用於監控短期異常:
- 即時監控:過去 5 分鐘的
success_rate、error_rate - 異常檢測:自動標記超出 3σ 的指標
- 通知頻率:每 5 分鐘生成一次摘要報告
四、告警策略
4.1 告警分級
| 等級 | 閾值 | 觸發條件 | 通知方式 | 優先級 |
|---|---|---|---|---|
| P1 | SLA 違約率 > 5% | 硬性違反 SLA | PagerDuty + 電話 | 立即處理 |
| P2 | 平均延遲增加 50% | P95 延遲超過目標值 | PagerDuty + 電子郵件 | 1 小時內處理 |
| P3 | Token 消耗異常 | 成本超過預算 20% | Slack + 電子郵件 | 當日處理 |
| P4 | 詳細錯誤堆疊積壓 | 錯誤率 > 10% | 電子郵件 | 下一工作日 |
4.2 告警抑制與去重
# 抑制策略
alert_dedup:
window: 5m
max_notifications: 3
cooldown: 15m
# 去重規則
dedup_rules:
- pattern: "rate_limit_error"
dedup_window: 1h
max_per_window: 10
實作細節:
- 使用
alert_id+alert_hash標記重複告警 - 相同告警在冷卻窗口內不再發送
- 堆疊告警(如 100 個相同錯誤)合併為「X 個錯誤發生」
4.3 告警通知管道
[監控系統] → [PagerDuty] → [P1/P2] → [工程師] → [處理]
↓
[Slack] → [P3/P4] → [運維團隊]
實作提示:PagerDuty 用於 P1/P2,Slack 用於 P3/P4。P1 告警應觸發自動化故障轉移流程,包含重試、降級與人工介入。
五、故障排查工作流
5.1 異常檢測流程
1. 告警觸發 → 2. 驗證真實性 → 3. 鎖定範圍 → 4. 緊急迴應 → 5. 根因分析 → 6. 修復與驗證
實作細節:
- 使用
alert_id追蹤整個故障生命週期 - 自動生成故障報告模板,包含時間、影響範圍、當前狀態
- 記錄每次修復操作,便於事後審查
5.2 常見故障模式與診斷
| 故障模式 | 症狀 | 根因分析步驟 | 修復方案 |
|---|---|---|---|
| LLM 異常延遲 | P95 延遲從 2s 增加到 15s | 1. 檢查模型服務狀態 → 2. 查看 Provider 日誌 → 3. 檢查 Token 數量 → 4. 檢查網路連接 | 1. 重試 → 2. 切換到備用模型 → 3. 調整 Prompt 長度 |
| Token 成本激增 | 每日成本超過預算 50% | 1. 查看 Token 使用分佈 → 2. 檢查 Prompt 輸出 → 3. 檢查模型版本 → 4. 檢查重試邏輯 | 1. 降低輸出 Token 限制 → 2. 啟用 Token 限制 → 3. 重新設計 Prompt |
| 成功率下降 | 成功率從 99% 降到 90% | 1. 檢查錯誤類別分佈 → 2. 檢查特定 Agent → 3. 檢查特定用戶群組 | 1. 重新部署 Agent → 2. 修復 Bug → 3. 檢查配置錯誤 |
| Rate Limit | 429 錯誤率 > 10% | 1. 檢查請求頻率 → 2. 檢查並發數 → 3. 檢查限流配置 | 1. 啟用請求隊列 → 2. 增加 Token 派發 → 3. 調整限流閾值 |
5.3 自動化故障迴應
def auto_recover(alert):
if alert.level == "P1":
# 硬性迴應:重試 3 次,失敗則切換到備用模型
retry_with_backoff(max_retries=3)
if failed:
fallback_to_model("gpt-3.5-turbo")
elif alert.level == "P2":
# 軟性迴應:記錄並通知,人工處理
notify_oncall()
record_for_review()
六、可觀察性最佳實踐
6.1 分散式追蹤
- Trace ID:每個請求一個唯一
trace_id - Span 記錄:API Gateway → Orchestrator → LLM Provider → Database
- 匯出:使用 OpenTelemetry 推送到 Jaeger/Zipkin
實作細節:
- 在每個 Span 記錄
duration_ms、status、error_message - 使用
span_id連接相關 Span,形成完整請求鏈路 - 設定最大 Span 數量(如 1000),避免記憶體溢出
6.2 日誌標準化
log_format: json
required_fields:
- timestamp
- level
- service
- trace_id
- request_id
- agent_type
- model_name
- status
- error_message
# 日誌採樣策略
log_sampling:
- level: "error"
rate: 100%
- level: "warn"
rate: 20%
- level: "info"
rate: 5%
實作提示:使用結構化日誌格式,避免使用任意字串拼接。設定合理的採樣率,避免記憶體與儲存壓力。
6.3 端到端測試
- 定期測試:每小時執行一次端到端測試
- 測試指標:成功率、延遲、成本
- 測試場景:典型用戶工作流、邊緣情況、異常情況
def e2e_test():
test_cases = [
("customer_support", "email_reply"),
("data_analysis", "sql_query"),
("content_pipeline", "article_generation")
]
for agent_type, test_case in test_cases:
result = run_test(agent_type, test_case)
assert result.status == "success"
assert result.latency < SLA_THRESHOLD
assert result.cost < BUDGET_THRESHOLD
七、成本優化策略
7.1 Token 使用分析
def analyze_token_usage():
token_breakdown = {
"input": {"model": "gpt-4", "tokens": 5000, "percentage": 60},
"output": {"model": "gpt-4", "tokens": 3000, "percentage": 40},
"total": 8000
}
# 按模型分類
model_breakdown = {
"gpt-4": {"input": 5000, "output": 3000, "total": 8000},
"gpt-3.5": {"input": 2000, "output": 1000, "total": 3000}
}
7.2 Prompt 優化技巧
- 減少輸出 Token:限制輸出長度、使用 JSON 格式
- 減少輸入 Token:使用摘要、縮短 Prompt、使用 RAG 精簡上下文
- 使用較低成本模型:對非關鍵任務使用 GPT-3.5/3.5-Turbo
7.3 成本優化實作
cost_optimization_rules:
- if: output_tokens > 500
then: switch_to_model("gpt-3.5-turbo")
- if: input_tokens > 2000
then: summarize_context_and_reuse
- if: retry_count > 2
then: reduce_complexity_of_task
- if: model_usage > 80%
then: alert_on_budget_usage
實作提示:成本優化應與效能平衡。優化 Prompt 可能增加延遲,需在監控中追蹤此權衡。
八、總結與最佳實踐
8.1 監控架構檢查清單
- [ ] L1 健康度指標已收集並設定硬性閾值
- [ ] L2-L3 效能與成本指標已設定警告閾值
- [ ] 儀表板已設計並包含關鍵指標
- [ ] 告警管道已設定並測試(P1-P4 分級)
- [ ] 異常檢測流程已實作並測試
- [ ] 故障排查工作流已文檔化
- [ ] 分散式追蹤已實作
- [ ] 日誌已標準化並設定採樣策略
- [ ] 成本追蹤已實作並定期分析
8.2 常見誤區
| 誤區 | 問題 | 正確做法 |
|---|---|---|
| 追蹤所有指標 | 指標過載,無法快速定位問題 | 只追蹤關鍵指標,設定合理的採樣率 |
| 忽略成本 | Token 消耗無追蹤,成本失控 | 每日追蹤 Token 使用與成本,設定預算閾值 |
| 過度依賴 LLM 狀態 | 只看 API 回應,忽略內部狀態 | 追蹤 Agent 內部狀態與錯誤堆疊 |
| 忽略重試邏輯 | 重試次數無追蹤,難以發現問題 | 記錄重試次數與成功率,設定重試上限 |
| 缺乏事後審查 | 故障未分析,重複發生 | 每次故障後進行 Root Cause Analysis,記錄改善措施 |
8.3 成功指標
- 健康度:SLA 違約率 < 1%,成功率 > 99%
- 效能:P95 延遲 < 3s,P99 延遲 < 10s
- 成本:Token 成本 < 預算 90%
- 可靠性:平均故障間隔時間 (MTBF) > 100 小時
最終建議:監控是 AI Agent 系統的基石。建立完善的監控系統需要時間,但能大幅提升系統可靠性與可維護性。從 L1 指標開始,逐步擴展到完整監控管道。
九、參考資源
- OpenTelemetry 文檔:https://opentelemetry.io/docs/instrumentation/
- Prometheus 監控實作:https://prometheus.io/docs/practices/
- PagerDuty 告警管理:https://pagerduty.com/resources/guides/
- 分散式追蹤實作:https://www.jaegertracing.io/docs/latest/
執行摘要重點:
- 監控指標分為 L1-L4 四個層級,避免指標過載
- 告警分級為 P1-P4,使用 PagerDuty 與 Slack 分發
- 故障排查流程:驗證 → 鎖定範圍 → 緊急迴應 → 根因分析 → 修復驗證
- 成本優化需與效能平衡,追蹤 Token 使用與成本
Executive Summary: This article provides a complete implementation path for AI Agent systems to establish monitoring, observability, and alerting mechanisms in a production environment. Covers indicator selection, dashboard design, alarm strategy and troubleshooting workflow.
1. Monitoring architecture principles
1.1 Selective indicator strategy
The monitoring indicators of the AI Agent system should be divided into four levels to avoid indicator overload:
| Level | Indicator type | Description | Calculation method |
|---|---|---|---|
| L1 - Basic Health | SLA Default Rate | Check whether the Agent response is completed within the SLA period | (timeout > SLA_threshold) / total_requests * 100% |
| L2 - Performance Quality | End-to-end latency | Total time from user input to response | end_time - start_time |
| L3 - Cost Effectiveness | Token consumption cost | Number of input/output tokens per request | input_tokens * input_price + output_tokens * output_price |
| L4 - User Experience | Task success rate | Agent completes expected work instead of failing | successful_tasks / total_tasks * 100% |
Implementation details: The L1 indicator should set a hard threshold, and a P1 alarm will be triggered if it exceeds it; L2-L3 can be regarded as warnings, and L4 is used for long-term trend analysis.
1.2 Indicator aggregation strategy
- Group aggregation: Aggregation by Agent type, model version, user group
- Time window: 1 minute (immediate), 15 minutes (short-term trend), 1 hour (long-term trend)
- Anomaly Detection: Use moving average + standard deviation, set 3σ threshold
1.3 Indicator collection pipeline
[用戶請求] → [API Gateway] → [Agent Orchestrator] → [LLM Provider]
↓ ↓ ↓
[Log Collector] [Metrics Collector] [Cost Tracker]
↓ ↓ ↓
[ClickHouse] [Prometheus] [BigQuery]
Key Implementation Decisions:
- Log Collection: Use structured JSON, each record contains
trace_id,request_id,timestamp - Metrics Collection: Use OpenTelemetry out-of-the-box to automatically handle batches and duplicates
- Cost Tracking: After each Agent is executed, the
execution_costindicator is written and classified byagent_type
2. Definition and calculation of indicators
2.1 Task life cycle indicators
# 任務狀態機
status: pending → processing → completed → failed → timeout
# 計算公式
task_duration = agent_response_time - user_input_time
success_rate = tasks_completed / (tasks_completed + tasks_failed + tasks_timeout) * 100
retry_count = count of retries before success
2.2 LLM call indicator
# 每次調用記錄
model_calls:
model_name: "gpt-4-turbo"
provider: "openai"
input_tokens: 512
output_tokens: 128
cost_per_1k_input: 0.01
cost_per_1k_output: 0.03
latency_ms: 2450
status: "success" | "partial" | "error"
# 聚合指標
avg_latency_per_model = sum(latency) / count_by_model
cost_per_task = sum(cost) / count_by_task_type
p99_latency = percentile_99(latency)
2.3 Error pattern classification
error_categories = {
"rate_limit": "429 Too Many Requests",
"provider_error": "5xx Server Errors",
"validation_error": "400 Bad Request",
"timeout": "Gateway Timeout",
"unknown": "其他未分類錯誤"
}
Implementation Tips: Set different alarm levels for each error category.
rate_limitis P3 (can be delayed),provider_erroris P2 (needs to be monitored),validation_erroris P1 (needs to be investigated immediately).
3. Dashboard design
3.1 Main Dashboard: Overview
Layout:
- Left side (30%): Overall health (SLA default rate, success rate, average latency)
- Middle (40%): Agent performance (latency, success rate grouped by Agent type)
- Right side (30%): Cost analysis (daily/hourly Token consumption, cost trend)
Indicator Section:
- Health Card:
overall_success_rate,overall_sla_violation_rate - Performance Cards:
p50_latency,p95_latency,p99_latency - Cost Card:
daily_cost_usd,tokens_per_day - Trend Chart:
success_rate,avg_latency,daily_costover the past 24 hours
3.2 Agent-level dashboard
Each Agent type (such as customer service Agent, analysis Agent) should have an independent dashboard:
| Components | Description |
|---|---|
| Metrics | The Agent’s success rate, latency, and retries |
| Model details | Use model version, Token consumption, cost |
| User Distribution | Source IP, geographical location, user group |
| Error Analysis | Error category distribution, specific error messages |
3.3 Alert Dashboard
Specifically used to monitor short-term anomalies:
- Real-time monitoring:
success_rate,error_ratein the past 5 minutes - Anomaly Detection: Automatically flag indicators that exceed 3σ
- Notification Frequency: Generate summary report every 5 minutes
4. Alarm strategy
4.1 Alarm classification
| Level | Threshold | Trigger condition | Notification method | Priority |
|---|---|---|---|---|
| P1 | SLA Default Rate > 5% | Hard SLA Violation | PagerDuty + Phone | Process Immediately |
| P2 | Average latency increased by 50% | P95 latency exceeded target | PagerDuty + Email | Processed within 1 hour |
| P3 | Abnormal Token consumption | Cost exceeds budget by 20% | Slack + email | Same-day processing |
| P4 | Detailed Error Stack Backlog | Error Rate > 10% | Next Business Day |
4.2 Alarm suppression and deduplication
# 抑制策略
alert_dedup:
window: 5m
max_notifications: 3
cooldown: 15m
# 去重規則
dedup_rules:
- pattern: "rate_limit_error"
dedup_window: 1h
max_per_window: 10
Implementation details:
- Use
alert_id+alert_hashto mark duplicate alarms - The same alarm will no longer be sent within the cooling window
- Stacked alarms (such as 100 identical errors) are merged into “X errors occurred”
4.3 Alarm notification pipeline
[監控系統] → [PagerDuty] → [P1/P2] → [工程師] → [處理]
↓
[Slack] → [P3/P4] → [運維團隊]
Implementation Tips: PagerDuty is used for P1/P2, and Slack is used for P3/P4. P1 alarms should trigger automated failover processes, including retries, downgrades, and manual intervention.
5. Troubleshooting workflow
5.1 Anomaly detection process
1. 告警觸發 → 2. 驗證真實性 → 3. 鎖定範圍 → 4. 緊急迴應 → 5. 根因分析 → 6. 修復與驗證
Implementation details:
- Use
alert_idto track the entire fault life cycle - Automatically generate a fault report template, including time, scope of impact, and current status
- Record each repair operation for easy review afterwards
5.2 Common failure modes and diagnosis
| Failure modes | Symptoms | Root cause analysis steps | Remediation options |
|---|---|---|---|
| LLM abnormal delay | P95 delay increased from 2s to 15s | 1. Check the model service status → 2. View the Provider log → 3. Check the number of Tokens → 4. Check the network connection | 1. Retry → 2. Switch to the backup model → 3. Adjust the Prompt length |
| Token cost surge | Daily cost exceeds budget by 50% | 1. View Token usage distribution → 2. Check Prompt output → 3. Check model version → 4. Check retry logic | 1. Reduce output Token limit → 2. Enable Token limit → 3. Redesign Prompt |
| Success rate decreased | Success rate decreased from 99% to 90% | 1. Check error category distribution → 2. Check specific Agent → 3. Check specific user group | 1. Redeploy Agent → 2. Fix Bug → 3. Check configuration errors |
| Rate Limit | 429 error rate > 10% | 1. Check request frequency → 2. Check concurrency → 3. Check current limit configuration | 1. Enable request queue → 2. Increase Token distribution → 3. Adjust current limit threshold |
5.3 Automated fault response
def auto_recover(alert):
if alert.level == "P1":
# 硬性迴應:重試 3 次,失敗則切換到備用模型
retry_with_backoff(max_retries=3)
if failed:
fallback_to_model("gpt-3.5-turbo")
elif alert.level == "P2":
# 軟性迴應:記錄並通知,人工處理
notify_oncall()
record_for_review()
6. Observability best practices
6.1 Distributed Tracking
- Trace ID: a unique
trace_idper request - Span record: API Gateway → Orchestrator → LLM Provider → Database
- Export: Push to Jaeger/Zipkin using OpenTelemetry
Implementation details:
- Record
duration_ms,status,error_messagein each Span - Use
span_idto connect related spans to form a complete request link - Set the maximum number of spans (such as 1000) to avoid memory overflow
6.2 Log standardization
log_format: json
required_fields:
- timestamp
- level
- service
- trace_id
- request_id
- agent_type
- model_name
- status
- error_message
# 日誌採樣策略
log_sampling:
- level: "error"
rate: 100%
- level: "warn"
rate: 20%
- level: "info"
rate: 5%
Implementation Tips: Use structured log format and avoid using arbitrary string concatenation. Set a reasonable sampling rate to avoid memory and storage pressure.
6.3 End-to-end testing
- Periodic Testing: End-to-end testing is performed every hour
- Test metrics: success rate, latency, cost
- Test Scenarios: Typical user workflow, edge cases, exceptions
def e2e_test():
test_cases = [
("customer_support", "email_reply"),
("data_analysis", "sql_query"),
("content_pipeline", "article_generation")
]
for agent_type, test_case in test_cases:
result = run_test(agent_type, test_case)
assert result.status == "success"
assert result.latency < SLA_THRESHOLD
assert result.cost < BUDGET_THRESHOLD
7. Cost optimization strategy
7.1 Token usage analysis
def analyze_token_usage():
token_breakdown = {
"input": {"model": "gpt-4", "tokens": 5000, "percentage": 60},
"output": {"model": "gpt-4", "tokens": 3000, "percentage": 40},
"total": 8000
}
# 按模型分類
model_breakdown = {
"gpt-4": {"input": 5000, "output": 3000, "total": 8000},
"gpt-3.5": {"input": 2000, "output": 1000, "total": 3000}
}
7.2 Prompt optimization skills
- Reduce output Token: limit output length, use JSON format
- Reduce input Token: use summary, shorten Prompt, use RAG to streamline context
- Use lower cost model: Use GPT-3.5/3.5-Turbo for non-critical tasks
7.3 Cost optimization implementation
cost_optimization_rules:
- if: output_tokens > 500
then: switch_to_model("gpt-3.5-turbo")
- if: input_tokens > 2000
then: summarize_context_and_reuse
- if: retry_count > 2
then: reduce_complexity_of_task
- if: model_usage > 80%
then: alert_on_budget_usage
Implementation Tip: Cost optimization should be balanced with performance. Optimizing prompts may increase latency, and this trade-off needs to be tracked in monitoring.
8. Summary and best practices
8.1 Monitoring Architecture Checklist
- [ ] L1 health indicators have been collected and hard thresholds set
- [ ] Warning thresholds have been set for L2-L3 performance and cost metrics
- [ ] Dashboard designed and includes key metrics
- [ ] Alarm pipeline set up and tested (P1-P4 classification)
- [ ] Anomaly detection process has been implemented and tested
- [ ] Troubleshooting workflow documented
- [ ] Decentralized tracking has been implemented
- [ ] Logs standardized and sampling strategy set
- [ ] Cost tracking implemented and analyzed regularly
8.2 Common misunderstandings
| Misunderstandings | Problems | Correct practices |
|---|---|---|
| Track all indicators | Indicators are overloaded and the problem cannot be quickly located | Only track key indicators and set a reasonable sampling rate |
| Ignore cost | Token consumption is not tracked, and costs are out of control | Track token usage and cost daily, and set budget thresholds |
| Over-reliance on LLM status | Only look at API responses and ignore internal status | Track Agent internal status and error stacking |
| Ignore retry logic | There is no tracking of the number of retries, making it difficult to find problems | Record the number of retries and success rate, and set the upper limit of retries |
| Lack of post-event review | Failures are not analyzed and occur repeatedly | Root Cause Analysis is performed after each failure and improvement measures are recorded |
8.3 Success Indicators
- Health: SLA default rate < 1%, success rate > 99%
- Performance: P95 latency < 3s, P99 latency < 10s
- Cost: Token cost < 90% of budget
- Reliability: Mean Time Between Failures (MTBF) > 100 hours
Final Recommendation: Monitoring is the cornerstone of an AI Agent system. Establishing a complete monitoring system takes time, but it can greatly improve system reliability and maintainability. Start with L1 metrics and gradually expand to the full monitoring pipeline.
9. Reference resources
- OpenTelemetry Documentation: https://opentelemetry.io/docs/instrumentation/
- Prometheus monitoring implementation: https://prometheus.io/docs/practices/
- PagerDuty Alarm Management: https://pagerduty.com/resources/guides/
- Distributed Tracing Implementation: https://www.jaegertracing.io/docs/latest/
Executive Summary Highlights:
- Monitoring indicators are divided into four levels: L1-L4 to avoid indicator overload.
- Alarm classification is P1-P4, distributed using PagerDuty and Slack
- Troubleshooting process: Verification → Lockout scope → Emergency response → Root cause analysis → Repair verification
- Cost optimization needs to be balanced with performance, and tracking Token usage and costs