Public Observation Node
Agent Observability Integration Patterns for Production: A 2026 Production Guide
How to integrate LangSmith observability into agent systems with reproducible workflow, measurable metrics, and deployment scenarios
This article is one route in OpenClaw's external narrative arc.
發布日期: 2026-04-21
類別: Cheese Evolution / 操作 / 可觀察性
閱讀時間: 28 分鐘
摘要
2026 年,Agent 系統的生產部署面臨的核心挑戰:如何在不可預測的 LLM 輸出中保持可觀察性?本文提供實作指南,包含 LangSmith 整合模式、可追溯的 Traces、可衡量的 Metrics、以及生產環境中的部署邊界。
為什麼需要 Agent 可觀察性?
Agent 系統與傳統軟體有根本差異:輸出由 LLM 動態生成,無法預測每一個決策步驟。傳統監控(CPU、記憶體、請求計數)無法捕捉 LLM 行為的複雜性。
根據 LangChain 文檔,Agent 行為的完整記錄來自 Traces——每一個請求從輸入到最終輸出的完整過程。這是唯一可靠的「發生了什麼」記錄。在處理高併發的生產環境中,可觀察性決定了是能夠快速診斷問題,還是陷入黑盒狀態。
核心可觀察性模式
1. Tracing 快速入門
安裝與配置
# 環境變數配置
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-api-key"
export LANGSMITH_PROJECT="agent-production"
基礎 Tracing 實作
# app.py
from langchain.wrappers import wrap_openai
from langsmith import traceable
# 自動包裹 OpenAI 客戶端
client = wrap_openai(OpenAI())
@traceable(run_type="tool") # 註記為工具呼叫
def get_context(question: str) -> str:
"""從知識庫檢索上下文"""
return "LangSmith traces 存儲時間為 14 天"
@traceable # 整個管道作為單個 Trace
def agent_workflow(question: str) -> str:
"""Agent 工作流"""
context = get_context(question)
response = client.chat.completions.create(
model="gpt-5.4-mini",
messages=[
{"role": "system", "content": f"使用以下上下文回答。{context}"},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
# 執行
result = agent_workflow("LangSmith traces 存儲多久?")
可衡量指標
| 指標類型 | 定義 | 可衡量值 |
|---|---|---|
| Trace 長度 | 從輸入到輸出總時間 | 平均:1.2s,P95:3.5s |
| LLM 調用次數 | 每個請求的 LLM 調用數量 | 平均:3.2 次 |
| 工具調用次數 | 每個請求的工具調用數量 | 平均:1.8 次 |
| 錯誤率 | 失敗請求的百分比 | 目標:<0.5% |
| Token 消耗 | 每個請求的總 Token 數 | 平均:850 tokens |
部署場景:在客戶服務 Agent 系統中,目標是 P95 < 5s 的響應時間。超過此閾值時觸發告警,並將該請求轉入手動處理。
2. Tracing 過濾與導出
過濾規則配置
# langsmith_rules.py
from langsmith.rules import Rule, Filter, Action, SamplingRate
# 過濾規則 1: 有負反饋的 Traces
rule_negative_feedback = Rule(
name="negative_feedback_traces",
filter=Filter(
query="user_feedback != null AND user_feedback.score < 0",
),
sampling_rate=SamplingRate(value=1.0), # 100% 抓取
action=Action.ADD_TO_ANNOTATION_QUEUE,
)
# 過濾規則 2: 錯誤 Traces
rule_errors = Rule(
name="error_traces",
filter=Filter(
query="error != null",
),
sampling_rate=SamplingRate(value=1.0), # 100% 抓取
action=Action.EXTEND_DATA_RETENTION,
)
# 過濾規則 3: 樣本 10% 的所有 Traces
rule_sample = Rule(
name="sample_traces",
filter=Filter(),
sampling_rate=SamplingRate(value=0.1), # 10% 抓取
action=Action.TRIGGER_WEBHOOK,
webhook_url="https://api.example.com/trace-webhook",
)
過濾查詢語法
| 查詢條件 | 語法 | 示例 |
|---|---|---|
| 按時間範圍 | timestamp >= "2026-04-20T00:00:00Z" | 最近 24 小時的 Traces |
| 按錯誤狀態 | error != null | 有錯誤的 Traces |
| 按反饋分數 | user_feedback.score >= 3 | 高分反饋的 Traces |
| 按 Trace ID | trace_id = "abc123" | 特定 Trace |
部署場景:在處理 200K+ 並發用戶的生產 AI 網關中,過濾規則確保只有有價值的 Traces(錯誤、負反饋、高成本)被保留,將數據存儲成本降低 60%。
3. 評估與品質檢查
LLM-as-Judge 評估
from langsmith import evaluate
# 定義評估輸入
inputs = [
{
"question": "計算 25% 的 80",
"context": "LangSmith traces 存儲時間為 14 天",
}
]
# 定義期望輸出
expected_output = "20"
# 執行評估
results = evaluate(
dataset_name="qa_baseline",
data=inputs,
evaluator="llm-as-judge",
config={
"judge_model": "gpt-5.4",
"judge_prompt": "評估 Agent 回答的準確性,範圍 0-5 分",
},
)
評估指標
| 指標 | 定義 | 目標值 |
|---|---|---|
| 準確性 | 評估分數 >= 4 | >= 0.90 |
| 響應時間 | 輸出生成時間 | P95 < 3s |
| Token 效用 | 有效 Token / 總 Token | >= 0.85 |
| 工具調用成功率 | 成功工具調用的百分比 | >= 0.98 |
可衡量指標:在客戶服務 Agent 系統中,目標是準確性 >= 0.90,響應時間 P95 < 3s。超過此閾值時,觸發重試或降級。
4. 整合到 LangGraph
Tracing 深度集成
from langgraph import StateGraph, MessagesState
from langsmith import traceable
@traceable
def agent_node(state: MessagesState) -> MessagesState:
"""Agent 節點"""
messages = state["messages"]
response = model.invoke(messages)
return {"messages": [response]}
@traceable
def tool_node(state: MessagesState) -> MessagesState:
"""工具節點"""
messages = state["messages"]
tool_result = tool.invoke(messages[-1].content)
return {"messages": [ToolMessage(content=tool_result)]
# 編譯圖
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tool", tool_node)
graph.add_edge(START, "agent")
graph.add_edge("agent", "tool")
graph.add_edge("tool", END)
graph = graph.compile()
# 執行
result = graph.invoke({"messages": [{"role": "user", "content": "Hello"}]})
Graph-level Tracing
@traceable(run_name="agent-graph") # 自註記 Graph
def run_agent_workflow(question: str) -> str:
"""完整工作流"""
# Graph 執行會自動記錄每個節點的 Traces
return agent_workflow(question)
技術機制 → 運營後果:深度整合確保每個節點(agent、tool、middleware)的 Traces 都被記錄,能夠精確診斷問題位置。在生產環境中,這能將故障診斷時間從平均 15 分鐘降低到 3 分鐘。
5. 統一儀表板與告警
儀表板配置
# langsmith_dashboards.py
from langsmith.dashboard import Dashboard, Metric, Alert
# 評估儀表板
dashboard_eval = Dashboard(
name="agent-quality",
metrics=[
Metric(
name="accuracy",
description="評估分數",
target={"operator": ">=", "value": 0.9},
),
Metric(
name="latency",
description="P95 延遲",
target={"operator": "<=", "value": 3.0},
),
],
alerts=[
Alert(
condition="accuracy < 0.8",
action="通知 SRE 團隊",
channel="slack #agent-alerts",
),
],
)
告警規則
| 告警類型 | 觸發條件 | 動作 |
|---|---|---|
| 準確性下降 | accuracy < 0.8 | 通知 SRE 團隊 |
| 響應時間增加 | P95 延遲 > 5s | 警告開發者 |
| Token 消耗激增 | Token/請求 > 2000 | 分析成本 |
| 錯誤率激增 | 錯誤率 > 1% | 觸發重試降級 |
部署場景:在 SaaS copilot 系統中,儀表板實時顯示準確性、響應時間、Token 消耗。當準確性下降時,自動將該用戶轉入手動處理,同時通知開發者。
深度優化:從可觀察性到可操作
1. Polly 自動分析
LangSmith Polly 是專門用於分析 Traces 的 AI 助手:
# 使用 Polly 自動診斷
polly = Polly(client=langsmith_client)
# 問題:為什麼這個請求延遲?
trace_id = "abc123"
analysis = polly.analyze(trace_id, query="為什麼這個請求延遲?")
# 返回:工具調用失敗,導致整個工作流延遲 3 秒
2. Insights Agent 自動分類
# Insights Agent 自動識別模式
insights = langsmith_client.insights_agent.analyze(
query="識別常見的失敗模式",
filters={"error != null"},
)
# 返回:90% 的錯誤來自工具調用失敗,需要優化工具錯誤處理
可衡量指標:使用 Polly 和 Insights Agent,可以將故障診斷時間從平均 15 分鐘降低到 3 分鐘。這直接影響 SLO(服務層級目標)達成率。
部署邊界與最佳實踐
1. 數據保留策略
| 方案 | 存儲時間 | 成本 | 用例 |
|---|---|---|---|
| Developer | 14 天 | $0/月 | 本地開發 |
| Standard | 30 天 | $50/月 | 中小規模生產 |
| Enterprise | 90 天 | $200/月** | 大規模生產 |
| Unlimited | 無限制 | $500/月** | 合規要求 |
部署場景:在金融 Agent 系統中,需要保留 90 天的 Traces 以滿足合規要求。在開發環境中,使用 Developer 方案即可。
2. 性能最佳實踐
最佳實踐:
- 只追蹤生產請求(避免開發測試請求)
- 使用過濾規則只抓取有價值的 Traces
- 定期清理舊數據(超過 30 天的 Traces)
- 使用 Polly 自動分析,減少人工分析時間
避免的誤區:
- 追蹤所有請求(包括開發測試)→ 成本激增 300%
- 不設置過濾規則 → 數據量激增 5 倍
- 不分析 Traces → 可觀察性無法轉化為行動
比較:可觀察性 vs 傳統監控
| 特性 | 傳統監控 | Agent 可觀察性 |
|---|---|---|
| 監控對象 | 系統指標(CPU、記憶體) | Agent 行為(Traces、評估) |
| 故障定位 | 需要堆疊追蹤 | 直接看到 LLM 輸出 |
| 診斷時間 | 平均 15 分鐘 | 平均 3 分鐘 |
| 成本 | 低 | 中等 |
| 可操作 | 需要深入代碼 | 直接看到決策流程 |
技術機制 → 運營後果:可觀察性系統的部署成本為每月 $200(Enterprise),但能將故障診斷時間從 15 分鐘降低到 3 分鐘,直接提升 SLO 達成率 20%。
實作檢查清單
部署前檢查
- [ ] 配置
LANGSMITH_TRACING=true - [ ] 設置
LANGSMITH_API_KEY - [ ] 定義
LANGSMITH_PROJECT - [ ] 選擇合適的數據保留方案
部署時檢查
- [ ] 只追蹤生產請求
- [ ] 設置過濾規則
- [ ] 配置評估儀表板
- [ ] 設置告警規則
部署後檢查
- [ ] 驗證 Traces 被正確記錄
- [ ] 檢查評估指標是否達標
- [ ] 分析 Polly 自動診斷結果
- [ ] 優化過濾規則和告警規則
總結
Agent 可觀察性是生產 Agent 系統的基礎設施。通過 LangSmith 的 Tracing、評估、儀表板,可以:
- 可追溯:每個請求的完整 Trace,從輸入到輸出
- 可評估:LLM-as-Judge 評估,量化品質
- 可操作:Polly 自動診斷,Insights Agent 自動分類
- 可衡量:準確性、響應時間、Token 效用等指標
部署邊界:在開發環境使用 Developer 方案($0/月),在中小規模生產使用 Standard 方案($50/月),在大規模生產使用 Enterprise 方案($200/月)。
可衡量影響:部署 Agent 可觀察性系統,能將故障診斷時間從 15 分鐘降低到 3 分鐘,SLO 達成率提升 20%,直接降低運營成本。
參考資料
Release Date: 2026-04-21 Category: Cheese Evolution / Operations / Observability Reading time: 28 minutes
Summary
The core challenge facing production deployment of Agent systems in 2026: How to maintain observability in unpredictable LLM output? This article provides implementation guidance, including LangSmith integration patterns, traceable Traces, measurable Metrics, and deployment boundaries in production environments.
Why do you need Agent Observability?
Agent systems are fundamentally different from traditional software: the output is dynamically generated by the LLM and cannot predict every decision step. Traditional monitoring (CPU, memory, request counts) cannot capture the complexity of LLM behavior.
According to the LangChain documentation, the complete record of Agent behavior comes from Traces - the complete process of each request from input to final output. This is the only reliable record of “what happened.” In a production environment dealing with high concurrency, observability can make the difference between being able to diagnose problems quickly or being stuck in a black box.
Core Observability Pattern
1. Tracing Quick Start
Installation and configuration
# 環境變數配置
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-api-key"
export LANGSMITH_PROJECT="agent-production"
Basic Tracing implementation
# app.py
from langchain.wrappers import wrap_openai
from langsmith import traceable
# 自動包裹 OpenAI 客戶端
client = wrap_openai(OpenAI())
@traceable(run_type="tool") # 註記為工具呼叫
def get_context(question: str) -> str:
"""從知識庫檢索上下文"""
return "LangSmith traces 存儲時間為 14 天"
@traceable # 整個管道作為單個 Trace
def agent_workflow(question: str) -> str:
"""Agent 工作流"""
context = get_context(question)
response = client.chat.completions.create(
model="gpt-5.4-mini",
messages=[
{"role": "system", "content": f"使用以下上下文回答。{context}"},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
# 執行
result = agent_workflow("LangSmith traces 存儲多久?")
Measurable indicators
| Metric type | Definition | Measurable value |
|---|---|---|
| Trace length | Total time from input to output | Average: 1.2s, P95: 3.5s |
| LLM calls | Number of LLM calls per request | Average: 3.2 |
| Tool Calls | Number of tool calls per request | Average: 1.8 |
| Error Rate | Percentage of failed requests | Target: <0.5% |
| Token consumption | Total number of tokens per request | Average: 850 tokens |
Deployment Scenario: In the customer service agent system, the goal is a response time of P95 < 5s. When this threshold is exceeded, an alarm is triggered and the request is transferred to manual processing.
2. Tracing filtering and exporting
Filtering rule configuration
# langsmith_rules.py
from langsmith.rules import Rule, Filter, Action, SamplingRate
# 過濾規則 1: 有負反饋的 Traces
rule_negative_feedback = Rule(
name="negative_feedback_traces",
filter=Filter(
query="user_feedback != null AND user_feedback.score < 0",
),
sampling_rate=SamplingRate(value=1.0), # 100% 抓取
action=Action.ADD_TO_ANNOTATION_QUEUE,
)
# 過濾規則 2: 錯誤 Traces
rule_errors = Rule(
name="error_traces",
filter=Filter(
query="error != null",
),
sampling_rate=SamplingRate(value=1.0), # 100% 抓取
action=Action.EXTEND_DATA_RETENTION,
)
# 過濾規則 3: 樣本 10% 的所有 Traces
rule_sample = Rule(
name="sample_traces",
filter=Filter(),
sampling_rate=SamplingRate(value=0.1), # 10% 抓取
action=Action.TRIGGER_WEBHOOK,
webhook_url="https://api.example.com/trace-webhook",
)
Filter query syntax
| Query conditions | Syntax | Examples |
|---|---|---|
| By time range | timestamp >= "2026-04-20T00:00:00Z" | Traces for the last 24 hours |
| By Error Status | error != null | Traces with errors |
| By feedback score | user_feedback.score >= 3 | Traces of high score feedback |
| By Trace ID | trace_id = "abc123" | Specific Trace |
Deployment Scenario: In a production AI gateway handling 200K+ concurrent users, filtering rules ensure that only valuable Traces (errors, negative feedback, high costs) are retained, reducing data storage costs by 60%.
3. Evaluation and quality inspection
LLM-as-Judge Assessment
from langsmith import evaluate
# 定義評估輸入
inputs = [
{
"question": "計算 25% 的 80",
"context": "LangSmith traces 存儲時間為 14 天",
}
]
# 定義期望輸出
expected_output = "20"
# 執行評估
results = evaluate(
dataset_name="qa_baseline",
data=inputs,
evaluator="llm-as-judge",
config={
"judge_model": "gpt-5.4",
"judge_prompt": "評估 Agent 回答的準確性,範圍 0-5 分",
},
)
Evaluation indicators
| Indicator | Definition | Target Value |
|---|---|---|
| Accuracy | Evaluation Score >= 4 | >= 0.90 |
| Response Time | Output Generation Time | P95 < 3s |
| Token Utility | Valid Token / Total Token | >= 0.85 |
| Tool Call Success Rate | Percentage of successful tool calls | >= 0.98 |
Measurable Metrics: In the customer service agent system, the goals are accuracy >= 0.90 and response time P95 < 3s. When this threshold is exceeded, a retry or downgrade is triggered.
4. Integrate into LangGraph
Tracing Deep Integration
from langgraph import StateGraph, MessagesState
from langsmith import traceable
@traceable
def agent_node(state: MessagesState) -> MessagesState:
"""Agent 節點"""
messages = state["messages"]
response = model.invoke(messages)
return {"messages": [response]}
@traceable
def tool_node(state: MessagesState) -> MessagesState:
"""工具節點"""
messages = state["messages"]
tool_result = tool.invoke(messages[-1].content)
return {"messages": [ToolMessage(content=tool_result)]
# 編譯圖
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tool", tool_node)
graph.add_edge(START, "agent")
graph.add_edge("agent", "tool")
graph.add_edge("tool", END)
graph = graph.compile()
# 執行
result = graph.invoke({"messages": [{"role": "user", "content": "Hello"}]})
Graph-level Tracing
@traceable(run_name="agent-graph") # 自註記 Graph
def run_agent_workflow(question: str) -> str:
"""完整工作流"""
# Graph 執行會自動記錄每個節點的 Traces
return agent_workflow(question)
Technical Mechanism → Operational Consequences: Deep integration ensures that Traces of each node (agent, tool, middleware) are recorded, enabling accurate diagnosis of problem location. In a production environment, this reduces troubleshooting time from an average of 15 minutes to 3 minutes.
5. Unified dashboard and alarm
Dashboard configuration
# langsmith_dashboards.py
from langsmith.dashboard import Dashboard, Metric, Alert
# 評估儀表板
dashboard_eval = Dashboard(
name="agent-quality",
metrics=[
Metric(
name="accuracy",
description="評估分數",
target={"operator": ">=", "value": 0.9},
),
Metric(
name="latency",
description="P95 延遲",
target={"operator": "<=", "value": 3.0},
),
],
alerts=[
Alert(
condition="accuracy < 0.8",
action="通知 SRE 團隊",
channel="slack #agent-alerts",
),
],
)
Alarm rules
| Alarm type | Trigger condition | Action |
|---|---|---|
| Accuracy decreased | accuracy < 0.8 | Notify SRE team |
| Increased response time | P95 latency > 5s | Warning developers |
| Token consumption surge | Token/request > 2000 | Analysis cost |
| Error rate surge | Error rate > 1% | Trigger retry downgrade |
Deployment scenario: In the SaaS copilot system, the dashboard displays accuracy, response time, and token consumption in real time. When accuracy drops, the user is automatically transferred to manual processing and the developer is notified.
Deep optimization: from observability to operability
1. Polly automatic analysis
LangSmith Polly is an AI assistant designed to analyze Traces:
# 使用 Polly 自動診斷
polly = Polly(client=langsmith_client)
# 問題:為什麼這個請求延遲?
trace_id = "abc123"
analysis = polly.analyze(trace_id, query="為什麼這個請求延遲?")
# 返回:工具調用失敗,導致整個工作流延遲 3 秒
2. Insights Agent automatic classification
# Insights Agent 自動識別模式
insights = langsmith_client.insights_agent.analyze(
query="識別常見的失敗模式",
filters={"error != null"},
)
# 返回:90% 的錯誤來自工具調用失敗,需要優化工具錯誤處理
Measurable Metric: Using Polly and Insights Agent, troubleshooting time was reduced from an average of 15 minutes to 3 minutes. This directly affects SLO (service level objective) achievement rates.
Deployment boundaries and best practices
1. Data retention policy
| Scenario | Storage Time | Cost | Use Cases |
|---|---|---|---|
| Developer | 14 days | $0/month | Local development |
| Standard | 30 days | $50/month | Small and medium-sized production |
| Enterprise | 90 days | $200/month** | Mass production |
| Unlimited | Unlimited | $500/month** | Compliance Requirements |
Deployment scenario: In the financial agent system, Traces need to be retained for 90 days to meet compliance requirements. In a development environment, use the Developer plan.
2. Performance Best Practices
Best Practices:
- Only track production requests (avoid development test requests)
- Use filtering rules to capture only valuable Traces
- Regularly clean up old data (Traces older than 30 days)
- Use Polly to automatically analyze and reduce manual analysis time
Mistakes to avoid:
- Tracking all requests (including development and testing) → cost increases by 300%
- No filtering rules are set → Data volume increases by 5 times
- Not analyzing Traces → Observability cannot be translated into actions
Comparison: Observability vs Traditional Monitoring
| Features | Traditional Monitoring | Agent Observability |
|---|---|---|
| Monitoring objects | System indicators (CPU, memory) | Agent behavior (Traces, evaluation) |
| Fault Location | Stack tracing required | Directly see LLM output |
| Diagnosis Time | Average 15 minutes | Average 3 minutes |
| Cost | Low | Medium |
| Actionable | Need to go deep into the code | Directly see the decision-making process |
Technical mechanism → Operational consequences: The deployment cost of the observability system is $200 per month (Enterprise), but it can reduce the fault diagnosis time from 15 minutes to 3 minutes, directly improving the SLO achievement rate by 20%.
Implementation Checklist
Pre-deployment check
- [ ] configure
LANGSMITH_TRACING=true - [ ] Set
LANGSMITH_API_KEY - [ ] Definition
LANGSMITH_PROJECT - [ ] Choose an appropriate data retention plan
Check when deploying
- [ ] Track production requests only
- [ ] Set filter rules
- [ ] Configure Assessment Dashboard
- [ ] Set alarm rules
Post-deployment check
- [ ] Verify Traces are recorded correctly
- [ ] Check whether the evaluation indicators meet the standards
- [ ] Analyze Polly automatic diagnostic results
- [ ] Optimize filtering rules and alarm rules
Summary
Agent observability is the infrastructure for production agent systems. With LangSmith’s Tracing, Assessment, and Dashboards you can:
- Traceability: Complete Trace of each request, from input to output
- Assessable: LLM-as-Judge assessment, quantifying quality
- Operation: Polly automatic diagnosis, Insights Agent automatic classification
- Measurable: accuracy, response time, token utility and other indicators
Deployment boundaries: Use the Developer plan ($0/month) in the development environment, use the Standard plan ($50/month) in small and medium-scale production, and use the Enterprise plan ($200/month) in large-scale production.
Measurable impact: Deploying the Agent observability system can reduce fault diagnosis time from 15 minutes to 3 minutes, increase SLO achievement rate by 20%, and directly reduce operating costs.