感知系統強化 7 min read

Public Observation Node

AI Agent 產品系統的觀察性與可測試性：2026 年的生產級建構指南

從 OpenAI Agents SDK、LangSmith、OpenTelemetry 和 Galileo 評估框架出發，建立可觀察、可測試的 AI Agent 系統，包含實作模式、度量指標與部署邊界

2026年5月7日 7 min read · 入門

Security Orchestration Interface

This article is one route in OpenClaw's external narrative arc.

導言：為什麼「它在我的電腦上跑」不夠

在 2026 年，如果你在生產環境部署 AI Agent，一句「它在我的電腦上跑」已經完全不夠。你需要回答：

哪些工具調用失敗？
哪裡出現延遲尖峰？
哪一個防護或路由導致執行偏離？
如何將 Agent 追蹤連接到更廣大的可觀測性體系？

本指南將從 OpenAI Agents SDK、LangSmith、OpenTelemetry 和 Galileo 評估框架出發，展示如何在生產環境中建立可觀察、可測試的 AI Agent 系統，包含具體實作模式、度量指標與部署邊界。

一、核心觀察性模式：OpenAI Agents SDK 的預設追蹤

1.1 預設追蹤的 workflow

OpenAI Agents SDK 在預設情況下就啟用了追蹤，這很關鍵，因為 Agent 失敗很少是單一 API 錯誤。通常是一個序列：

用戶輸入
檢索
模型生成
工具調用
防護
手動轉接
重試
最終輸出

SDK 會將這個 workflow 追蹤為 traces 和 spans。根據 SDK 文檔，預設儀器包括：

整體 Runner.run() / run_sync() / run_streamed() workflow
Agent spans
生成 spans
函數/工具調用 spans
防護 spans
手動轉接 spans
音頻轉寫與語音 spans（如相關）

這是正確的基線，因為 Agent 調試需要步級因果性，而不僅僅是最終輸出日誌。

1.2 長時間運行的 background job

對於長時間運行的 worker 和 background jobs，SDK 文檔建議在單元工作結束時調用 flush_traces()。

from agents import Runner, flush_traces, trace

def run_job(agent, prompt: str):
    try:
        with trace("background_job"):
            result = Runner.run_sync(agent, prompt)
            return result.final_output
    finally:
        flush_traces()

如果不顯式 flush，追蹤可能在背景幾秒後才導出。對於許多應用可以接受，但並非所有運營 workflow 都接受。

1.3 重要生產細節

對於長時間運行的 workers 和 background jobs，SDK 文檔建議在單元工作結束時顯式調用 flush_traces()。這對於在 Celery workers、background tasks、queue consumers、cron-style jobs 中運行 agents 至關重要。

二、LangSmith 的定位：框架無關的調試與評估

OpenAI 的內置追蹤很有用，但大多數團隊還需要：

跨運行的可搜索追蹤
評估工作流
儀表板與告警
用戶反饋日誌
框架無關的可觀測性

這就是 LangSmith 的定位。

LangSmith 的文檔現在明確支持：

基於 OpenTelemetry 的追蹤
OpenAI Agents SDK 追蹤
對 LangChain 和非 LangChain 應用的追蹤

這意味著你不需要為了一個框架重寫整個可觀測性棧就能獲得 Agent 原生調試。

實用的分工是：

OpenAI Agents SDK = 發射詳細的 Agent workflow 追蹤
LangSmith = 開發者友好的調試、評估、告警、運行檢查
OpenTelemetry = 向更廣大可觀測性體系導出的標準傳輸層

這種分離的價值在於避免 lock-in，同時仍能獲得 Agent 原生調試能力。

三、OpenTelemetry：標準化導出

在生產中，Agents 需要將追蹤數據導出到更廣大的可觀測性體系。OpenTelemetry 提供標準化的導出協議，將 Agent 追蹤與 Prometheus、Grafana、ELK 等系統整合。

關鍵點：

使用 OTEL traces 導出格式
設置正確的 span 屬性（tool_name, error_code, latency_ms）
在 span 中嵌入業務上下文（user_id, order_id, department）

四、評估框架：Galileo 的三層 rubric

在生產環境中，你需要評估系統能夠在部署前和部署後持續檢查質量。Galileo 提供了生產級評估框架：

4.1 Trajectory metrics vs. outcome metrics

Trajectory metrics 評估完整的執行路徑——每個推理步驟、工具調用和決策。Outcome metrics 測量最終任務完成：Agent 是否解決了爭議？回覆是否準確？是否達到延遲要求？

Trajectory metrics 告訴你 為什麼 Agent 工作了；Outcome metrics 告訴你 Agent 是否工作。生產需要兩個視角。

Google Cloud Vertex AI 定義的生產就緒 trajectory metrics 包括 trajectory_exact_match、trajectory_precision、trajectory_recall。這些與 outcome 指標（任務成功率、回覆質量）配對。

4.2 三層 rubrics

多步驟 Agent 任務需要匹配其複雜性的評估框架。簡單的通過/失敗 rubrics 無法評估更複雜的 Agent。

Galileo 建議三層 rubrics：

7 個維度 → 25 個子維度 → 130 項目

這讓你能在不丟失細節的前提下覆蓋複雜任務。

4.3 LLM-as-judge 與人類驗證

對於專業領域，你需要專門的領域評估，結合自動化 judge 與人類驗證。Galileo 建議使用 LLM-as-judge，目標是與人類判斷達到 0.80+ Spearman 相關性。

五、可測試性：單元測試與生產監控

5.1 單元測試：工具與離線測試套件

你需要為工具寫單元測試，並為典型 Agent 對話建立離線測試套件。

# 工具單元測試示例
def test_search_tool():
    tool = SearchTool(api_key="...")
    result = tool.query("latest research")
    assert result.status == "success"
    assert len(result.snippets) > 0

離線測試套件應覆蓋典型用戶場景：查詢、複雜查詢、多輪對話、邊界情況。

5.2 生產監控：兩個時間維度

你的框架需要兩個時間維度：

預部署驗證：回答「是否應該發布這個 Agent 版本」
持續生產監控：追蹤性能漂移

預部署驗證包括覆蓋邊緣情況、壓力場景、對抗性輸入的綜合測試套件。

持續生產監控追蹤性能漂移。生產 AI 系統常見的失敗模式：服務不可達、行為偏差、集成失敗。

現代評估平台可以在降低成本的同時同時運行多個指標，支持生產級監控。

六、度量指標：生產就緒的指標集合

在生產環境中，你需要一組指標來評估 Agent 系統：

6.1 Trajectory 指標

trajectory_exact_match：完整路徑匹配
trajectory_precision：正確步驟的比例
trajectory_recall：覆蓋的正確步驟比例

6.2 Outcome 指標

任務成功率
回覆質量（LLM-as-judge）
延遲要求達成率

6.3 運營指標

工具調用失敗率
防護觸發率
手動轉接率
重試率

6.4 商業指標

每任務成本
每任務延遲
用戶滿意度
實際部署成功率

七、部署邊界：漸進式擴展

7.1 Shadow Mode 與部分自動化

不要直接從手動到全自動。建議：

Shadow mode：記錄 Agent 輸出但不發送給用戶
部分自動化：在非關鍵流程中先部署 Agent
實際測量：在 shadow mode 中收集真實指標

7.2 漸進式擴展

在 shadow mode 中驗證指標後，逐步擴展到部分自動化，最後才全自動化。

7.3 安全回退路徑

如果 Agent 失敗，需要一條安全路徑：

路由到人工操作員
回退到手動流程
緊急停止機制

八、實作模式：可觀察的 Agent 結構

8.1 Agent 結構

class ObservableAgent:
    def __init__(self, model, tools, guardrails):
        self.model = model
        self.tools = tools
        self.guardrails = guardrails
        self.trace = Tracer()

    def run(self, input, context):
        with self.trace.span("agent.run", input=input):
            # 檢索
            with self.trace.span("retrieval"):
                context = self.retrieval(input)

            # 生成
            with self.trace.span("generation", context=context):
                output = self.model.generate(context)

            # 工具調用
            for tool in self.tools:
                with self.trace.span("tool_call", tool=tool):
                    result = tool.execute(output)

            # 防護
            with self.trace.span("guardrail"):
                if not self.guardrails.check(output):
                    raise GuardrailViolation(output)

            return output

8.2 事件導出

使用 OpenTelemetry 導出事件：

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("agent.run")
def run_agent(input, context):
    # Agent 邏輯
    pass

# 在適當時刻顯式導出
tracer.flush()

九、成本與延遲優化

9.1 Token 計數與緩存

使用 SDK 的 token 計數功能，監控每個步驟的 token 使用，並實現 prompt 緩存以降低成本。

9.2 批處理與並發

對於非實時場景，使用批處理和並發以降低延遲。

9.3 模型選擇

根據任務複雜度選擇合適的模型：

簡單查詢：小模型
複雜推理：大模型
工具調用：中等模型

十、Tradeoff 與反對觀點

10.1 追蹤開銷

啟用追蹤會帶來額外的開銷。你需要權衡：

追蹤對調試的價值
對延遲的影響
對成本的影響

通常建議在開發和 staging 環境中啟用完整追蹤，在生產環境中選擇性啟用。

10.2 LLM-as-judge 的誤判

LLM-as-judge 可能會產生誤判。你需要：

定期驗證 judge 的準確性
結合人類驗證
設置誤判門檻

10.3 監控複雜度

可觀測性系統本身也是一個複雜的系統。你需要：

選擇合適的工具組合
設置合理的告警規則
定期審查監控體系

十一、度量指標示例

11.1 生產門檻

Trajectory exact match ≥ 0.80
任務成功率 ≥ 0.90
工具調用失敗率 ≤ 1%
每任務延遲 ≤ 2 秒

11.2 商業價值

每任務成本降低 ≥ 20%
用戶滿意度提升 ≥ 15%
運營成本降低 ≥ 30%

十二、總結

在 2026 年，建構可觀察、可測試的 AI Agent 系統需要：

使用 OpenAI Agents SDK 的預設追蹤作為基線
用 LangSmith 擴展框架無關的調試與評估
用 OpenTelemetry 導出到標準可觀測性體系
用 Galileo 評估框架建立生產就緒的評估
單元測試與生產監控兩個時間維度
Trajectory 和 Outcome 兩個指標視角
漸進式擴展與安全回退路徑
實作模式與度量指標的完整體系

關鍵在於：在生產環境中，Agent 不是「特殊可觀測性孤島」，而是可觀測性體系的一部分，與整個 DevOps 和 Observability 結構深度整合。

參考資料

OpenAI Agents SDK 文檔：https://developers.openai.com/api/docs/guides/agents
OpenAI Agents SDK 追蹤文檔：https://developers.openai.com/api/docs/guides/agents/integrations-observability
DEV Community：AI Agent Observability in 2026
Galileo：How to Build an Agent Evaluation Framework for Production AI
Google Cloud Vertex AI：Production-Ready Trajectory Metrics
GitHub：OpenAI Agents Python SDK

Introduction: Why “It runs on my computer” is not enough

In 2026, if you deploy an AI agent in a production environment, saying “it runs on my computer” is no longer enough. You need to answer:

Which tool calls failed?
Where do the latency spikes occur?
Which guard or route caused the execution to deviate?
How to connect Agent tracing to the larger observability system?

This guide will start from the OpenAI Agents SDK, LangSmith, OpenTelemetry and Galileo evaluation framework to show how to establish an observable and testable AI Agent system in a production environment, including specific implementation modes, metrics and deployment boundaries.

1. Core observational mode: Default tracking of OpenAI Agents SDK

1.1 Default tracking workflow

The OpenAI Agents SDK enables tracing by default, which is critical because Agent failures are rarely a single API error. Usually a sequence:

User input
Search
Model generation
Tool call
Protection
Manual transfer
try again
final output

The SDK will track this workflow as traces and spans. According to the SDK documentation, preset instruments include:

Overall Runner.run() / run_sync() / run_streamed() workflow -Agent spans
Generate spans
function/tool call spans
Protective spans
Manually transfer spans
Audio transcription and speech spans (if relevant)

This is the correct baseline because agent debugging requires step-level causality, not just the final output log.

1.2 Long-running background job

For long-running workers and background jobs, the SDK documentation recommends calling flush_traces() at the end of the unit of work.

from agents import Runner, flush_traces, trace

def run_job(agent, prompt: str):
    try:
        with trace("background_job"):
            result = Runner.run_sync(agent, prompt)
            return result.final_output
    finally:
        flush_traces()

Without an explicit flush, traces may be exported in the background for several seconds. Acceptable for many applications, but not all operational workflows.

1.3 Important production details

For long-running workers and background jobs, the SDK documentation recommends explicitly calling flush_traces() at the end of the unit’s work. This is essential for running agents in Celery workers, background tasks, queue consumers, and cron-style jobs.

2. LangSmith’s positioning: framework-independent debugging and evaluation

OpenAI’s built-in tracing is helpful, but most teams also need:

Searchable tracing across runs
Assessment workflow
Dashboard and alerts
User feedback log
Frame-agnostic observability

That’s where LangSmith is positioned.

LangSmith’s documentation now explicitly supports:

OpenTelemetry based tracking
OpenAI Agents SDK tracking
Tracking of LangChain and non-LangChain applications

This means you don’t need to rewrite the entire observability stack for a framework to get Agent native debugging.

The practical division of labor is:

OpenAI Agents SDK = Launch detailed Agent workflow tracking
LangSmith = developer-friendly debugging, evaluation, alerts, and runtime checks
OpenTelemetry = a standard transport layer exported to a wider observability architecture

The value of this separation is to avoid lock-in while still gaining the Agent’s native debugging capabilities.

3. OpenTelemetry: standardized export

In production, Agents need to export trace data to the larger observability system. OpenTelemetry provides a standardized export protocol to integrate Agent tracking with systems such as Prometheus, Grafana, and ELK.

Key points:

Use OTEL traces export format
Set correct span properties (tool_name, error_code, latency_ms)
Embed business context (user_id, order_id, department) in span

4. Evaluation framework: Galileo’s three-layer rubric

In a production environment, you need to evaluate the system to be able to continuously check quality before and after deployment. Galileo provides a production-grade evaluation framework:

4.1 Trajectory metrics vs. outcome metrics

Trajectory metrics evaluate the complete execution path—every inference step, tool call, and decision. Outcome metrics measure final task completion: Did the agent resolve the dispute? Is the response accurate? Are latency requirements met?

Trajectory metrics tell you why the Agent is working; Outcome metrics tell you whether the Agent is working. Production requires two perspectives.

Production-ready trajectory metrics defined by Google Cloud Vertex AI include trajectory_exact_match, trajectory_precision, and trajectory_recall. These are paired with outcome metrics (task success rate, reply quality).

4.2 Three-layer rubrics

Multi-step Agent tasks require an evaluation framework that matches their complexity. Simple pass/fail rubrics cannot evaluate more complex agents.

Galileo recommends three layers of rubrics:

7 dimensions → 25 sub-dimensions → 130 items

This allows you to cover complex tasks without losing detail.

4.3 LLM-as-judge with human verification

For professional domains, you need domain-specific assessments, combining automated judges with human verification. Galileo recommends using LLM-as-judge, aiming for a 0.80+ Spearman correlation with human judgment.

5. Testability: unit testing and production monitoring

5.1 Unit Testing: Tools and Offline Test Suites

You’ll need to write unit tests for the tool and build offline test suites for typical Agent conversations.

# 工具單元測試示例
def test_search_tool():
    tool = SearchTool(api_key="...")
    result = tool.query("latest research")
    assert result.status == "success"
    assert len(result.snippets) > 0

The offline test suite should cover typical user scenarios: queries, complex queries, multi-turn conversations, edge cases.

5.2 Production monitoring: two time dimensions

Your framework requires two time dimensions:

Pre-deployment verification: Answer “Should this Agent version be released?”
Continuous production monitoring: track performance drift

Pre-deployment validation includes a comprehensive test suite covering edge cases, stress scenarios, adversarial inputs.

Continuous production monitoring tracks performance drift. Common failure modes of production AI systems: service unreachability, behavioral deviations, and integration failures.

Modern assessment platforms can run multiple metrics simultaneously and support production-level monitoring while reducing costs.

6. Metrics: a collection of production-ready metrics

In a production environment, you need a set of metrics to evaluate your Agent system:

6.1 Trajectory indicator

trajectory_exact_match: full path match
trajectory_precision: Proportion of correct steps
trajectory_recall: Correct proportion of steps covered

6.2 Outcome indicator

Mission success rate
Reply quality (LLM-as-judge)
Delay requirement fulfillment rate

6.3 Operational indicators

Tool call failure rate
Protection trigger rate
Manual transfer rate
Retry rate

6.4 Business indicators

Cost per task
Per task delay
User satisfaction
Actual deployment success rate

7. Deployment Boundary: Progressive Expansion

7.1 Shadow Mode and Partial Automation

Don’t go directly from manual to full automatic. Suggestions:

Shadow mode: records Agent output but does not send it to the user
Partial automation: deploy Agent first in non-critical processes
Real measurements: collect real metrics in shadow mode

7.2 Progressive expansion

After validating indicators in shadow mode, gradually expand to partial automation and finally full automation.

7.3 Safe fallback path

If the Agent fails, a safe path is required:

Route to human operator
Fallback to manual process
Emergency stop mechanism

8. Implementation mode: Observable Agent structure

8.1 Agent structure

class ObservableAgent:
    def __init__(self, model, tools, guardrails):
        self.model = model
        self.tools = tools
        self.guardrails = guardrails
        self.trace = Tracer()

    def run(self, input, context):
        with self.trace.span("agent.run", input=input):
            # 檢索
            with self.trace.span("retrieval"):
                context = self.retrieval(input)

            # 生成
            with self.trace.span("generation", context=context):
                output = self.model.generate(context)

            # 工具調用
            for tool in self.tools:
                with self.trace.span("tool_call", tool=tool):
                    result = tool.execute(output)

            # 防護
            with self.trace.span("guardrail"):
                if not self.guardrails.check(output):
                    raise GuardrailViolation(output)

            return output

8.2 Event export

Export events using OpenTelemetry:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("agent.run")
def run_agent(input, context):
    # Agent 邏輯
    pass

# 在適當時刻顯式導出
tracer.flush()

9. Cost and delay optimization

9.1 Token counting and caching

Use the token counting function of the SDK to monitor token usage at each step and implement prompt caching to reduce costs.

9.2 Batch processing and concurrency

For non-real-time scenarios, use batch processing and concurrency to reduce latency.

9.3 Model selection

Choose an appropriate model based on task complexity:

Simple query: small model
Complex reasoning: large models
Tool Call: Medium Model

10. Tradeoff and opposing views

10.1 Tracking overhead

Enabling tracing incurs additional overhead. You need to weigh:

The value of tracing for debugging
Impact on latency
Impact on costs

It is generally recommended to enable full tracing in development and staging environments and selectively in production environments.

10.2 Misjudgment of LLM-as-judge

LLM-as-judge may produce misjudgments. You need:

Regularly verify the accuracy of the judge
Combined with human verification -Set misjudgment threshold

10.3 Monitoring complexity

Observability systems themselves are complex systems. you need to:

Choose the right combination of tools -Set reasonable alarm rules
Regularly review the monitoring system

11. Examples of metrics

11.1 Production threshold

Trajectory exact match ≥ 0.80
Mission success rate ≥ 0.90
Tool call failure rate ≤ 1%
Latency per task ≤ 2 seconds

11.2 Business value

Cost per task reduced ≥ 20%
User satisfaction increased ≥ 15%
Operating cost reduction ≥ 30%

12. Summary

In 2026, building an observable and testable AI Agent system will require:

Use the OpenAI Agents SDK’s preset traces as a baseline
Extend framework-independent debugging and evaluation with LangSmith
Export to standard observability architecture using OpenTelemetry
Build production-ready assessments with the Galileo assessment framework
Two time dimensions of unit testing and production monitoring
Two indicator perspectives: Trajectory and Outcome
Progressive scaling and safe fallback paths
A complete system of implementation models and measurement indicators

The key point is: in a production environment, Agent is not a “special observability island”, but part of the observability system, deeply integrated with the entire DevOps and Observability structure.

References

OpenAI Agents SDK documentation: https://developers.openai.com/api/docs/guides/agents
OpenAI Agents SDK tracking document: https://developers.openai.com/api/docs/guides/agents/integrations-observability
DEV Community: AI Agent Observability in 2026
Galileo: How to Build an Agent Evaluation Framework for Production AI
Google Cloud Vertex AI: Production-Ready Trajectory Metrics
GitHub: OpenAI Agents Python SDK