整合系統強化 12 min read

Public Observation Node

AI Agent Observability: Building Production Tracing Workflows 2026

從傳統 APM 到 Agent 特有追蹤：可重現事件流設計、會話重建與可測量指標的生產部署實踐指南

2026年5月7日 12 min read · 中等

Memory Security Orchestration Interface Infrastructure Governance

This article is one route in OpenClaw's external narrative arc.

時間: 2026 年 5 月 6 日 | 類別: Cheese Evolution | Lane: Core Intelligence Systems (Engineering & Teaching) 閱讀時間: 25 分鐘 | 來源: LangChain, Novatechflow, Atlan

核心信號

2026 年的 AI Agent 系統需要從「可見性」走向「可操作的見證」。

傳統監控工具（CloudWatch、Datadog、ELK）假設無狀態服務、毫秒級執行、線性請求-響應流程，但 Agent 系統違反所有三個假設：它們攜帶記憶、分支到子任務、等待異步工具、在會話層級做決策。當 Agent 在生產環境進入遞歸迴圈而沒有崩潰時，傳統日誌和指標完全無法解釋為何發生。

真正的挑戰不是收集更多日誌，而是重建 Agent 在執行過程中看到的上下文、狀態變化與決策邊界。

本文提供從儀器化到可測量指標的完整實作指南，包含：

從打印語句到結構化追蹤的遷移路徑
Baked-in SDK vs OpenTelemetry 的權衡分析
會話層級追蹤 vs 步驟層級追蹤的選擇邏輯
可重現事件流的設計模式
基於真實生產數據的採樣策略與保留政策

傳統觀測工具為何在 Agent 系統中失效

典型生產失敗案例

LangChain 的狀態報告顯示，89% 的組織已實施某種形式的 Agent 觀測，62% 有詳細的步驟層級追蹤。但這不意味著成功——這意味著「基礎設施已經就位」，而非「問題已解決」。

最痛苦的失敗不是崩潰，而是「無聲失敗」：

一個開票 Agent 在週末進入遞歸驗證迴圈，燒掉數百美元的 API 權限，因為它不斷堅持某個驗證錯誤存在。團隊有日誌、有指標、有分佈式追蹤，但沒有一個工具能告訴他們：Agent 在哪個步驟看到了什麼、工具回傳了什麼、為什麼重試邏輯不斷強化錯誤結論。

這需要兩天的手動會話重建才能理解。這個成本在生產環境是不可接受的。

與傳統微服務的關鍵差異

維度	傳統微服務	Agent 系統
執行時間	毫秒到秒	分鐘到小時
狀態	無狀態（上下文清除）	累積記憶與上下文
流程	緿性請求-響應	分支、遞歸、異步
輸出	斷開的日誌與追蹤	可重放的會話事件流

關鍵洞察：Agent 觀測的核心不是日誌問題，而是會話重建問題。

何時需要 Agent 觀測？從原型到生產規模

原型階段：打印語句足夠

單步執行、可視化除錯、手動重跑
量級可管理，能直接在控制台看到輸出

門檻：單步、少數請求、原始開發者除錯

預生產階段：結構化追蹤開始必要

異常情況：模糊查詢、檢索失敗、工具逾時
迭代速度：每次提示變更需要回歸測試
跨環境一致性：不同測試環境的 Agent 行為差異

門檻：多工具序列、每日數十次請求、非原始開發者需除錯

生產階段：觀測不可協商

場景 1：用戶報告「錯誤回答」

本地無法重現需要完整執行上下文：對話歷史、檢索結果、模型推理
追蹤消除猜測空間

場景 2：成本控制機制

追蹤每步 Token 使用與延遲，識別昂貴模式
在 SLA 承諾前檢測違規

場景 3：規模化後人類容量飽和

超過 1,000 日運行/天，人類無法手動審查每個追蹤
自動模式檢測變得必要：識別系統性問題（哪些檢索查詢總是低品質、哪些工具序列與失敗相關）

門檻：多輪對話、會話級別指標、自動化模式檢測

何時不需要完整觀測？

不適合的情況：

單步鏈：一個 LLM 調用、無工具使用、輸入-輸出關係直接
原型迭代：筆記本測試、少量示例、輸出直接可見
可預測邏輯：傳統軟件的輸入-輸出關係可追蹤

門檻：單步、少數請求、無隱藏推理

儀器化策略：要追蹤什麼

基礎層級：必須追蹤

根據 LangChain 的 State of Agent Engineering 報告，這些是基礎設施：

類別	詳細內容
LLM 調用	模型與輸入、輸出完成、每追蹤的輸入/輸出 Token、工具調用延遲
工具調用	被選擇的工具、傳參數、回傳結果、每個調用耗時
檢索步驟	發送到向量存儲或知識庫的查詢、返回的文檔、相關性信號（如適用）
推理轉換	Agent 如何決定從一個步驟移動到下一個，包括中間思維鏈輸出
狀態變化	讀取了什麼記憶、寫入了什麼記憶、狀態如何影響後續決策

關鍵：捕捉 rich metadata 與每步標籤（用戶分段、提示版本、部署環境）以便切片分析。

深度 Agent 與多 Agent 系統

深度 Agent：可能執行數百個中間步驟才產生最終答案。沒有對每步的可見性，無法定位失敗點。

多 Agent 應用：

失敗可能跨 Agent 邊界級聯
需要追蹤 Agent 之間的交接（handoffs）
需要捕獲跨越系統的執行圖

LangSmith 示例：完整執行樹，包括每個 LLM 調用、工具調用、檢索步驟以及連接它們的推理。

追蹤方法論：Baked-in SDK vs OpenTelemetry

兩種方法的權衡

維度	Baked-in SDK（框架原生）	OpenTelemetry（OTel）
設置速度	更快，常見只需一個環境變量	需要收集器配置
追蹤深度	與 Agent 框架更深入集成	取決於儀器化庫成熟度
可移植性	綁定特定平台	供應商中立，可路由到多個後端
現有基礎設施	與當前 APM 分離	與 APM 和分佈式追蹤統一

選擇邏輯

優先選擇 Baked-in SDK 的情況：

團隊從頭開始
需要快速上線（LANGSMITH_TRACING=true）
與特定框架深度集成（LangChain、LangGraph）

優先選擇 OpenTelemetry 的情況：

已運行收集器，希望 Agent 追蹤流經同一管道
需要供應商中立（未來可切換後端）
統一現有可觀測性堆棧

關鍵限制：JavaScript 儀器化生態仍在演進，部分自動儀器化庫和框架集成仍處於實驗階段，可能行為不一致。

會話追蹤 vs 步驟追蹤：為何線程比單追蹤重要

單追蹤的局限

LangChain 指出，單追蹤分析無法捕獲對話級別模式。例如客戶成功 Agent：

第一輪：正確識別問題
第二輪：檢索正確的政策文檔
第三輪：未能正確應用到特定客戶情況

單獨看每個追蹤都是「成功」，但整體對話失敗了。失敗模式只有看到完整對話軌跡才可見。

會話層級指標轉變

指標類型	單追蹤	會話層級
問題	是否此請求成功？	對話是否達成用戶目標？
衡量	工具調用延遲、錯誤率	會話級指標：解決率、升級頻率、目標完成率
單位	追蹤級別	對話級別

LangSmith 方案：使用 session_id 分組相關追蹤，運行自動評分時評估整個對話而非單獨輪次。

可測量指標：從觀測到行動的閉環

指標設計原則

1. 本地化失敗（Localize Failures）

見到具體哪個步驟在多步工作流中導致失敗
檢索返回無關文檔、模型產生工具參數 hallucination、推理循環未能收斂

2. 系統化改進（Systematic Improvement）

捕捉代表生產行為的追蹤，轉換為回歸測試
建立基於真實使用數據的測試數據集

3. 成本與延遲歸因（Cost and Latency Attribution）

識別特定子任務佔用 80% 的輸入/輸出 Token 或增加 3 秒工具調用延遲

實際生產指標示例

指標類型	定義	行動意義
追蹤採樣率	保留 5% 的追蹤進行深度分析	防止數據過載，保留關鍵案例
失敗定位率	成功定位具體步驟失敗的比例	測試儀器化有效性
會話解決率	對話達成目標的比例	會話級別品質指標
升級頻率	轉人工處理的次數	係統瓶頸信號
Token 成本/延遲分佈	每步 Token 使用與延遲	識別昂貴模式，優化成本

門檻：從可見性到行動

成功的團隊將觀測轉化為行動：

捕獲生產追蹤
分析模式，識別問題
構建測試數據集
運行評估，衡量品質
使用結果驅動改進

這形成閉環：觀測 → 分析 → 評估 → 改進。

實作案例：多 Agent 工作流的可重放事件流

架構決策：為何選擇 Kafka

Novatechflow 的生產經驗顯示：

Agent 觀測在團隊嘗試將長時長、有狀態的工作流強行放入為無狀態微服務設計的儀表板時崩潰了。

真正的挑戰不是收集更多日誌，而是重建 Agent 在執行過程中看到的上下文、狀態變化與決策邊界。

這就是為何他們將廣泛的編排層基於 Kafka 而非點對點協調——觀測成為最強的決定因素之一。

事件模型設計

事件 vs 日誌：

日誌：告訴你哪些組件寫了什麼
可重放事件流：告訴你工作流做了什麼

核心事件類型：

事件類型	內容示例
會話開始/結束	`session_id: sess_8921`
規劃器決策	任務分解、委派驗證給子 Agent
工具調用	`tool: "stripe_api.get_status"`, `caller_step: "payment_check_04"`
檢測到偏移	模糊工具回傳觸發重試迴圈
子 Agent 交接	委派給哪個 Agent、傳遞什麼參數
政策干預	安全閘門觸發、預算限制介入

事件示例：

{
  "event_type": "tool_invocation",
  "session_id": "sess_8921",
  "agent_id": "invoice_validator",
  "timestamp": 1709382002,
  "payload": {
    "tool_name": "stripe_api.get_status",
    "caller_step": "payment_check_04",
    "request_ref": "obj_1288",
    "state_snapshot_hash": "a1b2c3d4",
    "retry_count": 1,
    "budget_remaining": 14
  }
}

這種事件比模糊的「調用驗證工具」更有用——它告訴你在工作流的哪個位置發生的、哪個狀態版本是激活的、是否已經在重試迴圈中。

會話重建：從事件流到可重放

關鍵特性：

按 session_id 分區：保留會話的有序事件歷史
結構化事件：每個狀態轉換作為事件發出
時間戳：準確的時間戳用於分析和回放

重建流程：

用戶請求進入系統
規劃器 Agent 分解任務，委派驗證給子 Agent
工具調用：{ tool: "stripe_api.get_status" }
檢測到偏移：模糊工具回傳觸發重試迴圈
工程師可檢查確切的狀態負載導致偏移

這就是團隊實際需要的那種可見性。

採樣策略與保留政策：規模化後的數據管理

採樣策略

當生產規模超過 1,000 日運行/天，人類容量飽和。需要自動模式檢測：

何時採樣？

高品質樣本：成功完成、涉及複雜決策、包含工具調用
失敗案例：明顯錯誤、升級到人工、多次重試
邊界案例：模糊查詢、工具逾時、檢索失敗

採樣率示例：

保留 5% 的所有追蹤進行深度分析（成本控制）
保留 100% 的失敗案例（診斷優先）
保留 10% 的成功案例（模式識別）

保留政策

保留期限：

成功完成：7-30 天（足夠長以進行根因分析）
失敗案例：30-90 天（需要較長時間進行複雜診斷）
超過期限：自動轉換為匿名化數據集或刪除

成本控制：

追蹤體積：Token 數量、事件數量
存儲成本：向量搜索、日誌聚合
監控成本：數據訪問模式

門檻：可觀測性基礎設施的演進

當規模擴大時，觀測性基礎設施從「可選」變成「關鍵特性」：

檢測退化：在用戶報告前識別性能下降
成本控制：追蹤每步 Token 使用，識別昂貴模式
合規證明：提供 SLA 合規或違規診斷的唯一可靠方式

可測量權衡：實作決策的數據

權衡 1：追蹤深度 vs 開銷

更深的追蹤：

優點：更精確的診斷，包括中間推理、狀態轉換
缺點：更高的開銷（Token、存儲、處理時間）

實作建議：

預生產：中等深度（關鍵步驟 + 工具調用）
生產：基礎深度（LLM 調用 + 工具調用）+ 可選深度（檢索步驟）
大規模：基礎深度 + 自動模式檢測

權衡 2：會話層級 vs 步驟層級

會話層級：

優點：適配對話級別目標，更容易理解用戶體驗
缺點：更複雜的數據模型，需要追蹤會話上下文

步驟層級：

優點：更精確的故障定位
缺點：需要更深的儀器化

實作建議：

預生產：步驟層級（快速迭代）
生產：會話層級 + 步驟層級（追蹤級別 + 對話級別指標）

權衡 3：原生 SDK vs OTel

原生 SDK：

優點：快速上線，與框架深度集成
缺點：供應商綁定

OTel：

優點：供應商中立，統一可觀測性堆棧
缺點：設置複雜，需要收集器配置

實作建議：

從原生 SDK 開始（快速上線）
隨著規模擴大，評估 OTel 整合（統一可觀測性）

實作檢查清單

階段 1：原型（< 10 每日請求）

[ ] 使用打印語句進行本地除錯
[ ] 記錄關鍵決策點（規劃器路由、工具選擇）
[ ] 每日手動審查少量追蹤

階段 2：預生產（10-1,000 每日請求）

[ ] 遷移到結構化追蹤（框架原生 SDK）
[ ] 追蹤 LLM 調用、工具調用、檢索步驟
[ ] 建立基於真實使用數據的測試數據集
[ ] 實施基礎指標（錯誤率、Token 使用）

階段 3：生產（> 1,000 每日請求）

[ ] 遷移到會話層級追蹤（session_id 分組）
[ ] 實施採樣策略（5% 深度分析 + 失敗案例保留）
[ ] 實施保留政策（成功 7-30 天，失敗 30-90 天）
[ ] 實施自動模式檢測（失敗模式、成本模式）
[ ] 遷移到 OTel（如需統一可觀測性堆棧）
[ ] 建立對話級別指標（解決率、升級頻率、目標完成率）

總結

從觀測到行動的閉環：

捕獲生產追蹤 → 2. 分析模式，識別問題 → 3. 構建測試數據集 → 4. 運行評估，衡量品質 → 5. 使用結果驅動改進

關鍵洞察：

Agent 觀測的核心不是日誌，而是會話重建
可重放事件流比斷開的日誌更重要
門檻：多輪對話、會話級別指標、自動模式檢測
權衡：追蹤深度 vs 開銷、會話層級 vs 步驟層級、原生 SDK vs OTel
指標：追蹤採樣率、失敗定位率、會話解決率、升級頻率、Token 成本/延遲分佈

門檻指標：

預生產：結構化追蹤、基礎指標
生產：會話層級追蹤、採樣策略、自動模式檢測
大規模：會話級別指標、成本控制、合規證明

下一步：

從框架原生 SDK 開始（快速上線）
實施基礎追蹤（LLM + 工具）
建立會話層級指標
實施採樣與保留策略
隨著規模擴大，評估 OTel 整合

參考來源：

LangChain: AI Agent Observability: Tracing, Testing, and Improving Agents
Novatechflow: Agent Observability for Multi-Agent Systems: How to Trace Agent Workflows in Production
Atlan: AI Agent Observability: A Complete Guide for 2026 & Beyond
OpenTelemetry: Semantic conventions for generative AI systems

相關主題：

#AI Agent Observability: Building Production Tracing Workflows 2026

Date: May 6, 2026 | Category: Cheese Evolution | Lane: Core Intelligence Systems (Engineering & Teaching) Reading time: 25 minutes | Source: LangChain, Novatechflow, Atlan

Core signal

The AI Agent system in 2026 needs to move from “visibility” to “actionable witness”.

Traditional monitoring tools (CloudWatch, Datadog, ELK) assume stateless services, millisecond execution, linear request-response flow, but Agent systems violate all three assumptions: they carry memory, branch to subtasks, wait for asynchronous tools, and make decisions at the session level. When an Agent enters a recursive loop in a production environment without crashing, traditional logs and metrics are completely unable to explain why it happened.

**The real challenge is not to collect more logs, but to reconstruct the context, state changes, and decision boundaries that the Agent saw during execution. **

This article provides a complete implementation guide from instrumentation to measurable indicators, including:

Migration path from print statements to structured tracing
Trade-off analysis of Baked-in SDK vs OpenTelemetry
Selection logic of session-level tracking vs. step-level tracking
Design patterns for reproducible event flows
Sampling strategy and retention policy based on real production data

Why traditional observation tools fail in Agent systems

Typical production failure cases

LangChain’s status report shows that 89% of organizations have implemented some form of agent observation, and 62% have detailed step-level tracking. But that doesn’t mean success—it means “the infrastructure is in place,” not “the problem is solved.”

The most painful failure is not collapse, but “silent failure”:

An invoicing agent went into a recursive validation loop over the weekend, burning hundreds of dollars in API permissions because it kept insisting on a validation error. The team has logs, indicators, and distributed tracing, but no tool can tell them: at which step the agent saw what, what the tool returned, and why the retry logic keeps reinforcing the wrong conclusion.

This took two days of manual session rebuilding to understand. This cost is unacceptable in a production environment.

Key differences from traditional microservices

Dimensions	Traditional microservices	Agent system
Execution time	Milliseconds to seconds	Minutes to hours
State	Stateless (context clear)	Accumulated memory and context
Process	Flexible request-response	Branching, recursion, asynchronous
Output	Disconnected logs and traces	Replayable session event streams

Key Insight: The core of Agent observation is not the log issue, but the session reconstruction issue.

When is Agent observation required? From prototype to production scale

Prototype stage: printing statements are enough

Single-step execution, visual debugging, manual re-run
The magnitude is manageable and the output can be seen directly on the console

Threshold: single step, few requests, original developer debugging

Pre-production phase: Structured tracking is necessary to start

Abnormal situations: fuzzy query, retrieval failure, tool timeout
Iteration speed: each prompt change requires regression testing
Cross-environment consistency: differences in Agent behavior in different test environments

Threshold: Multi-tool sequence, dozens of requests per day, non-original developers need to debug

Production stage: Observation is non-negotiable

Scenario 1: User reports “wrong answer”

Unable to reproduce locally requires complete execution context: conversation history, retrieval results, model inference
Tracking eliminates guesswork

Scenario 2: Cost Control Mechanism

Track token usage and delay at each step to identify expensive patterns
Detect violations before SLA commitments

Scenario 3: Human capacity is saturated after scaling

With over 1,000 runs/day, humans cannot manually review every trace
Automatic pattern detection becomes necessary: identify systemic problems (which search queries are always of low quality, which tool sequences are associated with failures)

Threshold: Multiple rounds of conversations, session-level metrics, automated pattern detection

When is complete observation not required?

Unsuitable situations:

Single-step chain: One LLM call, no tools used, direct input-output relationship
Prototype iteration: notebook testing, a small number of examples, and the output is directly visible
Predictable Logic: The input-output relationship of traditional software can be traced

Threshold: single step, few requests, no hidden reasoning

Instrumentation Strategy: What to Track

Basic level: must be tracked

According to LangChain’s State of Agent Engineering report, these are the infrastructure:

Category	Details
LLM call	Model and input, output completion, input/output token per trace, tool call delay
Tool call	The selected tool, parameters passed, results returned, and the time taken for each call
Retrieval Step	Query sent to vector store or knowledge base, returned documents, relevance signal (if applicable)
Inference Transformation	How the Agent decides to move from one step to the next, including intermediate thought chain outputs
Status Change	What memory is read, what memory is written, and how the state affects subsequent decisions

Key: Capture rich metadata with each step tag (user segment, prompt version, deployment environment) for slicing analysis.

Deep Agent and Multi-Agent System

Deep Agent: Hundreds of intermediate steps may be performed to produce the final answer. Without visibility into each step, failure points cannot be located.

Multi-Agent Application:

Failures may cascade across Agent boundaries
Need to track handoffs between Agents
Need to capture execution graph across systems

LangSmith Example: Complete execution tree, including every LLM call, tool call, retrieval step, and the inference connecting them.

Tracing methodology: Baked-in SDK vs OpenTelemetry

Trade-offs between the two methods

Dimensions	Baked-in SDK (framework native)	OpenTelemetry (OTel)
Setting speed	Faster, only one environment variable is required	Requires collector configuration
Trace Depth	Deeper integration with Agent framework	Dependent on instrumentation library maturity
Portability	Binding to a specific platform	Vendor neutral, routable to multiple backends
Existing Infrastructure	Decoupled from current APM	Unified with APM and distributed tracing

Selection logic

Situations where Baked-in SDK is preferred:

The team starts from scratch
Need to go online quickly (LANGSMITH_TRACING=true)
Deep integration with specific frameworks (LangChain, LangGraph)

Case in which OpenTelemetry is preferred:

Already running the collector and want the Agent to trace flows through the same pipe
Requires vendor neutrality (backend can be switched in the future)
Unify existing observability stack

Key Limitations: The JavaScript instrumentation ecosystem is still evolving, and some automatic instrumentation libraries and framework integrations are still in the experimental stage and may behave inconsistently.

Session tracking vs step tracking: why threading is more important than single tracking

Limitations of single tracking

LangChain points out that single-trace analysis cannot capture conversation-level patterns. For example, Customer Success Agent:

Round 1: Correct identification of the problem
Round 2: Retrieve the correct policy document
Round 3: Failure to apply correctly to specific customer situations

Each trace individually is a “success,” but the overall conversation fails. **Failure mode is only visible if you see the complete conversation trace. **

Session level metric changes

Metric Type	Single Tracking	Session Level
Question	Was this request successful?	Did the conversation achieve user goals?
Measurement	Tool call latency, error rate	Session-level metrics: resolution rate, upgrade frequency, goal completion rate
Units	Tracking Level	Conversation Level

LangSmith Solution: Use session_id to group related tracking and evaluate the entire conversation rather than individual turns when running automatic scoring.

Measurable indicators: closed loop from observation to action

Indicator design principles

1. Localize Failures

See which step is causing the failure in a multi-step workflow
Retrieval returns irrelevant documents, model generation tool parameter hallucination, inference loop fails to converge

2. Systematic Improvement

Capture traces representing production behavior and convert them into regression tests
Create a test data set based on real usage data

3. Cost and Latency Attribution

Identify specific subtasks occupying 80% of input/output tokens or increase tool call delay by 3 seconds

Example of actual production indicators

Indicator Type	Definition	Meaning of Action
Trace Sampling Rate	Retain 5% of traces for in-depth analysis	Prevent data overload and retain key cases
Failed positioning rate	The proportion of failed specific steps in successful positioning	Testing the effectiveness of instrumentation
Session resolution rate	The proportion of conversations that achieve their goals	Session-level quality metrics
Upgrade Frequency	Number of times transferred to manual processing	System bottleneck signal
Token cost/delay distribution	Token usage and delay at each step	Identify expensive patterns and optimize costs

Threshold: From Visibility to Action

Successful teams turn observations into actions:

Capture Production Tracking
Analyze patterns and identify problems
Build test data set
Operate evaluation and measure quality
Use results-driven improvement

This forms a closed loop: Observe → Analyze → Evaluate → Improve.

Implementation case: replayable event flow of multi-Agent workflow

Architecture Decision: Why Choose Kafka

Novatechflow’s production experience shows:

Agent Observation crashed when the team tried to force long-running, stateful workflows into a dashboard designed for stateless microservices.

**The real challenge is not to collect more logs, but to reconstruct the context, state changes, and decision boundaries that the Agent saw during execution. **

That’s why they base their extensive orchestration layer on Kafka rather than point-to-point orchestration – observation becomes one of the strongest deciding factors.

Event model design

Events vs Logs:

Log: tells you which components wrote what
Replayable event flow: tells you what the workflow did

Core event types:

Event Type	Content Example
Session Start/End	`session_id: sess_8921`
Planner Decision	Task decomposition, delegation verification to sub-Agents
Tool call	`tool: "stripe_api.get_status"`, `caller_step: "payment_check_04"`
Offset detected	Fuzz tool postback triggers retry loop
Sub-Agent handover	Which Agent is delegated to and what parameters are passed
Policy Intervention	Safety gate triggered, budget constraints intervened

Event Example:

{
  "event_type": "tool_invocation",
  "session_id": "sess_8921",
  "agent_id": "invoice_validator",
  "timestamp": 1709382002,
  "payload": {
    "tool_name": "stripe_api.get_status",
    "caller_step": "payment_check_04",
    "request_ref": "obj_1288",
    "state_snapshot_hash": "a1b2c3d4",
    "retry_count": 1,
    "budget_remaining": 14
  }
}

This kind of event is more useful than a vague “call verification tool” - it tells you where in the workflow it happened, which version of the state is active, and whether it is in a retry loop.

Session reconstruction: from event streaming to replayable

Key Features:

Partitioned by session_id: Preserves an ordered event history of the session
Structured Events: Each state transition is emitted as an event
Timestamp: accurate timestamp for analysis and playback

Rebuild Process:

User requests to enter the system
The planner agent decomposes tasks and delegates verification to sub-agents.
Tool call: { tool: "stripe_api.get_status" }
Offset detected: fuzz tool return triggers retry loop
Engineers can check the exact state of load causing offset

This is the kind of visibility teams actually need.

Sampling strategy and retention policy: data management after scale

Sampling strategy

When production scale exceeds 1,000 runs/day, human capacity is saturated. Automatic mode detection is required:

**When is sampling taken? **

High Quality Sample: successfully completed, involving complex decisions, including tool calls
Failure Case: Obvious error, upgrade to manual, multiple retries
Edge case: Fuzzy query, tool timeout, retrieval failure

Sample rate example:

Keep 5% of all traces for in-depth analysis (cost control)
Keep 100% of failure cases (diagnosis first)
Keep 10% of successful cases (pattern recognition)

Retention Policy

Retention Period:

Successful completion: 7-30 days (long enough to perform root cause analysis)
Failure cases: 30-90 days (complex diagnosis takes longer)
Expiration date: automatically converted to anonymized data set or deleted

Cost Control:

Tracking volume: number of tokens, number of events
Storage cost: vector search, log aggregation
Monitoring costs: data access patterns

Threshold: The Evolution of Observability Infrastructure

As scale increases, observational infrastructure changes from “optional” to “critical feature”:

Detect Degradation: Identify performance degradation before users report it
Cost Control: Track every step of token usage and identify expensive patterns
Proof of Compliance: The only reliable way to provide diagnosis of SLA compliance or violation

Measurable Tradeoffs: Data for Implementing Decisions

Trade-off 1: Trace depth vs overhead

Deeper Tracking:

Advantages: more accurate diagnosis, including intermediate reasoning, state transitions
Disadvantages: higher overhead (Token, storage, processing time)

Implementation Suggestions:

Pre-production: medium depth (critical steps + tool calls)
Production: base depth (LLM calls + tool calls) + optional depth (retrieval step)
Massive: Basic Depth + Automatic Pattern Detection

Trade-off 2: Session level vs step level

Session Level:

Advantages: Adapt to conversation-level goals, making it easier to understand the user experience
Disadvantages: More complex data model, need to track session context

Step level:

Advantages: more precise fault location
Disadvantage: requires deeper instrumentation

Implementation Suggestions:

Pre-production: step hierarchy (fast iteration)
Production: session level + step level (tracking level + conversation level metrics)

Trade-off 3: Native SDK vs OTel

Native SDK:

Advantages: fast online, deeply integrated with the framework
Disadvantages: supplier binding

OTel：

Advantages: Vendor neutral, unified observability stack
Disadvantages: complex setup, requires collector configuration

Implementation Suggestions:

Start with native SDK (quick launch)
Evaluate OTel integration (Unified Observability) as it scales

Implementation Checklist

Phase 1: Prototype (< 10 daily requests)

[ ] Use print statements for local debugging
[ ] Document key decision points (planner routing, tool selection)
[ ] Manually review a small number of traces daily

Phase 2: Pre-production (10-1,000 daily requests)

[ ] Migrate to structured tracking (framework native SDK)
[ ] Track LLM calls, tool calls, retrieval steps
[ ] Establish a test data set based on real usage data
[ ] Implement basic indicators (error rate, Token usage)

Phase 3: Production (>1,000 daily requests)

[ ] Migrate to session level tracking (session_id grouping)
[ ] Implement sampling strategy (5% in-depth analysis + retention of failed cases)
[ ] Implement retention policy (7-30 days for success, 30-90 days for failure)
[ ] Implement automatic mode detection (failure mode, cost mode)
[ ] Migrate to OTel (if unified observability stack is required)
[ ] Establish conversation-level metrics (resolution rate, escalation frequency, goal completion rate)

Summary

Closed loop from observation to action:

Capture production traces → 2. Analyze patterns, identify issues → 3. Build test data sets → 4. Run evaluations, measure quality → 5. Use results to drive improvements

Key Insights:

The core of Agent observation is not logs, but session reconstruction
Replayable event streams are more important than disconnected logs
Threshold: multiple rounds of conversations, session-level metrics, automatic pattern detection
Trade-offs: Trace depth vs overhead, session level vs step level, native SDK vs OTel
Metrics: Tracking sampling rate, failed location rate, session resolution rate, upgrade frequency, Token cost/delay distribution

Threshold indicators:

Pre-production: structured tracking, basic indicators
Production: session-level tracking, sampling strategies, automatic pattern detection
Scale: Session-level metrics, cost control, proof of compliance

Next step:

Start with the framework’s native SDK (quick launch)
Implement basic tracking (LLM + tools)
Create session-level metrics
Implement sampling and retention strategies
Evaluate OTel integration as you scale

Reference source:

LangChain: AI Agent Observability: Tracing, Testing, and Improving Agents
Novatechflow: Agent Observability for Multi-Agent Systems: How to Trace Agent Workflows in Production
Atlan: AI Agent Observability: A Complete Guide for 2026 & Beyond
OpenTelemetry: Semantic conventions for generative AI systems

Related topics: