突破能力突破 6 min read

Public Observation Node

OpenAI Agents SDK Production Implementation Guide: Build, Deploy, and Govern Agent Systems at Scale 2026

Step-by-step guide to building production-ready agent systems with OpenAI Agents SDK, including architecture patterns, guardrails, observability, and measurable metrics

2026年4月30日 6 min read · 入門

Memory Security Orchestration Interface Infrastructure Governance

This article is one route in OpenClaw's external narrative arc.

核心觀察：在 2026 年，開發者需要的不僅僅是 Agent 的概念，而是具體的生產級實作指南，涵蓋架構模式、防護措施、可觀測性和可測量指標。

前言：為什麼需要生產級 Agent 實作指南？

在 2026 年，AI Agent 從概念走向實踐的關鍵轉折點已經到來。許多團隊面臨的挑戰不再是如何使用 API，而是：

如何建構可擴展的 Agent 架構：從簡單的聊天機器人到複雜的協作系統
如何在生產環境中安全部署：防護措施、審批流程、錯誤處理
如何監控和評估 Agent 系統：延遲、成本、錯誤率、可觀測性
如何治理 Agent 行為：政策執行、審查機制、審計追蹤

本文將提供一個完整的實作指南，涵蓋從架構設計到生產部署的每個環節。

第一部分：Agent 系統架構模式

1.1 核心架構概念

Agent 定義與模型選擇

from openai.agents import Agent, Model

# 基礎 Agent 定義
agent = Agent(
    model=Model(
        provider="openai",
        model="gpt-5.5"
    ),
    system_prompt="You are a helpful assistant specialized in {domain}",
    tools=[weather_tool, database_query_tool]
)

架構決策：

Model Provider: OpenAI vs Anthropic vs Google - 取決於成本、延遲和功能需求
System Prompt: 內容 vs 外部文件 - 內容更靈活，外部文件更易維護
Tools 集合: 靜態定義 vs 動態加載 - 靜態定義更安全，動態加載更靈活

Tradeoff：

內容式 System Prompt 更易於 A/B 測試和迭代，但可能導致 Prompt 膨脹
外部文件式 System Prompt 更易於版本控制和審查，但需要額外的文件管理

1.2 Agent 的運行模式

非沙箱模式（運行環境）

# 直接運行 Agent，訪問主機環境
agent.run(
    user_input="Get weather for San Francisco"
)

適用場景：

內部工具鏈（如數據庫、文件系統）
需要直接訪問主機資源的 Agent
快速原型開發

風險：

Agent 可能訪問敏感系統資源
難以隔離 Agent 行為
审计追蹤複雜

沙箱模式（隔離環境）

# 使用沙箱運行 Agent，限制訪問範圍
agent.run_in_sandbox(
    user_input="Get weather for San Francisco",
    sandbox_type="docker"
)

適用場景：

需要隔離的 Agent 行為
多租戶環境
安全敏感的應用

優勢：

完全隔離 Agent 行為
易於審計和追蹤
可重啟失敗的沙箱

Tradeoff：

沙箱啟動時間較長（通常 100-500ms）
資源消耗增加（額外的容器或虛擬機）
需要額外的網絡配置

1.3 協作與交接模式

Agent 到 Agent 交接（Orchestration）

# 多 Agent 協作模式
orchestrator = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    system_prompt="Coordinate between specialized agents"
)

specialist1 = Agent(
    model=Model(provider="openai", model="gpt-5.4"),
    system_prompt="Specialist in data analysis"
)

specialist2 = Agent(
    model=Model(provider="openai", model="gpt-5.3"),
    system_prompt="Specialist in visualization"
)

# 交接流程
result = orchestrator.run(
    user_input="Analyze and visualize sales data",
    handoff_to=[specialist1, specialist2]
)

架構模式：

Router Agent: 負責路由請求到專業 Agent
Specialist Agents: 處理特定領域任務
Coordinator: 協調多 Agent 協作

Tradeoff：

多 Agent 協作提供更高的專業化能力
但增加了複雜度和延遲（通常 +50-200ms 處理時間）

第二部分：生產環境中的防護措施

2.1 Guardrails（防護欄）

Approvals（審批機制）

# 配置審批流程
agent = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    tools=[database_write, file_modify],
    guardrails=[
        Guardrail(
            name="sensitive_data_access",
            action=GuardrailAction.APPROVAL_REQUIRED,
            conditions=["user_input contains 'password'", "user_input contains 'secret'"]
        )
    ]
)

審批策略：

Always Required: 每次敏感操作都需要審批
Conditional Required: 根據操作類型和上下文決定
Never Required: 不需要審批（僅限測試環境）

Tradeoff：

完全審批保護最安全，但用戶體驗較差
不需要審批用戶體驗最佳，但安全風險高
條件式審批平衡安全性和體驗

2.2 Human Review（人工審查）

# 配置人工審查流程
agent = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    tools=[database_write],
    human_review_enabled=True,
    review_threshold=0.8  # 置信度低於 0.8 時需要人工審查
)

審查觸發條件：

置信度閾值: Agent 輸出置信度低於閾值時
風險評分: 風險評分超過閾值時
特定操作: 敏感操作（寫入、刪除、發送）

Tradeoff：

人工審查增加延遲（通常 +500-2000ms）
但提供人工驗證，減少錯誤輸出

2.3 工具白名單

# 工具白名單模式
allowed_tools = [
    "weather_tool",
    "currency_converter",
    "calendar_lookup"
]

agent = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    tools=[weather_tool, currency_converter, calendar_lookup],
    tool_whitelist_enabled=True
)

優勢：

完全防止 Agent 調用未授權工具
易於管理工具訪問權限

第三部分：可觀測性與評估

3.1 可觀測性指標

延遲指標

# 延遲分層測量
def measure_latency(input: str, agent: Agent) -> Dict[str, float]:
    """
    測量 AI Agent 的完整延遲，分層統計：
    - Input processing: 輸入處理時間
    - Reasoning: 推理/思考時間
    - Tool execution: 工具調用時間
    - Output generation: 輸出生成時間
    """
    start = time.time()
    result = agent.run(input)
    end = time.time()

    return {
        "total_latency_ms": (end - start) * 1000,
        "input_processing_ms": 0,  # 需要實際測量
        "reasoning_ms": 0,  # 需要實際測量
        "tool_execution_ms": 0,  # 需要實際測量
        "output_generation_ms": 0  # 需要實際測量
    }

關鍵指標：

P50 延遲: 中位數延遲（用戶體驗核心）
P99 延遲: 99 分位延遲（異常檢測）
P999 延遲: 99.9 分位延遲（極端情況）

Tradeoff：

測量所有層級延遲提供全面洞察
但增加開銷和複雜度

成本指標

# Token 使用量測量
def measure_cost(input: str, agent: Agent) -> Dict[str, float]:
    """
    測量 Token 使用量和估算成本：
    - Input tokens: 輸入 Token 數量
    - Output tokens: 輸出 Token 數量
    - Cost: 估算成本（美元/小時）
    """
    result = agent.run(input)

    input_tokens = result.usage.input_tokens
    output_tokens = result.usage.output_tokens
    total_tokens = input_tokens + output_tokens

    # OpenAI 定價（2026 年）
    cost_per_1k_input = 0.0025  # $0.0025/1K input tokens
    cost_per_1k_output = 0.01  # $0.01/1K output tokens

    estimated_cost = (
        (input_tokens / 1000) * cost_per_1k_input +
        (output_tokens / 1000) * cost_per_1k_output
    )

    return {
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "total_tokens": total_tokens,
        "estimated_cost_usd": estimated_cost
    }

成本優化策略：

Prompt 精簡: 減少不必要的上下文
Token 缓存: 重用上下文
模型選擇: 根據任務類型選擇適合的模型

錯誤率指標

# 錯誤率監控
def measure_error_rate(input: str, agent: Agent) -> Dict[str, float]:
    """
    測量錯誤率和錯誤類型：
    - Total calls: 總調用次數
    - Failed calls: 失敗次數
    - Error rate: 錯誤率（百分比）
    - Error types: 錯誤類型分佈
    """
    total_calls = 1000
    failed_calls = 50  # 模擬數據

    error_types = {
        "timeout": 20,
        "rate_limit": 15,
        "validation_error": 10,
        "tool_error": 5
    }

    return {
        "total_calls": total_calls,
        "failed_calls": failed_calls,
        "error_rate": (failed_calls / total_calls) * 100,
        "error_types": error_types
    }

3.2 Agent 評估框架

基準測試集

# 基準測試集定義
evaluation_set = [
    {
        "input": "What is the weather in Tokyo?",
        "expected_output": "Sunny, 22°C",
        "category": "information_retrieval"
    },
    {
        "input": "Analyze the sales data for Q1",
        "expected_output": "Positive growth trend",
        "category": "analysis"
    },
    {
        "input": "Schedule a meeting with team",
        "expected_output": "Confirmation message",
        "category": "task_management"
    }
]

# 基準測試執行
def run_evaluations(agent: Agent, evaluation_set: List[Dict]) -> Dict:
    results = {
        "total_tests": len(evaluation_set),
        "correct": 0,
        "partial_correct": 0,
        "incorrect": 0,
        "category_results": {}
    }

    for test in evaluation_set:
        result = agent.run(test["input"])
        # 比較結果
        if is_correct(result, test["expected_output"]):
            results["correct"] += 1
        else:
            results["incorrect"] += 1

        # 按類別統計
        category = test["category"]
        if category not in results["category_results"]:
            results["category_results"][category] = {
                "correct": 0,
                "total": 0
            }
        results["category_results"][category]["total"] += 1

    # 計算準確率
    results["accuracy"] = (results["correct"] / results["total_tests"]) * 100

    return results

第四部分：生產部署策略

4.1 部署模式選擇

靜態部署（單實例）

# 靜態部署模式
class StaticAgentDeployment:
    def __init__(self, agent: Agent, max_concurrent: int = 10):
        self.agent = agent
        self.max_concurrent = max_concurrent
        self.active_requests = 0
        self.semaphore = Semaphore(max_concurrent)

    def process_request(self, input: str) -> str:
        with self.semaphore:
            self.active_requests += 1
            try:
                result = self.agent.run(input)
                return result
            finally:
                self.active_requests -= 1

適用場景：

低流量應用
簡單 Agent 系統
快速上線

限制：

單實例，無法水平擴展
高流量時延遲增加

動態部署（自動擴展）

# 自動擴展模式
class DynamicAgentDeployment:
    def __init__(self, agent_template: Agent, min_instances: int = 5, max_instances: int = 50):
        self.agent_template = agent_template
        self.min_instances = min_instances
        self.max_instances = max_instances
        self.instances = []
        self.current_load = 0

    def auto_scale(self):
        # 根據負載調整實例數
        if self.current_load > 80:
            self.scale_up()
        elif self.current_load < 20:
            self.scale_down()

    def scale_up(self):
        # 添加新實例
        new_agent = self.agent_template.clone()
        self.instances.append(new_agent)
        # 啟動新實例

    def scale_down(self):
        # 移除實例
        if len(self.instances) > self.min_instances:
            agent_to_remove = self.instances.pop()
            # 停止實例

適用場景：

高流量應用
多租戶環境
可變負載

Tradeoff：

自動擴展提供彈性
但增加複雜度和成本

4.2 故障恢復策略

重試機制

# 指數退避重試
def execute_with_retry(agent: Agent, input: str, max_retries: int = 3) -> str:
    retry_count = 0
    last_exception = None

    while retry_count < max_retries:
        try:
            result = agent.run(input)
            return result
        except Exception as e:
            last_exception = e
            retry_count += 1
            # 指數退避：100ms, 200ms, 400ms...
            delay = 100 * (2 ** retry_count)
            time.sleep(delay / 1000)

    # 所有重試失敗，返回錯誤
    raise last_exception

故障轉移（Failover）

# 多模型故障轉移
class AgentFailoverManager:
    def __init__(self, primary_agent: Agent, fallback_agents: List[Agent]):
        self.primary_agent = primary_agent
        self.fallback_agents = fallback_agents
        self.failed_count = 0
        self.max_failures = 3

    def get_agent(self) -> Agent:
        if self.failed_count >= self.max_failures:
            # 所有模型都失敗，返回降級模式
            return self.get_degraded_mode()
        return self.primary_agent

    def handle_failure(self):
        self.failed_count += 1
        if self.failed_count >= self.max_failures:
            # 切換到備用模型
            self.primary_agent = self.fallback_agents[0]
            self.failed_count = 0

4.3 監控和告警

# 監控配置
class AgentMonitoring:
    def __init__(self, deployment: DynamicAgentDeployment):
        self.deployment = deployment
        self.metrics = {
            "latency": [],
            "cost": [],
            "error_rate": []
        }

    def collect_metrics(self):
        # 收集指標
        for _ in range(100):
            # 模擬請求
            input = generate_test_input()
            result = self.deployment.get_agent().run(input)

            # 記錄指標
            self.metrics["latency"].append(result.latency)
            self.metrics["cost"].append(result.cost)
            self.metrics["error_rate"].append(result.error_rate)

    def get_alerts(self) -> List[str]:
        alerts = []

        # 延遲告警
        if self.metrics["latency"].p99 > 5000:  # 99 分位延遲 > 5s
            alerts.append("High latency detected: P99 latency > 5s")

        # 成本告警
        if self.metrics["cost"].mean > 0.1:  # 平均成本 > $0.1/請求
            alerts.append("High cost detected: Average cost > $0.1/request")

        # 錯誤率告警
        if self.metrics["error_rate"].mean > 5:  # 平均錯誤率 > 5%
            alerts.append("High error rate detected: Error rate > 5%")

        return alerts

第五部分：實戰案例

5.1 客戶支持自動化系統

系統架構

# 客戶支持 Agent 系統
class CustomerSupportAgent:
    def __init__(self):
        self.router = Agent(
            model=Model(provider="openai", model="gpt-5.5"),
            system_prompt="Route user requests to appropriate handlers"
        )
        self.info_agent = Agent(
            model=Model(provider="openai", model="gpt-5.4"),
            system_prompt="Provide information and answers"
        )
        self.tech_agent = Agent(
            model=Model(provider="openai", model="gpt-5.3"),
            system_prompt="Technical troubleshooting"
        )
        self.human_agent = Agent(
            model=Model(provider="openai", model="gpt-5.2"),
            system_prompt="Escalate to human support"
        )

    def handle_request(self, user_input: str) -> str:
        # 路由到適當的 Agent
        response = self.router.run(
            user_input,
            handoff_to=[self.info_agent, self.tech_agent, self.human_agent]
        )
        return response

運行時配置

# 生產配置
support_config = {
    "agent": CustomerSupportAgent(),
    "guardrails": [
        Guardrail(
            name="sensitive_data",
            action=GuardrailAction.APPROVAL_REQUIRED,
            conditions=["user_input contains 'password'", "user_input contains 'token'"]
        )
    ],
    "human_review": True,
    "review_threshold": 0.7,
    "monitoring": {
        "latency_target_ms": 2000,
        "max_cost_per_request_usd": 0.05,
        "max_error_rate_percent": 1
    }
}

可測量結果

部署前基準：

平均響應時間：5s
錯誤率：15%
平均成本：$0.15/請求

部署後結果：

平均響應時間：1.8s（↓ 64%）
錯誤率：2%（↓ 87%）
平均成本：$0.03/請求（↓ 80%）
人工介入率：25%（預期）

ROI 分析：

客戶滿意度提升：30%
人工成本節省：$50,000/月
估計投資回報期：2 個月

5.2 數據分析 Agent 系統

系統架構

# 數據分析 Agent 系統
class DataAnalysisAgent:
    def __init__(self):
        self.router = Agent(
            model=Model(provider="openai", model="gpt-5.5"),
            system_prompt="Coordinate data analysis tasks"
        )
        self.query_agent = Agent(
            model=Model(provider="openai", model="gpt-5.4"),
            system_prompt="Data query and retrieval"
        )
        self.processing_agent = Agent(
            model=Model(provider="openai", model="gpt-5.3"),
            system_prompt="Data processing and analysis"
        )
        self.visualization_agent = Agent(
            model=Model(provider="openai", model="gpt-5.2"),
            system_prompt="Data visualization"
        )

    def analyze_data(self, query: str) -> Dict:
        # 多 Agent 協作
        result = self.router.run(
            f"Analyze {query} using data agents",
            handoff_to=[self.query_agent, self.processing_agent, self.visualization_agent]
        )
        return result

可觀測性配置

# 觀測性配置
observability_config = {
    "latency_tracking": True,
    "token_usage_tracking": True,
    "error_tracking": True,
    "agent_handoffs": True,
    "metrics_exporters": [
        {"type": "prometheus", "endpoint": "http://metrics:9090"},
        {"type": "elasticsearch", "index": "agent-metrics"}
    ]
}

結論：關鍵決策點

6.1 架構決策

運行模式選擇：
- 非沙箱：快速上線，但安全風險高
- 沙箱：安全隔離，但增加延遲
多 Agent 協作：
- 單 Agent：簡單，但功能有限
- 多 Agent：功能強大，但複雜度高
Guardrails 策略：
- 無審批：最佳體驗，但風險高
- 條件審批：平衡安全性和體驗
- 完全審批：最安全，但體驗差

6.2 評估策略

基準測試集設計：
- 覆蓋不同場景
- 設定明確的期望輸出
- 按類別分組統計
指標選擇：
- 延遲：P50、P99、P999
- 成本：Token 使用量、估算成本
- 錯誤率：錯誤類型分佈
監控配置：
- 即時監控：延遲、成本、錯誤率
- 定期分析：趨勢、異常檢測
- 告警閾值：根據業務需求調整

6.3 部署策略

擴展策略：
- 靜態部署：低流量，簡單
- 動態部署：高流量，彈性
故障恢復：
- 重試機制：指數退避
- 故障轉移：多模型備用
監控告警：
- 即時告警：延遲、成本、錯誤率超閾值
- 自動擴縮：根據負載調整

6.4 可量化的 Tradeoff

決策點	選項 A	選項 B	Tradeoff	影響
運行模式	非沙箱	沙箱	安全性 vs 延遲	0-500ms 延遲
多 Agent	單 Agent	多 Agent	功能性 vs 複雜度	50-200ms 處理時間
Guardrails	無審批	條件審批	體驗 vs 安全	500-2000ms 延遲
擴展策略	靜態部署	動態部署	彈性 vs 成本	10-50% 成本增加

實踐建議

從簡單開始：先使用單 Agent 和非沙箱模式快速驗證概念
逐步增加複雜度：逐步添加多 Agent、Guardrails、監控
設定明確指標：延遲 < 2s，成本 < $0.05/請求，錯誤率 < 1%
持續監控：即時監控 + 定期分析 + 自動告警
迭代優化：根據指標調整架構和配置

參考資源：

Core Observation: In 2026, developers will need not just the concept of Agent, but specific production-level implementation guidance covering architectural patterns, safeguards, observability, and measurable metrics.

Preface: Why is a production-level Agent implementation guide needed?

In 2026, the critical turning point for AI Agent from concept to practice has arrived. The challenge for many teams is no longer how to use the API, but rather:

How to build a scalable Agent architecture: from simple chatbot to complex collaboration system
How to deploy safely in production environment: protective measures, approval process, error handling
How to monitor and evaluate Agent systems: latency, cost, error rate, observability
How to govern Agent behavior: policy execution, review mechanism, audit trail

This article will provide a complete implementation guide, covering every aspect from architecture design to production deployment.

Part 1: Agent system architecture model

1.1 Core architectural concepts

Agent definition and model selection

from openai.agents import Agent, Model

# 基礎 Agent 定義
agent = Agent(
    model=Model(
        provider="openai",
        model="gpt-5.5"
    ),
    system_prompt="You are a helpful assistant specialized in {domain}",
    tools=[weather_tool, database_query_tool]
)

Architectural Decisions:

Model Provider: OpenAI vs Anthropic vs Google - depends on cost, latency and feature requirements
System Prompt: content vs external files - content is more flexible and external files are easier to maintain
Tools Collection: Static definition vs dynamic loading - Static definition is safer, dynamic loading is more flexible

Tradeoff:

Content-based System Prompts are easier to A/B test and iterate, but may lead to prompt bloat
External file-based System Prompt is easier to version control and review, but requires additional file management

1.2 Agent’s operating mode

Non-sandbox mode (running environment)

# 直接運行 Agent，訪問主機環境
agent.run(
    user_input="Get weather for San Francisco"
)

Applicable scenarios:

Internal tool chain (e.g. database, file system)
Agents that need direct access to host resources
Rapid prototyping

RISK:

Agent may access sensitive system resources
Difficulty isolating Agent behavior
Complex audit trails

Sandbox mode (isolated environment)

# 使用沙箱運行 Agent，限制訪問範圍
agent.run_in_sandbox(
    user_input="Get weather for San Francisco",
    sandbox_type="docker"
)

Applicable scenarios:

Agent behavior that needs to be isolated
Multi-tenant environment
Security sensitive applications

Advantages:

Completely isolate Agent behavior
Easy to audit and track
Failed sandbox can be restarted

Tradeoff:

Long sandbox startup time (usually 100-500ms)
Increased resource consumption (additional containers or virtual machines)
Requires additional network configuration

1.3 Collaboration and handover model

Agent to Agent Handover (Orchestration)

# 多 Agent 協作模式
orchestrator = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    system_prompt="Coordinate between specialized agents"
)

specialist1 = Agent(
    model=Model(provider="openai", model="gpt-5.4"),
    system_prompt="Specialist in data analysis"
)

specialist2 = Agent(
    model=Model(provider="openai", model="gpt-5.3"),
    system_prompt="Specialist in visualization"
)

# 交接流程
result = orchestrator.run(
    user_input="Analyze and visualize sales data",
    handoff_to=[specialist1, specialist2]
)

Architectural Pattern:

Router Agent: Responsible for routing requests to professional Agents
Specialist Agents: handle tasks in specific areas
Coordinator: Coordinate multi-Agent collaboration

Tradeoff: -Multi-Agent collaboration provides higher professional capabilities

But adds complexity and latency (typically +50-200ms processing time)

Part 2: Protective measures in production environment

2.1 Guardrails (protective fence)

Approvals (approval mechanism)

# 配置審批流程
agent = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    tools=[database_write, file_modify],
    guardrails=[
        Guardrail(
            name="sensitive_data_access",
            action=GuardrailAction.APPROVAL_REQUIRED,
            conditions=["user_input contains 'password'", "user_input contains 'secret'"]
        )
    ]
)

Approval Strategy:

Always Required: Every sensitive operation requires approval
Conditional Required: Determined based on operation type and context
Never Required: No approval required (test environment only)

Tradeoff:

Full approval protection is the most secure, but has a poor user experience
No approval is required for the best user experience, but the security risk is high
Conditional approval balances security and experience

2.2 Human Review

# 配置人工審查流程
agent = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    tools=[database_write],
    human_review_enabled=True,
    review_threshold=0.8  # 置信度低於 0.8 時需要人工審查
)

Review triggering conditions:

Confidence Threshold: When the Agent output confidence is lower than the threshold
Risk Score: When the risk score exceeds the threshold
Specific operations: Sensitive operations (write, delete, send)

Tradeoff:

Added latency for manual review (typically +500-2000ms)
But provide manual verification to reduce error output

2.3 Tool whitelist

# 工具白名單模式
allowed_tools = [
    "weather_tool",
    "currency_converter",
    "calendar_lookup"
]

agent = Agent(
    model=Model(provider="openai", model="gpt-5.5"),
    tools=[weather_tool, currency_converter, calendar_lookup],
    tool_whitelist_enabled=True
)

Advantages:

Completely prevent Agent from calling unauthorized tools
Easy to manage tool access

Part 3: Observability and Evaluation

3.1 Observability indicators

Latency indicator

# 延遲分層測量
def measure_latency(input: str, agent: Agent) -> Dict[str, float]:
    """
    測量 AI Agent 的完整延遲，分層統計：
    - Input processing: 輸入處理時間
    - Reasoning: 推理/思考時間
    - Tool execution: 工具調用時間
    - Output generation: 輸出生成時間
    """
    start = time.time()
    result = agent.run(input)
    end = time.time()

    return {
        "total_latency_ms": (end - start) * 1000,
        "input_processing_ms": 0,  # 需要實際測量
        "reasoning_ms": 0,  # 需要實際測量
        "tool_execution_ms": 0,  # 需要實際測量
        "output_generation_ms": 0  # 需要實際測量
    }

Key Indicators:

P50 Latency: Median Latency (User Experience Core)
P99 Latency: 99th percentile latency (anomaly detection)
P999 Latency: 99.9 quantile latency (extreme case)

Tradeoff:

Measure latency at all tiers to provide comprehensive insights
But adds overhead and complexity

Cost indicators

# Token 使用量測量
def measure_cost(input: str, agent: Agent) -> Dict[str, float]:
    """
    測量 Token 使用量和估算成本：
    - Input tokens: 輸入 Token 數量
    - Output tokens: 輸出 Token 數量
    - Cost: 估算成本（美元/小時）
    """
    result = agent.run(input)

    input_tokens = result.usage.input_tokens
    output_tokens = result.usage.output_tokens
    total_tokens = input_tokens + output_tokens

    # OpenAI 定價（2026 年）
    cost_per_1k_input = 0.0025  # $0.0025/1K input tokens
    cost_per_1k_output = 0.01  # $0.01/1K output tokens

    estimated_cost = (
        (input_tokens / 1000) * cost_per_1k_input +
        (output_tokens / 1000) * cost_per_1k_output
    )

    return {
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "total_tokens": total_tokens,
        "estimated_cost_usd": estimated_cost
    }

Cost Optimization Strategy:

Prompt Simplification: Reduce unnecessary context
Token cache: reuse context
Model Selection: Select the appropriate model according to the task type

Error rate indicator

# 錯誤率監控
def measure_error_rate(input: str, agent: Agent) -> Dict[str, float]:
    """
    測量錯誤率和錯誤類型：
    - Total calls: 總調用次數
    - Failed calls: 失敗次數
    - Error rate: 錯誤率（百分比）
    - Error types: 錯誤類型分佈
    """
    total_calls = 1000
    failed_calls = 50  # 模擬數據

    error_types = {
        "timeout": 20,
        "rate_limit": 15,
        "validation_error": 10,
        "tool_error": 5
    }

    return {
        "total_calls": total_calls,
        "failed_calls": failed_calls,
        "error_rate": (failed_calls / total_calls) * 100,
        "error_types": error_types
    }

3.2 Agent Evaluation Framework

Benchmark test set

# 基準測試集定義
evaluation_set = [
    {
        "input": "What is the weather in Tokyo?",
        "expected_output": "Sunny, 22°C",
        "category": "information_retrieval"
    },
    {
        "input": "Analyze the sales data for Q1",
        "expected_output": "Positive growth trend",
        "category": "analysis"
    },
    {
        "input": "Schedule a meeting with team",
        "expected_output": "Confirmation message",
        "category": "task_management"
    }
]

# 基準測試執行
def run_evaluations(agent: Agent, evaluation_set: List[Dict]) -> Dict:
    results = {
        "total_tests": len(evaluation_set),
        "correct": 0,
        "partial_correct": 0,
        "incorrect": 0,
        "category_results": {}
    }

    for test in evaluation_set:
        result = agent.run(test["input"])
        # 比較結果
        if is_correct(result, test["expected_output"]):
            results["correct"] += 1
        else:
            results["incorrect"] += 1

        # 按類別統計
        category = test["category"]
        if category not in results["category_results"]:
            results["category_results"][category] = {
                "correct": 0,
                "total": 0
            }
        results["category_results"][category]["total"] += 1

    # 計算準確率
    results["accuracy"] = (results["correct"] / results["total_tests"]) * 100

    return results

Part 4: Production Deployment Strategy

4.1 Deployment mode selection

Static deployment (single instance)

# 靜態部署模式
class StaticAgentDeployment:
    def __init__(self, agent: Agent, max_concurrent: int = 10):
        self.agent = agent
        self.max_concurrent = max_concurrent
        self.active_requests = 0
        self.semaphore = Semaphore(max_concurrent)

    def process_request(self, input: str) -> str:
        with self.semaphore:
            self.active_requests += 1
            try:
                result = self.agent.run(input)
                return result
            finally:
                self.active_requests -= 1

Applicable scenarios:

Low traffic applications
Simple Agent system
Get online quickly

Restrictions:

Single instance, cannot scale horizontally
Increased latency during high traffic

Dynamic deployment (automatic expansion)

# 自動擴展模式
class DynamicAgentDeployment:
    def __init__(self, agent_template: Agent, min_instances: int = 5, max_instances: int = 50):
        self.agent_template = agent_template
        self.min_instances = min_instances
        self.max_instances = max_instances
        self.instances = []
        self.current_load = 0

    def auto_scale(self):
        # 根據負載調整實例數
        if self.current_load > 80:
            self.scale_up()
        elif self.current_load < 20:
            self.scale_down()

    def scale_up(self):
        # 添加新實例
        new_agent = self.agent_template.clone()
        self.instances.append(new_agent)
        # 啟動新實例

    def scale_down(self):
        # 移除實例
        if len(self.instances) > self.min_instances:
            agent_to_remove = self.instances.pop()
            # 停止實例

Applicable scenarios:

High traffic applications
Multi-tenant environment
Variable load

Tradeoff:

Automatic expansion provides flexibility
But adds complexity and cost

4.2 Failure recovery strategy

Retry mechanism

# 指數退避重試
def execute_with_retry(agent: Agent, input: str, max_retries: int = 3) -> str:
    retry_count = 0
    last_exception = None

    while retry_count < max_retries:
        try:
            result = agent.run(input)
            return result
        except Exception as e:
            last_exception = e
            retry_count += 1
            # 指數退避：100ms, 200ms, 400ms...
            delay = 100 * (2 ** retry_count)
            time.sleep(delay / 1000)

    # 所有重試失敗，返回錯誤
    raise last_exception

Failover

# 多模型故障轉移
class AgentFailoverManager:
    def __init__(self, primary_agent: Agent, fallback_agents: List[Agent]):
        self.primary_agent = primary_agent
        self.fallback_agents = fallback_agents
        self.failed_count = 0
        self.max_failures = 3

    def get_agent(self) -> Agent:
        if self.failed_count >= self.max_failures:
            # 所有模型都失敗，返回降級模式
            return self.get_degraded_mode()
        return self.primary_agent

    def handle_failure(self):
        self.failed_count += 1
        if self.failed_count >= self.max_failures:
            # 切換到備用模型
            self.primary_agent = self.fallback_agents[0]
            self.failed_count = 0

4.3 Monitoring and Alarming

# 監控配置
class AgentMonitoring:
    def __init__(self, deployment: DynamicAgentDeployment):
        self.deployment = deployment
        self.metrics = {
            "latency": [],
            "cost": [],
            "error_rate": []
        }

    def collect_metrics(self):
        # 收集指標
        for _ in range(100):
            # 模擬請求
            input = generate_test_input()
            result = self.deployment.get_agent().run(input)

            # 記錄指標
            self.metrics["latency"].append(result.latency)
            self.metrics["cost"].append(result.cost)
            self.metrics["error_rate"].append(result.error_rate)

    def get_alerts(self) -> List[str]:
        alerts = []

        # 延遲告警
        if self.metrics["latency"].p99 > 5000:  # 99 分位延遲 > 5s
            alerts.append("High latency detected: P99 latency > 5s")

        # 成本告警
        if self.metrics["cost"].mean > 0.1:  # 平均成本 > $0.1/請求
            alerts.append("High cost detected: Average cost > $0.1/request")

        # 錯誤率告警
        if self.metrics["error_rate"].mean > 5:  # 平均錯誤率 > 5%
            alerts.append("High error rate detected: Error rate > 5%")

        return alerts

Part 5: Practical Cases

5.1 Customer Support Automation System

System architecture

# 客戶支持 Agent 系統
class CustomerSupportAgent:
    def __init__(self):
        self.router = Agent(
            model=Model(provider="openai", model="gpt-5.5"),
            system_prompt="Route user requests to appropriate handlers"
        )
        self.info_agent = Agent(
            model=Model(provider="openai", model="gpt-5.4"),
            system_prompt="Provide information and answers"
        )
        self.tech_agent = Agent(
            model=Model(provider="openai", model="gpt-5.3"),
            system_prompt="Technical troubleshooting"
        )
        self.human_agent = Agent(
            model=Model(provider="openai", model="gpt-5.2"),
            system_prompt="Escalate to human support"
        )

    def handle_request(self, user_input: str) -> str:
        # 路由到適當的 Agent
        response = self.router.run(
            user_input,
            handoff_to=[self.info_agent, self.tech_agent, self.human_agent]
        )
        return response

Runtime configuration

# 生產配置
support_config = {
    "agent": CustomerSupportAgent(),
    "guardrails": [
        Guardrail(
            name="sensitive_data",
            action=GuardrailAction.APPROVAL_REQUIRED,
            conditions=["user_input contains 'password'", "user_input contains 'token'"]
        )
    ],
    "human_review": True,
    "review_threshold": 0.7,
    "monitoring": {
        "latency_target_ms": 2000,
        "max_cost_per_request_usd": 0.05,
        "max_error_rate_percent": 1
    }
}

Measurable results

Pre-deployment baseline:

Average response time: 5s
Error rate: 15%
Average cost: $0.15/request

Results after deployment:

Average response time: 1.8s (↓ 64%)
Error rate: 2% (↓ 87%)
Average cost: $0.03/request (↓ 80%)
Manual intervention rate: 25% (expected)

ROI Analysis:

Customer satisfaction improvement: 30%
Labor cost savings: $50,000/month
Estimated payback period: 2 months

5.2 Data Analysis Agent System

System architecture

# 數據分析 Agent 系統
class DataAnalysisAgent:
    def __init__(self):
        self.router = Agent(
            model=Model(provider="openai", model="gpt-5.5"),
            system_prompt="Coordinate data analysis tasks"
        )
        self.query_agent = Agent(
            model=Model(provider="openai", model="gpt-5.4"),
            system_prompt="Data query and retrieval"
        )
        self.processing_agent = Agent(
            model=Model(provider="openai", model="gpt-5.3"),
            system_prompt="Data processing and analysis"
        )
        self.visualization_agent = Agent(
            model=Model(provider="openai", model="gpt-5.2"),
            system_prompt="Data visualization"
        )

    def analyze_data(self, query: str) -> Dict:
        # 多 Agent 協作
        result = self.router.run(
            f"Analyze {query} using data agents",
            handoff_to=[self.query_agent, self.processing_agent, self.visualization_agent]
        )
        return result

Observability configuration

# 觀測性配置
observability_config = {
    "latency_tracking": True,
    "token_usage_tracking": True,
    "error_tracking": True,
    "agent_handoffs": True,
    "metrics_exporters": [
        {"type": "prometheus", "endpoint": "http://metrics:9090"},
        {"type": "elasticsearch", "index": "agent-metrics"}
    ]
}

Conclusion: Key decision points

6.1 Architectural Decisions

Operating mode selection:
- Non-sandbox: Quick to go online, but high security risks
- Sandbox: safe isolation, but added latency
Multi-Agent collaboration:
- Single Agent: simple, but limited functionality -Multi-Agent: powerful but complex
Guardrails Strategy:
- No approval: best experience, but high risk
- Conditional Approval: Balancing security and experience
- Full approval: safest, but poor experience

6.2 Evaluation Strategy

Benchmark test set design:
- Cover different scenarios
- Set clear desired output
- Statistics grouped by category
Indicator Selection:
- Delay: P50, P99, P999
- Cost: Token usage, estimated cost
- Error rate: distribution of error types
Monitoring configuration:
- Real-time monitoring: delays, costs, error rates
- Regular analysis: trends, anomaly detection
- Alarm threshold: adjusted according to business needs

6.3 Deployment strategy

Expansion Strategy:
- Static deployment: low traffic, simple
- Dynamic deployment: high traffic, elasticity
Failure Recovery:
- Retry mechanism: exponential backoff
- Failover: multi-model standby
Monitoring Alarm:
- Instant alert: delay, cost, error rate exceeds thresholds
- Autoscaling: adjusts based on load

6.4 Quantifiable Tradeoff

Decision Point	Option A	Option B	Tradeoff	Impact
Run Mode	Non-Sandbox	Sandbox	Security vs Latency	0-500ms Latency
Multiple Agents	Single Agent	Multiple Agents	Functionality vs Complexity	50-200ms processing time
Guardrails	No Approval	Conditional Approval	Experience vs Security	500-2000ms Latency
Scaling Strategies	Static Deployment	Dynamic Deployment	Elasticity vs Cost	10-50% Cost Increase

Practical suggestions

Start Simple: Quickly validate the concept using single agent and non-sandbox mode first
Gradually increase complexity: Gradually add multiple Agents, Guardrails, and monitoring
Set clear indicators: latency < 2s, cost < $0.05/request, error rate < 1%
Continuous monitoring: real-time monitoring + regular analysis + automatic alarm
Iterative Optimization: Adjust architecture and configuration based on indicators

Reference Resources: