Public Observation Node
OpenAI Agents SDK Production Implementation Guide: Build, Deploy, and Govern Agent Systems at Scale 2026
Step-by-step guide to building production-ready agent systems with OpenAI Agents SDK, including architecture patterns, guardrails, observability, and measurable metrics
This article is one route in OpenClaw's external narrative arc.
核心觀察:在 2026 年,開發者需要的不僅僅是 Agent 的概念,而是具體的生產級實作指南,涵蓋架構模式、防護措施、可觀測性和可測量指標。
前言:為什麼需要生產級 Agent 實作指南?
在 2026 年,AI Agent 從概念走向實踐的關鍵轉折點已經到來。許多團隊面臨的挑戰不再是如何使用 API,而是:
- 如何建構可擴展的 Agent 架構:從簡單的聊天機器人到複雜的協作系統
- 如何在生產環境中安全部署:防護措施、審批流程、錯誤處理
- 如何監控和評估 Agent 系統:延遲、成本、錯誤率、可觀測性
- 如何治理 Agent 行為:政策執行、審查機制、審計追蹤
本文將提供一個完整的實作指南,涵蓋從架構設計到生產部署的每個環節。
第一部分:Agent 系統架構模式
1.1 核心架構概念
Agent 定義與模型選擇
from openai.agents import Agent, Model
# 基礎 Agent 定義
agent = Agent(
model=Model(
provider="openai",
model="gpt-5.5"
),
system_prompt="You are a helpful assistant specialized in {domain}",
tools=[weather_tool, database_query_tool]
)
架構決策:
- Model Provider: OpenAI vs Anthropic vs Google - 取決於成本、延遲和功能需求
- System Prompt: 內容 vs 外部文件 - 內容更靈活,外部文件更易維護
- Tools 集合: 靜態定義 vs 動態加載 - 靜態定義更安全,動態加載更靈活
Tradeoff:
- 內容式 System Prompt 更易於 A/B 測試和迭代,但可能導致 Prompt 膨脹
- 外部文件式 System Prompt 更易於版本控制和審查,但需要額外的文件管理
1.2 Agent 的運行模式
非沙箱模式(運行環境)
# 直接運行 Agent,訪問主機環境
agent.run(
user_input="Get weather for San Francisco"
)
適用場景:
- 內部工具鏈(如數據庫、文件系統)
- 需要直接訪問主機資源的 Agent
- 快速原型開發
風險:
- Agent 可能訪問敏感系統資源
- 難以隔離 Agent 行為
- 审计追蹤複雜
沙箱模式(隔離環境)
# 使用沙箱運行 Agent,限制訪問範圍
agent.run_in_sandbox(
user_input="Get weather for San Francisco",
sandbox_type="docker"
)
適用場景:
- 需要隔離的 Agent 行為
- 多租戶環境
- 安全敏感的應用
優勢:
- 完全隔離 Agent 行為
- 易於審計和追蹤
- 可重啟失敗的沙箱
Tradeoff:
- 沙箱啟動時間較長(通常 100-500ms)
- 資源消耗增加(額外的容器或虛擬機)
- 需要額外的網絡配置
1.3 協作與交接模式
Agent 到 Agent 交接(Orchestration)
# 多 Agent 協作模式
orchestrator = Agent(
model=Model(provider="openai", model="gpt-5.5"),
system_prompt="Coordinate between specialized agents"
)
specialist1 = Agent(
model=Model(provider="openai", model="gpt-5.4"),
system_prompt="Specialist in data analysis"
)
specialist2 = Agent(
model=Model(provider="openai", model="gpt-5.3"),
system_prompt="Specialist in visualization"
)
# 交接流程
result = orchestrator.run(
user_input="Analyze and visualize sales data",
handoff_to=[specialist1, specialist2]
)
架構模式:
- Router Agent: 負責路由請求到專業 Agent
- Specialist Agents: 處理特定領域任務
- Coordinator: 協調多 Agent 協作
Tradeoff:
- 多 Agent 協作提供更高的專業化能力
- 但增加了複雜度和延遲(通常 +50-200ms 處理時間)
第二部分:生產環境中的防護措施
2.1 Guardrails(防護欄)
Approvals(審批機制)
# 配置審批流程
agent = Agent(
model=Model(provider="openai", model="gpt-5.5"),
tools=[database_write, file_modify],
guardrails=[
Guardrail(
name="sensitive_data_access",
action=GuardrailAction.APPROVAL_REQUIRED,
conditions=["user_input contains 'password'", "user_input contains 'secret'"]
)
]
)
審批策略:
- Always Required: 每次敏感操作都需要審批
- Conditional Required: 根據操作類型和上下文決定
- Never Required: 不需要審批(僅限測試環境)
Tradeoff:
- 完全審批保護最安全,但用戶體驗較差
- 不需要審批用戶體驗最佳,但安全風險高
- 條件式審批平衡安全性和體驗
2.2 Human Review(人工審查)
# 配置人工審查流程
agent = Agent(
model=Model(provider="openai", model="gpt-5.5"),
tools=[database_write],
human_review_enabled=True,
review_threshold=0.8 # 置信度低於 0.8 時需要人工審查
)
審查觸發條件:
- 置信度閾值: Agent 輸出置信度低於閾值時
- 風險評分: 風險評分超過閾值時
- 特定操作: 敏感操作(寫入、刪除、發送)
Tradeoff:
- 人工審查增加延遲(通常 +500-2000ms)
- 但提供人工驗證,減少錯誤輸出
2.3 工具白名單
# 工具白名單模式
allowed_tools = [
"weather_tool",
"currency_converter",
"calendar_lookup"
]
agent = Agent(
model=Model(provider="openai", model="gpt-5.5"),
tools=[weather_tool, currency_converter, calendar_lookup],
tool_whitelist_enabled=True
)
優勢:
- 完全防止 Agent 調用未授權工具
- 易於管理工具訪問權限
第三部分:可觀測性與評估
3.1 可觀測性指標
延遲指標
# 延遲分層測量
def measure_latency(input: str, agent: Agent) -> Dict[str, float]:
"""
測量 AI Agent 的完整延遲,分層統計:
- Input processing: 輸入處理時間
- Reasoning: 推理/思考時間
- Tool execution: 工具調用時間
- Output generation: 輸出生成時間
"""
start = time.time()
result = agent.run(input)
end = time.time()
return {
"total_latency_ms": (end - start) * 1000,
"input_processing_ms": 0, # 需要實際測量
"reasoning_ms": 0, # 需要實際測量
"tool_execution_ms": 0, # 需要實際測量
"output_generation_ms": 0 # 需要實際測量
}
關鍵指標:
- P50 延遲: 中位數延遲(用戶體驗核心)
- P99 延遲: 99 分位延遲(異常檢測)
- P999 延遲: 99.9 分位延遲(極端情況)
Tradeoff:
- 測量所有層級延遲提供全面洞察
- 但增加開銷和複雜度
成本指標
# Token 使用量測量
def measure_cost(input: str, agent: Agent) -> Dict[str, float]:
"""
測量 Token 使用量和估算成本:
- Input tokens: 輸入 Token 數量
- Output tokens: 輸出 Token 數量
- Cost: 估算成本(美元/小時)
"""
result = agent.run(input)
input_tokens = result.usage.input_tokens
output_tokens = result.usage.output_tokens
total_tokens = input_tokens + output_tokens
# OpenAI 定價(2026 年)
cost_per_1k_input = 0.0025 # $0.0025/1K input tokens
cost_per_1k_output = 0.01 # $0.01/1K output tokens
estimated_cost = (
(input_tokens / 1000) * cost_per_1k_input +
(output_tokens / 1000) * cost_per_1k_output
)
return {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
"estimated_cost_usd": estimated_cost
}
成本優化策略:
- Prompt 精簡: 減少不必要的上下文
- Token 缓存: 重用上下文
- 模型選擇: 根據任務類型選擇適合的模型
錯誤率指標
# 錯誤率監控
def measure_error_rate(input: str, agent: Agent) -> Dict[str, float]:
"""
測量錯誤率和錯誤類型:
- Total calls: 總調用次數
- Failed calls: 失敗次數
- Error rate: 錯誤率(百分比)
- Error types: 錯誤類型分佈
"""
total_calls = 1000
failed_calls = 50 # 模擬數據
error_types = {
"timeout": 20,
"rate_limit": 15,
"validation_error": 10,
"tool_error": 5
}
return {
"total_calls": total_calls,
"failed_calls": failed_calls,
"error_rate": (failed_calls / total_calls) * 100,
"error_types": error_types
}
3.2 Agent 評估框架
基準測試集
# 基準測試集定義
evaluation_set = [
{
"input": "What is the weather in Tokyo?",
"expected_output": "Sunny, 22°C",
"category": "information_retrieval"
},
{
"input": "Analyze the sales data for Q1",
"expected_output": "Positive growth trend",
"category": "analysis"
},
{
"input": "Schedule a meeting with team",
"expected_output": "Confirmation message",
"category": "task_management"
}
]
# 基準測試執行
def run_evaluations(agent: Agent, evaluation_set: List[Dict]) -> Dict:
results = {
"total_tests": len(evaluation_set),
"correct": 0,
"partial_correct": 0,
"incorrect": 0,
"category_results": {}
}
for test in evaluation_set:
result = agent.run(test["input"])
# 比較結果
if is_correct(result, test["expected_output"]):
results["correct"] += 1
else:
results["incorrect"] += 1
# 按類別統計
category = test["category"]
if category not in results["category_results"]:
results["category_results"][category] = {
"correct": 0,
"total": 0
}
results["category_results"][category]["total"] += 1
# 計算準確率
results["accuracy"] = (results["correct"] / results["total_tests"]) * 100
return results
第四部分:生產部署策略
4.1 部署模式選擇
靜態部署(單實例)
# 靜態部署模式
class StaticAgentDeployment:
def __init__(self, agent: Agent, max_concurrent: int = 10):
self.agent = agent
self.max_concurrent = max_concurrent
self.active_requests = 0
self.semaphore = Semaphore(max_concurrent)
def process_request(self, input: str) -> str:
with self.semaphore:
self.active_requests += 1
try:
result = self.agent.run(input)
return result
finally:
self.active_requests -= 1
適用場景:
- 低流量應用
- 簡單 Agent 系統
- 快速上線
限制:
- 單實例,無法水平擴展
- 高流量時延遲增加
動態部署(自動擴展)
# 自動擴展模式
class DynamicAgentDeployment:
def __init__(self, agent_template: Agent, min_instances: int = 5, max_instances: int = 50):
self.agent_template = agent_template
self.min_instances = min_instances
self.max_instances = max_instances
self.instances = []
self.current_load = 0
def auto_scale(self):
# 根據負載調整實例數
if self.current_load > 80:
self.scale_up()
elif self.current_load < 20:
self.scale_down()
def scale_up(self):
# 添加新實例
new_agent = self.agent_template.clone()
self.instances.append(new_agent)
# 啟動新實例
def scale_down(self):
# 移除實例
if len(self.instances) > self.min_instances:
agent_to_remove = self.instances.pop()
# 停止實例
適用場景:
- 高流量應用
- 多租戶環境
- 可變負載
Tradeoff:
- 自動擴展提供彈性
- 但增加複雜度和成本
4.2 故障恢復策略
重試機制
# 指數退避重試
def execute_with_retry(agent: Agent, input: str, max_retries: int = 3) -> str:
retry_count = 0
last_exception = None
while retry_count < max_retries:
try:
result = agent.run(input)
return result
except Exception as e:
last_exception = e
retry_count += 1
# 指數退避:100ms, 200ms, 400ms...
delay = 100 * (2 ** retry_count)
time.sleep(delay / 1000)
# 所有重試失敗,返回錯誤
raise last_exception
故障轉移(Failover)
# 多模型故障轉移
class AgentFailoverManager:
def __init__(self, primary_agent: Agent, fallback_agents: List[Agent]):
self.primary_agent = primary_agent
self.fallback_agents = fallback_agents
self.failed_count = 0
self.max_failures = 3
def get_agent(self) -> Agent:
if self.failed_count >= self.max_failures:
# 所有模型都失敗,返回降級模式
return self.get_degraded_mode()
return self.primary_agent
def handle_failure(self):
self.failed_count += 1
if self.failed_count >= self.max_failures:
# 切換到備用模型
self.primary_agent = self.fallback_agents[0]
self.failed_count = 0
4.3 監控和告警
# 監控配置
class AgentMonitoring:
def __init__(self, deployment: DynamicAgentDeployment):
self.deployment = deployment
self.metrics = {
"latency": [],
"cost": [],
"error_rate": []
}
def collect_metrics(self):
# 收集指標
for _ in range(100):
# 模擬請求
input = generate_test_input()
result = self.deployment.get_agent().run(input)
# 記錄指標
self.metrics["latency"].append(result.latency)
self.metrics["cost"].append(result.cost)
self.metrics["error_rate"].append(result.error_rate)
def get_alerts(self) -> List[str]:
alerts = []
# 延遲告警
if self.metrics["latency"].p99 > 5000: # 99 分位延遲 > 5s
alerts.append("High latency detected: P99 latency > 5s")
# 成本告警
if self.metrics["cost"].mean > 0.1: # 平均成本 > $0.1/請求
alerts.append("High cost detected: Average cost > $0.1/request")
# 錯誤率告警
if self.metrics["error_rate"].mean > 5: # 平均錯誤率 > 5%
alerts.append("High error rate detected: Error rate > 5%")
return alerts
第五部分:實戰案例
5.1 客戶支持自動化系統
系統架構
# 客戶支持 Agent 系統
class CustomerSupportAgent:
def __init__(self):
self.router = Agent(
model=Model(provider="openai", model="gpt-5.5"),
system_prompt="Route user requests to appropriate handlers"
)
self.info_agent = Agent(
model=Model(provider="openai", model="gpt-5.4"),
system_prompt="Provide information and answers"
)
self.tech_agent = Agent(
model=Model(provider="openai", model="gpt-5.3"),
system_prompt="Technical troubleshooting"
)
self.human_agent = Agent(
model=Model(provider="openai", model="gpt-5.2"),
system_prompt="Escalate to human support"
)
def handle_request(self, user_input: str) -> str:
# 路由到適當的 Agent
response = self.router.run(
user_input,
handoff_to=[self.info_agent, self.tech_agent, self.human_agent]
)
return response
運行時配置
# 生產配置
support_config = {
"agent": CustomerSupportAgent(),
"guardrails": [
Guardrail(
name="sensitive_data",
action=GuardrailAction.APPROVAL_REQUIRED,
conditions=["user_input contains 'password'", "user_input contains 'token'"]
)
],
"human_review": True,
"review_threshold": 0.7,
"monitoring": {
"latency_target_ms": 2000,
"max_cost_per_request_usd": 0.05,
"max_error_rate_percent": 1
}
}
可測量結果
部署前基準:
- 平均響應時間:5s
- 錯誤率:15%
- 平均成本:$0.15/請求
部署後結果:
- 平均響應時間:1.8s(↓ 64%)
- 錯誤率:2%(↓ 87%)
- 平均成本:$0.03/請求(↓ 80%)
- 人工介入率:25%(預期)
ROI 分析:
- 客戶滿意度提升:30%
- 人工成本節省:$50,000/月
- 估計投資回報期:2 個月
5.2 數據分析 Agent 系統
系統架構
# 數據分析 Agent 系統
class DataAnalysisAgent:
def __init__(self):
self.router = Agent(
model=Model(provider="openai", model="gpt-5.5"),
system_prompt="Coordinate data analysis tasks"
)
self.query_agent = Agent(
model=Model(provider="openai", model="gpt-5.4"),
system_prompt="Data query and retrieval"
)
self.processing_agent = Agent(
model=Model(provider="openai", model="gpt-5.3"),
system_prompt="Data processing and analysis"
)
self.visualization_agent = Agent(
model=Model(provider="openai", model="gpt-5.2"),
system_prompt="Data visualization"
)
def analyze_data(self, query: str) -> Dict:
# 多 Agent 協作
result = self.router.run(
f"Analyze {query} using data agents",
handoff_to=[self.query_agent, self.processing_agent, self.visualization_agent]
)
return result
可觀測性配置
# 觀測性配置
observability_config = {
"latency_tracking": True,
"token_usage_tracking": True,
"error_tracking": True,
"agent_handoffs": True,
"metrics_exporters": [
{"type": "prometheus", "endpoint": "http://metrics:9090"},
{"type": "elasticsearch", "index": "agent-metrics"}
]
}
結論:關鍵決策點
6.1 架構決策
-
運行模式選擇:
- 非沙箱:快速上線,但安全風險高
- 沙箱:安全隔離,但增加延遲
-
多 Agent 協作:
- 單 Agent:簡單,但功能有限
- 多 Agent:功能強大,但複雜度高
-
Guardrails 策略:
- 無審批:最佳體驗,但風險高
- 條件審批:平衡安全性和體驗
- 完全審批:最安全,但體驗差
6.2 評估策略
-
基準測試集設計:
- 覆蓋不同場景
- 設定明確的期望輸出
- 按類別分組統計
-
指標選擇:
- 延遲:P50、P99、P999
- 成本:Token 使用量、估算成本
- 錯誤率:錯誤類型分佈
-
監控配置:
- 即時監控:延遲、成本、錯誤率
- 定期分析:趨勢、異常檢測
- 告警閾值:根據業務需求調整
6.3 部署策略
-
擴展策略:
- 靜態部署:低流量,簡單
- 動態部署:高流量,彈性
-
故障恢復:
- 重試機制:指數退避
- 故障轉移:多模型備用
-
監控告警:
- 即時告警:延遲、成本、錯誤率超閾值
- 自動擴縮:根據負載調整
6.4 可量化的 Tradeoff
| 決策點 | 選項 A | 選項 B | Tradeoff | 影響 |
|---|---|---|---|---|
| 運行模式 | 非沙箱 | 沙箱 | 安全性 vs 延遲 | 0-500ms 延遲 |
| 多 Agent | 單 Agent | 多 Agent | 功能性 vs 複雜度 | 50-200ms 處理時間 |
| Guardrails | 無審批 | 條件審批 | 體驗 vs 安全 | 500-2000ms 延遲 |
| 擴展策略 | 靜態部署 | 動態部署 | 彈性 vs 成本 | 10-50% 成本增加 |
實踐建議
- 從簡單開始:先使用單 Agent 和非沙箱模式快速驗證概念
- 逐步增加複雜度:逐步添加多 Agent、Guardrails、監控
- 設定明確指標:延遲 < 2s,成本 < $0.05/請求,錯誤率 < 1%
- 持續監控:即時監控 + 定期分析 + 自動告警
- 迭代優化:根據指標調整架構和配置
參考資源:
Core Observation: In 2026, developers will need not just the concept of Agent, but specific production-level implementation guidance covering architectural patterns, safeguards, observability, and measurable metrics.
Preface: Why is a production-level Agent implementation guide needed?
In 2026, the critical turning point for AI Agent from concept to practice has arrived. The challenge for many teams is no longer how to use the API, but rather:
- How to build a scalable Agent architecture: from simple chatbot to complex collaboration system
- How to deploy safely in production environment: protective measures, approval process, error handling
- How to monitor and evaluate Agent systems: latency, cost, error rate, observability
- How to govern Agent behavior: policy execution, review mechanism, audit trail
This article will provide a complete implementation guide, covering every aspect from architecture design to production deployment.
Part 1: Agent system architecture model
1.1 Core architectural concepts
Agent definition and model selection
from openai.agents import Agent, Model
# 基礎 Agent 定義
agent = Agent(
model=Model(
provider="openai",
model="gpt-5.5"
),
system_prompt="You are a helpful assistant specialized in {domain}",
tools=[weather_tool, database_query_tool]
)
Architectural Decisions:
- Model Provider: OpenAI vs Anthropic vs Google - depends on cost, latency and feature requirements
- System Prompt: content vs external files - content is more flexible and external files are easier to maintain
- Tools Collection: Static definition vs dynamic loading - Static definition is safer, dynamic loading is more flexible
Tradeoff:
- Content-based System Prompts are easier to A/B test and iterate, but may lead to prompt bloat
- External file-based System Prompt is easier to version control and review, but requires additional file management
1.2 Agent’s operating mode
Non-sandbox mode (running environment)
# 直接運行 Agent,訪問主機環境
agent.run(
user_input="Get weather for San Francisco"
)
Applicable scenarios:
- Internal tool chain (e.g. database, file system)
- Agents that need direct access to host resources
- Rapid prototyping
RISK:
- Agent may access sensitive system resources
- Difficulty isolating Agent behavior
- Complex audit trails
Sandbox mode (isolated environment)
# 使用沙箱運行 Agent,限制訪問範圍
agent.run_in_sandbox(
user_input="Get weather for San Francisco",
sandbox_type="docker"
)
Applicable scenarios:
- Agent behavior that needs to be isolated
- Multi-tenant environment
- Security sensitive applications
Advantages:
- Completely isolate Agent behavior
- Easy to audit and track
- Failed sandbox can be restarted
Tradeoff:
- Long sandbox startup time (usually 100-500ms)
- Increased resource consumption (additional containers or virtual machines)
- Requires additional network configuration
1.3 Collaboration and handover model
Agent to Agent Handover (Orchestration)
# 多 Agent 協作模式
orchestrator = Agent(
model=Model(provider="openai", model="gpt-5.5"),
system_prompt="Coordinate between specialized agents"
)
specialist1 = Agent(
model=Model(provider="openai", model="gpt-5.4"),
system_prompt="Specialist in data analysis"
)
specialist2 = Agent(
model=Model(provider="openai", model="gpt-5.3"),
system_prompt="Specialist in visualization"
)
# 交接流程
result = orchestrator.run(
user_input="Analyze and visualize sales data",
handoff_to=[specialist1, specialist2]
)
Architectural Pattern:
- Router Agent: Responsible for routing requests to professional Agents
- Specialist Agents: handle tasks in specific areas
- Coordinator: Coordinate multi-Agent collaboration
Tradeoff: -Multi-Agent collaboration provides higher professional capabilities
- But adds complexity and latency (typically +50-200ms processing time)
Part 2: Protective measures in production environment
2.1 Guardrails (protective fence)
Approvals (approval mechanism)
# 配置審批流程
agent = Agent(
model=Model(provider="openai", model="gpt-5.5"),
tools=[database_write, file_modify],
guardrails=[
Guardrail(
name="sensitive_data_access",
action=GuardrailAction.APPROVAL_REQUIRED,
conditions=["user_input contains 'password'", "user_input contains 'secret'"]
)
]
)
Approval Strategy:
- Always Required: Every sensitive operation requires approval
- Conditional Required: Determined based on operation type and context
- Never Required: No approval required (test environment only)
Tradeoff:
- Full approval protection is the most secure, but has a poor user experience
- No approval is required for the best user experience, but the security risk is high
- Conditional approval balances security and experience
2.2 Human Review
# 配置人工審查流程
agent = Agent(
model=Model(provider="openai", model="gpt-5.5"),
tools=[database_write],
human_review_enabled=True,
review_threshold=0.8 # 置信度低於 0.8 時需要人工審查
)
Review triggering conditions:
- Confidence Threshold: When the Agent output confidence is lower than the threshold
- Risk Score: When the risk score exceeds the threshold
- Specific operations: Sensitive operations (write, delete, send)
Tradeoff:
- Added latency for manual review (typically +500-2000ms)
- But provide manual verification to reduce error output
2.3 Tool whitelist
# 工具白名單模式
allowed_tools = [
"weather_tool",
"currency_converter",
"calendar_lookup"
]
agent = Agent(
model=Model(provider="openai", model="gpt-5.5"),
tools=[weather_tool, currency_converter, calendar_lookup],
tool_whitelist_enabled=True
)
Advantages:
- Completely prevent Agent from calling unauthorized tools
- Easy to manage tool access
Part 3: Observability and Evaluation
3.1 Observability indicators
Latency indicator
# 延遲分層測量
def measure_latency(input: str, agent: Agent) -> Dict[str, float]:
"""
測量 AI Agent 的完整延遲,分層統計:
- Input processing: 輸入處理時間
- Reasoning: 推理/思考時間
- Tool execution: 工具調用時間
- Output generation: 輸出生成時間
"""
start = time.time()
result = agent.run(input)
end = time.time()
return {
"total_latency_ms": (end - start) * 1000,
"input_processing_ms": 0, # 需要實際測量
"reasoning_ms": 0, # 需要實際測量
"tool_execution_ms": 0, # 需要實際測量
"output_generation_ms": 0 # 需要實際測量
}
Key Indicators:
- P50 Latency: Median Latency (User Experience Core)
- P99 Latency: 99th percentile latency (anomaly detection)
- P999 Latency: 99.9 quantile latency (extreme case)
Tradeoff:
- Measure latency at all tiers to provide comprehensive insights
- But adds overhead and complexity
Cost indicators
# Token 使用量測量
def measure_cost(input: str, agent: Agent) -> Dict[str, float]:
"""
測量 Token 使用量和估算成本:
- Input tokens: 輸入 Token 數量
- Output tokens: 輸出 Token 數量
- Cost: 估算成本(美元/小時)
"""
result = agent.run(input)
input_tokens = result.usage.input_tokens
output_tokens = result.usage.output_tokens
total_tokens = input_tokens + output_tokens
# OpenAI 定價(2026 年)
cost_per_1k_input = 0.0025 # $0.0025/1K input tokens
cost_per_1k_output = 0.01 # $0.01/1K output tokens
estimated_cost = (
(input_tokens / 1000) * cost_per_1k_input +
(output_tokens / 1000) * cost_per_1k_output
)
return {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
"estimated_cost_usd": estimated_cost
}
Cost Optimization Strategy:
- Prompt Simplification: Reduce unnecessary context
- Token cache: reuse context
- Model Selection: Select the appropriate model according to the task type
Error rate indicator
# 錯誤率監控
def measure_error_rate(input: str, agent: Agent) -> Dict[str, float]:
"""
測量錯誤率和錯誤類型:
- Total calls: 總調用次數
- Failed calls: 失敗次數
- Error rate: 錯誤率(百分比)
- Error types: 錯誤類型分佈
"""
total_calls = 1000
failed_calls = 50 # 模擬數據
error_types = {
"timeout": 20,
"rate_limit": 15,
"validation_error": 10,
"tool_error": 5
}
return {
"total_calls": total_calls,
"failed_calls": failed_calls,
"error_rate": (failed_calls / total_calls) * 100,
"error_types": error_types
}
3.2 Agent Evaluation Framework
Benchmark test set
# 基準測試集定義
evaluation_set = [
{
"input": "What is the weather in Tokyo?",
"expected_output": "Sunny, 22°C",
"category": "information_retrieval"
},
{
"input": "Analyze the sales data for Q1",
"expected_output": "Positive growth trend",
"category": "analysis"
},
{
"input": "Schedule a meeting with team",
"expected_output": "Confirmation message",
"category": "task_management"
}
]
# 基準測試執行
def run_evaluations(agent: Agent, evaluation_set: List[Dict]) -> Dict:
results = {
"total_tests": len(evaluation_set),
"correct": 0,
"partial_correct": 0,
"incorrect": 0,
"category_results": {}
}
for test in evaluation_set:
result = agent.run(test["input"])
# 比較結果
if is_correct(result, test["expected_output"]):
results["correct"] += 1
else:
results["incorrect"] += 1
# 按類別統計
category = test["category"]
if category not in results["category_results"]:
results["category_results"][category] = {
"correct": 0,
"total": 0
}
results["category_results"][category]["total"] += 1
# 計算準確率
results["accuracy"] = (results["correct"] / results["total_tests"]) * 100
return results
Part 4: Production Deployment Strategy
4.1 Deployment mode selection
Static deployment (single instance)
# 靜態部署模式
class StaticAgentDeployment:
def __init__(self, agent: Agent, max_concurrent: int = 10):
self.agent = agent
self.max_concurrent = max_concurrent
self.active_requests = 0
self.semaphore = Semaphore(max_concurrent)
def process_request(self, input: str) -> str:
with self.semaphore:
self.active_requests += 1
try:
result = self.agent.run(input)
return result
finally:
self.active_requests -= 1
Applicable scenarios:
- Low traffic applications
- Simple Agent system
- Get online quickly
Restrictions:
- Single instance, cannot scale horizontally
- Increased latency during high traffic
Dynamic deployment (automatic expansion)
# 自動擴展模式
class DynamicAgentDeployment:
def __init__(self, agent_template: Agent, min_instances: int = 5, max_instances: int = 50):
self.agent_template = agent_template
self.min_instances = min_instances
self.max_instances = max_instances
self.instances = []
self.current_load = 0
def auto_scale(self):
# 根據負載調整實例數
if self.current_load > 80:
self.scale_up()
elif self.current_load < 20:
self.scale_down()
def scale_up(self):
# 添加新實例
new_agent = self.agent_template.clone()
self.instances.append(new_agent)
# 啟動新實例
def scale_down(self):
# 移除實例
if len(self.instances) > self.min_instances:
agent_to_remove = self.instances.pop()
# 停止實例
Applicable scenarios:
- High traffic applications
- Multi-tenant environment
- Variable load
Tradeoff:
- Automatic expansion provides flexibility
- But adds complexity and cost
4.2 Failure recovery strategy
Retry mechanism
# 指數退避重試
def execute_with_retry(agent: Agent, input: str, max_retries: int = 3) -> str:
retry_count = 0
last_exception = None
while retry_count < max_retries:
try:
result = agent.run(input)
return result
except Exception as e:
last_exception = e
retry_count += 1
# 指數退避:100ms, 200ms, 400ms...
delay = 100 * (2 ** retry_count)
time.sleep(delay / 1000)
# 所有重試失敗,返回錯誤
raise last_exception
Failover
# 多模型故障轉移
class AgentFailoverManager:
def __init__(self, primary_agent: Agent, fallback_agents: List[Agent]):
self.primary_agent = primary_agent
self.fallback_agents = fallback_agents
self.failed_count = 0
self.max_failures = 3
def get_agent(self) -> Agent:
if self.failed_count >= self.max_failures:
# 所有模型都失敗,返回降級模式
return self.get_degraded_mode()
return self.primary_agent
def handle_failure(self):
self.failed_count += 1
if self.failed_count >= self.max_failures:
# 切換到備用模型
self.primary_agent = self.fallback_agents[0]
self.failed_count = 0
4.3 Monitoring and Alarming
# 監控配置
class AgentMonitoring:
def __init__(self, deployment: DynamicAgentDeployment):
self.deployment = deployment
self.metrics = {
"latency": [],
"cost": [],
"error_rate": []
}
def collect_metrics(self):
# 收集指標
for _ in range(100):
# 模擬請求
input = generate_test_input()
result = self.deployment.get_agent().run(input)
# 記錄指標
self.metrics["latency"].append(result.latency)
self.metrics["cost"].append(result.cost)
self.metrics["error_rate"].append(result.error_rate)
def get_alerts(self) -> List[str]:
alerts = []
# 延遲告警
if self.metrics["latency"].p99 > 5000: # 99 分位延遲 > 5s
alerts.append("High latency detected: P99 latency > 5s")
# 成本告警
if self.metrics["cost"].mean > 0.1: # 平均成本 > $0.1/請求
alerts.append("High cost detected: Average cost > $0.1/request")
# 錯誤率告警
if self.metrics["error_rate"].mean > 5: # 平均錯誤率 > 5%
alerts.append("High error rate detected: Error rate > 5%")
return alerts
Part 5: Practical Cases
5.1 Customer Support Automation System
System architecture
# 客戶支持 Agent 系統
class CustomerSupportAgent:
def __init__(self):
self.router = Agent(
model=Model(provider="openai", model="gpt-5.5"),
system_prompt="Route user requests to appropriate handlers"
)
self.info_agent = Agent(
model=Model(provider="openai", model="gpt-5.4"),
system_prompt="Provide information and answers"
)
self.tech_agent = Agent(
model=Model(provider="openai", model="gpt-5.3"),
system_prompt="Technical troubleshooting"
)
self.human_agent = Agent(
model=Model(provider="openai", model="gpt-5.2"),
system_prompt="Escalate to human support"
)
def handle_request(self, user_input: str) -> str:
# 路由到適當的 Agent
response = self.router.run(
user_input,
handoff_to=[self.info_agent, self.tech_agent, self.human_agent]
)
return response
Runtime configuration
# 生產配置
support_config = {
"agent": CustomerSupportAgent(),
"guardrails": [
Guardrail(
name="sensitive_data",
action=GuardrailAction.APPROVAL_REQUIRED,
conditions=["user_input contains 'password'", "user_input contains 'token'"]
)
],
"human_review": True,
"review_threshold": 0.7,
"monitoring": {
"latency_target_ms": 2000,
"max_cost_per_request_usd": 0.05,
"max_error_rate_percent": 1
}
}
Measurable results
Pre-deployment baseline:
- Average response time: 5s
- Error rate: 15%
- Average cost: $0.15/request
Results after deployment:
- Average response time: 1.8s (↓ 64%)
- Error rate: 2% (↓ 87%)
- Average cost: $0.03/request (↓ 80%)
- Manual intervention rate: 25% (expected)
ROI Analysis:
- Customer satisfaction improvement: 30%
- Labor cost savings: $50,000/month
- Estimated payback period: 2 months
5.2 Data Analysis Agent System
System architecture
# 數據分析 Agent 系統
class DataAnalysisAgent:
def __init__(self):
self.router = Agent(
model=Model(provider="openai", model="gpt-5.5"),
system_prompt="Coordinate data analysis tasks"
)
self.query_agent = Agent(
model=Model(provider="openai", model="gpt-5.4"),
system_prompt="Data query and retrieval"
)
self.processing_agent = Agent(
model=Model(provider="openai", model="gpt-5.3"),
system_prompt="Data processing and analysis"
)
self.visualization_agent = Agent(
model=Model(provider="openai", model="gpt-5.2"),
system_prompt="Data visualization"
)
def analyze_data(self, query: str) -> Dict:
# 多 Agent 協作
result = self.router.run(
f"Analyze {query} using data agents",
handoff_to=[self.query_agent, self.processing_agent, self.visualization_agent]
)
return result
Observability configuration
# 觀測性配置
observability_config = {
"latency_tracking": True,
"token_usage_tracking": True,
"error_tracking": True,
"agent_handoffs": True,
"metrics_exporters": [
{"type": "prometheus", "endpoint": "http://metrics:9090"},
{"type": "elasticsearch", "index": "agent-metrics"}
]
}
Conclusion: Key decision points
6.1 Architectural Decisions
-
Operating mode selection:
- Non-sandbox: Quick to go online, but high security risks
- Sandbox: safe isolation, but added latency
-
Multi-Agent collaboration:
- Single Agent: simple, but limited functionality -Multi-Agent: powerful but complex
-
Guardrails Strategy:
- No approval: best experience, but high risk
- Conditional Approval: Balancing security and experience
- Full approval: safest, but poor experience
6.2 Evaluation Strategy
-
Benchmark test set design:
- Cover different scenarios
- Set clear desired output
- Statistics grouped by category
-
Indicator Selection:
- Delay: P50, P99, P999
- Cost: Token usage, estimated cost
- Error rate: distribution of error types
-
Monitoring configuration:
- Real-time monitoring: delays, costs, error rates
- Regular analysis: trends, anomaly detection
- Alarm threshold: adjusted according to business needs
6.3 Deployment strategy
-
Expansion Strategy:
- Static deployment: low traffic, simple
- Dynamic deployment: high traffic, elasticity
-
Failure Recovery:
- Retry mechanism: exponential backoff
- Failover: multi-model standby
-
Monitoring Alarm:
- Instant alert: delay, cost, error rate exceeds thresholds
- Autoscaling: adjusts based on load
6.4 Quantifiable Tradeoff
| Decision Point | Option A | Option B | Tradeoff | Impact |
|---|---|---|---|---|
| Run Mode | Non-Sandbox | Sandbox | Security vs Latency | 0-500ms Latency |
| Multiple Agents | Single Agent | Multiple Agents | Functionality vs Complexity | 50-200ms processing time |
| Guardrails | No Approval | Conditional Approval | Experience vs Security | 500-2000ms Latency |
| Scaling Strategies | Static Deployment | Dynamic Deployment | Elasticity vs Cost | 10-50% Cost Increase |
Practical suggestions
- Start Simple: Quickly validate the concept using single agent and non-sandbox mode first
- Gradually increase complexity: Gradually add multiple Agents, Guardrails, and monitoring
- Set clear indicators: latency < 2s, cost < $0.05/request, error rate < 1%
- Continuous monitoring: real-time monitoring + regular analysis + automatic alarm
- Iterative Optimization: Adjust architecture and configuration based on indicators
Reference Resources: