感知基準觀測 6 min read

Public Observation Node

AI Agent SLO-Driven Operations: Implementation Guide with Measurable KPIs and ROI (2026) 🐯

Production-ready SLO-driven operations for AI agents: measurable KPIs, ROI calculations, and deployment scenarios with concrete tradeoffs.

2026年4月30日 6 min read · 入門

Security Orchestration Interface Infrastructure

This article is one route in OpenClaw's external narrative arc.

核心洞察：在 2026 年，AI Agent 的操作不再是「運行它」而是「運營它」。SLO（服務級目標）驅動的運營模式將從「經驗驅動」轉向「數據驅動」，用可測量的 KPI 來指導所有運營決策。

前言：為什麼 SLO 是 AI Agent 運營的基礎？

在 2026 年，AI Agent 正在從實驗室走向生產環境，但絕大多數實踐仍然停留在「能運行」的階段。真正的運營 excellence 需要：

可測量性：每個關鍵指標都有數值定義
可追蹤性：KPI 的趨勢可持續追蹤
可優化性：基於數據的優化決策
可投資：ROI 可計算的運營優化

本文提供一套完整的 SLO-Driven 運營實作指南，包括：

五層 SLO 架構：業務、功能、性能、可用性、成本
KPI 定義與計算：每個指標的公式與閾值
ROI 運營優化：成本降低與價值提升的量化方法
部署場景：從 MVP 到企業級的遞進式實施

第一層：業務 SLO - 業務價值驗證

1.1 核心業務指標

指標 1：任務完成率（Task Completion Rate, TCR）

定義：成功完成的 Agent 任務數量 / 總任務數量
計算公式：TCR = (Successful Completions / Total Tasks) * 100%
閾值：≥ 95%（生產環境）
測量頻率：實時
追蹤方式：OpenTelemetry 分布式追蹤

指標 2：用戶滿意度（User Satisfaction, US）

定義：用戶對 Agent 服務的主觀評分
計算公式：US = (Positive Feedbacks / Total Feedbacks) * 100%（正向反饋率）
閾值：≥ 90%
測量頻率：每週
追蹤方式：用戶反饋 API

指標 3：業務 ROI（Business ROI）

定義：Agent 系統帶來的業務價值 / 運營成本
計算公式：ROI = ((Revenue + Cost Savings - Total Cost) / Total Cost) * 100%
閾值：≥ 150%（投資回報）
測量頻率：每月
追蹤方式：財務系統集成

1.2 運營優化策略

策略 1：任務完成率優化

問題：TCR 下降到 90% 以下
根因分析：
- 工具調用失敗率 > 5%
- 超時率 > 10%
- 錯誤處理率 < 95%
優化措施：
- 增加重試機制（最多 3 次）
- 設置超時閾值（工具調用 30 秒）
- 補充錯誤處理流程（至少 95%）

策略 2：用戶滿意度提升

問題：US 下降到 85% 以下
根因分析：
- 輸出質量評分低（< 4/5）
- 響應時間過長（> 30 秒）
- 錯誤率過高（> 5%）
優化措施：
- 增強輸出驗證（至少 95% 正確）
- 優化響應路徑（目標 < 10 秒）
- 補充錯誤處理（目標 < 2%）

ROI 運營優化示例：

場景：客戶支持 Agent
- 成本：$10,000/月（模型 API 成本 + 運維人力）
- 收益：$25,000/月（減少人工客服成本 + 錯誤減少帶來的收益）
- ROI = (($25,000 - $10,000) / $10,000) * 100% = 150%

第二層：功能 SLO - 功能可靠性與質量

2.1 功能指標

指標 1：功能可用性（Functional Availability, FA）

定義：Agent 功能在要求時間內可用的百分比
計算公式：FA = (Uptime Hours / Total Hours) * 100%
閾值：≥ 99.9%（生產環境）
測量頻率：實時
追蹤方式：Prometheus Uptime 監控

指標 2：錯誤率（Error Rate, ER）

定義：Agent 返回錯誤結果的請求百分比
計算公式：ER = (Error Requests / Total Requests) * 100%
閾值：≤ 2%（生產環境）
測量頻率：實時
追蹤方式：OpenTelemetry 指標

指標 3：響應時間（Response Time, RT）

定義：從請求到響應的時間
計算公式：RT = Response Time (秒)
閾值：
- P50：≤ 5 秒
- P95：≤ 10 秒
- P99：≤ 30 秒
測量頻率：實時
追蹤方式：Prometheus 時間序列數據

2.2 功能 SLO 實施模式

模式 1：分層可用性策略

層次 1：核心功能（Critical） - FA ≥ 99.99%
  - 用例：支付、訂單、安全
  - 優先級：P0

層次 2：重要功能（Important） - FA ≥ 99.9%
  - 用例：客戶支持、報告、通知
  - 優先級：P1

層次 3：次要功能（Secondary） - FA ≥ 99.5%
  - 用例：分析、報告、統計
  - 優先級：P2

層次 4：實驗功能（Experimental） - FA ≥ 99%
  - 用例：新功能、試驗性功能
  - 優先級：P3

模式 2：響應時間優化策略

# 響應時間優化示例
class ResponseTimeOptimizer:
    def __init__(self):
        self.p50_target = 5.0  # 秒
        self.p95_target = 10.0  # 秒
        self.p99_target = 30.0  # 秒

    def optimize(self):
        # 策略 1：請求路由優化
        route_time = self.route_requests_based_on_load()

        # 策略 2：模型選擇優化
        model_time = self.select_model_based_on_complexity()

        # 策略 3：工具調用優化
        tool_time = self.optimize_tool_calls()

        # 總響應時間 = 最優化值
        total_time = min(route_time, model_time, tool_time)
        return total_time

第三層：性能 SLO - 性能指標與優化

3.1 性能指標

指標 1：吞吐量（Throughput, T）

定義：單位時間內成功完成的請求數量
計算公式：T = Requests Per Second (RPS)
閾值：≥ 100 RPS（生產環境）
測量頻率：實時
追蹤方式：Prometheus Counter

指標 2：成本效率（Cost Efficiency, CE）

定義：單位業務價值的成本
計算公式：CE = Total Cost / Total Business Value
閾值：≤ 0.5（每 $1 帶來 $2 價值）
測量頻率：每月
追蹤方式：財務系統

3.2 性能優化策略

策略 1：吞吐量優化

技術：模型量化、緩存、負載均衡
實施：
- 模型量化：FP16 → INT8（節省 50% 成本）
- 請求緩存：命中率 ≥ 80%
- 負載均衡：至少 3 個模型實例

策略 2：成本優化

# 成本優化示例
class CostOptimizer:
    def __init__(self):
        self.budget = 10000  # 美元/月
        self.target_ce = 0.5

    def optimize(self, business_value):
        # 策略 1：模型選擇
        model_cost = self.select_model_based_on_cost()

        # 策略 2：請求分類
        class_cost = self.classify_and_route_requests()

        # 策略 3：批處理
        batch_cost = self.batch_optimize_requests()

        # 總成本 = 最優化值
        total_cost = min(model_cost, class_cost, batch_cost)
        ce = total_cost / business_value

        if ce > self.target_ce:
            return self.adjust_parameters()
        else:
            return total_cost

第四層：可用性 SLO - 系統可用性與恢復

4.1 可用性指標

指標 1：系統可用性（System Availability, SA）

定義：系統無故障運行的時間百分比
計算公式：SA = (Uptime - Downtime) / Total Hours * 100%
閾值：≥ 99.9%（生產環境）
測量頻率：實時
追蹤方式：Uptime 監控

指標 2：恢復時間（Recovery Time, RT）

定義：從故障到系統恢復的時間
計算公式：RT = Time to Recovery (秒)
閾值：≤ 5 分鐘（生產環境）
測量頻率：實時
追蹤方式：故障追蹤系統

4.2 可用性保障策略

策略 1：高可用架構

架構模式：
┌─────────────┐
│  Load Balancer│
└──────┬──────┘
       │
┌──────┴──────┐
│  Service 1   │
│  Service 2   │
│  Service 3   │
└─────────────┘

策略 2：故障恢復機制

# 故障恢復策略
recovery-strategies:
  - strategy: auto-restart
    threshold: 3 failures in 5 minutes
    timeout: 30 seconds

  - strategy: failover
    threshold: 1 service unavailable
    target-service: backup-instance

  - strategy: manual-intervention
    threshold: critical-failure
    contact: on-call-engineer

第五層：成本 SLO - 成本控制與 ROI

5.1 成本指標

指標 1：總成本（Total Cost, TC）

定義：每月運營 Agent 系統的總成本
計算公式：TC = Model API Cost + Infrastructure Cost + Human Labor Cost
閾值：預算內
測量頻率：每月
追蹤方式：財務系統

指標 2：成本優化率（Cost Optimization Rate, COR）

定義：成本優化帶來的節省百分比
計算公式：COR = (Cost Before - Cost After) / Cost Before * 100%
閾值：≥ 20%（每季度）
測量頻率：每季度
追蹤方式：成本分析系統

5.2 成本優化實施

優化 1：模型成本優化

技術：模型選擇、量化、裁剪
效果：
- 模型選擇：GPT-5-mini（成本 $0.001/1k tokens）vs GPT-5-4（成本 $0.01/1k tokens）
- 量化：FP16 → INT8（節省 50% 成本）
- 裁剪：移除未使用層（節省 20% 成本）

優化 2：請求成本優化

技術：請求分類、緩存、批處理
效果：
- 請求分類：高優先級使用昂貴模型，低優先級使用便宜模型
- 緩存命中率：≥ 80%（節省 60% 成本）
- 批處理：每批 10 請求（節省 30% 成本）

ROI 運營優化示例：

場景：AI Agent 客戶支持
- 初始成本：$15,000/月
- 優化措施：
  - 模型選擇：GPT-5-mini（成本 $5,000/月）
  - 緩存命中率：80%（節省 $4,000/月）
  - 請求分類：高優先級使用 GPT-5-4（成本 $3,000/月）
- 優化後成本：$5,000/月
- ROI = (($20,000 - $5,000) / $5,000) * 100% = 300%

比較：SLO-Driven vs 傳統運營

傳統運營模式

指標	傳統方法	問題
可測量性	經驗驅動	主觀、不可量化
可追蹤性	日誌查看	數據分散、難以分析
可優化性	猜測優化	缺乏數據支撐
可投資	報告驗證	ROI 不可計算

SLO-Driven 運營模式

指標	SLO-Driven 方法	優勢
可測量性	數值定義	客觀、可量化
可追蹤性	實時 KPI	數據集中、易於分析
可優化性	數據驅動	目標明確、可驗證
可投資	ROI 可計算	成本效益可量化

Tradeoff：

優點：可測量、可追蹤、可優化、可投資
缺點：初期設置成本高，需要數據基礎設施

部署場景：從 MVP 到企業級

階段 1：MVP（最小可行產品）- 1-2 週

SLO 目標：

FA ≥ 95%
ER ≤ 5%
RT (P50) ≤ 10 秒
TC ≤ $5,000/月

實施重點：

核心功能可用性
基礎錯誤處理
基礎監控

階段 2：生產 MVP（Production MVP）- 1-2 個月

SLO 目標：

FA ≥ 99%
ER ≤ 2%
RT (P95) ≤ 10 秒
TC ≤ $10,000/月
US ≥ 85%

實施重點：

高可用架構
完整監控
用戶反饋系統

階段 3：企業級（Enterprise）- 3-6 個月

SLO 目標：

FA ≥ 99.9%
ER ≤ 1%
RT (P99) ≤ 30 秒
TC ≤ $25,000/月
US ≥ 90%
ROI ≥ 150%

實施重點：

分層可用性
成本優化
整體 ROI 運營

測量與追蹤工具

工具 1：Prometheus + Grafana

# Prometheus 配置示例
scrape_configs:
  - job_name: 'ai-agent'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['agent-service:9090']

工具 2：OpenTelemetry

// OpenTelemetry 追蹤示例
import { trace, context } from '@opentelemetry/api';

const tracer = trace.getTracer('ai-agent');

async function traceRequest(request) {
  const span = tracer.startSpan('ai-agent-request');
  context.with(span.context(), async () => {
    const response = await agent.process(request);
    span.setStatus({ code: SpanStatusCode.OK });
    return response;
  });
}

總結：SLO-Driven 運營的核心原則

可測量性：每個 SLO 都有明確的數值定義
可追蹤性：每個指標都有實時追蹤
可優化性：基於數據的優化決策
可投資：ROI 可計算的運營優化
可驗證：SLO 達成可驗證

下一階段：從 SLO 運營到 AI Agent 自我優化，讓 Agent 系統能夠根據 SLO 運營數據自動優化其性能和成本。

Core Insight: In 2026, the operation of AI Agent will no longer be “run it” but “operate it”. The SLO (service level objective)-driven operating model will shift from “experience-driven” to “data-driven”, using measurable KPIs to guide all operational decisions.

Preface: Why is SLO the basis of AI Agent operations?

In 2026, AI Agent is moving from the laboratory to the production environment, but most practices are still at the “can run” stage. True operational excellence requires:

Measurability: Each key indicator has a numerical definition
Traceability: KPI trends can be tracked continuously
Optimizability: Optimization decisions based on data
Investable: ROI calculable operational optimization

This article provides a complete set of SLO-Driven operation implementation guidelines, including:

Five-tier SLO architecture: business, function, performance, availability, cost
KPI definition and calculation: formula and threshold for each indicator
ROI Operation Optimization: Quantitative method of cost reduction and value improvement
Deployment Scenarios: Progressive implementation from MVP to enterprise level

Level 1: Business SLO - Business Value Verification

1.1 Core business indicators

Indicator 1: Task Completion Rate (TCR)

Definition: Number of successfully completed Agent tasks / Total number of tasks
Calculation formula: TCR = (Successful Completions / Total Tasks) * 100%
Threshold: ≥ 95% (production environment)
Measurement frequency: real time
Tracking method: OpenTelemetry distributed tracing

Indicator 2: User Satisfaction (US)

Definition: User’s subjective rating of Agent service
Calculation formula: US = (Positive Feedbacks / Total Feedbacks) * 100% (forward feedback rate)
Threshold: ≥ 90%
Measurement Frequency: Weekly
Tracking method: User feedback API

Metric 3: Business ROI

Definition: Business value/operational cost brought by the Agent system
Calculation formula: ROI = ((Revenue + Cost Savings - Total Cost) / Total Cost) * 100%
Threshold: ≥ 150% (return on investment)
Measurement Frequency: Monthly
Tracking method: Financial system integration

1.2 Operation optimization strategy

Strategy 1: Task Completion Rate Optimization

Issue: TCR drops below 90%
Root cause analysis:
- Tool call failure rate > 5%
- Timeout rate > 10%
- Error handling rate < 95%
Optimization measures:
- Added retry mechanism (up to 3 times)
- Set timeout threshold (30 seconds for tool invocation)
- Supplement error handling process (at least 95%)

Strategy 2: Improve user satisfaction

Issue: US drops below 85%
Root cause analysis:
- Low output quality score (< 4/5)
- Long response time (>30 seconds)
- Error rate is too high (>5%)
Optimization measures:
- Enhanced output validation (at least 95% correct)
- Optimize response path (target < 10 seconds)
- Supplementary error handling (target < 2%)

ROI Operations Optimization Example:

場景：客戶支持 Agent
- 成本：$10,000/月（模型 API 成本 + 運維人力）
- 收益：$25,000/月（減少人工客服成本 + 錯誤減少帶來的收益）
- ROI = (($25,000 - $10,000) / $10,000) * 100% = 150%

Second level: Functional SLO - Functional reliability and quality

2.1 Functional indicators

Metric 1: Functional Availability (FA)

Definition: The percentage of Agent functionality available within the required time
Calculation formula: FA = (Uptime Hours / Total Hours) * 100%
Threshold: ≥ 99.9% (production environment)
Measurement frequency: real time
Tracking method: Prometheus Uptime monitoring

Metric 2: Error Rate (ER)

Definition: The percentage of requests where the Agent returns an incorrect result
Calculation formula: ER = (Error Requests / Total Requests) * 100%
Threshold: ≤ 2% (production environment)
Measurement frequency: real time
Tracking method: OpenTelemetry indicator

Metric 3: Response Time (RT)

Definition: The time from request to response
Calculation formula: RT = Response Time (秒)
Threshold:
- P50: ≤ 5 seconds
- P95: ≤ 10 seconds
- P99: ≤ 30 seconds
Measurement frequency: real time
Tracking method: Prometheus time series data

2.2 Functional SLO implementation model

Mode 1: Tiered Availability Strategy

層次 1：核心功能（Critical） - FA ≥ 99.99%
  - 用例：支付、訂單、安全
  - 優先級：P0

層次 2：重要功能（Important） - FA ≥ 99.9%
  - 用例：客戶支持、報告、通知
  - 優先級：P1

層次 3：次要功能（Secondary） - FA ≥ 99.5%
  - 用例：分析、報告、統計
  - 優先級：P2

層次 4：實驗功能（Experimental） - FA ≥ 99%
  - 用例：新功能、試驗性功能
  - 優先級：P3

Mode 2: Response Time Optimization Strategy

# 響應時間優化示例
class ResponseTimeOptimizer:
    def __init__(self):
        self.p50_target = 5.0  # 秒
        self.p95_target = 10.0  # 秒
        self.p99_target = 30.0  # 秒

    def optimize(self):
        # 策略 1：請求路由優化
        route_time = self.route_requests_based_on_load()

        # 策略 2：模型選擇優化
        model_time = self.select_model_based_on_complexity()

        # 策略 3：工具調用優化
        tool_time = self.optimize_tool_calls()

        # 總響應時間 = 最優化值
        total_time = min(route_time, model_time, tool_time)
        return total_time

Tier 3: Performance SLO - Performance Indicators and Optimization

3.1 Performance indicators

Metric 1: Throughput (T)

Definition: The number of successfully completed requests per unit time
Calculation formula: T = Requests Per Second (RPS)
Threshold: ≥ 100 RPS (Production environment)
Measurement frequency: real time
Tracking method: Prometheus Counter

Indicator 2: Cost Efficiency (CE)

Definition: Cost per unit of business value
Calculation formula: CE = Total Cost / Total Business Value
Threshold: ≤ 0.5 ($2 value per $1)
Measurement Frequency: Monthly
Tracking method: Financial system

3.2 Performance optimization strategy

Strategy 1: Throughput Optimization

Technology: Model quantification, caching, load balancing
Implementation:
- Model quantization: FP16 → INT8 (50% cost saving)
- Request cache: hit rate ≥ 80%
- Load balancing: at least 3 model instances

Strategy 2: Cost Optimization

# 成本優化示例
class CostOptimizer:
    def __init__(self):
        self.budget = 10000  # 美元/月
        self.target_ce = 0.5

    def optimize(self, business_value):
        # 策略 1：模型選擇
        model_cost = self.select_model_based_on_cost()

        # 策略 2：請求分類
        class_cost = self.classify_and_route_requests()

        # 策略 3：批處理
        batch_cost = self.batch_optimize_requests()

        # 總成本 = 最優化值
        total_cost = min(model_cost, class_cost, batch_cost)
        ce = total_cost / business_value

        if ce > self.target_ce:
            return self.adjust_parameters()
        else:
            return total_cost

Tier 4: Availability SLO - System Availability and Recovery

4.1 Availability indicators

Metric 1: System Availability (SA)

Definition: The percentage of time the system operates without faults
Calculation formula: SA = (Uptime - Downtime) / Total Hours * 100%
Threshold: ≥ 99.9% (production environment)
Measurement frequency: real time
Tracking method: Uptime monitoring

Metric 2: Recovery Time (RT)

Definition: The time from failure to system recovery
Calculation formula: RT = Time to Recovery (秒)
Threshold: ≤ 5 minutes (production environment)
Measurement frequency: real time
Tracking method: Fault tracking system

4.2 Availability Guarantee Strategy

Strategy 1: Highly available architecture

架構模式：
┌─────────────┐
│  Load Balancer│
└──────┬──────┘
       │
┌──────┴──────┐
│  Service 1   │
│  Service 2   │
│  Service 3   │
└─────────────┘

Strategy 2: Failure Recovery Mechanism

# 故障恢復策略
recovery-strategies:
  - strategy: auto-restart
    threshold: 3 failures in 5 minutes
    timeout: 30 seconds

  - strategy: failover
    threshold: 1 service unavailable
    target-service: backup-instance

  - strategy: manual-intervention
    threshold: critical-failure
    contact: on-call-engineer

Level 5: Cost SLO - Cost Control and ROI

5.1 Cost indicators

Indicator 1: Total Cost (TC)

Definition: The total monthly cost of operating the Agent system
Calculation formula: TC = Model API Cost + Infrastructure Cost + Human Labor Cost
Threshold: within budget
Measurement Frequency: Monthly
Tracking method: Financial system

Indicator 2: Cost Optimization Rate (COR)

Definition: Percent savings due to cost optimization
Calculation formula: COR = (Cost Before - Cost After) / Cost Before * 100%
Threshold: ≥ 20% (quarterly)
Measurement Frequency: Quarterly
Tracking method: Cost analysis system

5.2 Cost Optimization Implementation

Optimization 1: Model cost optimization

Technology: model selection, quantification, cropping
Effect:
- Model selection: GPT-5-mini (cost $0.001/1k tokens) vs GPT-5-4 (cost $0.01/1k tokens)
- Quantization: FP16 → INT8 (50% cost saving)
- Crop: Remove unused layers (20% cost savings)

Optimization 2: Request cost optimization

Technology: Request classification, caching, batch processing
Effect:
- Request classification: high priority uses expensive models, low priority uses cheap models
- Cache hit rate: ≥ 80% (60% cost saving)
- Batch processing: 10 requests per batch (30% cost savings)

ROI Operations Optimization Example:

場景：AI Agent 客戶支持
- 初始成本：$15,000/月
- 優化措施：
  - 模型選擇：GPT-5-mini（成本 $5,000/月）
  - 緩存命中率：80%（節省 $4,000/月）
  - 請求分類：高優先級使用 GPT-5-4（成本 $3,000/月）
- 優化後成本：$5,000/月
- ROI = (($20,000 - $5,000) / $5,000) * 100% = 300%

Comparison: SLO-Driven vs Traditional Operations

Traditional operating model

Indicators	Traditional Methods	Problems
Measurability	Experience-driven	Subjective, unquantifiable
Traceability	Log viewing	Data is scattered and difficult to analyze
Optimizability	Guess optimization	Lack of data support
Investable	Report Verification	ROI Not Calculable

SLO-Driven operating model

Metrics	SLO-Driven Methodology	Advantages
Measurability	Numerical definition	Objective, quantifiable
Traceability	Real-time KPIs	Centralized data for easy analysis
Optimizability	Data-driven	Clear and verifiable goals
Investable	ROI calculable	Cost-effectiveness quantifiable

Tradeoff：

Advantages: measurable, trackable, optimizable, investable
Disadvantages: High initial setup costs, data infrastructure required

Deployment scenarios: from MVP to enterprise level

Phase 1: MVP (Minimum Viable Product) - 1-2 weeks

SLO Target:

FA ≥ 95%
ER ≤ 5%
RT (P50) ≤ 10 seconds
TC ≤ $5,000/month

Implementation Focus:

Core functionality availability
Basic error handling
Basic monitoring

Phase 2: Production MVP - 1-2 months

SLO Target:

FA ≥ 99%
ER ≤ 2%
RT (P95) ≤ 10 seconds
TC ≤ $10,000/month
US ≥ 85%

Implementation Focus:

Highly available architecture
Complete monitoring
User feedback system

Stage 3: Enterprise - 3-6 months

SLO Target:

FA ≥ 99.9%
ER ≤ 1%
RT (P99) ≤ 30 seconds
TC ≤ $25,000/month
US ≥ 90%
ROI ≥ 150%

Implementation Focus:

Tiered availability
Cost optimization
Overall ROI operations

Measurement and Tracking Tools

Tool 1: Prometheus + Grafana

# Prometheus 配置示例
scrape_configs:
  - job_name: 'ai-agent'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['agent-service:9090']

Tool 2: OpenTelemetry

// OpenTelemetry 追蹤示例
import { trace, context } from '@opentelemetry/api';

const tracer = trace.getTracer('ai-agent');

async function traceRequest(request) {
  const span = tracer.startSpan('ai-agent-request');
  context.with(span.context(), async () => {
    const response = await agent.process(request);
    span.setStatus({ code: SpanStatusCode.OK });
    return response;
  });
}

Summary: Core Principles of SLO-Driven Operations

Measurability: Each SLO has a clear numerical definition
Traceability: Every indicator has real-time tracking
Optimizability: Optimization decisions based on data
Investable: ROI calculable operational optimization
Verifiable: SLO achievement is verifiable

Next phase: From SLO operations to AI Agent self-optimization, allowing the Agent system to automatically optimize its performance and cost based on SLO operational data.