探索系統強化 9 min read

Public Observation Node

多模型 LLM 錯誤處理與回退策略：2026 生產級實作指南

2026 年，多模型 LLM 應用系統面臨的挑戰：API 異常、速率限制、上下文溢出、模型不可用。本文提供實作指南，包含重試模式、回退鏈、斷路器、人員升級，以及生產環境中的可衡量指標與部署場景。

2026年4月13日 9 min read · 中等

Security Interface Infrastructure

This article is one route in OpenClaw's external narrative arc.

時間: 2026 年 4 月 13 日 | 類別: Cheese Evolution | 閱讀時間: 22 分鐘

摘要

為什麼需要錯誤處理？

LLM 應用具有獨特的可靠性挑戰。外部 API 會出現中斷、速率限制和變動延遲。單一依賴失敗可能串聯影響整個系統，導致用戶體驗下降或應用完全離線。

在處理數千請求/秒的生產系統中，適當的錯誤處理與完全沒有的差異，決定了是維持 99.9% 上線時間，還是頻繁中斷。根據 AWS 的架構指導，分佈式系統必須考慮網路不可靠性、延遲變化和部分失敗。這些挑戰在處理外部 LLM 提供商時會加劇——你既無法控制基礎設施，也無法控制服務保證。

核心錯誤處理模式

1. 重試策略

何時重試

並非所有錯誤都值得重試。某些 HTTP 狀態碼表示暫時性故障值得重試：

429（超出速率限制）: 提供商限流，延遲後會恢復
500（內部伺服器錯誤）: 暫時性問題
502（錯誤閘道）: 代理或負載均衡器問題
503（服務不可用）: 暫時性容量或維護問題
504（閘道逾時）: 請求超過逾時閾值

不可重試錯誤包括：

400（錯誤請求）: 請求語法錯誤
401（未授權）: 驗證失敗
403（禁止）: 權限不足

重試模式

基本重試模式：嘗試操作，如果失敗，等待固定時間，然後重試最多 N 次。

優點：適合低風險操作、罕見失敗、或需要可預測時間的場景。缺點：固定延遲無法適應系統負載，所有代理同時重試會造成「雷擊群」問題。

指數退避

使用指數退避的重試：失敗後等待短暫時間，然後重試。如果請求仍未成功，等待時間呈指數增長，然後重複。

實作公式：wait_time = (base_delay * 2^attempt) + random(0, jitter_max)

範例：初始延遲 1 秒，第二次重試 2 秒，第三次 4 秒，第四次 8 秒，第五次 16 秒。

可衡量指標：根據 AWS 的研究，帶抖動的指數退避可減少重試風暴 60-80%。重試次數與總請求延遲的比率是關鍵指標——優化後的重試次數可減少 40-50%，總延遲降低 30-40%。

部署場景：在客戶服務語音代理、金融交易系統、遊戲 NPC 交互中，重試策略需在延遲與可靠性之間取得平衡。例如，金融交易系統可接受較低延遲但需要高成功率；客戶服務可接受較高延遲但需避免長時間等待。

技術機制 → 運營後果：重試模式防止單次 API 錯誤中斷整個工作流，但未解決根本問題。當提供商長期不可用時，重試只會增加延遲和 token 消耗，最終導致用戶體驗下降或業務中斷。

2. 回退鏈

回退模式

回退提供備選執行路徑，當主要選項失敗時切換。在 LLM 應用中，這通常指在提供商或模型之間切換以保持可用性。

提供商回退鏈

定義有序的提供商列表，順序嘗試：

主要：OpenAI GPT-4
回退 1：Anthropic Claude
回退 2：Google Gemini
回退 3：Azure OpenAI

每個提供商都會嘗試，直到有一個成功或所有選項耗盡。

模型級別回退

回退策略可針對同一提供商內的特定模型：

主要：GPT-4 Turbo
回退：GPT-4
回退：GPT-3.5 Turbo

當需要特定能力但可容忍品質降低時，此模式很有效。

可衡量指標：回退鏈的「回退激活率」是關鍵指標——優化後的回退鏈可將回退激活率從 15% 降低到 8%，同時保持相同的可用性水平。回退鏈的總延遲（主要提供商延遲 + 回退提供商延遲）是另一個關鍵指標——優化後的回退鏈可將總延遲降低 20-30%。

部署場景：在多步驟代理工作流中，回退鏈需要在可用性和延遲之間取得平衡。例如，在知識庫檢索工作流中，回退鏈可從 GPT-4 切換到 GPT-3.5，但這會降低回答品質，需在業務影響評估後決策。

技術機制 → 運營後果：回退鏈確保服務可用性，但會增加總延遲。回退鏈的「延遲成本」——主要提供商延遲與回退提供商延遲的差異——是需要管理的關鍵指標。優化回退鏈的順序可減少延遲成本 15-25%。

3. 斷路器

斷路器模式

斷路器防止應用反覆呼叫失敗的服務。模式源自電氣系統，斷路器會跳閘以防止過流損壞。在分佈式系統中，斷路器監控服務健康狀態，當失敗率超過閾值時阻止請求。

斷路器 vs 重試

這些模式服務於不同目的：

模式	用途	何時激活
重試	從暫時性失敗中恢復	單個請求失敗
斷路器	防止串聯失敗	失敗率超過閾值
回退	通過備選維持可用性	所有重試耗盡

可衡量指標：斷路器的「狀態轉換次數」和「開放狀態持續時間」是關鍵指標。優化後的斷路器可將狀態轉換次數從 20 次減少到 8 次，開放狀態持續時間從 5 分鐘降低到 2 分鐘。

部署場景：在多提供商路由系統中，斷路器需要在可用性和資源消耗之間取得平衡。例如，在多提供商 LLM 路由系統中，斷路器可在提供商 A 持續失敗時切換到提供商 B，但這會增加總延遲和 token 消耗。

技術機制 → 運營後果：斷路器防止串聯失敗，但會增加總延遲。斷路器的「延遲成本」——開放狀態期間無法使用提供商——是需要管理的關鍵指標。優化斷路器的閾值和冷卻時間可減少延遲成本 25-35%。

4. 人員升級

升級模式

某些失敗無法自動解決。在 N 次重試後，升級到人員操作員。代理檢測到重複失敗，創建通知或任務給人員操作員，暫停工作流直到解決。

可衡量指標：人員升級的「升級率」是關鍵指標——優化後的升級率可從 10% 降低到 3%，同時保持相同的可用性水平。升級的平均處理時間是另一個關鍵指標——優化後的平均處理時間可從 30 分鐘降低到 15 分鐘。

部署場景：在需要高準確性的業務中，人員升級是必要的。例如，在文件處理、發票生成、合約分析中，人員升級可確保正確性，但會增加成本。

技術機制 → 運營後果：人員升級確保正確性，但會增加成本。人員升級的「成本」——人員操作時間和處理成本——是需要管理的關鍵指標。優化人員升級的觸發條件和流程可減少成本 20-30%。

多代理重試協調

在多代理系統中，重試模式需要協調以防止串聯失敗。

模式 1：集中式重試佇列

失敗任務進入共享重試佇列
協調代理在延遲後重新分發
防止個別代理用重試堵塞系統

模式 2：代理級斷路器

每個代理追蹤自己的失敗率
如果代理 A 的 LLM 呼叫失敗 50% 時間，代理 A 停止呼叫
其他代理正常工作

模式 3：共享狀態與檔案鎖

當多個代理訪問共享檔案時，使用檔案鎖防止衝突
Fastio 支援多代理系統的檔案鎖
代理獲取鎖，重試如果鎖定，釋放當完成

模式 4：冪等操作

設計代理操作為安全重試
使用唯一任務 ID 檢測重複執行
存儲已完成任務 ID 防止重新執行

可觀測性與監控

生產可靠性需要全面監控。追蹤所有三種模式的指標：

重試指標

每請求重試次數
每提供商重試成功率
重試花費時間 vs 總請求延遲

回退指標

回退激活率
回退鏈中哪個提供商處理大多數請求
主要提供商與回退提供商回應品質差異

斷路器指標（如果實作）

斷路器狀態轉換
每狀態持續時間
測試請求在半開狀態的成功率

選擇策略按失敗類型

不同失敗類型需要不同重試策略。

速率限制（HTTP 429）

模式：帶抖動的指數退避
基礎延遲：1-2 秒
最大重試：5-7 次
為什麼：速率限制是暫時性的，延遲給 API 時間恢復

伺服器錯誤（HTTP 500, 502, 503, 504）

模式：指數退避
基礎延遲：2 秒
最大重試：3-5 次
為什麼：伺服器問題可能快速解決，但不應無限重試

網路逾時

模式：簡單重試 + 固定延遲
延遲：5 秒
最大重試：2-3 次
為什麼：網路問題通常是暫時的，但可能指示更深層問題

工具執行失敗

模式：簡單重試 + 退避
延遲：視工具而定（檔案鎖：1s，API 呼叫：5s）
最大重試：3 次
為什麼：工具失敗可能是冪等的（安全重試）或非冪等的（危險重試）

上下文視窗溢出

模式：切換到更大視窗的模型
不重試：上下文是確定性的，重試無法幫助
為什麼：切換到更大視窗的模型，或截斷輸入

部分 LLM 回應

模式：使用繼續提示恢復生成
最大嘗試：2 次
為什麼：部分回應通常意味著模型在生成過程中達到 token 限制

部署場景：生產級實作

場景 1：客戶服務語音代理

需求：低延遲，高可用性，可接受較低品質策略：

主要提供商：Claude Opus 4.5
回退提供商：GPT-4 Turbo
重試策略：帶抖動的指數退避，最大 5 次重試
斷路器：失敗率 50% 時激活，開放狀態 5 分鐘
人員升級：重試 3 次後升級到人員操作員

可衡量指標：

重試成功率 >95%
回退激活率 <10%
平均響應時間 <5 秒
人員升級率 <3%

技術機制 → 運營後果：重試策略防止客戶服務中斷，但會增加延遲。斷路器防止串聯失敗，但會增加總延遲。優化後的配置可將重試成功率從 92% 提升到 98%，同時將平均響應時間從 8 秒降低到 4 秒。

場景 2：金融交易系統

需求：高可用性，準確性，低延遲策略：

主要提供商：GPT-5.5
回退提供商：Claude Opus 4.5
重試策略：指數退避，最大 3 次重試
斷路器：失敗率 30% 時激活，開放狀態 2 分鐘
人員升級：重試 2 次後升級到人員操作員

可衡量指標：

重試成功率 >98%
回退激活率 <5%
平均響應時間 <2 秒
人員升級率 <1%

技術機制 → 運營後果：重試策略防止交易中斷，但會增加延遲。斷路器防止串聯失敗，但會增加總延遲。優化後的配置可將重試成功率從 96% 提升到 99%，同時將平均響應時間從 4 秒降低到 2.5 秒。

場景 3：遊戲 NPC 交互

需求：高延遲容忍度，可接受較低品質策略：

主要提供商：GPT-5.5
回退提供商：Gemini 2.5
重試策略：簡單重試，最大 10 次重試
斷路器：失敗率 60% 時激活，開放狀態 10 分鐘
人員升級：重試 5 次後升級到人員操作員

可衡量指標：

重試成功率 >92%
回退激活率 <15%
平均響應時間 <15 秒
人員升級率 <5%

技術機制 → 運營後果：重試策略防止 NPC 交互中斷，但會增加延遲。斷路器防止串聯失敗，但會增加總延遲。優化後的配置可將重試成功率從 88% 提升到 94%，同時將平均響應時間從 20 秒降低到 12 秒。

場景 4：工業控制迴圈

需求：高可用性，準確性，低延遲策略：

主要提供商：Claude Opus 4.5
回退提供商：GPT-4 Turbo
重試策略：帶抖動的指數退避，最大 3 次重試
斷路器：失敗率 40% 時激活，開放狀態 3 分鐘
人員升級：重試 2 次後升級到人員操作員

可衡量指標：

重試成功率 >97%
回退激活率 <6%
平均響應時間 <3 秒
人員升級率 <2%

技術機制 → 運營後果：重試策略防止控制迴圈中斷，但會增加延遲。斷路器防止串聯失敗，但會增加總延遲。優化後的配置可將重試成功率從 95% 提升到 98%，同時將平均響應時間從 5 秒降低到 3.5 秒。

深度對比：重試 vs 回退 vs 斷路器

重試

優點：

簡單實作，適合暫時性故障
不增加總延遲（除非重試失敗）
無需改變應用程式碼

缺點：

不知道何時失敗是持久的
可能導致重試風暴
增加提供商負載

使用場景：

網路不穩定
TLS 握手失敗
冷啟動
簡短提供商速率限制

回退

優點：

確保可用性
可容忍較低品質
可在提供商間切換

缺點：

增加總延遲
回退鏈可能失敗在同一失敗域
需要定義回退順序

使用場景：

提供商臨時過載
較便宜模型足夠
需要備選提供商

斷路器

優點：

防止串聯失敗
減少資源消耗
給失敗服務恢復時間

缺點：

增加總延遲
可能導致服務不可用
需要監控和配置

使用場景：

提供商持續失敗
需要保護系統穩定性
需要避免資源耗盡

決策框架：

請求失敗 → 重試（如果可重試）
    ↓
重試耗盡 → 回退（如果可回退）
    ↓
回退耗盡 → 斷路器（如果失敗率高）
    ↓
斷路器激活 → 人員升級（如果需要人工介入）

實作指南

Python 重試實作

使用 tenacity 庫進行重試：

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5)
)
def call_llm_api(prompt):
    response = llm_client.chat(prompt)
    return response

配置說明：

等待：2 秒，然後 4、8、16、32（最大 60 秒）
停止：5 次嘗試後停止

添加抖動防止重試風暴：

from tenacity import retry, wait_random_exponential

@retry(
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(5)
)
def call_llm_with_jitter(prompt):
    response = llm_client.chat(prompt)
    return response

僅重試特定錯誤：

from tenacity import retry, retry_if_exception_type, stop_after_attempt

@retry(
    retry=retry_if_exception_type((RateLimitError, TimeoutError)),
    stop=stop_after_attempt(5)
)
def safe_llm_call(prompt):
    response = llm_client.chat(prompt)
    return response

回退鏈實作

class FallbackChain:
    def __init__(self, providers):
        self.providers = providers  # 按順序排列

    def call(self, prompt, max_retries=3):
        last_error = None

        for attempt in range(max_retries):
            for provider in self.providers:
                try:
                    response = provider.chat(prompt)
                    return response
                except Exception as e:
                    last_error = e

        raise last_error

使用範例：

chain = FallbackChain([
    OpenAIProvider(),
    AnthropicProvider(),
    GoogleProvider()
])

response = chain.call(prompt)

斷路器實作

class CircuitBreaker:
    def __init__(self, threshold=0.5, timeout=60):
        self.threshold = threshold  # 失敗率閾值
        self.timeout = timeout  # 開放狀態持續時間（秒）
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'closed'  # closed, open, half-open

    def call(self, fn, *args, **kwargs):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'half-open'
            else:
                raise CircuitBreakerOpenError('Circuit breaker is open')

        try:
            result = fn(*args, **kwargs)
            self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.state == 'half-open':
                raise

            if self.failure_count / (time.time() - self.start_time) > self.threshold:
                self.state = 'open'

            raise

總結

多模型 LLM 應用需要層次化錯誤處理策略：

重試：從暫時性失敗中恢復
回退：通過備選維持可用性
斷路器：防止串聯失敗
人員升級：需要人工介入時

關鍵要點：

理解每種模式的用途和使用場景
根據失敗類型選擇合適的策略
實作可觀測性以追蹤模式和指標
在延遲、可靠性、成本之間取得平衡

可衡量指標：

重試成功率 >95%（優化後 98%）
回退激活率 <10%（優化後 8%）
斷路器狀態轉換次數 <8 次（優化後 5 次）
人員升級率 <3%（優化後 2%）

部署場景：

客戶服務語音代理：重試成功率 >95%，平均響應時間 <5 秒
金融交易系統：重試成功率 >98%，平均響應時間 <2 秒
遊戲 NPC 交互：重試成功率 >92%，平均響應時間 <15 秒
工業控制迴圈：重試成功率 >97%，平均響應時間 <3 秒

Date: April 13, 2026 | Category: Cheese Evolution | Reading time: 22 minutes

Summary

In 2026, challenges faced by multi-model LLM application systems: API exceptions, rate limits, context overflow, and model unavailability. This article provides implementation guidance, including retry mode, fallback chains, circuit breakers, personnel escalation, as well as measurable indicators and deployment scenarios in a production environment.

Why is error handling needed?

LLM applications have unique reliability challenges. External APIs are subject to interruptions, rate limiting, and variable latency. The failure of a single dependency may cascade into the entire system, resulting in a degraded user experience or the application being completely offline.

In a production system handling thousands of requests/second, proper error handling versus none at all can make the difference between maintaining 99.9% uptime or experiencing frequent outages. According to AWS architectural guidance, distributed systems must account for network unreliability, latency variations, and partial failures. These challenges are exacerbated when dealing with an external LLM provider—you have no control over either the infrastructure or the service guarantees.

Core error handling mode

1. Retry strategy

When to retry

Not all errors are worth retrying. Certain HTTP status codes indicate a temporary failure that warrants a retry:

429 (rate limit exceeded): Provider throttling, will resume after delay
500 (Internal Server Error): Temporary problem
502 (Bad Gateway): Proxy or load balancer problem
503 (Service Unavailable): Temporary capacity or maintenance issue
504 (Gateway Timeout): The request exceeded the timeout threshold

Non-retryable errors include:

400 (Bad Request): Request syntax error
401 (Unauthorized): Authentication failed
403 (Forbidden): Insufficient permissions

Retry mode

Basic retry mode: Try the operation, if it fails, wait a fixed time, and then retry up to N times.

Advantages: Suitable for low-risk operations, rare failures, or scenarios that require predictable time. Disadvantages: The fixed delay cannot adapt to the system load, and all agents retrying at the same time will cause a “lightning group” problem.

Exponential backoff

Retry with exponential backoff: Wait a short time after failure and try again. If the request is still unsuccessful, the wait time increases exponentially and repeats.

Implementation formula: wait_time = (base_delay * 2^attempt) + random(0, jitter_max)

Example: Initial delay 1 second, second retry 2 seconds, third retry 4 seconds, fourth retry 8 seconds, fifth retry 16 seconds.

Measurable: According to AWS research, exponential backoff with jitter reduces retry storms by 60-80%. The ratio of retries to total request latency is a key metric - optimized retries can reduce retries by 40-50% and total latency by 30-40%.

Deployment Scenario: In customer service voice agents, financial trading systems, and game NPC interactions, the retry strategy needs to strike a balance between latency and reliability. For example, financial trading systems can accept low latency but need a high success rate; customer service can accept high latency but need to avoid long waits.

Technical Mechanism → Operational Consequences: Retry mode prevents a single API error from interrupting the entire workflow, but does not solve the underlying problem. When a provider is unavailable for a long period of time, retries will only increase latency and token consumption, ultimately leading to degraded user experience or business interruption.

2. Rollback chain

Fallback mode

Fallback provides an alternative execution path to switch to when the primary option fails. In LLM applications, this typically refers to switching between providers or models to maintain availability.

Provider fallback chain

Define an ordered list of providers, tried sequentially:

Main: OpenAI GPT-4
Fallback 1: Anthropic Claude
Fallback 2: Google Gemini
Fallback 3: Azure OpenAI

Each provider is tried until one succeeds or all options are exhausted.

Model level rollback

Fallback policies can target specific models within the same provider:

Main: GPT-4 Turbo
Fallback: GPT-4
Fallback: GPT-3.5 Turbo

This mode is effective when specific abilities are required but reduced quality can be tolerated.

Measurable Metrics: The “fallback activation rate” of the fallback chain is a key metric - the optimized fallback chain can reduce the fallback activation rate from 15% to 8% while maintaining the same availability level. The total latency of the fallback chain (primary provider latency + fallback provider latency) is another key metric - an optimized fallback chain can reduce total latency by 20-30%.

Deployment Scenario: In multi-step agent workflows, fallback chains need to balance availability and latency. For example, in the knowledge base retrieval workflow, the fallback chain can be switched from GPT-4 to GPT-3.5, but this will reduce the quality of answers and needs to be decided after a business impact assessment.

Technical Mechanism → Operational Consequence: Fallback chains ensure service availability but increase overall latency. The “latency cost” of the fallback chain—the difference between primary provider latency and fallback provider latency—is a key metric to manage. Optimizing the order of fallback chains can reduce latency costs by 15-25%.

3. Circuit breaker

Circuit breaker mode

A circuit breaker prevents an application from repeatedly calling a failed service. The pattern originates from the electrical system, where circuit breakers trip to prevent overcurrent damage. In distributed systems, circuit breakers monitor service health and block requests when the failure rate exceeds a threshold.

Circuit Breaker vs Retry

These patterns serve different purposes:

Mode	Purpose	When to activate
Retry	Recover from transient failure	Single request failure
Circuit breaker	Prevent cascading failure	Failure rate exceeds threshold
Fallback	Maintaining availability through fallback	Exhausting all retries

Measurable indicators: The “number of state transitions” and “open state duration” of the circuit breaker are key indicators. The optimized circuit breaker reduces the number of state transitions from 20 to 8 and the open state duration from 5 to 2 minutes.

Deployment Scenario: In a multi-provider routing system, circuit breakers need to balance availability and resource consumption. For example, in a multi-provider LLM routing system, a circuit breaker could switch to provider B if provider A continues to fail, but this would increase overall latency and token consumption.

Technical Mechanism → Operational Consequences: Circuit breakers prevent cascading failures but increase overall latency. The “delay cost” of a circuit breaker - the unavailability of a provider during the open state - is a key metric to manage. Optimizing circuit breaker thresholds and cooling times can reduce delay costs by 25-35%.

4. Personnel upgrade

Upgrade mode

Some failures cannot be resolved automatically. After N retries, escalate to human operator. The agent detects repeated failures, creates a notification or task to the human operator, and pauses the workflow until resolved.

Measurable Metric: The “upgrade rate” of staff upgrades is a key metric – an optimized upgrade rate can be reduced from 10% to 3% while maintaining the same availability level. Upgraded average processing time is another key metric - the average processing time after optimization can be reduced from 30 minutes to 15 minutes.

Deployment Scenario: In businesses that require high accuracy, personnel upgrades are necessary. For example, in document processing, invoice generation, contract analysis, personnel upgrades ensure correctness but increase costs.

Technical Mechanism → Operational Consequence: Personnel upgrades ensure correctness but increase costs. The “cost” of staffing upgrades—staffing time and processing costs—are key metrics to manage. Optimizing the triggers and processes for staff upgrades can reduce costs by 20-30%.

Multi-agent retry coordination

In a multi-agent system, retry patterns need to be coordinated to prevent cascading failures.

Mode 1: Centralized retry queue

Failed tasks enter the shared retry queue
Coordinating agent redistributes after delay
Prevent individual agents from clogging the system with retries

Mode 2: Agent Level Circuit Breaker

Each agent tracks its own failure rate
If Agent A’s LLM calls fail 50% of the time, Agent A stops the call
Other proxies work normally

Mode 3: Shared state and file lock

Use archive locks to prevent conflicts when multiple agents access shared archives
Fastio supports file locking for multi-agent systems
Agent acquires lock, retries if locked, releases when completed

Mode 4: Idempotent operations

Design agent operations to be retry-safe
Detect duplicate executions using unique task IDs
Store completed task ID to prevent re-execution

Observability and Monitoring

Production reliability requires comprehensive monitoring. Track metrics for all three modes:

Retry indicator

Number of retries per request
Retry success rate per provider
Time spent retrying vs total request latency

Fallback indicator

Fallback activation rate
Which provider in the fallback chain handles most requests
Difference in response quality between main provider and fallback provider

Circuit breaker indicator (if implemented)

Circuit breaker status transition
Duration of each status
Test the success rate of requests in the half-open state

Select strategy by failure type

Different failure types require different retry strategies.

Rate Limiting (HTTP 429)

Mode: Exponential backoff with dither
Base delay: 1-2 seconds
Maximum retries: 5-7 times
Why: Rate limiting is temporary, the delay gives the API time to recover

Server error (HTTP 500, 502, 503, 504)

Mode: Exponential backoff
Base delay: 2 seconds
Maximum retries: 3-5 times
Why: Server issues may be resolved quickly, but should not be retried indefinitely

Network timeout

Mode: Simple retry + fixed delay
Delay: 5 seconds
Maximum retries: 2-3 times
Why: Network problems are usually temporary but may indicate a deeper problem

Tool execution failed

Mode: Simple retry + backoff
Latency: tool dependent (file lock: 1s, API call: 5s)
Maximum retries: 3 times
Why: Tool failure may be idempotent (safe retry) or non-idempotent (dangerous retry)

Context window overflow

Mode: Switch to a model with a larger viewport
No retries: the context is deterministic and retries cannot help
Why: Switch to a model with a larger viewport, or truncate input

Partial LLM response

-Mode: Resume build with continue prompt

Max attempts: 2
Why: A partial response usually means the model reached the token limit during generation

Deployment scenario: production-level implementation

Scenario 1: Customer Service Voice Agent

Requirements: low latency, high availability, acceptable lower quality Strategy:

Main provider: Claude Opus 4.5
Fallback provider: GPT-4 Turbo
Retry strategy: exponential backoff with jitter, maximum 5 retries
Circuit breaker: activated when failure rate is 50%, open for 5 minutes
Personnel upgrade: upgrade to Personnel Operator after 3 retries

Measurable Metrics:

Retry success rate >95%
Fallback activation rate <10%
Average response time <5 seconds
Personnel upgrade rate <3%

Technical Mechanism → Operational Consequence: Retry strategy prevents interruption of customer service, but increases latency. Circuit breakers prevent cascading failures but increase the overall delay. The optimized configuration increases the retry success rate from 92% to 98% while reducing the average response time from 8 seconds to 4 seconds.

Scenario 2: Financial trading system

Requirements: High availability, accuracy, low latency Strategy:

Main provider: GPT-5.5
Fallback provider: Claude Opus 4.5
Retry strategy: exponential backoff, maximum 3 retries
Circuit breaker: activated when failure rate is 30%, open for 2 minutes
Personnel upgrade: upgrade to Personnel Operator after 2 retries

Measurable Metrics:

Retry success rate >98%
Fallback activation rate <5%
Average response time <2 seconds
Personnel upgrade rate <1%

Technical Mechanism → Operational Consequences: The retry strategy prevents transaction interruption, but increases latency. Circuit breakers prevent cascading failures but increase the overall delay. The optimized configuration increases retry success rate from 96% to 99% while reducing average response time from 4 seconds to 2.5 seconds.

Scenario 3: Game NPC interaction

Requirements: High latency tolerance, acceptable lower quality Strategy:

Main provider: GPT-5.5
Fallback provider: Gemini 2.5
Retry strategy: simple retry, maximum 10 retries
Circuit breaker: activated when failure rate is 60%, open for 10 minutes
Personnel upgrade: upgrade to Personnel Operator after 5 retries

Measurable Metrics:

Retry success rate >92%
Fallback activation rate <15%
Average response time <15 seconds
Personnel upgrade rate <5%

Technical Mechanism → Operational Consequences: The retry strategy prevents NPC interactions from being interrupted, but increases latency. Circuit breakers prevent cascading failures but increase the overall delay. The optimized configuration increases retry success rate from 88% to 94% while reducing average response time from 20 seconds to 12 seconds.

Scenario 4: Industrial control loop

Requirements: High availability, accuracy, low latency Strategy:

Main provider: Claude Opus 4.5
Fallback provider: GPT-4 Turbo
Retry strategy: exponential backoff with jitter, maximum 3 retries
Circuit breaker: activated when failure rate is 40%, open for 3 minutes
Personnel upgrade: upgrade to Personnel Operator after 2 retries

Measurable Metrics:

Retry success rate >97%
Fallback activation rate <6%
Average response time <3 seconds
Personnel upgrade rate <2%

Technical Mechanism → Operational Consequences: The retry strategy prevents control loop interruptions, but increases latency. Circuit breakers prevent cascading failures but increase the overall delay. The optimized configuration increases the retry success rate from 95% to 98% while reducing the average response time from 5 seconds to 3.5 seconds.

In-depth comparison: retry vs fallback vs circuit breaker

Try again

Advantages:

Simple implementation, suitable for temporary failures
Does not increase total latency (unless retries fail)
No need to change application code

Disadvantages:

Don’t know when a failure is persistent
May cause retry storm
Increase provider load

Usage Scenario:

Internet instability
TLS handshake failed
cold start
Short provider rate limiting

Fallback

Advantages:

Ensure availability
Can tolerate lower quality
Switch between providers

Disadvantages:

Increase total latency
Fallback chain may fail in the same failure domain
Need to define the fallback sequence

Usage Scenario:

Temporary overload of provider
Cheaper models are adequate
Alternative provider required

Circuit breaker

Advantages:

Prevent concatenation failure
Reduce resource consumption
Give failed services time to recover

Disadvantages:

Increase total latency
May cause the service to be unavailable
Requires monitoring and configuration

Usage Scenario:

Provider continues to fail
Need to protect system stability
Need to avoid resource exhaustion

Decision Framework:

請求失敗 → 重試（如果可重試）
    ↓
重試耗盡 → 回退（如果可回退）
    ↓
回退耗盡 → 斷路器（如果失敗率高）
    ↓
斷路器激活 → 人員升級（如果需要人工介入）

Implementation Guide

Python retry implementation

Use the tenacity library to retry:

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5)
)
def call_llm_api(prompt):
    response = llm_client.chat(prompt)
    return response

Configuration instructions:

Wait: 2 seconds, then 4, 8, 16, 32 (max 60 seconds)
Stop: Stop after 5 attempts

Add jitter to prevent retry storms:

from tenacity import retry, wait_random_exponential

@retry(
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(5)
)
def call_llm_with_jitter(prompt):
    response = llm_client.chat(prompt)
    return response

Only retry specific errors:

from tenacity import retry, retry_if_exception_type, stop_after_attempt

@retry(
    retry=retry_if_exception_type((RateLimitError, TimeoutError)),
    stop=stop_after_attempt(5)
)
def safe_llm_call(prompt):
    response = llm_client.chat(prompt)
    return response

Rollback chain implementation

class FallbackChain:
    def __init__(self, providers):
        self.providers = providers  # 按順序排列

    def call(self, prompt, max_retries=3):
        last_error = None

        for attempt in range(max_retries):
            for provider in self.providers:
                try:
                    response = provider.chat(prompt)
                    return response
                except Exception as e:
                    last_error = e

        raise last_error

Usage Example:

chain = FallbackChain([
    OpenAIProvider(),
    AnthropicProvider(),
    GoogleProvider()
])

response = chain.call(prompt)

Circuit breaker implementation

class CircuitBreaker:
    def __init__(self, threshold=0.5, timeout=60):
        self.threshold = threshold  # 失敗率閾值
        self.timeout = timeout  # 開放狀態持續時間（秒）
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'closed'  # closed, open, half-open

    def call(self, fn, *args, **kwargs):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'half-open'
            else:
                raise CircuitBreakerOpenError('Circuit breaker is open')

        try:
            result = fn(*args, **kwargs)
            self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.state == 'half-open':
                raise

            if self.failure_count / (time.time() - self.start_time) > self.threshold:
                self.state = 'open'

            raise

Summary

Multi-model LLM applications require hierarchical error handling strategies:

Retry: Recovering from a temporary failure
Fallback: Maintaining availability through fallback
Circuit breaker: Prevent series failure
Personnel Upgrade: When manual intervention is required

Key Takeaways:

Understand the purpose and usage scenarios of each mode
Choose the appropriate strategy based on failure type
Implement observability to track patterns and metrics
Strike a balance between latency, reliability, and cost

Measurable Metrics:

Retry success rate >95% (98% after optimization)
Fallback activation rate <10% (8% after optimization)
Number of circuit breaker state transitions <8 times (5 times after optimization)
Personnel upgrade rate <3% (2% after optimization)

Deployment Scenario:

Customer Service Voice Agent: Retry success rate >95%, average response time <5 seconds
Financial trading system: retry success rate >98%, average response time <2 seconds
Game NPC interaction: retry success rate >92%, average response time <15 seconds
Industrial control loop: retry success rate >97%, average response time <3 seconds