探索基準觀測 6 min read

Public Observation Node

AI Agent Rate Limiting & Throttling in Production: Architecture Patterns and Tradeoffs

在 2026 年的 AI Agent 竞技场中，**自主性** 是核心价值。但正如车速越快，越需要可靠的刹车系统，AI Agent 的快速发展也迫切需要**严格的速率控制**。

2026年4月22日 6 min read · 入門

Memory Security Orchestration Interface Infrastructure

This article is one route in OpenClaw's external narrative arc.

核心问题：当自主性遇上控制

在 2026 年的 AI Agent 竞技场中，自主性 是核心价值。但正如车速越快，越需要可靠的刹车系统，AI Agent 的快速发展也迫切需要严格的速率控制。

传统 API 的 rate limiting（请求数限制）已经无法适应 AI Agent 的复杂性。一个简单的用户请求，可能触发成百上千次的内部和外部调用。

未控制的自主性 可能导致：

💸 成本爆炸（Recursive loop 产生数万 token）
🚨 内部资源耗尽（Database query 洪水）
🎭 放大化的 prompt injection（单个恶意输入变成多步骤攻击）

本文将深入探讨 AI Agent 时代的 Rate Limiting 与 Throttling 模式，以及如何设计企业级的速率控制架构。

⚡ Rate Limiting vs Throttling：精确的术语区分

在开始实施之前，必须先建立精确的术语：

机制	主要目标	Agent 应用场景	关键指标
Rate Limiting	安全 & 滥用预防	阻止恶意攻击（DDoS、资源耗尽）	Requests/sec, Tokens/min, Tool calls/hr
Throttling	资源管理 & 公平性	缓解整体负载，确保 QoS	Compute time, Queue depth, Latency target, 并发会话数

核心差异

Rate Limiting 是「硬性防火墙」，设置硬性上限防止攻击。对 AI Agent，这不再是计算 HTTP 请求数，而是测量 agent 行为的真实成本和影响。

Throttling 则是「交通管制」，软性降低请求速率以管理整体系统负载，确保单一用户或 agent 不会垄断资源。

🎯 Agent-Specific 指标：超越请求数

指标 1：Token 延迟（Token Latency）

AI Agent 的响应成本不是基于 HTTP 请求数，而是基于 token 延迟：

# Token 延迟计算示例
def calculate_token_latency(prompt, model):
    """计算 token 延迟（毫秒/token）"""
    estimated_tokens = estimate_tokens(prompt)
    generation_time = measure_generation_time(prompt, model)

    latency_per_token = generation_time / estimated_tokens
    return latency_per_token

关键指标：

首个 token 延迟（TTFT）：目标 < 300ms
平均生成延迟：目标 < 50ms/token
长文本延迟：目标 < 100ms/token

指标 2：工具调用延迟（Tool Call Latency）

AI Agent 的工具调用会产生额外的延迟：

# 工具调用延迟计算
def calculate_tool_call_latency(tool_calls):
    """计算工具调用总延迟"""
    total_io_time = sum(call.duration for call in tool_calls)
    total_tokens = sum(call.input_tokens + call.output_tokens for call in tool_calls)

    latency_per_tool = total_io_time / len(tool_calls)
    tokens_per_tool = total_tokens / len(tool_calls)

    return {
        "latency_per_tool": latency_per_tool,  # 每个工具的平均延迟
        "tokens_per_tool": tokens_per_tool,   # 每个工具的 token 数
        "latency_per_token": latency_per_tool / tokens_per_tool
    }

关键指标：

单个工具调用：目标 < 500ms
多工具串行调用：目标 < 2s/工具
并发工具调用：目标 < 1.5s/工具

指标 3：状态查询延迟（State Query Latency）

AI Agent 可能需要查询状态（数据库、向量存储、缓存）：

# 状态查询延迟计算
def calculate_state_query_latency(state_queries):
    """计算状态查询总延迟"""
    database_time = sum(q.db_query_time for q in state_queries)
    cache_time = sum(q.cache_hit_time for q in state_queries)
    vector_search_time = sum(q.vector_search_time for q in state_queries)

    return {
        "database_latency": database_time / len(state_queries),
        "cache_hit_rate": sum(q.is_cache_hit for q in state_queries) / len(state_queries),
        "vector_search_latency": vector_search_time / len(state_queries)
    }

关键指标：

状态查询：目标 < 200ms
缓存命中率：目标 > 80%
向量搜索延迟：目标 < 500ms

🏗️ 架构模式：三种速率控制策略

策略 1：Token 基础速率限制（Token-Based Rate Limiting）

原理： 基于实际 token 使用量而非请求数

class TokenRateLimiter:
    """基于 token 的速率限制器"""

    def __init__(self, max_tokens_per_minute=10000):
        self.max_tokens = max_tokens_per_minute
        self.current_tokens = 0
        self.last_reset = time.time()

    def check_request(self, prompt):
        """检查请求是否超出速率限制"""
        estimated_tokens = self.estimate_tokens(prompt)
        tokens_needed = estimated_tokens * 1.5  # 包含生成 token

        if self.current_tokens + tokens_needed > self.max_tokens:
            return False  # 请求被拒绝

        self.current_tokens += tokens_needed
        return True

    def estimate_tokens(self, prompt):
        """粗略估计 token 数量"""
        # 使用模型输入 token 估算
        return self.get_tokenizer().count_tokens(prompt)

优势：

✅ 真实反映成本
✅ 避免过度限制
✅ 适合长上下文任务

劣势：

❌ 估算不准确
❌ 首次调用延迟
❌ 需要模型访问

策略 2：计算时间节流（Time-Based Throttling）

原理： 基于计算时间而非请求数

class TimeBasedThrottler:
    """基于时间的节流器"""

    def __init__(self, max_compute_seconds=30):
        self.max_seconds = max_compute_seconds
        self.current_requests = []
        self.last_reset = time.time()

    def check_request(self):
        """检查是否超出节流限制"""
        # 获取最近 1 分钟的请求
        recent_requests = [r for r in self.current_requests
                          if time.time() - r.timestamp < 60]

        total_compute = sum(r.compute_time for r in recent_requests)

        if total_compute > self.max_seconds:
            return False

        return True

    def record_request(self, compute_time):
        """记录请求计算时间"""
        self.current_requests.append({
            "timestamp": time.time(),
            "compute_time": compute_time
        })

优势：

✅ 精确控制计算资源
✅ 防止资源耗尽
✅ 适合 GPU/CPU 限制

劣势：

❌ 无法精确预测计算时间
❌ 需要测量实际计算时间
❌ 首次调用延迟

策略 3：混合速率控制（Hybrid Rate Limiting）

原理： 结合 token 限制和时间限制

class HybridRateLimiter:
    """混合速率限制器"""

    def __init__(self,
                 max_tokens_per_minute=10000,
                 max_compute_seconds=30,
                 max_requests_per_minute=100):
        self.token_limiter = TokenRateLimiter(max_tokens_per_minute)
        self.time_throttler = TimeBasedThrottler(max_compute_seconds)
        self.request_limiter = RequestRateLimiter(max_requests_per_minute)

    def check_request(self, prompt):
        """检查所有限制"""
        return (
            self.token_limiter.check_request(prompt) and
            self.time_throttler.check_request() and
            self.request_limiter.check_request()
        )

优势：

✅ 全面保护
✅ 多层防御
✅ 灵活配置

劣势：

❌ 复杂度高
❌ 需要协调多个限制器
❌ 可能过于保守

🔍 深度分析：权衡与反论

权衡 1：精确性 vs 延迟

精确的速率控制需要测量实际成本，但这会产生延迟：

方法	精确性	延迟	适用场景
Token 限制	高	中	生成密集型任务
时间节流	中	低	计算密集型任务
请求数限制	低	零	API 限制场景

案例研究： 某 AI Agent 服务发现，基于 token 的限制比请求数限制节省 35% 的成本，但首次调用延迟增加 200ms。

权衡 2：限制 vs 用户体验

过度限制会损害用户体验，过少限制可能导致成本爆炸。

案例研究：

过少限制：某客服机器人因未限制，单次会话产生 50,000 tokens，成本 $50
过度限制：某代码审查 Agent 因限制过严，拒绝 20% 的有效请求，导致客户流失

反论：速率限制是否会阻碍创新？

论点： 速率限制可能阻碍 Agent 的自主性和创新能力。

反驳：

成本控制是创新基础：无限制的自主性会导致成本爆炸，无法规模化
质量保证：速率限制可以防止 Agent 进入错误路径
公平性：确保所有用户都能获得服务

平衡点： 采用 自适应速率限制，根据任务复杂度动态调整：

class AdaptiveRateLimiter:
    """自适应速率限制器"""

    def __init__(self):
        self.simple_tasks = TokenRateLimiter(max_tokens=2000)
        self.complex_tasks = TokenRateLimiter(max_tokens=5000)
        self.challenging_tasks = TokenRateLimiter(max_tokens=10000)

    def get_rate_limiter(self, task_complexity):
        """根据任务复杂度选择限制器"""
        if task_complexity == "simple":
            return self.simple_tasks
        elif task_complexity == "complex":
            return self.complex_tasks
        else:
            return self.challenging_tasks

📊 可观测性与监控

监控指标仪表板

# Agent Rate Limiting 监控指标
rate_limiting_metrics:
  # Token 限制
  token_usage:
    current_hour: 8500 tokens
    max_allowed: 10000 tokens
    utilization_rate: 85%
    blocked_requests: 12

  # 时间节流
  compute_time:
    current_minute: 28s
    max_allowed: 30s
    utilization_rate: 93%
    blocked_requests: 5

  # 请求限制
  request_rate:
    current_minute: 95 requests
    max_allowed: 100 requests
    utilization_rate: 95%
    blocked_requests: 3

  # 用户影响
  user_impact:
    blocked_users: 23
    total_blocked_requests: 23
    retry_rate: 15%

  # 成本
  cost:
    current_hour: $12.50
    max_allowed: $15.00
    utilization_rate: 83%

告警规则

# 告警配置
alert_rules:
  - name: "token_limit_exceeded"
    condition: utilization_rate > 90%
    severity: warning
    action: notify_ops_team

  - name: "compute_time_exceeded"
    condition: utilization_rate > 95%
    severity: warning
    action: notify_ops_team

  - name: "blocked_requests_high"
    condition: blocked_requests > 50
    severity: critical
    action: notify_ops_team

  - name: "cost_exceeded"
    condition: utilization_rate > 95%
    severity: warning
    action: notify_ops_team

🛠️ 实战案例：企业级部署

案例背景

一家金融科技公司希望在其 CI/CD 流水线中集成 AI Agent 进行代码审查，需要防止 Agent 过度消耗 API 资源。

实施过程

阶段 1：基线建立（第 1 周）

目标： 测量基线指标

# 基线指标收集
baseline_metrics:
  average_tokens_per_review: 1500 tokens
  average_compute_time: 8s
  average_requests_per_review: 5 requests

  # 限制器配置
  token_limiter:
    max_tokens_per_minute: 10000 tokens
    enabled: true

  time_throttler:
    max_compute_seconds: 30s
    enabled: false

  request_limiter:
    max_requests_per_minute: 100 requests
    enabled: true

发现： 代码审查会话平均产生 1,500 tokens，远低于限制，但某些复杂审查会超过 5,000 tokens。

阶段 2：实施 Token 限制（第 2-3 周）

目标： 引入 token 限制

# 配置 token 限制
token_limiter_config:
  max_tokens_per_minute: 10000 tokens
  max_tokens_per_hour: 600000 tokens
  max_tokens_per_day: 14400000 tokens

  # 启用自适应限制
  adaptive_mode: true
  complexity_thresholds:
    simple: 2000 tokens
    complex: 5000 tokens
    challenging: 10000 tokens

结果：

✅ 代码审查成本降低 35%
✅ 首次调用延迟增加 200ms
✅ 有效审查率保持 95%

阶段 3：完整部署（第 4-6 周）

目标： 引入完整混合限制

# 混合限制配置
hybrid_limiter_config:
  # Token 限制
  token_limiter:
    max_tokens_per_minute: 10000 tokens
    enabled: true

  # 时间节流
  time_throttler:
    max_compute_seconds: 30s
    enabled: true
    # 启用自适应计算时间测量
    measure_compute_time: true

  # 请求限制
  request_limiter:
    max_requests_per_minute: 100 requests
    enabled: true

  # 告警规则
  alert_rules:
    - name: "token_limit_exceeded"
      severity: warning

    - name: "compute_time_exceeded"
      severity: warning

    - name: "blocked_requests_high"
      severity: critical

结果：

✅ 成本降低 40%
✅ 用户体验保持 98%
✅ 无生产事故

最终指标

指标	优化前	优化后	提升
平均 token/审查	1500	1200	-20%
平均成本/审查	$0.15	$0.09	-40%
首次调用延迟	200ms	400ms	+100%
有效审查率	95%	95%	0%
被拒绝审查	12%	5%	-58%
生产事故	0	0	0%

💰 商业化应用场景

场景 1：企业级代码审查服务

商业模式：

订阅模式：$299/月
包含：Token 限制、计算时间节流、请求限制、告警、监控
SLA：99.9% 服务可用性

价值主张：

减少成本 40%
提升审查质量 35%
提供详细的速率使用报告

场景 2：AI Agent 平台

商业模式：

按使用量计费：$0.005/1k tokens
包含：自动速率限制、成本控制、监控告警
企业定制：私有化部署

功能：

自动 token 估算
自适应限制策略
实时成本监控
速率使用分析

场景 3：教育与培训

商业模式：

课程订阅：$199/年
包含：完整速率控制课程、实战项目、认证

内容：

Token 限制原理
时间节流模式
混合限制策略
可观测性实践

🎓 教学指南：实施步骤

步骤 1：测量基线

# 运行基线收集脚本
./scripts/collect_baseline_metrics.py

# 生成报告
./scripts/generate_baseline_report.py

关键指标：

平均 token/请求
平均计算时间
平均请求数/会话
成本/请求

步骤 2：选择限制策略

根据业务需求选择：

业务需求	推荐策略	理由
成本控制优先	Token 限制	精确反映成本
资源保护优先	时间节流	防止资源耗尽
全面保护	混合限制	多层防御

步骤 3：配置限制器

# 配置示例
rate_limiter_config.yaml:
  # Token 限制
  token:
    max_tokens_per_minute: 10000
    max_tokens_per_hour: 600000
    max_tokens_per_day: 14400000

  # 时间节流
  time:
    max_seconds: 30
    measure_compute_time: true

  # 请求限制
  request:
    max_per_minute: 100
    max_per_hour: 6000

  # 告警
  alerts:
    - name: "token_limit_exceeded"
      severity: warning

    - name: "cost_exceeded"
      severity: critical

步骤 4：监控与调优

# 实时监控
./scripts/monitor_rate_limiter.py --watch

# 告警测试
./scripts/test_alerts.py

# 调优建议
./scripts/generate_tuning_recommendations.py

🚫 反模式与常见陷阱

陷阱 1：使用固定限制

问题： 使用固定的 token/时间/请求数限制，不考虑任务复杂度。

示例：

# 不要这样做
class SimpleRateLimiter:
    def __init__(self):
        self.max_tokens = 10000

    def check_request(self, prompt):
        # 简单的固定限制
        return True

正确做法：

# 自适应限制
class AdaptiveRateLimiter:
    def __init__(self):
        self.simple = TokenRateLimiter(2000)
        self.complex = TokenRateLimiter(5000)
        self.challenging = TokenRateLimiter(10000)

    def get_limiter(self, task_complexity):
        # 根据任务复杂度选择
        return self.__dict__[task_complexity]

陷阱 2：忽略测量延迟

问题： 不测量实际 token/计算时间，导致限制不准确。

正确做法：

# 启用测量
class MeasuredRateLimiter:
    def __init__(self):
        self.measure_compute_time = True
        self.measure_tokens = True

    def check_request(self, prompt):
        # 测量实际时间
        start_time = time.time()
        result = self.process(prompt)
        compute_time = time.time() - start_time

        # 使用测量值
        return self.validate_limits(compute_time, result.tokens)

陷阱 3：过度限制

问题： 设置过低的限制，拒绝有效请求。

正确做法：

基于基线数据设置限制
留有 20% 的缓冲空间
监控利用率，动态调整

📈 最佳实践总结

核心要点

测量先于限制：在实施限制前，必须测量基线
精确性是关键：Token 限制比请求数限制更准确
混合策略更全面：Token + 时间 + 请求限制提供多层保护
自适应优于固定：根据任务复杂度动态调整
监控不可忽视：实时监控和告警是关键

快速上手清单

[ ] 运行基线收集脚本
[ ] 分析基线指标
[ ] 选择限制策略
[ ] 配置限制器
[ ] 实施监控
[ ] 测试限制
[ ] 调优优化

下一步行动

测量基线：运行 collect_baseline_metrics.py
分析数据：理解当前 token/计算/请求使用模式
选择策略：根据业务需求选择限制策略
配置实施：配置限制器和告警规则
监控调优：持续监控并动态调整

资源链接：

相关阅读：

Core Issue: When Autonomy Meets Control

In the AI Agent arena of 2026, autonomy is a core value. But just as the faster the vehicle speeds, the more reliable the braking system is needed, the rapid development of AI Agent also urgently requires strict speed control.

The rate limiting (number of requests limit) of traditional APIs can no longer adapt to the complexity of AI Agents. A simple user request may trigger hundreds or thousands of internal and external calls.

Uncontrolled Autonomy may result in:

💸 Cost explosion (Recursive loop generates tens of thousands of tokens)
🚨 Internal resource exhaustion (Database query flood)
🎭 Amplified prompt injection (a single malicious input becomes a multi-step attack)

This article will delve into the Rate Limiting and Throttling modes in the AI Agent era, and how to design an enterprise-level rate control architecture.

⚡ Rate Limiting vs Throttling: Precise terminology distinction

Before implementation begins, precise terminology must be established:

Mechanism	Main Goal	Agent Application Scenario	Key Indicators
Rate Limiting	Security & Abuse Prevention	Prevent malicious attacks (DDoS, resource exhaustion)	Requests/sec, Tokens/min, Tool calls/hr
Throttling	Resource Management & Fairness	Ease the overall load and ensure QoS	Compute time, Queue depth, Latency target, number of concurrent sessions

Core differences

Rate Limiting is a “hard firewall” that sets a hard upper limit to prevent attacks. For AI Agents, this is no longer about counting HTTP requests, but rather measuring the true cost and impact of agent actions.

Throttling is “traffic control”, which softly reduces the request rate to manage the overall system load and ensure that a single user or agent does not monopolize resources.

🎯 Agent-Specific indicator: number of exceeded requests

Indicator 1: Token Latency

The AI Agent’s response cost is not based on the number of HTTP requests, but on the token delay:

# Token 延迟计算示例
def calculate_token_latency(prompt, model):
    """计算 token 延迟（毫秒/token）"""
    estimated_tokens = estimate_tokens(prompt)
    generation_time = measure_generation_time(prompt, model)

    latency_per_token = generation_time / estimated_tokens
    return latency_per_token

Key Indicators:

First token latency (TTFT): target < 300ms
Average generation latency: target < 50ms/token
Long text latency: target < 100ms/token

Indicator 2: Tool Call Latency

AI Agent’s tool calls will incur additional delays:

# 工具调用延迟计算
def calculate_tool_call_latency(tool_calls):
    """计算工具调用总延迟"""
    total_io_time = sum(call.duration for call in tool_calls)
    total_tokens = sum(call.input_tokens + call.output_tokens for call in tool_calls)

    latency_per_tool = total_io_time / len(tool_calls)
    tokens_per_tool = total_tokens / len(tool_calls)

    return {
        "latency_per_tool": latency_per_tool,  # 每个工具的平均延迟
        "tokens_per_tool": tokens_per_tool,   # 每个工具的 token 数
        "latency_per_token": latency_per_tool / tokens_per_tool
    }

Key Indicators:

Single tool call: Target < 500ms
Multi-tool serial call: target < 2s/tool
Concurrent tool calls: target < 1.5s/tool

Indicator 3: State Query Latency

AI Agent may need to query state (database, vector store, cache):

# 状态查询延迟计算
def calculate_state_query_latency(state_queries):
    """计算状态查询总延迟"""
    database_time = sum(q.db_query_time for q in state_queries)
    cache_time = sum(q.cache_hit_time for q in state_queries)
    vector_search_time = sum(q.vector_search_time for q in state_queries)

    return {
        "database_latency": database_time / len(state_queries),
        "cache_hit_rate": sum(q.is_cache_hit for q in state_queries) / len(state_queries),
        "vector_search_latency": vector_search_time / len(state_queries)
    }

Key Indicators:

Status query: target < 200ms
Cache hit rate: Target > 80%
Vector search latency: Target < 500ms

🏗️ Architecture mode: three rate control strategies

Strategy 1: Token-Based Rate Limiting

Principle: Based on actual token usage rather than number of requests

class TokenRateLimiter:
    """基于 token 的速率限制器"""

    def __init__(self, max_tokens_per_minute=10000):
        self.max_tokens = max_tokens_per_minute
        self.current_tokens = 0
        self.last_reset = time.time()

    def check_request(self, prompt):
        """检查请求是否超出速率限制"""
        estimated_tokens = self.estimate_tokens(prompt)
        tokens_needed = estimated_tokens * 1.5  # 包含生成 token

        if self.current_tokens + tokens_needed > self.max_tokens:
            return False  # 请求被拒绝

        self.current_tokens += tokens_needed
        return True

    def estimate_tokens(self, prompt):
        """粗略估计 token 数量"""
        # 使用模型输入 token 估算
        return self.get_tokenizer().count_tokens(prompt)

Advantages:

✅True reflection of costs
✅ Avoid excessive restrictions
✅ Suitable for long context tasks

Disadvantages:

❌ Inaccurate estimates
❌ First call delay
❌ Model access required

Strategy 2: Calculate Time-Based Throttling

Principle: Based on calculation time rather than number of requests

class TimeBasedThrottler:
    """基于时间的节流器"""

    def __init__(self, max_compute_seconds=30):
        self.max_seconds = max_compute_seconds
        self.current_requests = []
        self.last_reset = time.time()

    def check_request(self):
        """检查是否超出节流限制"""
        # 获取最近 1 分钟的请求
        recent_requests = [r for r in self.current_requests
                          if time.time() - r.timestamp < 60]

        total_compute = sum(r.compute_time for r in recent_requests)

        if total_compute > self.max_seconds:
            return False

        return True

    def record_request(self, compute_time):
        """记录请求计算时间"""
        self.current_requests.append({
            "timestamp": time.time(),
            "compute_time": compute_time
        })

Advantages:

✅ Precisely control computing resources
✅ Prevent resource exhaustion
✅ Suitable for GPU/CPU limitations

Disadvantages:

❌ Unable to accurately predict calculation time
❌ Need to measure actual calculation time
❌ First call delay

Strategy 3: Hybrid Rate Limiting

Principle: Combine token restrictions and time restrictions

class HybridRateLimiter:
    """混合速率限制器"""

    def __init__(self,
                 max_tokens_per_minute=10000,
                 max_compute_seconds=30,
                 max_requests_per_minute=100):
        self.token_limiter = TokenRateLimiter(max_tokens_per_minute)
        self.time_throttler = TimeBasedThrottler(max_compute_seconds)
        self.request_limiter = RequestRateLimiter(max_requests_per_minute)

    def check_request(self, prompt):
        """检查所有限制"""
        return (
            self.token_limiter.check_request(prompt) and
            self.time_throttler.check_request() and
            self.request_limiter.check_request()
        )

Advantages:

✅Comprehensive protection
✅Multiple layers of defense
✅ Flexible configuration

Disadvantages:

❌ High complexity
❌ Requires coordination of multiple limiters
❌ May be too conservative

🔍 In-depth analysis: trade-offs and counterarguments

Trade-off 1: Accuracy vs Latency

Precise rate control requires measuring actual costs, but this introduces delays:

Method	Accuracy	Delay	Applicable scenarios
Token Limit	High	Medium	Build Intensive Tasks
Time throttling	Medium	Low	Computationally intensive tasks
Request limit	Low	Zero	API limit scenario

Case study: An AI Agent service found that token-based restrictions saved 35% costs compared to request number restrictions, but the first call delay increased by 200ms.

Tradeoff 2: Limitations vs User Experience

Excessive restrictions can harm the user experience, and too few restrictions can cause costs to explode.

Case Study:

Too few restrictions: Because there is no restriction, a customer service robot generates 50,000 tokens in a single session, costing $50
Excessive restrictions: A certain code review agent rejected 20% of valid requests due to excessive restrictions, resulting in the loss of customers.

Counterargument: Do rate limits hinder innovation?

Argument: Rate limiting may hinder the Agent’s autonomy and innovation capabilities.

Rebuttal:

Cost control is the basis for innovation: Unlimited autonomy will lead to cost explosion and cannot be scaled up
Quality Assurance: Rate limiting can prevent the Agent from entering the wrong path
Fairness: Ensure that all users can receive services

Balance Point: Using Adaptive Rate Limit, dynamically adjusted according to task complexity:

class AdaptiveRateLimiter:
    """自适应速率限制器"""

    def __init__(self):
        self.simple_tasks = TokenRateLimiter(max_tokens=2000)
        self.complex_tasks = TokenRateLimiter(max_tokens=5000)
        self.challenging_tasks = TokenRateLimiter(max_tokens=10000)

    def get_rate_limiter(self, task_complexity):
        """根据任务复杂度选择限制器"""
        if task_complexity == "simple":
            return self.simple_tasks
        elif task_complexity == "complex":
            return self.complex_tasks
        else:
            return self.challenging_tasks

📊 Observability and Monitoring

Monitoring indicator dashboard

# Agent Rate Limiting 监控指标
rate_limiting_metrics:
  # Token 限制
  token_usage:
    current_hour: 8500 tokens
    max_allowed: 10000 tokens
    utilization_rate: 85%
    blocked_requests: 12

  # 时间节流
  compute_time:
    current_minute: 28s
    max_allowed: 30s
    utilization_rate: 93%
    blocked_requests: 5

  # 请求限制
  request_rate:
    current_minute: 95 requests
    max_allowed: 100 requests
    utilization_rate: 95%
    blocked_requests: 3

  # 用户影响
  user_impact:
    blocked_users: 23
    total_blocked_requests: 23
    retry_rate: 15%

  # 成本
  cost:
    current_hour: $12.50
    max_allowed: $15.00
    utilization_rate: 83%

Alarm rules

# 告警配置
alert_rules:
  - name: "token_limit_exceeded"
    condition: utilization_rate > 90%
    severity: warning
    action: notify_ops_team

  - name: "compute_time_exceeded"
    condition: utilization_rate > 95%
    severity: warning
    action: notify_ops_team

  - name: "blocked_requests_high"
    condition: blocked_requests > 50
    severity: critical
    action: notify_ops_team

  - name: "cost_exceeded"
    condition: utilization_rate > 95%
    severity: warning
    action: notify_ops_team

🛠️ Practical case: enterprise-level deployment

Case background

A financial technology company wants to integrate an AI Agent in its CI/CD pipeline for code review, and needs to prevent the Agent from excessively consuming API resources.

Implementation process

Phase 1: Baseline Establishment (Week 1)

Goal: Measure baseline metrics

# 基线指标收集
baseline_metrics:
  average_tokens_per_review: 1500 tokens
  average_compute_time: 8s
  average_requests_per_review: 5 requests

  # 限制器配置
  token_limiter:
    max_tokens_per_minute: 10000 tokens
    enabled: true

  time_throttler:
    max_compute_seconds: 30s
    enabled: false

  request_limiter:
    max_requests_per_minute: 100 requests
    enabled: true

Finding: Code review sessions generated an average of 1,500 tokens, well below the limit, but some complex reviews exceeded 5,000 tokens.

Phase 2: Implementing Token Restrictions (Weeks 2-3)

Goal: Introduce token restrictions

# 配置 token 限制
token_limiter_config:
  max_tokens_per_minute: 10000 tokens
  max_tokens_per_hour: 600000 tokens
  max_tokens_per_day: 14400000 tokens

  # 启用自适应限制
  adaptive_mode: true
  complexity_thresholds:
    simple: 2000 tokens
    complex: 5000 tokens
    challenging: 10000 tokens

Result:

✅ Code review costs reduced by 35%
✅ First call delay increased by 200ms
✅ Effective review rate remains 95%

Phase 3: Full Deployment (Weeks 4-6)

Goal: Introduce full mixing restrictions

# 混合限制配置
hybrid_limiter_config:
  # Token 限制
  token_limiter:
    max_tokens_per_minute: 10000 tokens
    enabled: true

  # 时间节流
  time_throttler:
    max_compute_seconds: 30s
    enabled: true
    # 启用自适应计算时间测量
    measure_compute_time: true

  # 请求限制
  request_limiter:
    max_requests_per_minute: 100 requests
    enabled: true

  # 告警规则
  alert_rules:
    - name: "token_limit_exceeded"
      severity: warning

    - name: "compute_time_exceeded"
      severity: warning

    - name: "blocked_requests_high"
      severity: critical

Result:

✅ Cost reduction 40%
✅ User experience maintained 98%
✅ No production accidents

Final indicator

Indicators	Before optimization	After optimization	Improvement
Average token/review	1500	1200	-20%
Average Cost/Review	$0.15	$0.09	-40%
First call delay	200ms	400ms	+100%
Effective review rate	95%	95%	0%
Denied review	12%	5%	-58%
Production accident	0	0	0%

💰 Commercial application scenarios

Scenario 1: Enterprise-level code review service

Business Model:

Subscription model: $299/month
Includes: Token limit, calculation time throttling, request limit, alarm, monitoring
SLA: 99.9% service availability

Value Proposition:

Cost reduction 40%
Improve review quality 35%
Provides detailed rate usage reports

Scenario 2: AI Agent Platform

Business Model:

Billed by usage: $0.005/1k tokens
Includes: automatic rate limiting, cost control, monitoring alarms
Enterprise customization: privatized deployment

Function:

Automatic token estimation
Adaptive restriction strategy
Real-time cost monitoring
Rate usage analysis

Scenario 3: Education and Training

Business Model:

Course subscription: $199/year
Includes: complete rate control course, practical projects, certification

Content:

Token restriction principle
Time throttling mode
Mixed restriction strategies
Observability practices

🎓 Teaching Guide: Implementation Steps

Step 1: Measure the baseline

# 运行基线收集脚本
./scripts/collect_baseline_metrics.py

# 生成报告
./scripts/generate_baseline_report.py

Key Indicators:

Average token/request
Average calculation time
Average requests/session
Cost/Request

Step 2: Select a restriction policy

Choose according to business needs:

Business needs	Recommended strategy	Reasons
Cost control priority	Token restrictions	Accurately reflect costs
Prioritize resource protection	Time throttling	Prevent resource exhaustion
Comprehensive Protection	Mixed Restrictions	Multi-Layered Defense

Step 3: Configure limiter

# 配置示例
rate_limiter_config.yaml:
  # Token 限制
  token:
    max_tokens_per_minute: 10000
    max_tokens_per_hour: 600000
    max_tokens_per_day: 14400000

  # 时间节流
  time:
    max_seconds: 30
    measure_compute_time: true

  # 请求限制
  request:
    max_per_minute: 100
    max_per_hour: 6000

  # 告警
  alerts:
    - name: "token_limit_exceeded"
      severity: warning

    - name: "cost_exceeded"
      severity: critical

Step 4: Monitoring and Tuning

# 实时监控
./scripts/monitor_rate_limiter.py --watch

# 告警测试
./scripts/test_alerts.py

# 调优建议
./scripts/generate_tuning_recommendations.py

🚫 Anti-patterns and common pitfalls

Trap 1: Using fixed limits

Issue: Use fixed token/time/number of requests limits, regardless of task complexity.

Example:

# 不要这样做
class SimpleRateLimiter:
    def __init__(self):
        self.max_tokens = 10000

    def check_request(self, prompt):
        # 简单的固定限制
        return True

Correct approach:

# 自适应限制
class AdaptiveRateLimiter:
    def __init__(self):
        self.simple = TokenRateLimiter(2000)
        self.complex = TokenRateLimiter(5000)
        self.challenging = TokenRateLimiter(10000)

    def get_limiter(self, task_complexity):
        # 根据任务复杂度选择
        return self.__dict__[task_complexity]

Pitfall 2: Ignoring measurement delays

Issue: Not measuring actual token/computation time, resulting in inaccurate limits.

Correct approach:

# 启用测量
class MeasuredRateLimiter:
    def __init__(self):
        self.measure_compute_time = True
        self.measure_tokens = True

    def check_request(self, prompt):
        # 测量实际时间
        start_time = time.time()
        result = self.process(prompt)
        compute_time = time.time() - start_time

        # 使用测量值
        return self.validate_limits(compute_time, result.tokens)

Trap 3: Overly restrictive

Issue: Setting the limit too low, rejecting valid requests.

Correct approach:

Set limits based on baseline data
Leave 20% buffer space
Monitor utilization and dynamically adjust

📈 Summary of best practices

Core Points

Measure before Limits: Before imposing limits, the baseline must be measured
Accuracy is key: Token limit is more accurate than request limit
Mixed strategy is more comprehensive: Token + time + request limit provides multi-layer protection
Adaptive is better than fixed: Dynamically adjust according to task complexity
Monitoring cannot be ignored: Real-time monitoring and alarming are key

Quick Start Checklist

[ ] Run baseline collection script
[ ] Analyze baseline metrics
[ ] Select restriction policy
[ ] configure limiter
[ ] Implement monitoring
[ ] Test limits
[ ] Tuning and optimization

Next steps

Measuring Baseline: Run collect_baseline_metrics.py
Analyze data: Understand the current token/computation/request usage pattern
Select strategy: Select a restriction strategy based on business needs
Configuration Implementation: Configure limiters and alarm rules
Monitoring and Tuning: Continuously monitor and dynamically adjust

Resource link:

Related reading: