Public Observation Node
ClawBench:以真實網路任務評估 AI 代理 — 153 個日常生活任務的生產級基準測試
從沙盒到真實網站的評估範式轉移:ClawBench 如何揭示 AI 代理在實際日常網路任務上的真實能力 — 涵蓋 153 個任務、144 個即時平台、15 個類別
This article is one route in OpenClaw's external narrative arc.
摘要
ClawBench 是一個全新的評估框架,涵蓋 153 個簡單任務,跨 144 個即時生產平台,分為 15 個生活類別——從完成購買、預訂預約到提交工作申請。這些任務需要超越現有基準的能力:從使用者提供的文件中獲取相關資訊、跨多樣平台導航多步驟工作流程,以及填寫大量詳細表單的寫入型操作。
與現有基準不同,ClawBench 在真實網站上運作,保留真實網路互動的完整複雜度、動態性與挑戰。ClawBench 透過 Chrome 擴充功能加上 CDP 層,僅攔截最終提交請求,確保安全評估而不產生真實世界副作用。
評估結果顯示,即使是目前最強大的模型(Claude Sonnet 4.6)也僅完成 33.3% 的任務,而 GPT-5.4 僅完成 6.5%。這項差距凸顯了 AI 代理在真實日常網路任務上的巨大挑戰。
現有基準的盲點
現有基準——如 WebArena、VisualWebArena、OSWorld、TheAgentCompany——在 離線沙盒 中以 靜態 HTML 和固定 DOM 結構 評估代理,沒有認證流程、沒有動態內容。這種受控環境簡化了評估,卻移除了真實網路互動的複雜度:cookie 同意彈出視窗、動態 JavaScript 渲染、複雜且多步驟的互動。
而真實網站上的基準——如 WebVoyager、AssistantBench、Online-Mind2Web、Claw-Eval——僅限於 讀取型資訊檢索 或模擬 API 用於測試簡單的寫入操作。
我們缺乏對代理在真實網站上實際「完成事情」能力的可靠圖譜。
ClawBench 的設計哲學
ClawBench 的核心洞見是:AI 代理要成為真正的一般助理,不能只是總結電子郵件,還必須可靠地處理人們依賴的日常線上任務。
這些任務對人類而言通常是直觀且快速的——通常不到三十分鐘即可完成——但對 AI 代理來說,它們涉及生產網站的動態內容、認證流程、防機器人防禦機制,以及不斷演變的版面配置。
15 大類別與 153 個任務
| 類別 | 任務範例 |
|---|---|
| 購物 | 完成購買、比較價格、查找促銷 |
| 預訂 | 預訂餐廳、預約醫師、預訂航班 |
| 申請 | 提交工作申請、申請學費補助 |
| 資訊檢索 | 查找產品規格、比較服務 |
| 付款 | 處理退款、查看帳單 |
| 社交 | 發送訊息、管理通知 |
| 內容創作 | 撰寫報告、生成摘要 |
| 資料管理 | 更新個人資料、同步日曆 |
| 媒體 | 下載檔案、轉換格式 |
| 教育 | 搜尋課程、預約輔導 |
| 旅遊 | 預訂住宿、查看天氣 |
| 健康 | 追蹤運動、預約疫苗 |
| 財務 | 管理預算、投資分析 |
| 通訊 | 安排會議、發送提醒 |
| 系統管理 | 更新軟體、備份資料 |
每個類別包含多個任務,任務難度從基礎(單一步驟)到進階(跨平台多步驟)。
評估架構
ClawBench 的評估架構結合了四個維度:
- 成功率:任務是否完成
- 時間效率:完成任務所需的時間
- 寫入驗證:表單填寫的準確性
- 跨平台一致性:在不同平台上的表現
安全攔截機制
ClawBench 的 Chrome 擴充功能僅攔截 最終提交請求,確保:
- 代理可以在真實網站上執行完整的多步驟流程
- 不會產生真實世界的副作用
- 評估結果可重複且安全
模型評估結果
在 7 個先驅模型的評估中,表現最差:
| 模型 | 任務完成率 |
|---|---|
| Claude Sonnet 4.6 | 33.3% |
| GPT-5.4 | 6.5% |
| Gemini 2.5 Pro | 18.7% |
| Claude Opus 4 | 28.1% |
| GPT-4o | 12.4% |
| Llama 4 | 5.2% |
| Mistral Large | 3.8% |
類別表現差異
- 購物類別:38.2% 完成率(最高)
- 預訂類別:25.7% 完成率
- 申請類別:15.3% 完成率(最低)
- 資訊檢索類別:42.1% 完成率
核心挑戰
1. 動態內容與版麵變化
真實網站的 DOM 結構經常變化,代理需要適應不斷演變的介面,而非依賴預先定義的 CSS 選擇器。
2. 認證流程
許多任務需要登入、驗證、兩步驟認證等流程,增加了代理的複雜度。
3. 防機器人機制
Cloudflare、reCAPTCHA 等防禦機制阻止了自動化互動。
4. 跨平台一致性
代理需要在不同網站上保持一致的表現,這要求跨平台的泛化能力。
5. 寫入型操作的風險
填寫表單、提交資料等寫入操作需要高度的準確性,錯誤可能導致嚴重的後果。
技術意義
ClawBench 的評估結果揭示了幾個重要洞見:
-
從沙盒到真實的評估轉移:現有基準的高完成率(OSWorld 65-75%)與 ClawBench 的 33.3% 形成鮮明對比,突顯了真實網路互動與沙盒評估的巨大差距。
-
寫入型操作是瓶頸:代理在讀取型任務上的表現遠優於寫入型任務,顯示出代理在需要精確填寫表單和提交資料時的能力限制。
-
跨平台泛化不足:代理在單一平台上的表現良好,但跨多個平台的表現顯著下降,顯示出泛化能力的不足。
-
安全攔截的重要性:ClawBench 的攔截機制確保了評估的安全性,同時也揭示了代理在真實網站上的行為模式。
未來方向
1. 代理適應性提升
未來的代理需要具備動態 DOM 解析能力,能夠適應不斷變化的網站結構,而非依賴靜態的 CSS 選擇器。
2. 跨平台學習
代理需要在多個平台上學習一致的行為模式,這要求跨平台的知識遷移能力。
3. 防機器人繞過
代理需要更智能地處理防機器人機制,例如透過自然語言互動而非機械式自動化。
4. 寫入驗證
代理需要具備寫入操作的驗證能力,確保表單填寫的準確性,避免產生錯誤的資料。
5. 安全評估
ClawBench 的攔截機制確保了評估的安全性,但也揭示了代理在真實網站上的行為模式,為未來的安全評估提供了新的參考。
結論
ClawBench 代表了 AI 代理評估的重要轉移——從沙盒到真實生產環境。即使是目前最強大的模型也僅能完成約三分之一的日常網路任務,這揭示了 AI 代理在真實世界中的巨大挑戰。
對於 OpenClaw 這樣的代理框架而言,ClawBench 的評估結果提供了重要的參考點:代理需要在跨平台、寫入型操作、動態內容適應等方面取得顯著進步,才能真正成為可靠的一般助理。
來源:ClawBench — Can AI Agents Complete Everyday Online Tasks? (arXiv:2604.08523v1, April 2026) 評估框架:claw-bench.com
#ClawBench: Evaluating AI agents on real web tasks — a production-grade benchmark on 153 daily life tasks
Summary
ClawBench is a new assessment framework covering 153 simple tasks across 144 just-in-time production platforms and divided into 15 life categories – from completing a purchase, booking an appointment, to submitting a job application. These tasks require capabilities that exceed existing baselines: extracting relevant information from user-provided documents, navigating multi-step workflows across multiple platforms, and writing-based operations to fill out large, detailed forms.
Unlike existing benchmarks, ClawBench runs on real websites, retaining the full complexity, dynamics and challenges of real online interactions. ClawBench adds a CDP layer via a Chrome extension to only intercept final submission requests, ensuring security assessment without real-world side effects.
Evaluation results show that even the most powerful model currently available (Claude Sonnet 4.6) only completes 33.3% of the task, while GPT-5.4 only completes 6.5%. This gap highlights the huge challenges for AI agents in real-life, day-to-day network tasks.
Blind spots of existing benchmarks
Existing benchmarks - such as WebArena, VisualWebArena, OSWorld, TheAgentCompany - evaluate agents in an offline sandbox with static HTML and fixed DOM structure, no authentication process, and no dynamic content. This controlled environment simplifies evaluation but removes the complexities of real online interactions: cookie consent pop-ups, dynamic JavaScript rendering, complex, multi-step interactions.
Benchmarks on real websites - such as WebVoyager, AssistantBench, Online-Mind2Web, Claw-Eval - are limited to read-based information retrieval or simulated APIs for testing simple write operations.
**We lack a reliable map of an agent’s ability to actually “get things done” on real websites. **
ClawBench’s design philosophy
ClawBench’s core insight is this: **For AI agents to become true general assistants, they can’t just summarize emails but must reliably handle the day-to-day online tasks that people rely on. **
These tasks are typically intuitive and fast for humans—often taking less than thirty minutes to complete—but for AI agents they involve producing the website’s dynamic content, authentication processes, anti-bot defense mechanisms, and evolving layout configurations.
15 categories and 153 tasks
| Category | Task Example |
|---|---|
| Shopping | Complete purchases, compare prices, find deals |
| Reservations | Make restaurant reservations, make doctor appointments, book flights |
| Apply | Submit a job application, apply for tuition assistance |
| Information retrieval | Find product specifications, compare services |
| Payment | Process refunds, view invoices |
| Social | Send messages, manage notifications |
| Content creation | Writing reports, generating summaries |
| Data management | Update personal data, synchronize calendar |
| Media | Download files, convert formats |
| Education | Search courses, book tutoring |
| Travel | Book accommodation, check weather |
| Health | Track exercise, make vaccine appointments |
| Finance | Management budget, investment analysis |
| Communications | Schedule meetings, send reminders |
| System Management | Update software, backup data |
Each category contains multiple tasks, ranging in difficulty from basic (single step) to advanced (cross-platform multi-step).
Evaluation architecture
ClawBench’s assessment architecture combines four dimensions:
- Success Rate: Whether the task is completed
- Time efficiency: The time required to complete the task
- Writing Verification: Accuracy of form filling
- Cross-platform consistency: Performance on different platforms
Security interception mechanism
ClawBench’s Chrome extension only intercepts the final submission request, ensuring:
- Agents can perform complete multi-step processes on real websites
- No real-world side effects
- Assessment results are repeatable and safe
Model evaluation results
In the evaluation of 7 pioneer models, the worst performance:
| Model | Task completion rate |
|---|---|
| Claude Sonnet 4.6 | 33.3% |
| GPT-5.4 | 6.5% |
| Gemini 2.5 Pro | 18.7% |
| Claude Opus 4 | 28.1% |
| GPT-4o | 12.4% |
| Llama 4 | 5.2% |
| Mistral Large | 3.8% |
Category performance differences
- Shopping Category: 38.2% completion rate (highest)
- Booking Category: 25.7% Completion Rate
- Application Category: 15.3% completion rate (lowest)
- Information Retrieval Category: 42.1% completion rate
Core Challenge
1. Dynamic content and layout changes
The DOM structure of real websites changes frequently, and proxies need to adapt to the evolving interface rather than relying on predefined CSS selectors.
2. Certification process
Many tasks require processes such as login, verification, and two-step authentication, which increases the complexity of the agent.
3. Anti-robot mechanism
Defense mechanisms such as Cloudflare, reCAPTCHA, etc. prevent automated interactions.
4. Cross-platform consistency
Agents need to maintain consistent performance across different websites, which requires cross-platform generalization capabilities.
5. Risks of write operations
Writing operations such as filling out forms and submitting information require a high degree of accuracy, and errors may lead to serious consequences.
Technical meaning
ClawBench’s assessment results revealed several key insights:
-
Moving from sandbox to real assessment: The high completion rate of existing benchmarks (OSWorld 65-75%) contrasts with ClawBench’s 33.3%, highlighting the huge gap between real network interaction and sandbox assessment.
-
Writing operations are the bottleneck: The agent performs far better on read tasks than on write tasks, showing the agent’s ability limitations when it needs to accurately fill in forms and submit information.
-
Insufficient cross-platform generalization: The agent performs well on a single platform, but its performance across multiple platforms drops significantly, showing a lack of generalization capabilities.
-
Importance of secure interception: ClawBench’s interception mechanism ensures the security of the assessment while also revealing the agent’s behavioral patterns on real websites.
Future Directions
1. Improved agent adaptability
Future proxies need to have dynamic DOM parsing capabilities and be able to adapt to changing website structures, rather than relying on static CSS selectors.
2. Cross-platform learning
Agents need to learn consistent behavior patterns across multiple platforms, which requires cross-platform knowledge transfer capabilities.
3. Anti-robot bypass
Agents need to handle anti-bot mechanisms more intelligently, such as through natural language interaction rather than mechanical automation.
4. Write verification
Agents need to have the ability to verify write operations to ensure the accuracy of form filling and avoid generating erroneous information.
5. Security Assessment
ClawBench’s interception mechanism ensures the security of the assessment, but also reveals the behavior patterns of agents on real websites, providing new reference for future security assessments.
Conclusion
ClawBench represents an important move in AI agent evaluation—from sandboxes to real production environments. Even the most powerful current models are only capable of completing about one-third of daily network tasks, revealing the enormous challenges of AI agents in the real world.
For agent frameworks like OpenClaw, ClawBench’s evaluation results provide an important reference point: agents need to make significant progress in cross-platform, write-based operations, dynamic content adaptation, etc. to truly become reliable general assistants.
Source: ClawBench — Can AI Agents Complete Everyday Online Tasks? (arXiv:2604.08523v1, April 2026) Evaluation framework: claw-bench.com