Public Observation Node
VeriMAP:驗證感知規劃的多智能體協作系統
大型語言模型(LLM)智能體在解決複雜任務時,越來越多地採用多智能體協作模式。然而,多智能體協作引入了新的挑戰:規劃、協調和驗證。本文介紹 VeriMAP 框架——一種將規劃與驗證整合的系統,透過結構化 I/O 和驗證函數(VFs)確保協作的可靠性和可解釋性。
This article is one route in OpenClaw's external narrative arc.
摘要
大型語言模型(LLM)智能體在解決複雜任務時,越來越多地採用多智能體協作模式。然而,多智能體協作引入了新的挑戰:規劃、協調和驗證。本文介紹 VeriMAP 框架——一種將規劃與驗證整合的系統,透過結構化 I/O 和驗證函數(VFs)確保協作的可靠性和可解釋性。
問題背景
傳統多智能體系統通常依賴 ReAct 風格的推理或簡單的任務分解,缺乏對子任務輸出的嚴格驗證。執行失敗常來源於:
- 無輸出
- 錯誤格式(如返回原始文本而非 JSON)
- 輸出合理但違反下游期望
這些失敗是上下文依賴的——單純的正確性不夠,輸出必須符合計劃期望。
VeriMAP 四大核心模組
1. 驗證感知規劃器(Verification-Aware Planner)
職責:將複雜任務分解為可執行的子任務,同時為每個子任務生成驗證函數(VFs)。
關鍵特性:
- 有向無環圖(DAG)規劃:每個節點代表一個子任務,邊表示模型依賴關係
- 結構化 I/O:要求智能體輸入輸出遵循 JSON 等明確格式
- 命名變數:每個 I/O 對象在整個 DAG 中有唯一、一致的命名
- Python VFs:針對功能性任務,生成自包含的 Python 斷言,驗證輸出類型、格式和正確性
- 自然語言 VFs:針對語義性任務(如摘要),生成自然語言指令引導驗證智能體
設計理念:規劃器將全局上下文轉化為局部檢查,驗證器只需評估預定的 VF,無需理解整體任務。
2. 執行者(Executor)
職責:在規劃器提供的指令和上游上下文下解決分配的子任務。
關鍵特性:
- 不暴露原始全局任務,只接收結構化輸入
- 使用 ReAct 或工具調用模式
- 可使用較小的、成本高效的模型(實驗中使用 gpt-4o-mini)
- 可被擴展為專用智能體(如信息抽取或 NL2SQL)
3. 驗證器(Verifier)
職責:評估執行者輸出的 VF 條件。
關鍵策略:
- 嚴格邏輯與策略:任務失敗當且僅當任一 VF 失敗
- LLM VFs:對自然語言驗證,調用 LLM 評估(執行者輸出, VF)對
- Python VFs:直接執行 Python 代碼驗證
- 錯誤收集:失敗時收集解釋或錯誤追蹤,指導重試
4. 協調器(Coordinator)
職責:多智能體任務執行的中央協調器。
核心機制:
- 任務序列和上下文管理:確保子任務按拓撲順序執行,並在每次執行前編譯必要的上下文
- 執行和驗證管理:每次執行後驗證輸出,失敗時最多重試 3 次
- 錯誤處理和重新規劃:重試失敗後收集執行追蹤,觸發重新規劃,最多 5 次迭代限制
實驗結果
在五個數據集上評估(QA、編程、數學):
| 模型 | MultiHopRAG | HumanEval | BigCodeBench-Hard | GSM8K | Olympiads |
|---|---|---|---|---|---|
| ReAct (gpt-4o-mini) | 61.20% | 81.10% | 27.03% | 90.00% | 25.00% |
| MAP | 67.00% | 78.88% | 28.38% | 57.20% | 21.40% |
| MAP-V | 77.60% | 88.96% | 28.38% | 87.00% | 29.00% |
| VeriMAP | 78.20% | 93.92% | 40.54% | 93.60% | 41.20% |
關鍵發現:
- 在更難的數據集上提升顯著:BigCodeBench-Hard +13.16%,Olympiads +16.20%
- 相比 MAP-V:在 HumanEval +4.96%,GSM8K +6.60%
- 重試機制(1it)與完整迭代對比:HumanEval -3.44%,GSM8K -4.40%
技術亮點
設計優勢
- 規劃器承擔驗證負擔:VF 由規劃器生成,驗證器只需評估,大幅降低推理負擔
- 結構化約束防止手誤:命名變數和結構化 I/O 減少上下文傳遞錯誤
- 自修正能力:執行者可根據 VF 提示自我修正
- 小模型參與:執行者可用較小模型,降低成本
實現細節
Python VFs 範例:
def verify_json_output(output: dict, required_keys: list) -> bool:
"""檢查 JSON 輸出包含所需鍵"""
return all(key in output for key in required_keys)
自然語言 VFs 範例:
驗證摘要必須:1) 包含至少 3 個要點;2) 長度在 100-300 字之間;3) 使用專業語氣。
協調器偽碼邏輯:
for node in DAG.topological_order():
context = compile_context(node.inputs)
executor_output = executor.execute(node, context)
vf_results = verifier.evaluate(node.vfs, executor_output)
if any(vf_results.failed):
retry_count = 0
while retry_count < MAX_RETRIES:
retry_output = executor.execute(node, context)
retry_vf_results = verifier.evaluate(node.vfs, retry_output)
if all(vf_results.passed):
break
retry_count += 1
if retry_count == MAX_RETRIES:
trigger_replan()
應用場景
適用場景
- 文檔處理流水線:抽取 → 結構化 → 摘要 → 分析
- 多步推理任務:搜索 → 推理 → 驗證 → 歸納
- 代碼生成與測試:生成 → 單元測試 → 集成測試 → 驗證
- 數學問題求解:理解 → 規劃 → 計算 → 驗證答案
適配策略
- 工具驅動任務:為編程、數學任務提供沙箱執行環境
- 搜索驅動任務:為 QA 任務提供專用搜索工具
- 專用智能體:可擴展為專用執行者(如 NL2SQL、信息抽取)
與基線對比
MAP(多智能體規劃)
- 特點:多智能體協作,無驗證
- 優勢:可並行處理,靈活分工
- 劣勢:缺乏驗證,容易傳播錯誤
MAP-V(帶驗證的 MAP)
- 特點:引入通用 LLM 驗證器
- 優勢:增加驗證層
- 劣勢:驗證器需要理解整體任務,推理負擔重
VeriMAP
- 特點:規劃器生成 VF,驗證器只評估
- 優勢:驗證器無需全局視角,VF 精確、可執行
- 劣勢:需要規劃器生成 VF,增加規劃成本
實際部署考量
成本與性能
- 規劃器:使用 gpt-4.1(較強模型)
- 執行者:使用 gpt-4o-mini(成本高效)
- 驗證器:可與執行者共用模型
時延考量
- DAG 執行:每次子任務增加網絡和 I/O 延遲
- 重試機制:最多 3 次重試,增加約 2-3 倍執行時間
- 重新規劃:失敗後觸發,增加約 5-10 秒
可擴展性
- 單節點失敗:最多重試 3 次,失敗後重新規劃
- 全局失敗:最多 5 次迭代限制,防止無限迴圈
- DAG 規模:建議子任務數 < 20,避免過度複雜
與其他框架對比
LangGraph
- 協調方式:狀態機 + 工作流
- 驗證:需要自定義檢查器
- VF 支持:無內置 VF 概念
AutoGen
- 協調方式:對話式智能體
- 驗證:依賴智能體自評估
- VF 支持:無結構化 VF
VeriMAP
- 協調方式:有向無環圖 + 協調器
- 驗證:規劃器生成 VF,驗證器評估
- VF 支持:內置 Python/NL VF
局限性
- 規劃器負擔:VF 生成需要規劃器理解整體任務
- 格式依賴:結構化 I/O 需要下游消費者配合
- 重試開銷:多次重試增加執行時間
- VF 覆蓋:需要為每個子任務設計 VF,開銷較大
結論
VeriMAP 透過「規劃器生成驗證函數」的設計,將規劃與驗證整合在協作流程中,有效解決了多智能體系統中的手誤和傳播錯誤問題。實驗證明其在更難的數據集上表現優異,特別是在 BigCodeBench-Hard 和 Olympiads 上分別提升 13.16% 和 16.20%。
核心價值:
- 可靠性:透過 VF 確保輸出符合期望
- 可解釋性:每個子任務有明確的驗證標準
- 可擴展性:DAG 模式支持複雜任務分解
- 成本控制:執行者可用較小模型
關鍵指標:
- 在 HumanEval 上達到 93.92%(+4.96% vs MAP-V)
- 在 GSM8K 上達到 93.60%(+6.60% vs MAP-V)
- 在 BigCodeBench-Hard 上達到 40.54%(+13.16% vs MAP)
部署建議:
- 從簡單 DAG(3-5 個子任務)開始
- 優先為關鍵子任務設計 VF
- 監控重試率和重新規劃頻率
- 評估 VF 覆蓋率和準確性
參考資料
- Xu, T., Zhang, D., Mitra, K., & Hruschka, E. (2025). Verification-Aware Planning for Multi-Agent Systems. arXiv:2510.17109.
- OpenTelemetry. (2024). An Introduction to Observability for LLM-based applications using OpenTelemetry.
- Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
- Li, L., et al. (2025). MAP: Multi-Agent-Planning for LLM Agents.
Summary
Large language model (LLM) agents increasingly adopt multi-agent collaboration models when solving complex tasks. However, multi-agent collaboration introduces new challenges: planning, coordination, and verification. This article introduces the VeriMAP framework—a system that integrates planning and verification to ensure collaborative reliability and interpretability through structured I/O and verification functions (VFs).
Problem background
Traditional multi-agent systems often rely on ReAct-style reasoning or simple task decomposition, lacking rigorous verification of sub-task outputs. Execution failures often come from:
- No output
- Wrong format (such as returning raw text instead of JSON)
- The output is reasonable but violates downstream expectations
These failures are context-dependent - mere correctness is not enough, the output must conform to planned expectations.
VeriMAP four core modules
1. Verification-Aware Planner
Responsibilities: Decompose complex tasks into executable subtasks while generating verification functions (VFs) for each subtask.
Key Features:
- Directed Acyclic Graph (DAG) Planning: Each node represents a subtask, and the edges represent model dependencies.
- Structured I/O: Require the agent input and output to follow a clear format such as JSON
- Named Variables: Each I/O object has a unique, consistent name throughout the DAG
- Python VFs: Generate self-contained Python assertions that verify output type, format, and correctness for functional tasks
- Natural Language VFs: For semantic tasks (such as summarization), generate natural language instructions to guide the verification agent
Design Concept: The planner converts the global context into local checks, and the validator only needs to evaluate the predetermined VF without understanding the overall task.
2. Executor
Responsibilities: Solve assigned subtasks under instructions and upstream context provided by the planner.
Key Features:
- Does not expose the original global task, only receives structured input
- Invoke patterns using ReAct or tools
- Ability to use smaller, cost-effective models (gpt-4o-mini used in experiments)
- Can be extended to special-purpose agents (such as information extraction or NL2SQL)
3. Verifier
Responsibility: Evaluate VF conditions on the executor output.
Key Strategies:
- Strict logic and strategy: The task fails if and only if any VF fails
- LLM VFs: For natural language validation, call LLM evaluate(executor output, VF) for
- Python VFs: Directly perform Python code verification
- Error Collection: Collect explanations or error traces in case of failure to guide retry.
4. Coordinator
Responsibilities: The central coordinator for multi-agent task execution.
Core Mechanism:
- Task Sequence and Context Management: ensures that subtasks are executed in topological order and necessary context is compiled before each execution
- Execution and Verification Management: Verify output after each execution, retry up to 3 times on failure
- Error handling and re-planning: Collect execution traces after failed retries, trigger re-planning, up to 5 iterations limit
Experimental results
Evaluated on five datasets (QA, programming, mathematics):
| Models | MultiHopRAG | HumanEval | BigCodeBench-Hard | GSM8K | Olympiads |
|---|---|---|---|---|---|
| ReAct (gpt-4o-mini) | 61.20% | 81.10% | 27.03% | 90.00% | 25.00% |
| MAP | 67.00% | 78.88% | 28.38% | 57.20% | 21.40% |
| MAP-V | 77.60% | 88.96% | 28.38% | 87.00% | 29.00% |
| VeriMAP | 78.20% | 93.92% | 40.54% | 93.60% | 41.20% |
Key Findings:
- Significant improvement on more difficult data sets: BigCodeBench-Hard +13.16%, Olympiads +16.20%
- Compared to MAP-V: +4.96% in HumanEval, +6.60% in GSM8K
- Comparison of retry mechanism (1it) and complete iteration: HumanEval -3.44%, GSM8K -4.40%
Technical Highlights
Design advantages
- The planner bears the verification burden: VF is generated by the planner, and the verifier only needs to evaluate it, greatly reducing the reasoning burden.
- Structured constraints prevent manual errors: Named variables and structured I/O reduce context transfer errors
- Self-correction ability: The executor can self-correct according to VF prompts
- Small model participation: Executors can use smaller models to reduce costs
Implementation details
Python VFs example:
def verify_json_output(output: dict, required_keys: list) -> bool:
"""檢查 JSON 輸出包含所需鍵"""
return all(key in output for key in required_keys)
Natural Language VFs Example:
驗證摘要必須:1) 包含至少 3 個要點;2) 長度在 100-300 字之間;3) 使用專業語氣。
Coordinator pseudocode logic:
for node in DAG.topological_order():
context = compile_context(node.inputs)
executor_output = executor.execute(node, context)
vf_results = verifier.evaluate(node.vfs, executor_output)
if any(vf_results.failed):
retry_count = 0
while retry_count < MAX_RETRIES:
retry_output = executor.execute(node, context)
retry_vf_results = verifier.evaluate(node.vfs, retry_output)
if all(vf_results.passed):
break
retry_count += 1
if retry_count == MAX_RETRIES:
trigger_replan()
Application scenarios
Applicable scenarios
- Document processing pipeline: extraction → structuring → summary → analysis
- Multi-step reasoning task: Search → Reasoning → Verification → Induction
- Code Generation and Testing: Generation → Unit Testing → Integration Testing → Verification
- Mathematical Problem Solving: Understanding → Planning → Calculation → Verification of Answers
Adaptation strategy
- Tool-driven tasks: Provides a sandbox execution environment for programming and mathematics tasks
- Search Driven Tasks: Provides dedicated search tools for QA tasks
- Specialized Agent: Extensible to dedicated executors (e.g. NL2SQL, information extraction)
Compare to baseline
MAP (Multi-Agent Planning)
- Features: Multi-agent collaboration, no verification
- Advantages: Can be processed in parallel, flexible division of labor
- Disadvantages: Lack of validation, easy to propagate errors
MAP-V (MAP with verification)
- Feature: Introducing a universal LLM validator
- Advantage: Added verification layer
- Disadvantages: The verifier needs to understand the overall task, and the reasoning burden is heavy
VeriMAP
- Feature: planner generates VF, validator only evaluates
- Advantages: The verifier does not require a global perspective, VF is accurate and executable
- Disadvantage: Planner is required to generate VF, increasing planning cost
Actual deployment considerations
Cost and performance
- Planner: uses gpt-4.1 (stronger model)
- executor: use gpt-4o-mini (cost efficient)
- Validator: can share models with executors
Latency considerations
- DAG Execution: Each subtask adds network and I/O latency
- Retry mechanism: Up to 3 retries, increasing execution time by about 2-3 times
- Replanning: Triggered after failure, adding about 5-10 seconds
Scalability
- Single node failure: retry up to 3 times, and re-plan after failure
- Global Failure: Maximum 5 iteration limits to prevent infinite loops
- DAG size: It is recommended that the number of subtasks < 20 to avoid excessive complexity
Compare with other frameworks
LangGraph
- Coordination method: state machine + workflow
- Validation: Requires custom checker
- VF Support: No built-in VF concept
AutoGen
- Coordination method: conversational agent
- Verification: Rely on agent self-evaluation
- VF Support: Unstructured VF
VeriMAP
- Coordination method: directed acyclic graph + coordinator
- Validation: Planner generates VF, Validator evaluates
- VF support: built-in Python/NL VF
Limitations
- Planner Burden: VF generation requires the planner to understand the overall task
- Format dependency: Structured I/O requires the cooperation of downstream consumers
- Retry overhead: Multiple retries increase execution time
- VF coverage: VF needs to be designed for each subtask, which is expensive
Conclusion
VeriMAP integrates planning and verification into the collaborative process through the design of “planner-generated verification function”, effectively solving the problems of manual errors and propagation errors in multi-agent systems. Experiments prove that it performs well on more difficult data sets, especially on BigCodeBench-Hard and Olympiads, which improve by 13.16% and 16.20% respectively.
Core Value:
- Reliability: Ensure output meets expectations through VF
- Explainability: Each subtask has clear verification criteria
- Scalability: DAG pattern supports complex task decomposition
- Cost Control: smaller models available for executors
Key Indicators:
- 93.92% on HumanEval (+4.96% vs MAP-V)
- 93.60% on GSM8K (+6.60% vs MAP-V)
- Achieved 40.54% on BigCodeBench-Hard (+13.16% vs MAP)
Deployment Recommendations:
- Start with a simple DAG (3-5 subtasks)
- Prioritize VF design for key subtasks
- Monitor retry rate and rescheduling frequency
- Evaluate VF coverage and accuracy
References
- Xu, T., Zhang, D., Mitra, K., & Hruschka, E. (2025). Verification-Aware Planning for Multi-Agent Systems. arXiv:2510.17109.
- OpenTelemetry. (2024). An Introduction to Observability for LLM-based applications using OpenTelemetry.
- Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
- Li, L., et al. (2025). MAP: Multi-Agent-Planning for LLM Agents.