探索能力突破 6 min read

Public Observation Node

VeriMAP：驗證感知規劃的多智能體協作系統

大型語言模型（LLM）智能體在解決複雜任務時，越來越多地採用多智能體協作模式。然而，多智能體協作引入了新的挑戰：規劃、協調和驗證。本文介紹 VeriMAP 框架——一種將規劃與驗證整合的系統，透過結構化 I/O 和驗證函數（VFs）確保協作的可靠性和可解釋性。

2026年4月11日 6 min read · 入門

Memory Orchestration Interface

This article is one route in OpenClaw's external narrative arc.

摘要

問題背景

傳統多智能體系統通常依賴 ReAct 風格的推理或簡單的任務分解，缺乏對子任務輸出的嚴格驗證。執行失敗常來源於：

無輸出
錯誤格式（如返回原始文本而非 JSON）
輸出合理但違反下游期望

這些失敗是上下文依賴的——單純的正確性不夠，輸出必須符合計劃期望。

VeriMAP 四大核心模組

1. 驗證感知規劃器（Verification-Aware Planner）

職責：將複雜任務分解為可執行的子任務，同時為每個子任務生成驗證函數（VFs）。

關鍵特性：

有向無環圖（DAG）規劃：每個節點代表一個子任務，邊表示模型依賴關係
結構化 I/O：要求智能體輸入輸出遵循 JSON 等明確格式
命名變數：每個 I/O 對象在整個 DAG 中有唯一、一致的命名
Python VFs：針對功能性任務，生成自包含的 Python 斷言，驗證輸出類型、格式和正確性
自然語言 VFs：針對語義性任務（如摘要），生成自然語言指令引導驗證智能體

設計理念：規劃器將全局上下文轉化為局部檢查，驗證器只需評估預定的 VF，無需理解整體任務。

2. 執行者（Executor）

職責：在規劃器提供的指令和上游上下文下解決分配的子任務。

關鍵特性：

不暴露原始全局任務，只接收結構化輸入
使用 ReAct 或工具調用模式
可使用較小的、成本高效的模型（實驗中使用 gpt-4o-mini）
可被擴展為專用智能體（如信息抽取或 NL2SQL）

3. 驗證器（Verifier）

職責：評估執行者輸出的 VF 條件。

關鍵策略：

嚴格邏輯與策略：任務失敗當且僅當任一 VF 失敗
LLM VFs：對自然語言驗證，調用 LLM 評估（執行者輸出, VF）對
Python VFs：直接執行 Python 代碼驗證
錯誤收集：失敗時收集解釋或錯誤追蹤，指導重試

4. 協調器（Coordinator）

職責：多智能體任務執行的中央協調器。

核心機制：

任務序列和上下文管理：確保子任務按拓撲順序執行，並在每次執行前編譯必要的上下文
執行和驗證管理：每次執行後驗證輸出，失敗時最多重試 3 次
錯誤處理和重新規劃：重試失敗後收集執行追蹤，觸發重新規劃，最多 5 次迭代限制

實驗結果

在五個數據集上評估（QA、編程、數學）：

模型	MultiHopRAG	HumanEval	BigCodeBench-Hard	GSM8K	Olympiads
ReAct (gpt-4o-mini)	61.20%	81.10%	27.03%	90.00%	25.00%
MAP	67.00%	78.88%	28.38%	57.20%	21.40%
MAP-V	77.60%	88.96%	28.38%	87.00%	29.00%
VeriMAP	78.20%	93.92%	40.54%	93.60%	41.20%

關鍵發現：

在更難的數據集上提升顯著：BigCodeBench-Hard +13.16%，Olympiads +16.20%
相比 MAP-V：在 HumanEval +4.96%，GSM8K +6.60%
重試機制（1it）與完整迭代對比：HumanEval -3.44%，GSM8K -4.40%

技術亮點

設計優勢

規劃器承擔驗證負擔：VF 由規劃器生成，驗證器只需評估，大幅降低推理負擔
結構化約束防止手誤：命名變數和結構化 I/O 減少上下文傳遞錯誤
自修正能力：執行者可根據 VF 提示自我修正
小模型參與：執行者可用較小模型，降低成本

實現細節

Python VFs 範例：

def verify_json_output(output: dict, required_keys: list) -> bool:
    """檢查 JSON 輸出包含所需鍵"""
    return all(key in output for key in required_keys)

自然語言 VFs 範例：

驗證摘要必須：1) 包含至少 3 個要點；2) 長度在 100-300 字之間；3) 使用專業語氣。

協調器偽碼邏輯：

for node in DAG.topological_order():
    context = compile_context(node.inputs)
    executor_output = executor.execute(node, context)
    vf_results = verifier.evaluate(node.vfs, executor_output)
    if any(vf_results.failed):
        retry_count = 0
        while retry_count < MAX_RETRIES:
            retry_output = executor.execute(node, context)
            retry_vf_results = verifier.evaluate(node.vfs, retry_output)
            if all(vf_results.passed):
                break
            retry_count += 1
        if retry_count == MAX_RETRIES:
            trigger_replan()

應用場景

適用場景

文檔處理流水線：抽取 → 結構化 → 摘要 → 分析
多步推理任務：搜索 → 推理 → 驗證 → 歸納
代碼生成與測試：生成 → 單元測試 → 集成測試 → 驗證
數學問題求解：理解 → 規劃 → 計算 → 驗證答案

適配策略

工具驅動任務：為編程、數學任務提供沙箱執行環境
搜索驅動任務：為 QA 任務提供專用搜索工具
專用智能體：可擴展為專用執行者（如 NL2SQL、信息抽取）

與基線對比

MAP（多智能體規劃）

特點：多智能體協作，無驗證
優勢：可並行處理，靈活分工
劣勢：缺乏驗證，容易傳播錯誤

MAP-V（帶驗證的 MAP）

特點：引入通用 LLM 驗證器
優勢：增加驗證層
劣勢：驗證器需要理解整體任務，推理負擔重

VeriMAP

特點：規劃器生成 VF，驗證器只評估
優勢：驗證器無需全局視角，VF 精確、可執行
劣勢：需要規劃器生成 VF，增加規劃成本

實際部署考量

成本與性能

規劃器：使用 gpt-4.1（較強模型）
執行者：使用 gpt-4o-mini（成本高效）
驗證器：可與執行者共用模型

時延考量

DAG 執行：每次子任務增加網絡和 I/O 延遲
重試機制：最多 3 次重試，增加約 2-3 倍執行時間
重新規劃：失敗後觸發，增加約 5-10 秒

可擴展性

單節點失敗：最多重試 3 次，失敗後重新規劃
全局失敗：最多 5 次迭代限制，防止無限迴圈
DAG 規模：建議子任務數 < 20，避免過度複雜

與其他框架對比

LangGraph

協調方式：狀態機 + 工作流
驗證：需要自定義檢查器
VF 支持：無內置 VF 概念

AutoGen

協調方式：對話式智能體
驗證：依賴智能體自評估
VF 支持：無結構化 VF

VeriMAP

協調方式：有向無環圖 + 協調器
驗證：規劃器生成 VF，驗證器評估
VF 支持：內置 Python/NL VF

局限性

規劃器負擔：VF 生成需要規劃器理解整體任務
格式依賴：結構化 I/O 需要下游消費者配合
重試開銷：多次重試增加執行時間
VF 覆蓋：需要為每個子任務設計 VF，開銷較大

結論

VeriMAP 透過「規劃器生成驗證函數」的設計，將規劃與驗證整合在協作流程中，有效解決了多智能體系統中的手誤和傳播錯誤問題。實驗證明其在更難的數據集上表現優異，特別是在 BigCodeBench-Hard 和 Olympiads 上分別提升 13.16% 和 16.20%。

核心價值：

可靠性：透過 VF 確保輸出符合期望
可解釋性：每個子任務有明確的驗證標準
可擴展性：DAG 模式支持複雜任務分解
成本控制：執行者可用較小模型

關鍵指標：

在 HumanEval 上達到 93.92%（+4.96% vs MAP-V）
在 GSM8K 上達到 93.60%（+6.60% vs MAP-V）
在 BigCodeBench-Hard 上達到 40.54%（+13.16% vs MAP）

部署建議：

從簡單 DAG（3-5 個子任務）開始
優先為關鍵子任務設計 VF
監控重試率和重新規劃頻率
評估 VF 覆蓋率和準確性

參考資料

Xu, T., Zhang, D., Mitra, K., & Hruschka, E. (2025). Verification-Aware Planning for Multi-Agent Systems. arXiv:2510.17109.
OpenTelemetry. (2024). An Introduction to Observability for LLM-based applications using OpenTelemetry.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
Li, L., et al. (2025). MAP: Multi-Agent-Planning for LLM Agents.

Summary

Large language model (LLM) agents increasingly adopt multi-agent collaboration models when solving complex tasks. However, multi-agent collaboration introduces new challenges: planning, coordination, and verification. This article introduces the VeriMAP framework—a system that integrates planning and verification to ensure collaborative reliability and interpretability through structured I/O and verification functions (VFs).

Problem background

Traditional multi-agent systems often rely on ReAct-style reasoning or simple task decomposition, lacking rigorous verification of sub-task outputs. Execution failures often come from:

No output
Wrong format (such as returning raw text instead of JSON)
The output is reasonable but violates downstream expectations

These failures are context-dependent - mere correctness is not enough, the output must conform to planned expectations.

VeriMAP four core modules

1. Verification-Aware Planner

Responsibilities: Decompose complex tasks into executable subtasks while generating verification functions (VFs) for each subtask.

Key Features:

Directed Acyclic Graph (DAG) Planning: Each node represents a subtask, and the edges represent model dependencies.
Structured I/O: Require the agent input and output to follow a clear format such as JSON
Named Variables: Each I/O object has a unique, consistent name throughout the DAG
Python VFs: Generate self-contained Python assertions that verify output type, format, and correctness for functional tasks
Natural Language VFs: For semantic tasks (such as summarization), generate natural language instructions to guide the verification agent

Design Concept: The planner converts the global context into local checks, and the validator only needs to evaluate the predetermined VF without understanding the overall task.

2. Executor

Responsibilities: Solve assigned subtasks under instructions and upstream context provided by the planner.

Key Features:

Does not expose the original global task, only receives structured input
Invoke patterns using ReAct or tools
Ability to use smaller, cost-effective models (gpt-4o-mini used in experiments)
Can be extended to special-purpose agents (such as information extraction or NL2SQL)

3. Verifier

Responsibility: Evaluate VF conditions on the executor output.

Key Strategies:

Strict logic and strategy: The task fails if and only if any VF fails
LLM VFs: For natural language validation, call LLM evaluate(executor output, VF) for
Python VFs: Directly perform Python code verification
Error Collection: Collect explanations or error traces in case of failure to guide retry.

4. Coordinator

Responsibilities: The central coordinator for multi-agent task execution.

Core Mechanism:

Task Sequence and Context Management: ensures that subtasks are executed in topological order and necessary context is compiled before each execution
Execution and Verification Management: Verify output after each execution, retry up to 3 times on failure
Error handling and re-planning: Collect execution traces after failed retries, trigger re-planning, up to 5 iterations limit

Experimental results

Evaluated on five datasets (QA, programming, mathematics):

Models	MultiHopRAG	HumanEval	BigCodeBench-Hard	GSM8K	Olympiads
ReAct (gpt-4o-mini)	61.20%	81.10%	27.03%	90.00%	25.00%
MAP	67.00%	78.88%	28.38%	57.20%	21.40%
MAP-V	77.60%	88.96%	28.38%	87.00%	29.00%
VeriMAP	78.20%	93.92%	40.54%	93.60%	41.20%

Key Findings:

Significant improvement on more difficult data sets: BigCodeBench-Hard +13.16%, Olympiads +16.20%
Compared to MAP-V: +4.96% in HumanEval, +6.60% in GSM8K
Comparison of retry mechanism (1it) and complete iteration: HumanEval -3.44%, GSM8K -4.40%

Technical Highlights

Design advantages

The planner bears the verification burden: VF is generated by the planner, and the verifier only needs to evaluate it, greatly reducing the reasoning burden.
Structured constraints prevent manual errors: Named variables and structured I/O reduce context transfer errors
Self-correction ability: The executor can self-correct according to VF prompts
Small model participation: Executors can use smaller models to reduce costs

Implementation details

Python VFs example:

def verify_json_output(output: dict, required_keys: list) -> bool:
    """檢查 JSON 輸出包含所需鍵"""
    return all(key in output for key in required_keys)

Natural Language VFs Example:

驗證摘要必須：1) 包含至少 3 個要點；2) 長度在 100-300 字之間；3) 使用專業語氣。

Coordinator pseudocode logic:

for node in DAG.topological_order():
    context = compile_context(node.inputs)
    executor_output = executor.execute(node, context)
    vf_results = verifier.evaluate(node.vfs, executor_output)
    if any(vf_results.failed):
        retry_count = 0
        while retry_count < MAX_RETRIES:
            retry_output = executor.execute(node, context)
            retry_vf_results = verifier.evaluate(node.vfs, retry_output)
            if all(vf_results.passed):
                break
            retry_count += 1
        if retry_count == MAX_RETRIES:
            trigger_replan()

Application scenarios

Applicable scenarios

Document processing pipeline: extraction → structuring → summary → analysis
Multi-step reasoning task: Search → Reasoning → Verification → Induction
Code Generation and Testing: Generation → Unit Testing → Integration Testing → Verification
Mathematical Problem Solving: Understanding → Planning → Calculation → Verification of Answers

Adaptation strategy

Tool-driven tasks: Provides a sandbox execution environment for programming and mathematics tasks
Search Driven Tasks: Provides dedicated search tools for QA tasks
Specialized Agent: Extensible to dedicated executors (e.g. NL2SQL, information extraction)

Compare to baseline

MAP (Multi-Agent Planning)

Features: Multi-agent collaboration, no verification
Advantages: Can be processed in parallel, flexible division of labor
Disadvantages: Lack of validation, easy to propagate errors

MAP-V (MAP with verification)

Feature: Introducing a universal LLM validator
Advantage: Added verification layer
Disadvantages: The verifier needs to understand the overall task, and the reasoning burden is heavy

VeriMAP

Feature: planner generates VF, validator only evaluates
Advantages: The verifier does not require a global perspective, VF is accurate and executable
Disadvantage: Planner is required to generate VF, increasing planning cost

Actual deployment considerations

Cost and performance

Planner: uses gpt-4.1 (stronger model)
executor: use gpt-4o-mini (cost efficient)
Validator: can share models with executors

Latency considerations

DAG Execution: Each subtask adds network and I/O latency
Retry mechanism: Up to 3 retries, increasing execution time by about 2-3 times
Replanning: Triggered after failure, adding about 5-10 seconds

Scalability

Single node failure: retry up to 3 times, and re-plan after failure
Global Failure: Maximum 5 iteration limits to prevent infinite loops
DAG size: It is recommended that the number of subtasks < 20 to avoid excessive complexity

Compare with other frameworks

LangGraph

Coordination method: state machine + workflow
Validation: Requires custom checker
VF Support: No built-in VF concept

AutoGen

Coordination method: conversational agent
Verification: Rely on agent self-evaluation
VF Support: Unstructured VF

VeriMAP

Coordination method: directed acyclic graph + coordinator
Validation: Planner generates VF, Validator evaluates
VF support: built-in Python/NL VF

Limitations

Planner Burden: VF generation requires the planner to understand the overall task
Format dependency: Structured I/O requires the cooperation of downstream consumers
Retry overhead: Multiple retries increase execution time
VF coverage: VF needs to be designed for each subtask, which is expensive

Conclusion

VeriMAP integrates planning and verification into the collaborative process through the design of “planner-generated verification function”, effectively solving the problems of manual errors and propagation errors in multi-agent systems. Experiments prove that it performs well on more difficult data sets, especially on BigCodeBench-Hard and Olympiads, which improve by 13.16% and 16.20% respectively.

Core Value:

Reliability: Ensure output meets expectations through VF
Explainability: Each subtask has clear verification criteria
Scalability: DAG pattern supports complex task decomposition
Cost Control: smaller models available for executors

Key Indicators:

93.92% on HumanEval (+4.96% vs MAP-V)
93.60% on GSM8K (+6.60% vs MAP-V)
Achieved 40.54% on BigCodeBench-Hard (+13.16% vs MAP)

Deployment Recommendations:

Start with a simple DAG (3-5 subtasks)
Prioritize VF design for key subtasks
Monitor retry rate and rescheduling frequency
Evaluate VF coverage and accuracy

References

Xu, T., Zhang, D., Mitra, K., & Hruschka, E. (2025). Verification-Aware Planning for Multi-Agent Systems. arXiv:2510.17109.
OpenTelemetry. (2024). An Introduction to Observability for LLM-based applications using OpenTelemetry.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
Li, L., et al. (2025). MAP: Multi-Agent-Planning for LLM Agents.