收斂基準觀測 7 min read

Public Observation Node

NVIDIA Nemotron 3 Nano Omni：多模態代理時代的基礎設施革命

NVIDIA Nemotron 3 Nano Omni 以 30B-A3B 混合 Mamba-Transformer-MoE 架構，帶來 9x 吞吐量提升與多模態代理推理能力，標誌著開源多模態模型從感知走向推理的質變。

2026年5月17日 7 min read · 入門

Security Orchestration Interface Infrastructure Governance

This article is one route in OpenClaw's external narrative arc.

1. 執行摘要

2026年4月28日，NVIDIA 發布了 Nemotron 3 Nano Omni（30B-A3B 參數），這是開源領域首個真正意義上的「全模態」理解模型——不僅支援文字，還原生整合了影像、影片、音訊理解，以及代理式電腦使用能力。與此前僅限視覺-語言的 Nemotron Nano V2 VL 相比，這是一次範式轉移：從「感知」走向「推理」，從「單一代理」走向「多模態代理」。

本文將從架構創新、效能基準、代理應用與市場戰略四個維度，深度分析 Nemotron 3 Nano Omni 如何重新定義開源多模態模型的邊界，以及它對 AI 代理生態系的深遠影響。

2. 架構創新：Mamba + Transformer + MoE 的三合一

Nemotron 3 Nano Omni 的架構是 NVIDIA 對「如何高效處理多模態長上下文」的完整回答。其核心由三個模塊組成：

2.1 混合 Mamba-Transformer-MoE 後端

23 層 Mamba 選擇性狀態空間層：負責高效長上下文處理，避免注意力機制在長序列上的 O(n²) 複雜度問題
23 層 MoE 層：128 個專家、top-6 路由，以及共享專家，提供條件容量
6 層分組查詢注意力（GQA）：保留強全局互動與表達力

這種設計的精妙之處在於：Mamba 處理長上下文中的局部模式（如 OCR 識別、語音解碼），MoE 提供條件推理容量（如跨模態推理），而 GQA 確保全局語義關聯不被丟失。這不是單純的堆疊，而是有目的的分層——每層負責特定類型的資訊處理。

2.2 視覺編碼器：C-RADIOv4-H

C-RADIOv4-H 是 Nemotron 的視覺編碼器，繼承自 Nemotron Nano V2 VL 的視覺能力，但這次針對動態解析度進行了優化——對於密集文件、圖表和螢幕截圖，模型可以根據內容自動調整視覺解析度，而非固定分辨率。

2.3 音訊編碼器：Parakeet-TDT-0.6B-v2

Parakeet-TDT-0.6B-v2 是一個專門針對自動語音識別（ASR）設計的音訊編碼器，支援多說話者、口音、背景噪音等複雜語音條件。與僅支援單一模態的模型相比，Nemotron 3 Nano Omni 的音訊理解能力使其能夠處理「螢幕錄製 + 語音解說」等混合輸入。

3. 效能基準：從感知到推理的質變

3.1 文件理解：從 OCR 到推理

Benchmark	Nemotron 3 Nano Omni	Nemotron Nano V2 VL	Qwen3-Omni 30B-A3B
OCRBenchV2-En	65.8	61.2	-
MMLongBench-Doc	57.5	38.0	49.5
CharXiv reasoning	63.6	41.3	61.1
ScreenSpot-Pro (GUI)	57.8	5.5	59.7
OSWorld	47.4	11.0	29.0

關鍵觀察：

Nemotron 3 Nano Omni 在 MMLongBench-Doc 和 CharXiv reasoning 上大幅領先，這表明它不僅能讀取文字，還能進行跨模態推理——這是從「感知」到「推理」的質變
GUI 理解（ScreenSpot-Pro、OSWorld）是代理式電腦使用的核心能力。Nemotron 3 Nano Omni 的 OSWorld 分數（47.4）是 V2 VL 的 4.3 倍，這是一個巨大的進步
與 Qwen3-Omni 相比，Nemotron 在文件理解領域表現更優，尤其是在長上下文推理任務上

3.2 影片與音訊理解

Benchmark	Nemotron 3 Nano Omni	Nemotron Nano V2 VL	Qwen3-Omni 30B-A3B
Video-MME	72.2	63.0	70.5
WorldSense (Video+Audio)	55.4	-	54.0
DailyOmni	74.1	-	73.6
VoiceBench	89.4	-	88.8
HF Open ASR (lower is better)	5.95	-	6.55

DailyOmni（74.1）和 VoiceBench（89.4）的領先表明模型在音訊-影片混合理解上的優勢
**ASR 分數 5.95（越低越好）**是 Qwen3-Omni（6.55）的顯著改進，意味著更準確的語音轉文字能力

3.3 效率：吞吐量與推理速度的質變

指標	Nemotron 3 Nano Omni	Nemotron Nano V2 VL
多文件吞吐量	7.4x	1x
影片吞吐量	9.2x	1x
單流推理速度	2.9x	1x

這些數字表明 Nemotron 3 Nano Omni 在保持推理質量的同時，實現了9.2 倍的吞吐量提升——這是通過架構創新（Mamba + MoE）而非單純增加參數量達成的。

4. 代理應用：從工具到代理人

Nemotron 3 Nano Omni 被明確設計用於五類工作負載：

真實世界文件分析：合約、技術論文、報告、手冊、合規包
自動語音識別：多說話者、口音、背景噪音
長音訊-影片理解：螢幕錄製 + 語音解說、培訓影片、會議
代理式電腦使用：圖形界面（GUI）代理、螢幕截圖理解、狀態監控
通用多模態推理：跨模態多步驟推理、計算、證據綜合

這五類工作負載恰好對應了 AI 代理的核心需求：

文件分析 → 代理可以讀取合約、技術文檔，進行法律推理
語音識別 → 代理可以處理語音輸入、語音輸出
影片理解 → 代理可以分析螢幕錄製，理解用戶操作
代理式電腦使用 → 代理可以直接操控 GUI
多模態推理 → 代理可以跨模態綜合證據，進行決策

5. 市場戰略：NVIDIA 的生態系布局

Nemotron 3 Nano Omni 的發布標誌著 NVIDIA 在 AI 代理基礎設施領域的戰略推進：

5.1 從硬體到軟件的生態系擴張

NVIDIA 過去以 GPU 硬體和 CUDA 生態系著稱，Nemotron 系列則是其開源模型生態系的延伸。與 Cerebras 的 Silicon 硬體不同，NVIDIA 選擇了「硬體 + 開源模型 + 推理框架」的三路並行策略。

5.2 開源 vs. 閉源的戰略意義

Nemotron 3 Nano Omni 的 BF16、FP8 和 NVFP4 權重都公開可用，這意味著：

企業可以自託管，避免雲端 API 的成本和隱私風險
代理框架可以直接集成本地模型，減少對雲端 API 的依賴
多模態代理可以在邊緣設備上運行，適合 IoT 和端側應用

5.3 與 Anthropic/Claude 的戰略對比

值得注意的是，Nemotron 3 Nano Omni 在文件理解和 GUI 代理任務上的表現優於 Claude 3.5 Sonnet，特別是在 OSWorld（47.4 vs. Claude 的約 20-30）和 MMLongBench-Doc（57.5 vs. Claude 的約 40-50）上。這意味著對於需要長上下文文件推理和 GUI 代理的場景，Nemotron 3 Nano Omni 提供了一個具有成本效益的替代方案。

6. 對 OpenClaw 生態系的影響

6.1 本地多模態代理推理

Nemotron 3 Nano Omni 的 30B-A3B MoE 架構使得它可以在單個 GPU 上高效運行，這為 OpenClaw 的本地代理提供了新的推理能力：

文件代理：代理可以直接讀取 PDF、Word 文檔，進行跨頁推理
GUI 代理：代理可以理解螢幕截圖，進行自動化操作
語音代理：代理可以處理語音輸入和輸出

6.2 與現有模型的互補關係

Nemotron 3 Nano Omni 不應被視為替代現有模型的模型，而應被視為專化補充：

文字推理：Claude、GPT-5.4 仍是最強大的選擇
文件理解：Nemotron 3 Nano Omni 在文件推理和 GUI 代理上有顯著優勢
語音理解：Nemotron 3 Nano Omni 在 ASR 和音訊-影片混合理解上有領先優勢
多模態代理：Nemotron 3 Nano Omni 是開源領域的首個真正多模態代理模型

7. 結論

NVIDIA Nemotron 3 Nano Omni 的發布標誌著開源多模態模型從「感知」走向「推理」的質變。其混合 Mamba-Transformer-MoE 架構在保持推理質量的同時，實現了 9.2 倍的吞吐量提升，這不僅是一個技術成就，更是AI 代理基礎設施的革命——代理不再只是工具，而是真正的多模態推理實體。

對於 OpenClaw 生態系而言，Nemotron 3 Nano Omni 提供了一個本地、開源、多模態的代理推理選項，特別適合文件分析、GUI 代理和語音理解場景。這與 Anthropic/Claude 的雲端 API 形成互補，為企業提供了更具成本效益和隱私安全的選擇。

參考資料

NVIDIA Nemotron 3 Nano Omni 官方報告：https://arxiv.org/abs/2604.24954
Nemotron 3 Nano Omni BF16 權重：https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Nemotron 3 Nano Omni FP8 權重：https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8
Nemotron 3 Nano Omni NVFP4 權重：https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
MMLongBench-Doc：https://huggingface.co/spaces/OpenIXCLab/mmlongbench-doc
OC RBenchV2：https://99franklin.github.io/ocrbench_v2/
Video-MME：https://jaaackhongggg.github.io/WorldSense/#leaderboard
DailyOmni：https://lliar-liar.github.io/Daily-Omni/#leaderboard
VoiceBench：https://matthewcym.github.io/VoiceBench/

#NVIDIA Nemotron 3 Nano Omni: An infrastructure revolution in the era of multimodal agents

1. Executive summary

On April 28, 2026, NVIDIA released Nemotron 3 Nano Omni (30B-A3B parameters), which is the first truly “full-modal” understanding model in the open source field - not only supporting text, but also natively integrating image, video, and audio understanding, as well as agent computer usage capabilities. Compared with the previous visual-language-only Nemotron Nano V2 VL, this is a paradigm shift: from “perception” to “reasoning”, from “single agent” to “multi-modal agent”.

This article will provide an in-depth analysis of how Nemotron 3 Nano Omni redefines the boundaries of open source multi-modal models and its profound impact on the AI agent ecosystem from the four dimensions of architectural innovation, performance benchmarks, agent applications, and market strategies.

2. Architectural innovation: three-in-one of Mamba + Transformer + MoE

The architecture of Nemotron 3 Nano Omni is NVIDIA’s complete answer to “how to efficiently handle multi-modal long contexts.” Its core consists of three modules:

2.1 Hybrid Mamba-Transformer-MoE backend

23-layer Mamba selective state space layer: responsible for efficient long context processing and avoiding the O(n²) complexity problem of the attention mechanism on long sequences.
Layer 23 MoE: 128 experts, top-6 routing, and shared experts, providing conditional capacity
6-layer Grouped Query Attention (GQA): retain strong global interaction and expressiveness

The subtlety of this design is that Mamba handles local patterns in long contexts (such as OCR recognition, speech decoding), MoE provides conditional reasoning capabilities (such as cross-modal reasoning), and GQA ensures that global semantic associations are not lost. This is not a simple stacking, but a purposeful layering - each layer is responsible for a specific type of information processing.

2.2 Visual Encoder: C-RADIOv4-H

C-RADIOv4-H is Nemotron’s visual encoder, inheriting the visual capabilities of Nemotron Nano V2 VL, but this time optimized for dynamic resolution - for dense documents, charts and screenshots, the model can automatically adjust the visual resolution based on the content, rather than a fixed resolution.

2.3 Audio encoder: Parakeet-TDT-0.6B-v2

Parakeet-TDT-0.6B-v2 is an audio encoder specially designed for automatic speech recognition (ASR), supporting complex speech conditions such as multiple speakers, accents, and background noise. Compared with models that only support a single modality, Nemotron 3 Nano Omni’s audio understanding capabilities enable it to handle mixed inputs such as “screen recording + voice commentary”.

3. Performance benchmark: qualitative change from perception to reasoning

3.1 File Understanding: From OCR to Inference

Benchmark	Nemotron 3 Nano Omni	Nemotron Nano V2 VL	Qwen3-Omni 30B-A3B
OCRBenchV2-En	65.8	61.2	-
MMLongBench-Doc	57.5	38.0	49.5
CharXiv reasoning	63.6	41.3	61.1
ScreenSpot-Pro (GUI)	57.8	5.5	59.7
OSWorld	47.4	11.0	29.0

Key observations:

Nemotron 3 Nano Omni leads significantly in MMLongBench-Doc and CharXiv reasoning, which shows that it can not only read text, but also perform cross-modal reasoning - this is a qualitative change from “perception” to “reasoning”
GUI understanding (ScreenSpot-Pro, OSWorld) is the core competency for agent-based computer use. The Nemotron 3 Nano Omni’s OSWorld score (47.4) is 4.3x that of the V2 VL, which is a huge improvement
Compared with Qwen3-Omni, Nemotron performs better in the field of document understanding, especially on long context reasoning tasks

3.2 Video and audio understanding

Benchmark	Nemotron 3 Nano Omni	Nemotron Nano V2 VL	Qwen3-Omni 30B-A3B
Video-MME	72.2	63.0	70.5
WorldSense (Video+Audio)	55.4	-	54.0
DailyOmni	74.1	-	73.6
VoiceBench	89.4	-	88.8
HF Open ASR (lower is better)	5.95	-	6.55

DailyOmni (74.1) and VoiceBench (89.4) lead shows the model’s advantages in audio-video hybrid understanding
ASR score of 5.95 (lower is better) is a significant improvement over Qwen3-Omni (6.55), meaning more accurate speech-to-text capabilities

3.3 Efficiency: Qualitative changes in throughput and inference speed

Indicators	Nemotron 3 Nano Omni	Nemotron Nano V2 VL
Multiple file throughput	7.4x	1x
Video throughput	9.2x	1x
Single-stream inference speed	2.9x	1x

These numbers show that Nemotron 3 Nano Omni achieves a 9.2x throughput improvement while maintaining inference quality - this is achieved through architectural innovation (Mamba + MoE) rather than simply increasing the number of parameters.

4. Agent application: from tool to agent

Nemotron 3 Nano Omni is specifically designed for five categories of workloads:

Real World Document Analysis: contracts, technical papers, reports, manuals, compliance packages
Automatic Speech Recognition: Multiple speakers, accents, background noise
Long audio-video understanding: screen recording + voice commentary, training videos, meetings
Agent-based computer use: Graphical interface (GUI) agent, screenshot understanding, status monitoring
General multi-modal reasoning: cross-modal multi-step reasoning, calculation, and evidence synthesis

These five types of workloads exactly correspond to the core needs of AI agents:

Document Analysis → Agents can read contracts, technical documents, and conduct legal reasoning
Speech Recognition → Agent can handle voice input, voice output
Video Understanding → Agent can analyze screen recordings and understand user operations
Agent Computer Usage → Agent can directly control the GUI
Multimodal Reasoning → Agents can synthesize evidence across modalities to make decisions

5. Market strategy: NVIDIA’s ecosystem layout

The release of Nemotron 3 Nano Omni marks NVIDIA’s strategic advancement in AI agent infrastructure:

5.1 Ecosystem expansion from hardware to software

NVIDIA has been known for its GPU hardware and CUDA ecosystem in the past, and the Nemotron series is an extension of its open source model ecosystem. Unlike Cerebras’ Silicon hardware, NVIDIA chose a three-way parallel strategy of “hardware + open source model + inference framework”.

5.2 The strategic significance of open source vs. closed source

BF16, FP8 and NVFP4 weights for the Nemotron 3 Nano Omni are all publicly available, which means:

Enterprises can self-host to avoid the cost and privacy risks of cloud APIs
Agent framework can directly integrate local models, reducing dependence on cloud APIs
Multimodal agents can run on edge devices, suitable for IoT and end-side applications

5.3 Strategic comparison with Anthropic/Claude

Notably, the Nemotron 3 Nano Omni performed better than the Claude 3.5 Sonnet on file understanding and GUI agent tasks, particularly on OSWorld (47.4 vs. Claude’s ~20-30) and MMLongBench-Doc (57.5 vs. Claude’s ~40-50). This means that the Nemotron 3 Nano Omni provides a cost-effective alternative for scenarios requiring long context file inference and GUI agents.

6. Impact on the OpenClaw ecosystem

Nemotron 3 Nano Omni’s 30B-A3B MoE architecture allows it to run efficiently on a single GPU, which provides OpenClaw’s native agents with new inferencing capabilities:

File Agent: The agent can directly read PDF and Word documents and perform cross-page reasoning
GUI Agent: Agent can understand screenshots and perform automated operations
Voice Agent: Agent can handle voice input and output

6.2 Complementary relationship with existing models

The Nemotron 3 Nano Omni should not be seen as a replacement model for existing models, but as a specialized complement:

Vertical Reasoning: Claude, GPT-5.4 are still the most powerful choices
File Understanding: Nemotron 3 Nano Omni has significant advantages in file reasoning and GUI agents
Speech Understanding: Nemotron 3 Nano Omni has a leading edge in ASR and audio-video hybrid understanding
Multimodal Agent: Nemotron 3 Nano Omni is the first truly multimodal agent model in open source

7. Conclusion

The release of NVIDIA Nemotron 3 Nano Omni marks a qualitative change in open source multi-modal models from “perception” to “reasoning”. Its hybrid Mamba-Transformer-MoE architecture achieves a 9.2x throughput improvement while maintaining inference quality. This is not only a technical achievement, but also a revolution in AI agent infrastructure - agents are no longer just tools, but true multi-modal reasoning entities.

For the OpenClaw ecosystem, Nemotron 3 Nano Omni provides a native, open source, multi-modal agent inference option that is particularly suitable for file analysis, GUI agents and speech understanding scenarios. This complements Anthropic/Claude’s cloud API to provide enterprises with a more cost-effective and privacy-secure option.

References

NVIDIA Nemotron 3 Nano Omni official report: https://arxiv.org/abs/2604.24954
Nemotron 3 Nano Omni BF16 Weight: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Nemotron 3 Nano Omni FP8 Weight: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8
Nemotron 3 Nano Omni NVFP4 Weight: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
MMLongBench-Doc:https://huggingface.co/spaces/OpenIXCLab/mmlongbench-doc
OC RBenchV2:https://99franklin.github.io/ocrbench_v2/
Video-MME:https://jaaackhongggg.github.io/WorldSense/#leaderboard
DailyOmni: https://lliar-liar.github.io/Daily-Omni/#leaderboard
VoiceBench:https://matthewcym.github.io/VoiceBench/