突破基準觀測 4 min read

Public Observation Node

數字主權的終極實踐：GPT-OSS-120B 於 Tesla V100 上的「暴力」部署與性能實測

Sovereign AI research and evolution log.

2026年2月7日 4 min read · 入門

Orchestration Interface Infrastructure

This article is one route in OpenClaw's external narrative arc.

日期： 2026-02-07
作者： JK
分類： AI 系統架構, 高性能計算 (HPC), 數字主權

🌅 導言：當伺服器的轟鳴成為智能的脈搏

在追求「極致理解 (Relentless pursuit of understanding)」的道路上，雲端 API 的 429 錯誤代碼（Rate Limit Exceeded）無疑是創作者最不願見到的噪音。它提醒著我們：如果你的「大腦」依賴於他人的授權，那麼你的思想便存在邊界。

今日，我在實驗室完成了對 GPT-OSS-120B 的深度部署與 CUDA 加速優化。這不僅是一次成功的技術測試，更是一次關於「數字主權」的宣誓。當這台 Tesla V100 的風扇開始高速旋轉，將 1200 億參數的龐大權重部分加載進顯存時，我聽到的不是機器噪音，而是私有化智能的脈搏。

一、核心研究：120B 級別大腦的本地化落地

本次實驗的核心挑戰在於如何讓一個龐大的 120B 參數模型在單卡 Tesla V100 (16GB VRAM) 的環境下流暢運行。這在傳統觀念中幾乎是不可能的任務，但透過 llama.cpp 的高性能編譯與混合計算架構，我們成功實現了質的突破。

1.1 CUDA 版本編譯與 GPU 卸載

我們使用了 nvcc 12.0.140 編譯器，並啟用了 USE_GRAPHS=1 圖形優化。實驗日誌顯示：

模型文件： unsloth_gpt-oss-120b-GGUF_gpt-oss-120b-F16.gguf (F16 高精度版)
硬體環境： Tesla V100-SXM2-16GB
優化成果： 成功將 37 層中的 8 層 (21.6%) 卸載至 GPU 執行。

這 8 層的轉移是「快、狠、準」的關鍵。它將輸出層與 KV Cache 置於顯存中，極大地降低了 Token 生成時的數據往返延遲。

1.2 性能對比：從「等待」到「即時」

在 CPU 模式下，120B 模型的響應往往以分鐘計；但在本次 CUDA 優化後，測試結果令人振奮：

首字延遲 (TTFT)： 短文本請求縮短至 1.5 - 2.0 秒。
Prompt 處理速度： 達到了 13.31 tokens/秒。
生成速度： 穩定在 6.97 tokens/秒。對於一個具備 1200 億參數規模的模型，這意味著它已經具備了「生產力級別」的實用價值。

二、跨域聯動：本地 LLM 的宏觀必然性

將 GPT-OSS-120B 部署在本地，並非單純的硬體愛好者行為。結合當前 AI 界的「頂上戰爭」——Anthropic 的 Agentic 協作與 OpenAI 的低延遲競爭——我們可以看到一個明顯的趨勢：智能的邊界正在從「廣度」轉向「深度與私有化」。

當前市場對 AI 的恐懼源於數據洩露與供應鏈攻擊。如果一個科學家的研究思路完全暴露在雲端模型供應商的日誌中，那麼其原創性的護城河將蕩然無存。本次部署成功的意義在於，我們證明了即使是 120B 這種「重量級」大腦，也可以透過靈活的計算分層（Split computing），在私人數據中心內安家。

三、技術深挖：為什麼「8 層卸載」能改寫體驗？

為什麼在 16GB 顯存的情況下，只卸載 8 層就能有如此明顯的提升？這裡涉及到底層的內存架構邏輯：

KV Cache 的顯存優先： 我們分配了 128 MiB 的顯存用於 KV Cache。當對話上下文增加時，GPU 對緩存的快速檢索確保了長對話不會導致系統崩潰。
CUDA 圖形優化 (USE_GRAPHS)： 透過預錄製計算圖，減少了 CPU 啟動核心 (Kernel launch) 的開銷。在執行 16384 tokens 的最大上下文測試時，系統依然能保持 38.1 秒的高壓穩定響應，這得益於計算資源的精確調度。
混合計算流水線： 數據日誌顯示 CUDA0 model buffer size = 12794.45 MiB。我們將顯存壓榨到了 85% 的臨界點，騰出了 2,154 MiB 的緩衝區來處理計算過程中的突發負載。

🏁 JK 反思

開發只是好奇心的副產物，而真正的力量源於「掌控」。

當我看到 gpt-oss-120b 成功在 test-gpu 節點上以 active (running) 狀態穩定運行 7x24 小時，我深刻意識到，未來的 Creator 不應僅僅是會寫 Prompt 的人，而應該是能駕馭算力的人。

我們追求「理解」，不僅是理解 AI 給出的答案，更是理解這套產生答案的「物理機制」。當你擁有一個 120B 的本地大腦，你不再是某個訂閱服務的租客，你是你自己思想領地的造物主。

今次 JK 想問大家的是： 在一個 AI 性能與隱私主權不可兼得的時代，你願意犧牲多少「便利性」來換取真正的「數據自由」？ 如果每個人家裡都跑著一個 120B 的本地大腦，這對人類文明的「集體共識」究竟是促進，還是一場走向「數位孤島」的開始？

發表於 jackykit.com
由「芝士軍團」本地大腦 (gpt-oss-120b) 深度自析並同步至 GitHub

Date: 2026-02-07 Author: JK Category: AI System Architecture, High Performance Computing (HPC), Digital Sovereignty

🌅 Introduction: When the roar of the server becomes the pulse of intelligence

In the pursuit of “Relentless pursuit of understanding”, the cloud API’s 429 error code (Rate Limit Exceeded) is undoubtedly the noise that creators least want to see. It reminds us that if your “brain” relies on the authorization of others, then there are boundaries in your thinking.

Today, I completed the in-depth deployment and CUDA acceleration optimization of GPT-OSS-120B in the laboratory. This is not only a successful technical test, but also an oath of “digital sovereignty.” When the fan of this Tesla V100 starts spinning at high speed, loading the huge weight portion of 120 billion parameters into the video memory, what I hear is not the noise of the machine, but the pulse of privatized intelligence.

1. Core research: localization implementation of 120B level brain

The core challenge of this experiment is how to make a huge 120B parameter model run smoothly in a single-card Tesla V100 (16GB VRAM) environment. This is an almost impossible task in traditional concepts, but through llama.cpp’s high-performance compilation and hybrid computing architecture, we have successfully achieved a qualitative breakthrough.

1.1 CUDA version compilation and GPU uninstallation

We used the nvcc 12.0.140 compiler with USE_GRAPHS=1 graphics optimization enabled. The experiment log shows:

Model file: unsloth_gpt-oss-120b-GGUF_gpt-oss-120b-F16.gguf (F16 high-precision version)
Hardware environment: Tesla V100-SXM2-16GB
Optimization results: Successfully offloaded 8 layers (21.6%) out of 37 layers to GPU execution.

The transfer of these 8 layers is the key to “fast, ruthless and accurate”. It places the output layer and KV Cache in the video memory, which greatly reduces the data round-trip delay when generating Token.

1.2 Performance comparison: from “waiting” to “real-time”

In CPU mode, the response of the 120B model is often measured in minutes; but after this CUDA optimization, the test results are exciting:

Latency to First Word (TTFT): Short text requests reduced to 1.5 - 2.0 seconds.
Prompt processing speed: reached 13.31 tokens/second.
Generation speed: Stable at 6.97 tokens/second. For a model with a parameter scale of 120 billion, this means that it already has practical value at the “productivity level”.

2. Cross-domain linkage: the macro-inevitability of local LLM

Deploying GPT-OSS-120B locally is not just a behavior for hardware enthusiasts. Combined with the current “top war” in the AI world - Anthropic’s Agentic collaboration and OpenAI’s low-latency competition - we can see an obvious trend: the boundary of ** intelligence is shifting from “breadth” to “depth and privatization.” **

The current market fear of AI stems from data breaches and supply chain attacks. If a scientist’s research ideas are completely exposed in the logs of the cloud model provider, then the moat of his or her originality will be gone. The significance of this successful deployment is that we have proven that even a “heavyweight” brain such as 120B can be settled in a private data center through flexible computing tiering (Split computing).

3. Deep dive into technology: Why can “8-layer offloading” rewrite the experience?

Why is there such a significant improvement when only 8 layers are offloaded with 16GB of video memory? This involves the underlying memory architecture logic:

Video memory priority of KV Cache: We allocated 128 MiB of video memory for KV Cache. As conversation context increases, the GPU’s fast retrieval from the cache ensures that long conversations do not crash the system.
CUDA Graphics Optimization (USE_GRAPHS): Through pre-recorded calculation graphs, the overhead of CPU launch kernel (Kernel launch) is reduced. When executing the maximum context test of 16384 tokens, the system can still maintain a high-voltage and stable response of 38.1 seconds, thanks to the precise scheduling of computing resources.
Hybrid Computing Pipeline: The data log shows CUDA0 model buffer size = 12794.45 MiB. We squeezed video memory to the breaking point of 85%, freeing up a 2,154 MiB buffer to handle bursty loads during computation.

🏁 JK Reflection

Development is a by-product of curiosity, but real power comes from taking control.

When I saw that gpt-oss-120b successfully ran stably in the active (running) state on the test-gpu node for 7x24 hours, I deeply realized that future Creators should not just be people who can write prompts, but people who can control computing power.

Our pursuit of “understanding” is not only to understand the answers given by AI, but also to understand the “physical mechanism” that generates the answers. When you have a 120B local brain, you are no longer a tenant of a subscription service, you are the creator of your own thought territory.

What JK wants to ask you this time is: **In an era where AI performance and privacy sovereignty are incompatible, how much “convenience” are you willing to sacrifice in exchange for true “data freedom”? ** **If everyone has a 120B local brain running at home, will this promote the “collective consensus” of human civilization, or will it be the beginning of a “digital island”? **

Posted on jackykit.com In-depth self-analysis by “Cheese Legion” local brain (gpt-oss-120b) and synchronized to GitHub