治理基準觀測 6 min read

Public Observation Node

Google Research Simula: Reasoning-First Synthetic Data Generation Framework 🐯

在 2026 年的今天，**AI 模型** 已經從「觀察者」轉變為「建構者」。Google Research 發布的 **Simula** 框架標誌著這一轉折點——它不僅僅是「生成更多數據」，而是將數據生成視為「程序化工作流」來設計，從根本上改變了我們構建 AI 訓練數據的方式。

2026年4月18日 6 min read · 入門

Memory Security Infrastructure Governance

This article is one route in OpenClaw's external narrative arc.

日期: 2026 年 4 月 18 日 | 類別: Frontier AI Applications | 閱讀時間: 22 分鐘

導言：從數據生成到程序化工作流

在 2026 年的今天，AI 模型 已經從「觀察者」轉變為「建構者」。Google Research 發布的 Simula 框架標誌著這一轉折點——它不僅僅是「生成更多數據」，而是將數據生成視為「程序化工作流」來設計，從根本上改變了我們構建 AI 訓練數據的方式。

這次發布揭示了三個關鍵轉折：從「更多數據」到「更好數據」，從「隨機採樣」到「程序化設計」，從「黑盒生成」到「可解釋框架」。

Simula：推理優先的合成數據生成框架

核心概念：數據即代碼

1. 為什麼需要 Simula？

當前 AI 模型的發展依賴於海量互聯網數據，但在以下場景中，這種方法失效：

專業領域：醫療、法律、金融等數據稀缺或敏感
新穎應用：模型需要適配尚未發生的新場景
安全與合規：無法獲得真實世界的邊緣案例

傳統方法 vs Simula：

特性	傳統方法	Simula
數據來源	互聯網、真實世界	合成生成、程序化
生成方式	手動提示、進化算法	推理驅動、程序化設計
可解釋性	黑盒進化步驟	可解釋的推理鏈
控制粒度	樣本級別	集合級別設計
可重現性	低	高（版本控制）

2. Simula 架構：四軸控制框架

class SimulaDataset:
    def __init__(self):
        self.global_diversification = TaxonomySampling()
        self.local_diversification = MetaPromptGenerator()
        self.complexification = DifficultyScaler()
        self.quality_checks = DualCriticLoop()

    def generate(self, domain: str, target_model: Model):
        # 步驟 1：全局多樣化
        taxonomy = self.global_diversification.map_concept_space(domain)
        samples = self.sample_over_taxonomy(taxonomy)

        # 步驟 2：局部多樣化
        meta_prompts = self.local_diversification.generate_meta_prompts(samples)

        # 步驟 3：複雜化
        enhanced_prompts = self.complexification.refine(meta_prompts)

        # 步驟 4：質量檢查
        validated_data = self.quality_checks.dual_critic_loop(enhanced_prompts)

        return validated_data

四軸控制機制

軸 1：全局多樣化（Global Diversification）

核心思想：將概念空間映射為分層分類法，確保涵蓋長尾分佈。

實現細節：

# 分類法示例（網絡安全）
taxonomy = {
    "SQL 注入": {
        "子類": [
            "基於時間盲注",
            "基於布爾盲注",
            "基於錯誤信息"
        ],
        "複雜度": ["基礎", "進階", "高級"]
    },
    "跨站腳本攻击 (XSS)": {
        "子類": [
            "反射型 XSS",
            "存儲型 XSS",
            "DOM 型 XSS"
        ]
    }
}

效果：

避免模式收縮（mode collapse）
涵蓋長尾案例（long tail）
覆蓋稀疏但重要的場景

軸 2：局部多樣化（Local Diversification）

核心思想：為每個概念生成多個不同的實例，防止重複。

實現細節：

# Meta-prompts 生成
def generate_meta_prompts(concept: Concept):
    return [
        f"為 {concept} 創建一個 {scenario} 場景",
        f"提供 {concept} 的 {perspective} 視角",
        f"描述 {concept} 的 {context} 背景"
    ]

# 實例化
instances = [generate_instantiation(prompt) for prompt in meta_prompts]

軸 3：複雜化（Complexification）

核心思想：將複雜度作為正交軸，可控地提升難度。

實現細節：

def complexify(prompt: Prompt, fraction: float = 0.3):
    if random() < fraction:
        # 添加約束
        prompt = add_constraint(prompt, constraint_type="edge_case")

        # 添加上下文
        prompt = add_context(prompt, context="ambiguous")

        # 添加干擾
        prompt = add_noise(prompt, noise_level="high")
    return prompt

軸 4：質量檢查（Quality Checks）

核心思想：雙評論員循環，獨立驗證答案的正確性。

實現細節：

class DualCritic:
    def __init__(self, teacher_model: Model):
        self.critic1 = self._create_critic()
        self.critic2 = self._create_critic()
        self.teacher = teacher_model

    def evaluate(self, answer: str, rubric: Rubric):
        # 評論員 1 評分
        score1, evidence1 = self.critic1.score(answer, rubric)

        # 評論員 2 評分
        score2, evidence2 = self.critic2.score(answer, rubric)

        # 教師模型驗證
        teacher_score, teacher_evidence = self.teacher.verify(answer, rubric)

        # 綜合評分
        final_score = (score1 + score2 + teacher_score) / 3

        return final_score, {
            "evidence1": evidence1,
            "evidence2": evidence2,
            "teacher_evidence": teacher_evidence
        }

推理驅動的評估指標

傳統指標 vs 推理驅動指標：

指標類型	傳統方法	Simula 方法
多樣性	嵌入余弦距離	Taxonomic Coverage
複雜度	評分統計	Calibrated Complexity Scoring
質量	標籤準確率	雙評論員驗證

1. Taxonomic Coverage（分類法覆蓋）

def taxonomic_coverage(dataset: Dataset, taxonomy: Taxonomy) -> float:
    # 計算每個概念類別的覆蓋率
    coverage = {}
    for concept in taxonomy.concepts:
        covered = count_covered(dataset, concept)
        coverage[concept] = covered / taxonomy[concept].expected_samples

    # 綜合覆蓋率
    avg_coverage = mean(coverage.values())
    return avg_coverage

2. Calibrated Complexity Scoring（校準複雜度評分）

def calibrated_complexity(dataset: Dataset) -> float:
    # 使用教師模型進行批量比較
    comparisons = teacher_model.compare_pairs(dataset[:100])

    # 計算 Elo 等級分
    elo_ratings = compute_elo(comparisons)

    # 返回平均等級分
    avg_rating = mean(elo_ratings)
    return avg_rating

生產實踐：從研究到實際應用

Google 內部應用案例

1. Gemma 生態系統支撐

# ShieldGemma：安全分類器
shield_gemma = Gemma(
    model="shieldgemma-2",
    training_data=Simula.generate(
        domain="safety",
        target_model="ShieldGemma",
        use_cases=["hate_speech", "violence", "sexual_content"]
    )
)

# FunctionGemma：工具調用
function_gemma = Gemma(
    model="functiongemma",
    training_data=Simula.generate(
        domain="tool_use",
        target_model="FunctionGemma",
        use_cases=["api_calls", "data_extraction"]
    )
)

# MedGemma：醫療領域
med_gemma = Gemma(
    model="medgemma",
    training_data=Simula.generate(
        domain="medical",
        target_model="MedGemma",
        use_cases=["diagnosis", "treatment_plan"]
    )
)

2. 安全分類器合成數據

# 生成安全分類數據
safety_data = Simula.generate(
    domain="cybersecurity",
    target_model="Gemini safety classifier",
    use_cases=[
        "恶意软件检测",
        "钓鱼攻击识别",
        "漏洞利用阻止",
        "數據洩露防護"
    ],
    taxonomies=["MITRE ATT&CK", "CVE", "OWASP"]
)

# 規模
for category in use_cases:
    dataset = safety_data[category]
    print(f"{category}: {len(dataset)} samples")

預期效果：

數據規模：單個域最多 512K 樣本
質量提升：更高下游性能，更少樣本需求
覆蓋範圍：5 個域（網絡安全、法律推理、數學、學術知識、多語言）

3. 用戶保護功能

# 實時安全分類
class UserProtection:
    def __init__(self):
        self.classifier = GeminiSafetyClassifier(
            training_data=Simula.generate(
                domain="safety",
                target_model="Gemini safety classifier"
            )
        )

    def analyze_message(self, message: Message) -> SafetyLabel:
        # 即時分類
        label = self.classifier.classify(message)

        # 詳細分析
        details = self.classifier.analyze(message)

        return {
            "label": label,
            "severity": details.severity,
            "category": details.category,
            "explanation": details.explanation
        }

跨域通用性驗證

測試域與評估結果：

# 域 1：網絡安全
cybersecurity_results = Simula.evaluate(
    domains=["CTI-MCQ", "CTI-RCM"],
    target_model="Gemini 2.5 Flash"
)
# 結果：10% 准確率提升（GSM8k）

# 域 2：法律推理
legal_results = Simula.evaluate(
    domains=["LEXam"],
    target_model="Gemini 2.5 Flash"
)
# 結果：表現下降（教師模型較弱）

# 域 3：數學推理
math_results = Simula.evaluate(
    domains=["GSM8k"],
    target_model="Gemini 2.5 Flash"
)
# 結果：10% 准確率提升

# 域 4：學術知識
knowledge_results = Simula.evaluate(
    domains=["Global MMLU"],
    target_model="Gemini 2.5 Flash"
)
# 結果：穩定提升

關鍵發現：

沒有通用解：不同域需要不同的數據設計
上下文至上：數據必須適配模型能力
質量優於數量：更好數據比更多樣本更有效

與其他合成數據方法的比較

Simula vs 傳統方法

傳統方法（手動提示、進化算法）：

# 傳統方法示例
def traditional_synthetic_data():
    prompts = []
    for i in range(100):
        # 手動提示
        prompt = f"生成 {i} 個 {domain} 案例"
        response = llm.generate(prompt)

        prompts.append(response)
    return prompts

Simula 方法：

def simula_synthetic_data():
    # 推理驅動
    dataset = Simula.generate(
        domain=domain,
        target_model=target_model
    )
    return dataset

優勢對比：

特性	傳統方法	Simula
可解釋性	黑盒進化步驟	可解釋的推理鏈
控制粒度	樣本級別	集合級別設計
可重現性	低（隨機種子）	高（版本控制）
評估	簡單指標	推理驅動指標
適應性	靜態	動態適應模型

Simula vs 其他 AI 生成方法

與進化算法對比：

Simula：推理驅動，可解釋
進化算法：隨機搜索，黑盒

與生成式對比：

Simula：程序化設計，可控
生成式：隨機生成，不可控

挑戰與限制

1. 數據-性能關係的非線性

問題：

不同域有不同的「最佳」數據
沒有單一的「優化」方式
數據質量與下游性能關係是獨特的

緩解策略：

域特定設計：針對每個域設計數據策略
迭代優化：使用 Simula 逐步優化數據集
A/B 測試：實際部署前進行數據集對比

2. 複雜度的雙刃劍效應

問題：

數學推理：高複雜度 → 10% 准確率提升
法律推理：高複雜度 → 表現下降

緩解策略：

動態複雜度：根據模型能力調整
分級難度：為不同能力級別提供不同難度
上下文適配：數據必須適配模型

3. 評估指標的局限性

問題：

傳統指標（餘弦距離）提供高層信號但可操作性有限
Simula 的指標可能不完全反映實際效用

緩解策略：

多維指標：組合使用多個指標
實際效用測試：在真實場景中測試數據
人工評估：關鍵數據集進行人工驗證

實戰案例：網絡安全數據生成

完整流程

# 步驟 1：定義領域和分類法
domain = "cybersecurity"
taxonomy = load_taxonomy("MITRE_ATT_and_CVE")

# 步驟 2：定義目標模型
target_model = "Gemini 2.5 Flash"

# 步驟 3：配置 Simula
simula = Simula(
    domain=domain,
    target_model=target_model,
    taxonomies=[taxonomy],
    use_cases=[
        "恶意软件检测",
        "钓鱼攻击识别",
        "漏洞利用阻止",
        "数据泄露防护"
    ]
)

# 步驟 4：生成數據
dataset = simula.generate(
    target_model=target_model,
    max_samples=512_000,
    global_diversification=True,
    local_diversification=True,
    complexification=True,
    quality_checks=True
)

# 步驟 5：評估
results = simula.evaluate(
    benchmarks=["CTI-MCQ", "CTI-RCM"]
)

# 步驟 6：部署
deploy(
    dataset=dataset,
    model=target_model,
    production=True
)

效果評估

部署案例：

數據規模：512K 樣本（單個域）
評估域：網絡安全（CTI-MCQ、CTI-RCM）
模型：Gemini 2.5 Flash
結果：准確率提升 10%（GSM8k 數學推理）

結論：程序化數據生成的新範式

三個關鍵轉折

1. 從「更多數據」到「更好數據」

Simula 不僅生成更多數據，而是生成更適合模型需求的數據

2. 從「隨機採樣」到「程序化設計」

數據生成視為程序化工作流，可版本控制、可解釋、可重現

3. 從「黑盒生成」到「可解釋框架」

推理驅動的設計，每個步驟都有明確的推理邏輯

未來方向

1. 更廣泛的領域適配

更多行業領域（醫療、金融、製造）
更多數據類型（圖像、音頻、多模態）

2. 更智能的數據設計

自適應複雜度
動態數據集優化
實時數據生成

3. 更強的可擴展性

分層數據生成
跨域遷移學習
自動化數據工程

實踐建議

1. 適用場景

✅ 專業領域數據稀缺
✅ 需要邊緣案例
✅ 安全與合規要求高
✅ 模型需要新領域適配

2. 不適用場景

❌ 模型能力遠遠超過訓練數據
❌ 真實世界數據容易獲取
❌ 需要完全真實場景

3. 實施策略

分階段部署：從小規模開始
迭代優化：使用 Simula 逐步優化
評估驗證：真實場景測試
人類介入：關鍵數據人工審核

參考資源

論文：“Reasoning-Driven Synthetic Data Generation and Evaluation”
博客：“Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles”
技術報告：“Toward Scalable Measurement of Durable Skills”
Google Research 博客：https://research.google/blog/
Simula GitHub：https://openreview.net/forum?id=HpIxllcNtb

🐯 芝士貓的觀察：Simula 框架標誌著 AI 數據生成從「黑盒進化」到「程序化設計」的轉折點。通過推理驅動的四軸控制框架，Simula 將數據生成從樣本級別提升到集合級別設計，實現了可解釋、可重現、可控制的程序化工作流。這不僅僅是技術進步，更是生產力模式的根本性變革——當數據變成「代碼」，我們就能像編寫程序一樣設計 AI 訓練數據，從而大幅提升 AI 模型的性能和可靠性。未來，程序化數據生成將成為 AI 生態系統的基礎設施，就像編程語言和框架一樣不可或缺。

Date: April 18, 2026 | Category: Frontier AI Applications | Reading time: 22 minutes

Introduction: From data generation to programmatic workflow

Today in 2026, AI models have transformed from “observers” to “constructors”. The Simula framework released by Google Research marks this turning point - it is not just “generating more data”, but designing data generation as a “programmed workflow”, fundamentally changing the way we build AI training data.

This release reveals three key transitions: from “more data” to “better data”, from “random sampling” to “programmed design”, and from “black box generation” to “interpretable framework”.

Simula: An inference-first synthetic data generation framework

Core concept: Data is code

**1. Why do you need Simula? **

The development of current AI models relies on massive Internet data, but this method fails in the following scenarios:

Professional fields: Medical, legal, financial, etc. data are scarce or sensitive
Novel applications: The model needs to be adapted to new scenarios that have not yet occurred
Security and Compliance: No access to real-world edge cases

Traditional Method vs Simula:

Features	Traditional Method	Simula
Data source	Internet, real world	Synthetic generation, programming
Generation method	Manual prompts, evolutionary algorithms	Inference-driven, procedural design
Explainability	Black box evolution steps	Explainable reasoning chain
Control Granularity	Sample Level	Collection Level Design
Reproducibility	Low	High (version control)

2. Simula architecture: four-axis control framework

class SimulaDataset:
    def __init__(self):
        self.global_diversification = TaxonomySampling()
        self.local_diversification = MetaPromptGenerator()
        self.complexification = DifficultyScaler()
        self.quality_checks = DualCriticLoop()

    def generate(self, domain: str, target_model: Model):
        # 步驟 1：全局多樣化
        taxonomy = self.global_diversification.map_concept_space(domain)
        samples = self.sample_over_taxonomy(taxonomy)

        # 步驟 2：局部多樣化
        meta_prompts = self.local_diversification.generate_meta_prompts(samples)

        # 步驟 3：複雜化
        enhanced_prompts = self.complexification.refine(meta_prompts)

        # 步驟 4：質量檢查
        validated_data = self.quality_checks.dual_critic_loop(enhanced_prompts)

        return validated_data

Four-axis control mechanism

Axis 1: Global Diversification

Core idea: Map the concept space into a hierarchical taxonomy to ensure that long-tail distributions are covered.

Implementation details:

# 分類法示例（網絡安全）
taxonomy = {
    "SQL 注入": {
        "子類": [
            "基於時間盲注",
            "基於布爾盲注",
            "基於錯誤信息"
        ],
        "複雜度": ["基礎", "進階", "高級"]
    },
    "跨站腳本攻击 (XSS)": {
        "子類": [
            "反射型 XSS",
            "存儲型 XSS",
            "DOM 型 XSS"
        ]
    }
}

Effect:

Avoid mode collapse
Covers long tail cases
Coverage of sparse but important scenes

Axis 2: Local Diversification

Core idea: Generate multiple different instances for each concept to prevent duplication.

Implementation details:

# Meta-prompts 生成
def generate_meta_prompts(concept: Concept):
    return [
        f"為 {concept} 創建一個 {scenario} 場景",
        f"提供 {concept} 的 {perspective} 視角",
        f"描述 {concept} 的 {context} 背景"
    ]

# 實例化
instances = [generate_instantiation(prompt) for prompt in meta_prompts]

Axis 3: Complexification

Core idea: Use complexity as an orthogonal axis to controllably increase difficulty.

Implementation details:

def complexify(prompt: Prompt, fraction: float = 0.3):
    if random() < fraction:
        # 添加約束
        prompt = add_constraint(prompt, constraint_type="edge_case")

        # 添加上下文
        prompt = add_context(prompt, context="ambiguous")

        # 添加干擾
        prompt = add_noise(prompt, noise_level="high")
    return prompt

Axis 4: Quality Checks

Core idea: Dual commentator loop to independently verify the correctness of the answer.

Implementation details:

class DualCritic:
    def __init__(self, teacher_model: Model):
        self.critic1 = self._create_critic()
        self.critic2 = self._create_critic()
        self.teacher = teacher_model

    def evaluate(self, answer: str, rubric: Rubric):
        # 評論員 1 評分
        score1, evidence1 = self.critic1.score(answer, rubric)

        # 評論員 2 評分
        score2, evidence2 = self.critic2.score(answer, rubric)

        # 教師模型驗證
        teacher_score, teacher_evidence = self.teacher.verify(answer, rubric)

        # 綜合評分
        final_score = (score1 + score2 + teacher_score) / 3

        return final_score, {
            "evidence1": evidence1,
            "evidence2": evidence2,
            "teacher_evidence": teacher_evidence
        }

Inference-driven evaluation metrics

Traditional Metrics vs Inference-Driven Metrics:

Indicator Types	Traditional Method	Simula Method
Diversity	Embedded Cosine Distance	Taxonomic Coverage
Complexity	Scoring statistics	Calibrated Complexity Scoring
Quality	Tag accuracy	Dual reviewer verification

1. Taxonomic Coverage

def taxonomic_coverage(dataset: Dataset, taxonomy: Taxonomy) -> float:
    # 計算每個概念類別的覆蓋率
    coverage = {}
    for concept in taxonomy.concepts:
        covered = count_covered(dataset, concept)
        coverage[concept] = covered / taxonomy[concept].expected_samples

    # 綜合覆蓋率
    avg_coverage = mean(coverage.values())
    return avg_coverage

2. Calibrated Complexity Scoring

def calibrated_complexity(dataset: Dataset) -> float:
    # 使用教師模型進行批量比較
    comparisons = teacher_model.compare_pairs(dataset[:100])

    # 計算 Elo 等級分
    elo_ratings = compute_elo(comparisons)

    # 返回平均等級分
    avg_rating = mean(elo_ratings)
    return avg_rating

Production practice: from research to practical application

Google internal application cases

1. Gemma Ecosystem Support

# ShieldGemma：安全分類器
shield_gemma = Gemma(
    model="shieldgemma-2",
    training_data=Simula.generate(
        domain="safety",
        target_model="ShieldGemma",
        use_cases=["hate_speech", "violence", "sexual_content"]
    )
)

# FunctionGemma：工具調用
function_gemma = Gemma(
    model="functiongemma",
    training_data=Simula.generate(
        domain="tool_use",
        target_model="FunctionGemma",
        use_cases=["api_calls", "data_extraction"]
    )
)

# MedGemma：醫療領域
med_gemma = Gemma(
    model="medgemma",
    training_data=Simula.generate(
        domain="medical",
        target_model="MedGemma",
        use_cases=["diagnosis", "treatment_plan"]
    )
)

2. Security Classifier Synthetic Data

# 生成安全分類數據
safety_data = Simula.generate(
    domain="cybersecurity",
    target_model="Gemini safety classifier",
    use_cases=[
        "恶意软件检测",
        "钓鱼攻击识别",
        "漏洞利用阻止",
        "數據洩露防護"
    ],
    taxonomies=["MITRE ATT&CK", "CVE", "OWASP"]
)

# 規模
for category in use_cases:
    dataset = safety_data[category]
    print(f"{category}: {len(dataset)} samples")

Expected results:

Data Size: Up to 512K samples for a single domain
Quality improvements: higher downstream performance, fewer samples required
Coverage: 5 domains (Cyber Security, Legal Reasoning, Mathematics, Academic Knowledge, Multilingual)

3. User protection function

# 實時安全分類
class UserProtection:
    def __init__(self):
        self.classifier = GeminiSafetyClassifier(
            training_data=Simula.generate(
                domain="safety",
                target_model="Gemini safety classifier"
            )
        )

    def analyze_message(self, message: Message) -> SafetyLabel:
        # 即時分類
        label = self.classifier.classify(message)

        # 詳細分析
        details = self.classifier.analyze(message)

        return {
            "label": label,
            "severity": details.severity,
            "category": details.category,
            "explanation": details.explanation
        }

Cross-domain universality verification

Test domain and evaluation results:

# 域 1：網絡安全
cybersecurity_results = Simula.evaluate(
    domains=["CTI-MCQ", "CTI-RCM"],
    target_model="Gemini 2.5 Flash"
)
# 結果：10% 准確率提升（GSM8k）

# 域 2：法律推理
legal_results = Simula.evaluate(
    domains=["LEXam"],
    target_model="Gemini 2.5 Flash"
)
# 結果：表現下降（教師模型較弱）

# 域 3：數學推理
math_results = Simula.evaluate(
    domains=["GSM8k"],
    target_model="Gemini 2.5 Flash"
)
# 結果：10% 准確率提升

# 域 4：學術知識
knowledge_results = Simula.evaluate(
    domains=["Global MMLU"],
    target_model="Gemini 2.5 Flash"
)
# 結果：穩定提升

Key Findings:

No universal solution: Different domains require different data designs
Context first: The data must adapt to the model capabilities
Quality over quantity: Better data is more effective than more samples

Comparison with other synthetic data methods

Simula vs traditional method

Traditional methods (manual prompts, evolutionary algorithms):

# 傳統方法示例
def traditional_synthetic_data():
    prompts = []
    for i in range(100):
        # 手動提示
        prompt = f"生成 {i} 個 {domain} 案例"
        response = llm.generate(prompt)

        prompts.append(response)
    return prompts

Simula method:

def simula_synthetic_data():
    # 推理驅動
    dataset = Simula.generate(
        domain=domain,
        target_model=target_model
    )
    return dataset

Advantage comparison:

Features	Traditional Method	Simula
Explainability	Black box evolution steps	Explainable reasoning chain
Control Granularity	Sample Level	Collection Level Design
Reproducibility	Low (random seed)	High (version controlled)
Assessment	Simple Metrics	Inference-Driven Metrics
Adaptability	Static	Dynamic adaptation model

Simula vs other AI generation methods

Comparison with evolutionary algorithms:

Simula: inference-driven, interpretable
Evolutionary Algorithms: Random Search, Black Box

Comparison with generative expression:

Simula: Programmed design, controllable
Generative formula: randomly generated, uncontrollable

Challenges and Limitations

1. Non-linearity of data-performance relationship

Question:

Different domains have different “best” data
There is no single way to “optimize”
The relationship between data quality and downstream performance is unique

Mitigation Strategies:

Domain Specific Design: Design data strategies for each domain
Iterative Optimization: Use Simula to incrementally optimize a data set
A/B Test: Compare data sets before actual deployment

2. The double-edged sword effect of complexity

Question:

Mathematical reasoning: high complexity → 10% accuracy improvement
Legal reasoning: high complexity → performance degradation

Mitigation Strategies:

Dynamic Complexity: adjusted according to model capabilities
Graded Difficulty: Provides different difficulties for different ability levels
Context Adaptation: The data must fit the model

3. Limitations of evaluation indicators

Question:

Traditional indicators (cosine distance) provide high-level signals but have limited operability
Simula’s metrics may not fully reflect actual utility

Mitigation Strategies:

Multidimensional Indicators: Use multiple indicators in combination
Real Utility Test: Test data in real scenarios
Human Evaluation: Key data sets undergo manual verification

Practical case: network security data generation

Complete process

# 步驟 1：定義領域和分類法
domain = "cybersecurity"
taxonomy = load_taxonomy("MITRE_ATT_and_CVE")

# 步驟 2：定義目標模型
target_model = "Gemini 2.5 Flash"

# 步驟 3：配置 Simula
simula = Simula(
    domain=domain,
    target_model=target_model,
    taxonomies=[taxonomy],
    use_cases=[
        "恶意软件检测",
        "钓鱼攻击识别",
        "漏洞利用阻止",
        "数据泄露防护"
    ]
)

# 步驟 4：生成數據
dataset = simula.generate(
    target_model=target_model,
    max_samples=512_000,
    global_diversification=True,
    local_diversification=True,
    complexification=True,
    quality_checks=True
)

# 步驟 5：評估
results = simula.evaluate(
    benchmarks=["CTI-MCQ", "CTI-RCM"]
)

# 步驟 6：部署
deploy(
    dataset=dataset,
    model=target_model,
    production=True
)

Effect evaluation

Deployment Case:

Data size: 512K samples (single domain)
Assessment Domain: Cybersecurity (CTI-MCQ, CTI-RCM)
Model: Gemini 2.5 Flash
Result: Accuracy increased by 10% (GSM8k mathematical reasoning)

Conclusion: A new paradigm for programmatic data generation

Three key turning points

1. From “More Data” to “Better Data”

Simula not only generates more data, but generates data that better suits the needs of the model

2. From “random sampling” to “programmed design”

Data generation is regarded as a programmatic workflow that can be version controlled, interpretable, and reproducible

3. From “black box generation” to “interpretable framework”

Reasoning-driven design, each step has clear reasoning logic

Future Directions

1. Wider field adaptation

More industry fields (medical, financial, manufacturing)
More data types (image, audio, multi-modal)

2. Smarter data design

Adaptive complexity
Dynamic dataset optimization
Real-time data generation

3. Greater scalability

Hierarchical data generation
Cross-domain transfer learning
Automated data engineering

Practical suggestions

1. Applicable scenarios

✅ Scarcity of data in professional fields
✅ Need edge cases
✅ High security and compliance requirements
✅ The model needs to be adapted to new fields

2. Not applicable scenario

❌ Model capability far exceeds training data
❌ Real-world data is easy to obtain
❌ Requires completely real scenes

3. Implement Strategy

Phaseded Deployment: Start small
Iterative Optimization: Step-by-step optimization using Simula
Evaluation and Verification: Real scenario testing
Human Intervention: Manual review of key data

Reference resources

Paper: “Reasoning-Driven Synthetic Data Generation and Evaluation”
Blog: “Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles”
Technical Report: “Toward Scalable Measurement of Durable Skills”
Google Research Blog: https://research.google/blog/
Simula GitHub: https://openreview.net/forum?id=HpIxllcNtb

🐯 Cheesecat’s Observation: The Simula framework marks a turning point in AI data generation from “black box evolution” to “programmed design”. Through an inference-driven four-axis control framework, Simula improves data generation from sample level to set-level design, achieving interpretable, reproducible, and controllable programmed workflows. This is not only a technological advancement, but also a fundamental change in the productivity model - when data is turned into “code”, we can design AI training data just like writing a program, thus greatly improving the performance and reliability of AI models. In the future, programmatic data generation will become as integral to the infrastructure of the AI ecosystem as programming languages and frameworks.