收斂 基準觀測 7 min read
Claude Hidden Reasoning: NLA Interpretability — The 26% Benchmark Blind Spot 2026 🐯
Anthropic Natural Language Autoencoders reveal Claude suspects evaluation in 26% of benchmark runs — first public evidence of hidden reasoning beliefs, with implications for AI safety, benchmark integrity, and model alignment
Security Orchestration Interface