2026-06-03

GateKD提出置信度门控闭环蒸馏框架，通过动态教师监督、隐藏状态对齐和注意力过滤，提升小模型多步推理能力，在逻辑和符号推理任务上超越传统蒸馏方法。提出ToMAP，通过心智理论模块增强LLM说服者的对手意识与动态推理，仅3B参数即超越GPT-4o，相对提升39.4%，论证更有效且多样化。提出SchemaForge框架，通过对异构知识图谱进行模式切片对齐与…

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning 85

Tags: 模型蒸馏 推理优化 小模型 大模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
GateKD提出置信度门控闭环蒸馏框架，通过动态教师监督、隐藏状态对齐和注意力过滤，提升小模型多步推理能力，在逻辑和符号推理任务上超越传统蒸馏方法。

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind 85

Tags: 大模型 心智理论 推理优化 强化学习
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出ToMAP，通过心智理论模块增强LLM说服者的对手意识与动态推理，仅3B参数即超越GPT-4o，相对提升39.4%，论证更有效且多样化。

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs 85

Tags: 知识图谱 自然语言处理 查询生成 模式对齐
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出SchemaForge框架，通过对异构知识图谱进行模式切片对齐与反事实验证，提升Text-to-SPARQL查询生成准确率，在多个基准上平均提升11.5个百分点。

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses 85

Tags: 搜索代理 强化学习 检索增强 状态管理
Source: arXiv Computation and Language | 阅读原文

[摘要]
Harness-1 提出通过状态化搜索工具实现强化学习训练20B参数搜索代理，在八个检索基准上平均curated recall达0.730，超越开源子代理由11.4个点，接近前沿大模型。

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression 85

Tags: 推理优化 大模型 强化学习
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出HMPO，一种单阶段强化学习框架，通过自适应中位预算和余弦衰减奖励压缩CoT推理，在9B-122B模型上实现19%-46% token压缩且精度损失极小。

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution 85

Tags: 大模型 评测基准 训练数据生成 强化学习
Source: arXiv Computation and Language | 阅读原文

[摘要]
BenchEvolver 提出基于解演化的自动编程题生成框架，将饱和基准转化为高难度评测集，恢复模型区分度，并用于强化学习提升编码能力。

On the Limits of Token Reduction for Efficient Unified Vision Language Training 85

Tags: 大模型 多模态 训练优化 推理优化
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究统一视觉语言模型训练中token减少加速的可行性，发现理解与生成任务对图像token依赖不对称，提出需协同感知加速策略。

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models 85

Tags: 大模型 推理评估 思维链
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究发现大型推理模型存在显著的生产-评估差距：模型能近乎完美地生成正确答案，但在评估有琐碎推理缺陷但答案正确的数学题时，准确率低至48%，原因在于答案确认偏差。

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence 85

Tags: 范畴论 科学发现 Agent系统 材料科学
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出基于范畴论的自我修正科学发现框架，将科学发现形式化为表征体系的转换，并在材料科学中实例化以分离检索、搜索与发现过程。

BraveGuard: From Open-World Threats to Safer Computer-Use Agents 85

Tags: AI安全 智能体安全 自适应防御
Source: arXiv Computation and Language | 阅读原文

[摘要]
BraveGuard提出面向计算机使用智能体的自适应安全防御框架，通过挖掘开放世界威胁和真实执行轨迹训练防护模型，在AgentHazard上准确率从38.79%提升至82.38%。

Tags: 视觉语言导航 多模态 空间推理 零样本
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出层次语义几何地图(HSGM)，将3D几何信息转化为VLM兼容表示，实现零样本视觉语言导航，在R2R-CE和RxR-CE上达到SOTA。

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages 85

Tags: 语音识别 边缘模型 多语言 非洲语言
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究验证微调边缘ASR模型在19种非洲语言上WER比大规模多语言基线降低26.9%，模型小3-40倍，并发布全部权重与代码。

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs 85

Tags: 知识图谱 推理优化 缓存机制 智能代理
Source: arXiv Computation and Language | 阅读原文

[摘要]
Grokers提出在写入时投入智能、自底向上归纳遍历依赖子图的架构，实现知识图谱持久理解，后续查询零额外LM成本，并证明多项形式化性质。

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression 85

Tags: 大模型 模型压缩 推理优化
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出SubFit方法，在子模块级别非连续替换大模型组件，通过轻量残差旁路压缩，在10个LLM上超越基线，25%稀疏度保留84.6%下游精度。

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models 85

Tags: AI安全 大模型 工具使用 安全评测
Source: arXiv Computation and Language | 阅读原文

[摘要]
该研究提出 Safety Asymmetry Score (SAS)，测量大模型在不同输入渠道（用户消息、工具元数据、输出）对恶意指令的脆弱性差异，发现 agent 模型对工具描述中的攻击更敏感，暴露了系统性安全盲点。

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning 85

Tags: 推理优化 大模型 测试时扩展 早退策略
Source: arXiv Computation and Language | 阅读原文

[摘要]
发现思维链推理的熵动力学两阶段结构（探索不确定性和收敛置信），基于CUSUM变化点检测提出无训练早退与测试时扩展策略，显著提升推理效率与准确性。

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills? 85

Tags: 智能体 多模态学习 技能蒸馏 自我进化
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出MMG2Skill框架和MMG2Skill-Bench基准，使智能体从网络多模态指南中蒸馏可执行技能，并通过轨迹级根因反馈实现自我进化，在GUI控制等任务中性能提升+12.8到+25.3个百分点。

How to Correctly Report LLM-as-a-Judge Evaluations 85

Tags: 大模型 评测基准 统计方法
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出一种简单插件框架，校正LLM作为评估者的偏差，并提供统计原则下的置信区间，提升评估可靠性。

Scaling Agentic Capabilities via Grounded Interaction Synthesis 85

Tags: 智能体 数据合成 MCP 开源模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出GAIS框架，通过两阶段机制自动合成多样化、高保真环境与复杂任务，显著提升智能体能力，在多个基准上超越基线，数据效率高且开源。

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding 85

Tags: 推理优化 大模型 推测解码 开源模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
DFlare提出层级融合机制扩展草案模型容量，在块扩散推测解码中实现最高5.5倍加速，较DFlash提升约10%，代码已开源。

2026-06-03 ​

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning 85 ​

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind 85 ​

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs 85 ​

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses 85 ​

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression 85 ​

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution 85 ​

On the Limits of Token Reduction for Efficient Unified Vision Language Training 85 ​

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models 85 ​

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence 85 ​

BraveGuard: From Open-World Threats to Safer Computer-Use Agents 85 ​

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation 85 ​

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages 85 ​

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs 85 ​

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression 85 ​

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models 85 ​

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning 85 ​

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills? 85 ​

How to Correctly Report LLM-as-a-Judge Evaluations 85 ​

Scaling Agentic Capabilities via Grounded Interaction Synthesis 85 ​

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding 85 ​

2026-06-03

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning 85

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind 85

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs 85

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses 85

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression 85

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution 85

On the Limits of Token Reduction for Efficient Unified Vision Language Training 85

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models 85

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence 85

BraveGuard: From Open-World Threats to Safer Computer-Use Agents 85

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation 85

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages 85

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs 85

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression 85

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models 85

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning 85

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills? 85

How to Correctly Report LLM-as-a-Judge Evaluations 85

Scaling Agentic Capabilities via Grounded Interaction Synthesis 85

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding 85