2026-06-07

NVIDIA提出PixelDiT，是直接在像素空间端到端学习的单阶段扩散Transformer，无需预训练自编码器压缩，入选CVPR2026最佳论文决赛，有望提升图像生成质量。一项对11万+LLM评估论文的审计发现，多数论文评估的模型落后前沿约1.4个版本差距，且极少披露推理模式等关键配置，导致能力误传，研究呼吁强制披露框架。 SUPERNOVA框架利用自…

PixelDiT入选CVPR2026最佳论文决赛 82

Tags: 模型发布 研究进展 图像生成
Source: AI HOT 精选 | 阅读原文

[摘要]
NVIDIA提出PixelDiT，是直接在像素空间端到端学习的单阶段扩散Transformer，无需预训练自编码器压缩，入选CVPR2026最佳论文决赛，有望提升图像生成质量。

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation 80

Tags: 研究 AI评估 学术诚信 评估方法
Source: arXiv Computation and Language | 阅读原文

[摘要]
一项对11万+LLM评估论文的审计发现，多数论文评估的模型落后前沿约1.4个版本差距，且极少披露推理模式等关键配置，导致能力误传，研究呼吁强制披露框架。

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions 80

Tags: 大模型 推理优化 强化学习 模型发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
SUPERNOVA框架利用自然指令数据集和强化学习可验证奖励（RLVR）训练，显著提升大模型通用推理能力，在复杂推理基准上取得64.4%相对提升，且泛化到更大模型。

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition 80

Tags: 大模型 推理优化 金融AI 芯片算力
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出YouZhi-LLM，通过层自适应GQA-to-MLA转换压缩KV缓存，金融任务并发提升2倍以上，在华为昇腾上验证，为高吞吐金融推理提供新范式。

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding 80

Tags: 推理优化 模型部署 大模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出AdaPLD方法，通过自适应检索和重用来改善无模型投机解码，实现最高3.1倍解码加速，提升LLM推理效率。

EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading 80

Tags: 大模型 训练方法 推理优化 研究发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出EDIT框架，通过内部状态诊断定位推理错误并只修正局部步骤，结合信念引导强化学习，显著提升LLM在真实评分任务中遵循规则的能力。

UNIVID: Unified Vision-Language Model for Video Moderation 80

Tags: 模型发布 多模态 AI安全 内容审核
Source: arXiv Computation and Language | 阅读原文

[摘要]
UNIVID提出统一视觉语言模型用于视频审核，生成可解释的标题，减少42.7%违规泄漏和37%过杀，替代数千个模型，显著节省计算资源。

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges 80

Tags: 大模型 模型评测 AI安全
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究揭示LLM作为评判者在与用户交互后易被引导推翻初始判断，导致评估结果偏移、排名变动及与人类偏好不一致，提出评估鲁棒性分数（ERS）量化这一风险。

Self-Augmenting Retrieval for Diffusion Language Models 80

Tags: 研究 RAG 扩散模型 推理优化
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出 SARDI 框架，利用离散扩散语言模型去噪时低置信度 token 进行前瞻检索，无需训练且与检索器无关，在多跳 QA 上以更高吞吐量超越现有无训练方法。

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation 80

Tags: 训练方法 强化学习 大模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
OrderGrad提出一种优化顺序统计量目标的策略梯度估计方法，可处理风险规避、鲁棒性等需求，在LLM后训练等任务有潜力，为强化学习提供统一框架。

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding 80

Tags: 音频理解 模型发布 表征蒸馏
Source: arXiv Computation and Language | 阅读原文

[摘要]
USAD 2.0通过领域感知蒸馏与监督蒸馏，将通用音频编码器扩展到音乐域并缩放至10亿参数，在探测和LLM评估中达到领先性能。

DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections 80

Tags: 数据集 多模态 科学QA 基准测试
Source: arXiv Computation and Language | 阅读原文

[摘要]
DocHop-QA 是一个多文档、多模态科学问答基准，包含 11,379 个实例，要求模型综合多篇 PubMed 文章的文本、表格与布局信息进行跨文档推理，实验表明当前模型在此任务上表现困难，为复杂科学问答研究提供了严格测试平台。

Seeing is Believing? Evaluating Vision-Language Model Susceptibility in Agent-to-Agent Multimodal Persuasion 80

Tags: AI安全 多模态 智能体 模型评估
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究评估多模态agent间说服对VLM的影响，发现多模态输入可绕过安全防御，易感性因领域和格式而异，为构建鲁棒对齐的VLM提供基础。

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents 80

Tags: 智能体 主动检索 终身学习
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出ProactAgent主动检索框架，面向终身学习智能体，何时及检索什么作为策略行动，成功率最高提升32%，交互轮次减少33%。

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery 80

Tags: 研究进展 多智能体 自动机器学习 LLM智能体
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出MLEvolve自我进化多智能体框架，实现端到端机器学习算法自动发现，在MLE-Bench和数学算法优化任务上取得SOTA，展示了跨领域泛化能力。

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts 80

Tags: 智能体 推理优化 模型优化 AI研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出RHO方法，让AI智能体通过自我偏好和轨迹回滚自动优化工具与流程，无需人工标注，在SWE-Bench Pro上从59%提升至78%，显著增强长时任务表现。

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives 80

Tags: 大模型 AI安全 模型评估 价值对齐
Source: arXiv Computation and Language | 阅读原文

[摘要]
新基准CLASH评估大模型在高风险困境中的价值判断，发现GPT-5等强模型在矛盾决策中准确率低，且数学策略不适用于价值推理，揭示AI安全与价值对齐新挑战。

Arena 发布真实世界 AI 智能体排行榜 Agent Arena 78

Tags: 智能体 评测 大模型
Source: AI HOT 精选 | 阅读原文

[摘要]
Arena发布基于真实用户任务的智能体排行榜Agent Arena，评估模型在编程、文档分析等真实工作中的表现，基于30万+任务，对AI智能体能力评估有重要参考。

Diffusion Models Observe Only Gradients: A Geometric Perspective on Score Matching Errors 78

Tags: 扩散模型 理论研究
Source: arXiv Statistics - Machine Learning | 阅读原文

[摘要]
该研究从几何角度揭示L2得分匹配误差并非扩散模型分布质量的合适度量，提出新的分解与上界，对改进模型训练与评估有理论指导意义。

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning 78

Tags: 预训练 表示学习 模型研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
新研究提出结合JEPA潜在空间预测与MLM的混合预训练目标，在GLUE基准上产生更均匀、语义更丰富的嵌入，虽准确率持平但显著改善了表征几何质量。

2026-06-07 ​

PixelDiT入选CVPR2026最佳论文决赛 82 ​

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation 80 ​

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions 80 ​

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition 80 ​

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding 80 ​

EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading 80 ​

UNIVID: Unified Vision-Language Model for Video Moderation 80 ​

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges 80 ​

Self-Augmenting Retrieval for Diffusion Language Models 80 ​

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation 80 ​

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding 80 ​

DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections 80 ​

Seeing is Believing? Evaluating Vision-Language Model Susceptibility in Agent-to-Agent Multimodal Persuasion 80 ​

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents 80 ​

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery 80 ​

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts 80 ​

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives 80 ​

Arena 发布真实世界 AI 智能体排行榜 Agent Arena 78 ​

Diffusion Models Observe Only Gradients: A Geometric Perspective on Score Matching Errors 78 ​

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning 78 ​

2026-06-07

PixelDiT入选CVPR2026最佳论文决赛 82

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation 80

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions 80

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition 80

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding 80

EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading 80

UNIVID: Unified Vision-Language Model for Video Moderation 80

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges 80

Self-Augmenting Retrieval for Diffusion Language Models 80

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation 80

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding 80

DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections 80

Seeing is Believing? Evaluating Vision-Language Model Susceptibility in Agent-to-Agent Multimodal Persuasion 80

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents 80

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery 80

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts 80

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives 80

Arena 发布真实世界 AI 智能体排行榜 Agent Arena 78

Diffusion Models Observe Only Gradients: A Geometric Perspective on Score Matching Errors 78

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning 78