AI安全论文日报 - 2025年12月06日

cs.CR

Tipping the Dominos: Topology-Aware Multi-Hop Attacks on LLM-Based Multi-Agent Systems

提出针对LLM多智能体系统的拓扑感知多跳攻击方法

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

WildCode: An Empirical Analysis of Code Generated by ChatGPT

分析ChatGPT生成代码的安全性及潜在漏洞

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

提出鲁棒且自适应的恶意包检测方法

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

AutoGuard: A Self-Healing Proactive Security Layer for DevSecOps Pipelines Using Reinforcement Learning

引入强化学习构建DevSecOps主动防御层

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

A Light-Weight Large Language Model File Format for Highly-Secure Model Distribution

设计高安全性大模型分发文件格式

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

PBFuzz: Agentic Directed Fuzzing for PoV Generation

提出基于代理的漏洞PoV生成方法

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

量化多智能体LLM中的内存泄露风险

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

构建LLM安全因果分析框架

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

Personalizing Agent Privacy Decisions via Logical Entailment

基于逻辑蕴含的代理隐私决策个性化方法

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

比较检索增强提示与微调在代码漏洞检测中的效果

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

设计AI代理应用的安全威胁建模平台

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

提出一种对抗后门攻击的鲁棒神经网络训练框架，通过净化数据和增强模型鲁棒性抵御恶意样本注入

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents

开发实时监控AI代理数据行为的可视化工具，确保其隐私政策合规性以防止数据泄露风险

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

AI Kill Switch for malicious web-based LLM agent

设计AI紧急停止机制，可即时阻断恶意网络LLM代理的非法操作行为

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models

构建GAN模型检测深度伪造和支付欺诈，提升在线交易系统对AI生成内容的识别能力

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CR

In-Context Representation Hijacking

提出上下文表示劫持攻击，通过替换关键词使LLM隐式学习有害语义

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion

开发自由文本知识编辑方法，通过潜在空间扰动和参数融合实现模型知识更新

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

提出可靠知识编辑框架，通过分步编辑和知识固化解决模型知识更新的稳定性问题

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

提出检测混合作者文本中AI生成内容的框架，通过风格特征分析实现真实性验证

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

通过对比学习和可解释性重排序提升检索事实准确性，缓解AI幻觉问题

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment

提出医疗AI助手的迭代对齐框架，平衡安全性与有用性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

通过自增强对比对齐缓解多模态模型的视觉对象与动作幻觉

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

设计自诊断对比解码解决视频LLM的时序不一致幻觉问题

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective

从机制可解释性角度提出多代理LLM的伦理安全研究框架

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Multi-LLM Collaboration for Medication Recommendation

提出多LLM协作方法解决药物推荐中的幻觉和不一致性问题，提升临床决策可靠性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Grounding LLM Reasoning with Knowledge Graphs

将知识图谱与LLM推理结合，通过结构化数据增强推理可验证性和可靠性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

揭示LLM指令层次控制机制的失效，提出约束优先级评估框架

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

ChatGPT for President! Presupposed content in politicians versus GPT-generated texts

分析GPT生成文本在政治话语中的潜在操控性，揭示虚假信息生成风险

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

通过分解奖励机制实现LLM与原则对齐，提升模型输出的可解释性和安全性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

构建数学等价变换基准测试LLM数学推理鲁棒性，评估非数学扰动敏感性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

提出一种基于语义结构熵的不确定性量化框架，用于检测大语言模型的幻觉问题

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Large language models can learn and generalize steganographic chain-of-thought under process supervision

研究奖励黑客的隐蔽性，提出扩展方法防止CoT监控失效

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

Dual-branch Prompting for Multimodal Machine Translation

设计双分支提示框架提升多模态翻译对视觉噪声的鲁棒性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.CL

TaoSR1: The Thinking Model for E-commerce Relevance Search

解决LLM在电商搜索中的CoT错误累积和判别性幻觉问题

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Studying Various Activation Functions and Non-IID Data for Machine Learning Model Robustness

提出通过对抗训练和多种激活函数提升模型鲁棒性，研究非独立同分布数据下的安全防护

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

提出Radial Dispersion Score（RDS）用于无参数的LLM不确定性估计，提升系统可靠性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

提出SQDF方法缓解扩散模型奖励过优化问题，增强生成样本的多样性和安全性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Reliable Statistical Guarantees for Conformal Predictors with Small Datasets

为小数据集提供可靠的置信度量化方法，保障安全关键场景下的模型可信度

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Temp-SCONE: A Novel Out-of-Distribution Detection and Domain Generalization Framework for Wild Data with Temporal Shift

设计Temp-SCONE框架应对动态环境中的分布外检测，提升模型在时序变化下的安全性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Value Gradient Guidance for Flow Matching Alignment

提出VGG-Flow方法，通过梯度匹配优化流匹配模型的对齐，提升生成模型与人类偏好的一致性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Patient Safety Risks from AI Scribes: Signals from End-User Feedback

揭示AI记录员可能引发的患者安全风险，指出转录错误可能导致临床安全隐患

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Pick-to-Learn for Systems and Control: Data-driven Synthesis with State-of-the-art Safety Guarantees

提出数据驱动的系统控制方法，提供严格的安全性保证以应对复杂环境

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

发现LVLM模型生成图像存在显著社会偏见，提出多层级提示基准评估偏见影响

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Bant: Byzantine Antidote via Trial Function and Trust Scores

提出结合信任评分和试函数的拜占庭容错方法，动态过滤异常更新以提升分布式学习安全性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Joint Discriminative-Generative Modeling via Dual Adversarial Training

设计联合判别生成模型，通过对抗训练提升分类和生成模型的鲁棒性与样本质量

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Bilevel Models for Adversarial Learning and A Case Study

构建双层模型分析对抗攻击机制，量化学习模型的鲁棒性并揭示攻击影响

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Incoherent Beliefs & Inconsistent Actions in Large Language Models

揭示大语言模型在动态环境中的信念不一致问题，提出评估其序列决策鲁棒性的方法

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

Multi-Modal Machine Learning for Early Trust Prediction in Human-AI Interaction Using Face Image and GSR Bio Signals

开发多模态框架预测用户对AI的信任度，提升人机交互中的安全性与可靠性

📅 2025年12月05日

📄 论文 ⬇️ PDF

cs.LG

The Autonomy-Alignment Problem in Open-Ended Learning Robots: Formalising the Purpose Framework

提出开放学习机器人的自主性对齐框架，确保自主学习与人类价值观的一致性

📅 2025年12月05日

📄 论文 ⬇️ PDF

🤖 AI安全论文日报

Tipping the Dominos: Topology-Aware Multi-Hop Attacks on LLM-Based Multi-Agent Systems

WildCode: An Empirical Analysis of Code Generated by ChatGPT

One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises

AutoGuard: A Self-Healing Proactive Security Layer for DevSecOps Pipelines Using Reinforcement Learning

A Light-Weight Large Language Model File Format for Highly-Secure Model Distribution

PBFuzz: Agentic Directed Fuzzing for PoV Generation

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

Personalizing Agent Privacy Decisions via Logical Entailment

Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents

AI Kill Switch for malicious web-based LLM agent

Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models

In-Context Representation Hijacking

EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion

EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective

Multi-LLM Collaboration for Medication Recommendation

Grounding LLM Reasoning with Knowledge Graphs

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

ChatGPT for President! Presupposed content in politicians versus GPT-generated texts

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

Large language models can learn and generalize steganographic chain-of-thought under process supervision

Dual-branch Prompting for Multimodal Machine Translation

TaoSR1: The Thinking Model for E-commerce Relevance Search

Studying Various Activation Functions and Non-IID Data for Machine Learning Model Robustness

Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Reliable Statistical Guarantees for Conformal Predictors with Small Datasets

Temp-SCONE: A Novel Out-of-Distribution Detection and Domain Generalization Framework for Wild Data with Temporal Shift

Value Gradient Guidance for Flow Matching Alignment

Patient Safety Risks from AI Scribes: Signals from End-User Feedback

Pick-to-Learn for Systems and Control: Data-driven Synthesis with State-of-the-art Safety Guarantees

Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

Bant: Byzantine Antidote via Trial Function and Trust Scores

Joint Discriminative-Generative Modeling via Dual Adversarial Training

Bilevel Models for Adversarial Learning and A Case Study

Incoherent Beliefs & Inconsistent Actions in Large Language Models

Multi-Modal Machine Learning for Early Trust Prediction in Human-AI Interaction Using Face Image and GSR Bio Signals

The Autonomy-Alignment Problem in Open-Ended Learning Robots: Formalising the Purpose Framework