chore: update confs

actions-user · actions-user · commit ed8b1546b8f5 · 2025-04-29T10:21:50.000Z
diff --git a/arxiv.json b/arxiv.json
@@ -45729,5 +45729,40 @@
         "pub_date": "2025-04-25",
         "summary": "High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\\times$ 1,024 to 35,503 $\\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.",
         "translated": "高分辨率图像（HRI）理解旨在处理具有大量像素的图像，例如病理图像和农业航拍图像，这些图像的像素量均可超过百万级。视觉大语言模型（VLMs）据称能够处理高分辨率图像，但目前缺乏全面的基准测试来评估其HRI理解能力。为填补这一空白，我们提出HRScene——一个包含丰富场景的新型统一HRI理解基准。HRScene整合了25个真实世界数据集和2个合成诊断数据集，分辨率覆盖1,024×1,024至35,503×26,627像素。该基准由10名研究生级标注员采集并重新标注，涵盖从显微图像到放射影像、街景视图、远景照片及望远镜图像等25类场景，包含真实物体扫描文档和复合多图像的高分辨率样本。两个诊断评估数据集通过将目标图像与标准答案及干扰图像以不同顺序组合生成，用于评估模型对HRI区域的利用能力。我们对28个VLM（包括Gemini 2.0 Flash和GPT-4o）进行了广泛实验。HRScene测试表明，当前VLMs在真实任务中的平均准确率仅约50%，暴露出HRI理解存在显著不足。合成数据集实验揭示VLMs存在区域发散（Regional Divergence）和中间信息丢失（lost-in-middle）现象，难以有效利用HRI区域，这为未来研究指明了方向。"
+    },
+    {
+        "title": "LLM-Generated Fake News Induces Truth Decay in News Ecosystem: A Case\n  Study on Neural News Recommendation",
+        "url": "http://arxiv.org/abs/2504.20013v1",
+        "pub_date": "2025-04-28",
+        "summary": "Online fake news moderation now faces a new challenge brought by the malicious use of large language models (LLMs) in fake news production. Though existing works have shown LLM-generated fake news is hard to detect from an individual aspect, it remains underexplored how its large-scale release will impact the news ecosystem. In this study, we develop a simulation pipeline and a dataset with ~56k generated news of diverse types to investigate the effects of LLM-generated fake news within neural news recommendation systems. Our findings expose a truth decay phenomenon, where real news is gradually losing its advantageous position in news ranking against fake news as LLM-generated news is involved in news recommendation. We further provide an explanation about why truth decay occurs from a familiarity perspective and show the positive correlation between perplexity and news ranking. Finally, we discuss the threats of LLM-generated fake news and provide possible countermeasures. We urge stakeholders to address this emerging challenge to preserve the integrity of news ecosystems.",
+        "translated": "当前，在线虚假新闻治理正面临大型语言模型（LLMs）被恶意用于虚假新闻生产所带来的新挑战。尽管现有研究表明从个体层面难以检测LLM生成的虚假新闻，但其大规模传播对新闻生态系统的影响仍缺乏深入探究。本研究通过构建仿真管道和包含约5.6万条多元类型生成新闻的数据集，系统考察了神经新闻推荐系统中LLM生成虚假新闻的影响。研究发现存在\"真相衰减\"现象：当LLM生成新闻参与推荐时，真实新闻在排名对抗虚假新闻时的优势地位会逐渐丧失。我们进一步从认知熟悉度视角解释了该现象成因，并证明困惑度与新闻排名呈正相关性。最后，本文探讨了LLM生成虚假新闻的威胁并提出可能的应对策略，敦促相关方重视这一新兴挑战以维护新闻生态系统的完整性。\n\n（注：根据学术翻译规范，此处对部分表述进行了优化：\n1. \"truth decay\"译为\"真相衰减\"并首次出现加引号强调\n2. \"neural news recommendation systems\"译为专业术语\"神经新闻推荐系统\"\n3. \"perplexity\"统一译为\"困惑度\"保持NLP领域术语一致性\n4. 被动语态\"it remains underexplored\"转化为主动式\"缺乏深入探究\"\n5. 长难句拆分重组，如将\"where\"引导的从句处理为冒号分句）"
+    },
+    {
+        "title": "Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the\n  Evaluation of LLM Responses",
+        "url": "http://arxiv.org/abs/2504.20006v1",
+        "pub_date": "2025-04-28",
+        "summary": "Battles, or side-by-side comparisons in so called arenas that elicit human preferences, have emerged as a popular approach to assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-seeking queries: they are neither explanatory nor diagnostic. Recently, the nugget evaluation methodology has emerged as a promising approach to evaluate the quality of RAG answers. Nuggets decompose long-form LLM-generated answers into atomic facts, highlighting important pieces of information necessary in a \"good\" response. In this work, we apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena in a fully automatic manner. Our results show a significant correlation between nugget scores and human preferences, showcasing promise in our approach to explainable and diagnostic system evaluations.",
+        "translated": "在评估大语言模型（LLM）输出质量的方法中，\"竞技场对战\"（即通过并排对比引发人类偏好的比较方式）已成为流行手段。最近，这一理念被扩展到检索增强生成（RAG）系统评估领域。尽管这种方法无疑代表了评估技术的进步，但竞技场对战尤其面对复杂信息检索类查询时，至少存在两大缺陷：既缺乏解释性，也不具备诊断能力。近期兴起的\"信息粒\"（nugget）评估方法为RAG系统回答质量评估提供了新思路，该方法将LLM生成的长篇回答分解为原子事实，突显优质回答中应包含的关键信息单元。本研究运用AutoNuggetizer框架，对LMArena平台约7000场搜索竞技场对战数据进行全自动分析。结果表明，信息粒评分与人类偏好存在显著相关性，这验证了我们提出的可解释、可诊断系统评估方法的可行性。  \n\n（翻译说明：  \n1. 专业术语处理：\"arena battles\"译为行业惯用表述\"竞技场对战\"，\"nugget\"采用学界认可的\"信息粒\"译法  \n2. 技术概念显化：通过添加\"（即...）\"的解释结构确保概念清晰  \n3. 句式重构：将原文\"they are neither...\"的否定结构转为中文更自然的\"既缺乏...也不具备...\"  \n4. 数据表述：将\"roughly 7K\"转化为符合中文学术写作规范的\"约7000场\"  \n5. 逻辑衔接：使用\"尽管...但...\"、\"近期\"等连接词保持论证连贯性）"
+    },
+    {
+        "title": "Knowledge Distillation of Domain-adapted LLMs for Question-Answering in\n  Telecom",
+        "url": "http://arxiv.org/abs/2504.20000v1",
+        "pub_date": "2025-04-28",
+        "summary": "Knowledge Distillation (KD) is one of the approaches to reduce the size of Large Language Models (LLMs). A LLM with smaller number of model parameters (student) is trained to mimic the performance of a LLM of a larger size (teacher model) on a specific task. For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation. In this work, we study this problem from perspective of telecom domain Question-Answering (QA) task. We systematically experiment with Supervised Fine-tuning (SFT) of teacher only, SFT of student only and SFT of both prior to KD. We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model. Multi-faceted evaluation of the distillation using 14 different metrics (N-gram, embedding and LLM-based metrics) is considered. Experimental results show that SFT of teacher improves performance of distilled model when both models have same vocabulary, irrespective of algorithm and metrics. Overall, SFT of both teacher and student results in better performance across all metrics, although the statistical significance of the same depends on the vocabulary of the teacher models.",
+        "translated": "知识蒸馏（KD）是缩小大型语言模型（LLM）规模的方法之一。该方法通过训练参数规模较小的学生模型，使其在特定任务上模拟更大规模的教师模型性能。针对领域特定任务，目前尚不清楚领域适应阶段应仅调整教师模型、仅调整学生模型，还是需要同时调整两者。本研究从电信领域问答任务（QA）的视角探讨这一问题。我们系统性地实验了三种策略：仅对教师模型进行监督微调（SFT）、仅对学生模型进行SFT，以及在知识蒸馏前对两者同时进行SFT。实验设计涵盖词汇表（相同/不同）和蒸馏算法（传统KD与双空间KD/DSKD）对蒸馏模型的影响，并采用14种多维评估指标（包括N-gram、嵌入向量和基于LLM的指标）进行全面评测。实验结果表明：当师生模型共享相同词汇表时，无论采用何种算法或评估指标，对教师模型进行SFT都能提升蒸馏模型的性能。总体而言，虽然统计显著性受教师模型词汇表影响，但同时对师生模型进行SFT能在所有评估指标上取得更优效果。"
+    },
+    {
+        "title": "Hierarchical Uncertainty-Aware Graph Neural Network",
+        "url": "http://arxiv.org/abs/2504.19820v1",
+        "pub_date": "2025-04-28",
+        "summary": "Recent research on graph neural networks (GNNs) has explored mechanisms for capturing local uncertainty and exploiting graph hierarchies to mitigate data sparsity and leverage structural properties. However, the synergistic integration of these two approaches remains underexplored. In this work, we introduce a novel architecture, the Hierarchical Uncertainty-Aware Graph Neural Network (HU-GNN), which unifies multi-scale representation learning, principled uncertainty estimation, and self-supervised embedding diversity within a single end-to-end framework. Specifically, HU-GNN adaptively forms node clusters and estimates uncertainty at multiple structural scales from individual nodes to higher levels. These uncertainty estimates guide a robust message-passing mechanism and attention weighting, effectively mitigating noise and adversarial perturbations while preserving predictive accuracy on both node- and graph-level tasks. We also offer key theoretical contributions, including a probabilistic formulation, rigorous uncertainty-calibration guarantees, and formal robustness bounds. Finally, by incorporating recent advances in graph contrastive learning, HU-GNN maintains diverse, structurally faithful embeddings. Extensive experiments on standard benchmarks demonstrate that our model achieves state-of-the-art robustness and interpretability.",
+        "translated": "最近关于图神经网络（GNNs）的研究探索了两种机制：通过捕捉局部不确定性来缓解数据稀疏性，以及利用图层次结构来挖掘拓扑特性。然而，这两种方法的协同整合仍未得到充分研究。本文提出了一种新颖的层次化不确定性感知图神经网络架构（HU-GNN），该架构将多尺度表征学习、理论驱动的概率估计与自监督嵌入多样性统一在端到端框架中。具体而言，HU-GNN能够在从单一节点到高层集群的多级结构尺度上自适应形成节点聚类并量化不确定性。这些不确定性估计可指导鲁棒的消息传递机制和注意力权重分配，在保持节点级和图级任务预测精度的同时，有效抑制噪声与对抗扰动。我们还提供了关键理论贡献，包括概率形式化建模、严格的不确定性校准保证以及形式化的鲁棒性边界证明。通过融入图对比学习的最新进展，HU-GNN能够保持具有结构保真性的多样化嵌入。在标准基准测试上的大量实验表明，我们的模型实现了最先进的鲁棒性和可解释性。\n\n（翻译说明：\n1. 专业术语处理：\"principled uncertainty estimation\"译为\"理论驱动的概率估计\"以强调其方法论基础，\"message-passing mechanism\"保留为专业术语\"消息传递机制\"\n2. 技术概念转化：\"adversarial perturbations\"译为\"对抗扰动\"符合领域惯例，\"structural scales\"译为\"结构尺度\"保持专业一致性\n3. 句式重构：将原文\"estimates uncertainty at multiple structural scales from individual nodes to higher levels\"拆分为中文特有的流水句式\n4. 学术风格保持：\"synergistic integration\"译为\"协同整合\"符合学术用语，\"state-of-the-art\"译为\"最先进的\"为标准译法\n5. 理论贡献部分采用\"形式化建模\"\"严格保证\"等措辞体现论文的理论严谨性）"
+    },
+    {
+        "title": "Reconstructing Context: Evaluating Advanced Chunking Strategies for\n  Retrieval-Augmented Generation",
+        "url": "http://arxiv.org/abs/2504.19754v1",
+        "pub_date": "2025-04-28",
+        "summary": "Retrieval-augmented generation (RAG) has become a transformative approach for enhancing large language models (LLMs) by grounding their outputs in external knowledge sources. Yet, a critical question persists: how can vast volumes of external knowledge be managed effectively within the input constraints of LLMs? Traditional methods address this by chunking external documents into smaller, fixed-size segments. While this approach alleviates input limitations, it often fragments context, resulting in incomplete retrieval and diminished coherence in generation. To overcome these shortcomings, two advanced techniques, late chunking and contextual retrieval, have been introduced, both aiming to preserve global context. Despite their potential, their comparative strengths and limitations remain unclear. This study presents a rigorous analysis of late chunking and contextual retrieval, evaluating their effectiveness and efficiency in optimizing RAG systems. Our results indicate that contextual retrieval preserves semantic coherence more effectively but requires greater computational resources. In contrast, late chunking offers higher efficiency but tends to sacrifice relevance and completeness.",
+        "translated": "检索增强生成（RAG）已成为提升大语言模型（LLM）性能的变革性方法，其核心在于将模型输出与外部知识源相锚定。然而一个关键问题始终存在：如何在LLM的输入长度限制下有效管理海量外部知识？传统解决方案是将外部文档切分为固定尺寸的小片段。这种方法虽缓解了输入限制，却常导致上下文碎片化，引发检索不完整和生成内容连贯性下降的问题。\n\n为克服这些缺陷，学界提出了两种先进技术——延迟分块和上下文检索，二者均致力于保持全局上下文。尽管潜力显著，但它们的相对优势与局限性仍不明确。本研究对这两种技术展开严格分析，评估其在优化RAG系统时的效能与效率。实验结果表明：上下文检索能更有效保持语义连贯性，但需消耗更多计算资源；相比之下，延迟分块具有更高效率，但往往以牺牲相关性和完整性为代价。\n\n（注：根据技术文献翻译规范，采用以下处理：\n1. 专业术语统一：\"chunking\"译为\"分块\"，\"contextual retrieval\"译为\"上下文检索\"\n2. 被动语态转换：\"be managed\"译为主动式\"管理\"\n3. 长句拆分：将原文复合句分解为符合中文阅读习惯的短句\n4. 概念显化：\"late chunking\"增译为\"延迟分块技术\"以明确技术属性\n5. 逻辑连接词补充：添加\"相比之下\"等衔接词确保行文流畅）"
     }
 ]