chore: update confs

actions-user · actions-user · commit 818e6518b247 · 2025-03-07T10:19:40.000Z
diff --git a/arxiv.json b/arxiv.json
@@ -42103,5 +42103,40 @@
         "pub_date": "2025-03-05",
         "summary": "Analogical reasoning relies on conceptual abstractions, but it is unclear whether Large Language Models (LLMs) harbor such internal representations. We explore distilled representations from LLM activations and find that function vectors (FVs; Todd et al., 2024) - compact representations for in-context learning (ICL) tasks - are not invariant to simple input changes (e.g., open-ended vs. multiple-choice), suggesting they capture more than pure concepts. Using representational similarity analysis (RSA), we localize a small set of attention heads that encode invariant concept vectors (CVs) for verbal concepts like \"antonym\". These CVs function as feature detectors that operate independently of the final output - meaning that a model may form a correct internal representation yet still produce an incorrect output. Furthermore, CVs can be used to causally guide model behaviour. However, for more abstract concepts like \"previous\" and \"next\", we do not observe invariant linear representations, a finding we link to generalizability issues LLMs display within these domains.",
         "translated": "类比推理依赖于概念抽象，但目前尚不清楚大型语言模型（LLMs）是否具备此类内部表征。我们探索了从LLM激活中提取的蒸馏表征，发现用于上下文学习（ICL）任务的紧凑表征——功能向量（FVs；Todd等，2024）——对简单的输入变化（例如，开放式问题与多项选择题）并不具有不变性，这表明它们捕捉到的不仅仅是纯粹的概念。通过表征相似性分析（RSA），我们定位了一小部分注意力头，这些注意力头编码了诸如“反义词”等语言概念的不变概念向量（CVs）。这些CVs作为特征检测器独立于最终输出运行——这意味着模型可能形成正确的内部表征，但仍可能产生错误的输出。此外，CVs可用于因果引导模型行为。然而，对于更抽象的概念如“前一个”和“下一个”，我们并未观察到不变的线性表征，这一发现与LLMs在这些领域中表现出的泛化问题相关联。"
+    },
+    {
+        "title": "RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval\n  via Radiology Report Mining",
+        "url": "http://arxiv.org/abs/2503.04653v1",
+        "pub_date": "2025-03-06",
+        "summary": "Developing advanced medical imaging retrieval systems is challenging due to the varying definitions of `similar images' across different medical contexts. This challenge is compounded by the lack of large-scale, high-quality medical imaging retrieval datasets and benchmarks. In this paper, we propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities in a scalable and fully automatic manner. Using this approach, we construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans, providing detailed image-image ranking annotations conditioned on diverse anatomical structures. Furthermore, we develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks. These systems also enable flexible, effective image retrieval conditioned on specific anatomical structures described in text, achieving state-of-the-art results on 77 out of 78 metrics.",
+        "translated": "由于不同医学场景中对“相似图像”的定义存在差异，开发先进的医学影像检索系统面临挑战。这一挑战因缺乏大规模、高质量的医学影像检索数据集和基准而进一步加剧。本文提出了一种新颖的方法，利用密集的放射学报告以可扩展且全自动的方式定义多粒度图像相似性排序。基于这一方法，我们构建了两个综合医学影像检索数据集：用于胸部X光的MIMIC-IR和用于CT扫描的CTRATE-IR，提供了基于不同解剖结构的详细图像-图像排序标注。此外，我们开发了两个检索系统：用于胸部X光的RadIR-CXR和用于胸部CT的model-ChestCT，这些系统在传统的图像-图像和图像-报告检索任务中表现出卓越性能。这些系统还支持基于文本描述的特定解剖结构进行灵活、有效的图像检索，在78个评估指标中的77个上达到了最先进的性能。"
+    },
+    {
+        "title": "IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in\n  Expert-Domain Information Retrieval",
+        "url": "http://arxiv.org/abs/2503.04644v1",
+        "pub_date": "2025-03-06",
+        "summary": "We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.",
+        "translated": "我们介绍了IFIR，这是首个旨在评估专家领域内指令跟随信息检索（IR）的综合基准。IFIR包含了2,426个高质量示例，涵盖了金融、法律、医疗保健和科学文献四个专业领域的八个子集。每个子集都涉及一个或多个特定领域的检索任务，复制了现实世界中定制指令至关重要的场景。IFIR通过引入不同复杂程度的指令，实现了对指令跟随检索能力的详细分析。我们还提出了一种基于大型语言模型（LLM）的新颖评估方法，以更精确和可靠地评估模型在指令跟随方面的性能。通过对包括基于LLM的模型在内的15种前沿检索模型进行广泛实验，我们的结果表明，当前模型在有效跟随复杂、特定领域的指令方面面临重大挑战。我们进一步提供了深入分析，以突出这些局限性，为未来检索器的发展提供了宝贵的见解。"
+    },
+    {
+        "title": "Training-Free Graph Filtering via Multimodal Feature Refinement for\n  Extremely Fast Multimodal Recommendation",
+        "url": "http://arxiv.org/abs/2503.04406v1",
+        "pub_date": "2025-03-06",
+        "summary": "Multimodal recommender systems improve the performance of canonical recommender systems with no item features by utilizing diverse content types such as text, images, and videos, while alleviating inherent sparsity of user-item interactions and accelerating user engagement. However, current neural network-based models often incur significant computational overhead due to the complex training process required to learn and integrate information from multiple modalities. To overcome this limitation, we propose MultiModal-Graph Filtering (MM-GF), a training-free method based on the notion of graph filtering (GF) for efficient and accurate multimodal recommendations. Specifically, MM-GF first constructs multiple similarity graphs through nontrivial multimodal feature refinement such as robust scaling and vector shifting by addressing the heterogeneous characteristics across modalities. Then, MM-GF optimally fuses multimodal information using linear low-pass filters across different modalities. Extensive experiments on real-world benchmark datasets demonstrate that MM-GF not only improves recommendation accuracy by up to 13.35% compared to the best competitor but also dramatically reduces computational costs by achieving the runtime of less than 10 seconds.",
+        "translated": "多模态推荐系统通过利用文本、图像和视频等多种内容类型，提升了传统推荐系统（无物品特征）的性能，同时缓解了用户-物品交互中固有的稀疏性问题，并加速了用户参与度。然而，当前基于神经网络的模型通常由于需要从多个模态中学习和整合信息的复杂训练过程，而产生了显著的计算开销。为了克服这一限制，我们提出了多模态图滤波（MultiModal-Graph Filtering, MM-GF），这是一种基于图滤波（Graph Filtering, GF）概念的无训练方法，旨在实现高效且准确的多模态推荐。具体而言，MM-GF首先通过解决跨模态的异构特性（如鲁棒缩放和向量平移）进行非平凡的多模态特征精炼，构建多个相似性图。然后，MM-GF利用线性低通滤波器在不同模态之间最优地融合多模态信息。在真实世界基准数据集上的大量实验表明，MM-GF不仅将推荐准确率提高了最多13.35%（与最佳竞争对手相比），还通过将运行时间缩短至不到10秒，显著降低了计算成本。"
+    },
+    {
+        "title": "In-depth Analysis of Graph-based RAG in a Unified Framework",
+        "url": "http://arxiv.org/abs/2503.04338v1",
+        "pub_date": "2025-03-06",
+        "summary": "Graph-based Retrieval-Augmented Generation (RAG) has proven effective in integrating external knowledge into large language models (LLMs), improving their factual accuracy, adaptability, interpretability, and trustworthiness. A number of graph-based RAG methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework to incorporate all graph-based RAG methods from a high-level perspective. We then extensively compare representative graph-based RAG methods over a range of questing-answering (QA) datasets -- from specific questions to abstract questions -- and examine the effectiveness of all methods, providing a thorough analysis of graph-based RAG approaches. As a byproduct of our experimental analysis, we are also able to identify new variants of the graph-based RAG methods over specific QA and abstract QA tasks respectively, by combining existing techniques, which outperform the state-of-the-art methods. Finally, based on these findings, we offer promising research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide new valuable insights for future research.",
+        "translated": "基于图的检索增强生成（Graph-based Retrieval-Augmented Generation, RAG）已被证明在将外部知识整合到大型语言模型（LLMs）中具有显著效果，能够提升模型的事实准确性、适应性、可解释性和可信度。文献中已提出了多种基于图的RAG方法。然而，这些方法尚未在相同的实验设置下进行系统且全面的比较。本文首先从高层次视角总结了一个统一的框架，将所有基于图的RAG方法纳入其中。随后，我们在一系列问答（QA）数据集上——从具体问题到抽象问题——广泛比较了代表性的基于图的RAG方法，并评估了所有方法的有效性，提供了对基于图的RAG方法的深入分析。作为实验分析的副产品，我们还通过结合现有技术，分别针对具体QA任务和抽象QA任务，识别出新的基于图的RAG方法变体，这些变体在性能上超越了当前最先进的方法。最后，基于这些发现，我们提出了未来研究的潜在方向。我们相信，对现有方法行为的更深入理解可以为未来的研究提供新的有价值的见解。"
+    },
+    {
+        "title": "Measuring temporal effects of agent knowledge by date-controlled tool\n  use",
+        "url": "http://arxiv.org/abs/2503.04188v1",
+        "pub_date": "2025-03-06",
+        "summary": "Temporal progression is an integral part of knowledge accumulation and update. Web search is frequently adopted as grounding for agent knowledge, yet its inappropriate configuration affects the quality of agent responses. Here, we construct a tool-based out-of-sample testing framework to measure the knowledge variability of large language model (LLM) agents from distinct date-controlled tools (DCTs). We demonstrate the temporal effects of an LLM agent as a writing assistant, which can use web search to help complete scientific publication abstracts. We show that temporal effects of the search engine translates into tool-dependent agent performance but can be alleviated with base model choice and explicit reasoning instructions such as chain-of-thought prompting. Our results indicate that agent evaluation should take a dynamical view and account for the temporal influence of tools and the updates of external resources.",
+        "translated": "时间进展是知识积累和更新的重要组成部分。网络搜索常被用作智能体知识的基础，但其不当配置会影响智能体回答的质量。本文构建了一个基于工具的样本外测试框架，用于衡量来自不同日期控制工具（DCTs）的大型语言模型（LLM）智能体的知识变异性。我们展示了LLM智能体作为写作助手的时间效应，该智能体可以利用网络搜索帮助完成科学出版物摘要的撰写。研究表明，搜索引擎的时间效应会转化为依赖于工具的智能体性能，但可以通过基础模型的选择和显式推理指令（如思维链提示）来缓解。我们的结果表明，智能体评估应采取动态视角，并考虑工具的时间影响和外部资源的更新。"
     }
 ]