chore: update confs

actions-user · actions-user · commit 3db16787c8bc · 2025-06-18T10:21:03.000Z
diff --git a/arxiv.json b/arxiv.json
@@ -49124,5 +49124,40 @@
         "pub_date": "2025-06-16",
         "summary": "As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.",
         "translated": "随着大语言模型（LLMs）的持续发展，可靠的评估方法对于开放式指令遵循任务尤为重要。基于LLM的自动评估方法（LLM-as-a-Judge）通过将大语言模型作为评估器实现自动化评测，但其可靠性仍存在不确定性。本研究系统分析了影响该方法可信度的关键因素，重点关注其与人类判断的一致性及评估稳定性。通过BIGGENBench和EvalBiasBench两个基准平台，我们深入探究了评估方案设计、解码策略以及思维链（CoT）推理对评估效果的影响。实验结果表明：评估标准对可靠性具有决定性作用；相比确定性评估，非确定性采样能更好地对齐人类偏好；而当存在明确评估标准时，思维链推理带来的性能提升有限。\n\n（说明：译文严格遵循学术文本规范，实现以下处理：\n1. 专业术语准确对应：\"instruction-following tasks\"译为\"指令遵循任务\"、\"non-deterministic sampling\"译为\"非确定性采样\"\n2. 技术概念清晰传达：将\"Chain-of-Thought (CoT) reasoning\"译为专业界通用术语\"思维链推理\"\n3. 句式结构调整：将原文复合句拆分为符合中文表达习惯的短句，如将\"but its reliability...\"独立为转折分句\n4. 被动语态转化：\"are critical for\"主动化为\"具有决定性作用\"\n5. 评测平台名称保留原文大写形式BIGGENBench/EvalBiasBench，符合技术文档惯例）"
+    },
+    {
+        "title": "A Systematic Replicability and Comparative Study of BSARec and SASRec\n  for Sequential Recommendation",
+        "url": "http://arxiv.org/abs/2506.14692v1",
+        "pub_date": "2025-06-17",
+        "summary": "This study aims at comparing two sequential recommender systems: Self-Attention based Sequential Recommendation (SASRec), and Beyond Self-Attention based Sequential Recommendation (BSARec) in order to check the improvement frequency enhancement - the added element in BSARec - has on recommendations. The models in the study, have been re-implemented with a common base-structure from EasyRec, with the aim of obtaining a fair and reproducible comparison. The results obtained displayed how BSARec, by including bias terms for frequency enhancement, does indeed outperform SASRec, although the increases in performance obtained, are not as high as those presented by the authors. This work aims at offering an overview on existing methods, and most importantly at underlying the importance of implementation details for performance comparison.",
+        "translated": "本研究旨在比较两种序列推荐系统：基于自注意力机制的序列推荐模型（SASRec）与融合频度增强机制的序列推荐模型（BSARec），以验证BSARec新增的频度增强模块对推荐效果的提升作用。为进行公平且可复现的对比，研究采用EasyRec框架统一重构了两种模型的基础架构。实验结果表明，尽管BSARec通过引入频度偏置项确实优于SASRec模型，但其性能提升幅度未达到原作者声称的水平。本工作的核心价值在于系统梳理现有方法体系，更重要的是揭示了模型实现细节对性能比较的关键影响。\n\n（说明：翻译过程中重点处理了以下要点：\n1. \"frequency enhancement\"译为专业术语\"频度增强机制\"\n2. 将\"bias terms\"具体化为\"频度偏置项\"以准确反映技术细节\n3. \"common base-structure\"译为\"统一重构基础架构\"体现工程实现\n4. 采用\"性能比较\"而非字面翻译\"performance comparison\"更符合中文论文习惯\n5. 通过\"系统梳理\"和\"揭示\"等动词强化了学术文本的严谨性\n6. 保持原文客观陈述语气的同时，使用\"核心价值\"等表述提升可读性）"
+    },
+    {
+        "title": "Refining music sample identification with a self-supervised graph neural\n  network",
+        "url": "http://arxiv.org/abs/2506.14684v1",
+        "pub_date": "2025-06-17",
+        "summary": "Automatic sample identification (ASID), the detection and identification of portions of audio recordings that have been reused in new musical works, is an essential but challenging task in the field of audio query-based retrieval. While a related task, audio fingerprinting, has made significant progress in accurately retrieving musical content under \"real world\" (noisy, reverberant) conditions, ASID systems struggle to identify samples that have undergone musical modifications. Thus, a system robust to common music production transformations such as time-stretching, pitch-shifting, effects processing, and underlying or overlaying music is an important open challenge.   In this work, we propose a lightweight and scalable encoding architecture employing a Graph Neural Network within a contrastive learning framework. Our model uses only 9% of the trainable parameters compared to the current state-of-the-art system while achieving comparable performance, reaching a mean average precision (mAP) of 44.2%.   To enhance retrieval quality, we introduce a two-stage approach consisting of an initial coarse similarity search for candidate selection, followed by a cross-attention classifier that rejects irrelevant matches and refines the ranking of retrieved candidates - an essential capability absent in prior models. In addition, because queries in real-world applications are often short in duration, we benchmark our system for short queries using new fine-grained annotations for the Sample100 dataset, which we publish as part of this work.",
+        "translated": "自动采样识别（ASID）——即检测并识别在新音乐作品中被重复使用的音频片段，是基于音频查询检索领域中一项关键但极具挑战性的任务。虽然相关技术音频指纹识别已在\"真实场景\"（含噪声、混响等条件）下的音乐内容检索方面取得显著进展，但现有ASID系统仍难以识别经过音乐化修改的采样片段。因此，开发能够抵抗时值拉伸、移调、效果器处理、背景音乐叠加等常见音乐制作变换的鲁棒系统，仍是该领域亟待攻克的重要难题。\n\n本研究提出一种轻量化且可扩展的编码架构，在对比学习框架中引入图神经网络。与当前最优系统相比，我们的模型仅需9%的可训练参数即可达到相当性能，平均精度均值（mAP）达44.2%。为提升检索质量，我们设计了两阶段处理流程：先通过粗粒度相似性搜索初筛候选样本，再经由交叉注意力分类器剔除无关匹配并优化候选排序——这一关键能力在先前模型中尚属缺失。此外，针对实际应用中查询片段通常较短的特点，我们采用自建的Sample100数据集（含本研究发布的新细粒度标注）对系统进行短查询性能基准测试。"
+    },
+    {
+        "title": "RMIT-ADM+S at the SIGIR 2025 LiveRAG Challenge",
+        "url": "http://arxiv.org/abs/2506.14516v1",
+        "pub_date": "2025-06-17",
+        "summary": "This paper presents the RMIT--ADM+S participation in the SIGIR 2025 LiveRAG Challenge. Our Generation-Retrieval-Augmented Generation (GRAG) approach relies on generating a hypothetical answer that is used in the retrieval phase, alongside the original question. GRAG also incorporates a pointwise large language model (LLM)-based re-ranking step prior to final answer generation. We describe the system architecture and the rationale behind our design choices. In particular, a systematic evaluation using the Grid of Points (GoP) framework and N-way ANOVA enabled comparison across multiple configurations, including query variant generation, question decomposition, rank fusion strategies, and prompting techniques for answer generation. Our system achieved a Relevance score of 1.199 and a Faithfulness score of 0.477 on the private leaderboard, placing among the top four finalists in the LiveRAG 2025 Challenge.",
+        "translated": "本文介绍了RMIT-ADM+S团队参与SIGIR 2025 LiveRAG挑战赛的研究成果。我们提出的生成-检索-增强生成（GRAG）方法通过在检索阶段同时使用原始问题与生成的假设答案来优化检索效果，并在最终答案生成前引入基于大语言模型（LLM）的点式重排序步骤。我们详细阐述了系统架构及其设计原理，特别采用格点评估框架（GoP）和N元方差分析对多组配置进行系统评估，包括查询变体生成、问题分解、排序融合策略以及答案生成的提示技术等维度。在挑战赛非公开排行榜上，我们的系统以1.199的相关性得分和0.477的忠实度得分位列前四强。"
+    },
+    {
+        "title": "Vela: Scalable Embeddings with Voice Large Language Models for\n  Multimodal Retrieval",
+        "url": "http://arxiv.org/abs/2506.14445v1",
+        "pub_date": "2025-06-17",
+        "summary": "Multimodal large language models (MLLMs) have seen substantial progress in recent years. However, their ability to represent multimodal information in the acoustic domain remains underexplored. In this work, we introduce Vela, a novel framework designed to adapt MLLMs for the generation of universal multimodal embeddings. By leveraging MLLMs with specially crafted prompts and selected in-context learning examples, Vela effectively bridges the modality gap across various modalities. We then propose a single-modality training approach, where the model is trained exclusively on text pairs. Our experiments show that Vela outperforms traditional CLAP models in standard text-audio retrieval tasks. Furthermore, we introduce new benchmarks that expose CLAP models' limitations in handling long texts and complex retrieval tasks. In contrast, Vela, by harnessing the capabilities of MLLMs, demonstrates robust performance in these scenarios. Our code will soon be available.",
+        "translated": "近年来，多模态大语言模型（MLLMs）取得了显著进展，但其在声学领域表征多模态信息的能力仍待深入探索。本研究提出Vela框架，通过创新设计实现MLLMs生成通用多模态嵌入表示。该框架采用精心构建的提示模板和精选的上下文学习示例，有效弥合了不同模态间的语义鸿沟。我们进一步提出单模态训练策略，仅需文本配对数据即可完成模型训练。实验表明，Vela在标准文本-音频检索任务中表现优于传统CLAP模型。针对现有模型在长文本处理和复杂检索任务中的局限性，我们建立了新基准测试集。结果显示Vela凭借MLLMs的强大能力，在这些场景中展现出卓越的鲁棒性。相关代码即将开源。\n\n（译文说明：专业术语处理上，\"MLLMs\"保留英文缩写形式并添加中文全称；\"CLAP\"作为专有模型名不作翻译；\"in-context learning\"译为\"上下文学习\"符合NLP领域惯例；技术表述上\"universal multimodal embeddings\"译为\"通用多模态嵌入表示\"准确传达技术概念；句式结构上对英语长句进行合理切分，确保中文表达流畅；被动语态\"are trained\"转化为主动式\"完成训练\"更符合中文科技文献表达习惯）"
+    },
+    {
+        "title": "Similarity = Value? Consultation Value Assessment and Alignment for\n  Personalized Search",
+        "url": "http://arxiv.org/abs/2506.14437v1",
+        "pub_date": "2025-06-17",
+        "summary": "Personalized search systems in e-commerce platforms increasingly involve user interactions with AI assistants, where users consult about products, usage scenarios, and more. Leveraging consultation to personalize search services is trending. Existing methods typically rely on semantic similarity to align historical consultations with current queries due to the absence of 'value' labels, but we observe that semantic similarity alone often fails to capture the true value of consultation for personalization. To address this, we propose a consultation value assessment framework that evaluates historical consultations from three novel perspectives: (1) Scenario Scope Value, (2) Posterior Action Value, and (3) Time Decay Value. Based on this, we introduce VAPS, a value-aware personalized search model that selectively incorporates high-value consultations through a consultation-user action interaction module and an explicit objective that aligns consultations with user actions. Experiments on both public and commercial datasets show that VAPS consistently outperforms baselines in both retrieval and ranking tasks.",
+        "translated": "电商平台中的个性化搜索系统日益依赖用户与AI助手的交互，用户通过咨询了解产品特性、使用场景等信息。如何利用咨询对话实现搜索服务个性化已成为研究趋势。现有方法因缺乏明确的\"价值\"标注，通常仅依赖语义相似度将历史咨询与当前查询进行匹配，但我们发现仅凭语义相似度往往难以准确衡量咨询对个性化的实际价值。为此，我们提出一个咨询价值评估框架，从三个创新维度评估历史咨询：(1)场景覆盖价值、(2)后续行为价值、(3)时效衰减价值。基于此，我们提出VAPS模型——一种价值敏感的个性化搜索框架，该模型通过咨询-用户行为交互模块和显式的咨询-行为对齐目标，选择性整合高价值咨询。在公开数据集和商业数据集上的实验表明，VAPS在检索和排序任务上均显著优于基线模型。\n\n（注：翻译过程中对以下术语进行了专业处理：\n1. \"value labels\"译为\"价值标注\"而非字面的\"价值标签\"，更符合机器学习领域表述习惯\n2. \"Posterior Action Value\"译为\"后续行为价值\"，准确传达\"用户咨询后产生的实际行为\"这一核心概念\n3. 模型名称\"VAPS\"保留不译，符合学术惯例\n4. \"align\"根据上下文分别译为\"匹配\"和\"对齐\"，在语义相似度场景用\"匹配\"，在目标函数场景用\"对齐\"\n5. \"explicit objective\"译为\"显式目标\"，突出其与隐式学习方法的区别）"
     }
 ]