chore: update confs

actions-user · actions-user · commit 21e681a2f119 · 2025-03-13T10:20:24.000Z
diff --git a/arxiv.json b/arxiv.json
@@ -42509,5 +42509,40 @@
         "pub_date": "2025-03-11",
         "summary": "Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\\% and 80.20\\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.",
         "translated": "大型语言模型（LLMs）在科学研究评估中的应用日益广泛，特别是在自动化论文评审方面。然而，现有的基于LLM的评审系统面临着重大挑战，包括有限的领域专业知识、幻觉推理以及缺乏结构化评估。为了解决这些局限性，我们引入了DeepReview，一个多阶段框架，旨在通过结合结构化分析、文献检索和基于证据的论证来模拟专家评审者。利用DeepReview-13K这一带有结构化注释的精选数据集，我们训练了DeepReviewer-14B，其在较少的token数量下优于CycleReviewer-70B。在其最佳模式下，DeepReviewer-14B在评估中对GPT-o1和DeepSeek-R1的胜率分别达到88.21%和80.20%。我们的工作为基于LLM的论文评审设立了新的基准，所有资源均已公开。代码、模型、数据集和演示已在http://ai-researcher.net发布。"
+    },
+    {
+        "title": "Search-R1: Training LLMs to Reason and Leverage Search Engines with\n  Reinforcement Learning",
+        "url": "http://arxiv.org/abs/2503.09516v1",
+        "pub_date": "2025-03-12",
+        "summary": "Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns -- solely through reinforcement learning (RL) -- to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.",
+        "translated": "高效获取外部知识和最新信息对于大型语言模型（LLMs）的有效推理和文本生成至关重要。将搜索引擎视为工具的检索增强和工具使用训练方法缺乏复杂的多轮检索灵活性，或需要大规模的监督数据。在推理过程中提示具有推理能力的高级LLMs使用搜索引擎并不理想，因为LLM没有学会如何与搜索引擎进行最佳交互。本文介绍了Search-R1，它是DeepSeek-R1模型的扩展，其中LLM仅通过强化学习（RL）学习在逐步推理过程中自主生成（多个）搜索查询，并实时检索。Search-R1通过多轮搜索交互优化LLM的展开，利用检索到的令牌掩码进行稳定的RL训练，并采用基于结果的简单奖励函数。在七个问答数据集上的实验表明，Search-R1在SOTA基线基础上的性能提升了26%（Qwen2.5-7B）、21%（Qwen2.5-3B）和10%（LLaMA3.2-3B）。本文还进一步提供了关于RL优化方法、LLM选择和检索增强推理中响应长度动态的实证见解。代码和模型检查点可在https://github.com/PeterGriffinJin/Search-R1获取。"
+    },
+    {
+        "title": "Learning Cascade Ranking as One Network",
+        "url": "http://arxiv.org/abs/2503.09492v1",
+        "pub_date": "2025-03-12",
+        "summary": "Cascade Ranking is a prevalent architecture in large-scale top-k selection systems like recommendation and advertising platforms. Traditional training methods focus on single-stage optimization, neglecting interactions between stages. Recent advances such as RankFlow and FS-LTR have introduced interaction-aware training paradigms but still struggle to 1) align training objectives with the goal of the entire cascade ranking (i.e., end-to-end recall) and 2) learn effective collaboration patterns for different stages. To address these challenges, we propose LCRON, which introduces a novel surrogate loss function derived from the lower bound probability that ground truth items are selected by cascade ranking, ensuring alignment with the overall objective of the system. According to the properties of the derived bound, we further design an auxiliary loss for each stage to drive the reduction of this bound, leading to a more robust and effective top-k selection. LCRON enables end-to-end training of the entire cascade ranking system as a unified network. Experimental results demonstrate that LCRON achieves significant improvement over existing methods on public benchmarks and industrial applications, addressing key limitations in cascade ranking training and significantly enhancing system performance.",
+        "translated": "级联排序（Cascade Ranking）是推荐系统和广告平台等大规模top-k选择系统中的一种常见架构。传统的训练方法侧重于单阶段优化，忽略了各阶段之间的交互。最近的研究进展，如RankFlow和FS-LTR，引入了交互感知的训练范式，但仍然面临以下两个主要挑战：1）如何将训练目标与整个级联排序的目标（即端到端召回）对齐；2）如何为不同阶段学习有效的协作模式。为解决这些问题，我们提出了LCRON（Lower-bound Cascade Ranking Optimization Network），其引入了一种新的代理损失函数，该函数源自级联排序选择真实项的概率下界，从而确保与系统的整体目标保持一致。基于推导出的下界性质，我们进一步为每个阶段设计了辅助损失函数，以推动该下界的减小，从而实现更鲁棒和有效的top-k选择。LCRON使得整个级联排序系统能够作为一个统一的网络进行端到端训练。实验结果表明，LCRON在公开基准测试和工业应用中相比现有方法取得了显著改进，解决了级联排序训练中的关键限制，并显著提升了系统性能。"
+    },
+    {
+        "title": "Towards Next-Generation Recommender Systems: A Benchmark for\n  Personalized Recommendation Assistant with LLMs",
+        "url": "http://arxiv.org/abs/2503.09382v1",
+        "pub_date": "2025-03-12",
+        "summary": "Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.",
+        "translated": "推荐系统（RecSys）广泛应用于各种现代数字平台，并引起了广泛关注。传统的推荐系统通常只关注固定且简单的推荐场景，难以在交互式范式中推广到新的、未见过的推荐任务。最近，大语言模型（LLMs）的发展彻底改变了推荐系统的基础架构，推动其向更智能和交互式的个性化推荐助手方向发展。然而，大多数现有研究依赖于固定的任务特定提示模板来生成推荐并评估个性化助手的性能，这限制了对它们能力的全面评估。这是因为常用的数据集缺乏反映真实推荐场景的高质量文本用户查询，使得它们不适合评估基于LLM的个性化推荐助手。为了解决这一差距，我们引入了RecBench+，这是一个新的数据集基准，旨在评估LLMs在LLM时代处理复杂用户推荐需求的能力。RecBench+包含了一组多样化的查询，涵盖了硬性条件和软性偏好，且难度各异。我们在RecBench+上评估了常用的LLMs，并得出了以下发现：1）LLMs展示了作为推荐助手的初步能力，2）LLMs更擅长处理明确陈述条件的查询，而在需要推理或包含误导信息的查询方面面临挑战。我们的数据集已在https://github.com/jiani-huang/RecBench.git上发布。"
+    },
+    {
+        "title": "xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using\n  Self-Knowledge Distillation",
+        "url": "http://arxiv.org/abs/2503.09313v1",
+        "pub_date": "2025-03-12",
+        "summary": "In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.",
+        "translated": "在当前的研究文献中，大多数嵌入模型都基于仅编码器（encoder-only）的Transformer架构，用于提取给定输入的密集且有意义的表示，该输入可以是文本、图像等。随着大语言模型（Large Language Models）的引入，语言建模领域取得了显著进展，研究人员开始探索从这些经过大规模训练的模型中提取嵌入的可能性。然而，目前的研究主要集中在英语文本嵌入上，这也是这些模型主要训练的语言。此外，极少有模型能够处理多模态和多语言的输入。基于此，我们提出了一种适应方法，针对在英语数据上训练的大规模视觉-语言模型（Large Vision-Language Models），以提升其在提取多语言和多模态嵌入方面的性能。最后，我们设计并引入了一个基准测试，用于评估多语言和多模态嵌入模型的有效性。"
+    },
+    {
+        "title": "LREF: A Novel LLM-based Relevance Framework for E-commerce",
+        "url": "http://arxiv.org/abs/2503.09223v1",
+        "pub_date": "2025-03-12",
+        "summary": "Query and product relevance prediction is a critical component for ensuring a smooth user experience in e-commerce search. Traditional studies mainly focus on BERT-based models to assess the semantic relevance between queries and products. However, the discriminative paradigm and limited knowledge capacity of these approaches restrict their ability to comprehend the relevance between queries and products fully. With the rapid advancement of Large Language Models (LLMs), recent research has begun to explore their application to industrial search systems, as LLMs provide extensive world knowledge and flexible optimization for reasoning processes. Nonetheless, directly leveraging LLMs for relevance prediction tasks introduces new challenges, including a high demand for data quality, the necessity for meticulous optimization of reasoning processes, and an optimistic bias that can result in over-recall. To overcome the above problems, this paper proposes a novel framework called the LLM-based RElevance Framework (LREF) aimed at enhancing e-commerce search relevance. The framework comprises three main stages: supervised fine-tuning (SFT) with Data Selection, Multiple Chain of Thought (Multi-CoT) tuning, and Direct Preference Optimization (DPO) for de-biasing. We evaluate the performance of the framework through a series of offline experiments on large-scale real-world datasets, as well as online A/B testing. The results indicate significant improvements in both offline and online metrics. Ultimately, the model was deployed in a well-known e-commerce application, yielding substantial commercial benefits.",
+        "translated": "查询与商品相关性预测是确保电子商务搜索中用户体验流畅的关键组成部分。传统研究主要集中在基于BERT的模型上，以评估查询与商品之间的语义相关性。然而，这些方法的判别范式及其有限的知识容量限制了它们全面理解查询与商品之间相关性的能力。随着大型语言模型（LLMs）的快速发展，近期研究开始探索其在工业搜索系统中的应用，因为LLMs提供了广泛的世界知识以及对推理过程的灵活优化。尽管如此，直接利用LLMs进行相关性预测任务引入了新的挑战，包括对数据质量的高要求、对推理过程进行细致优化的必要性，以及可能导致过度召回的乐观偏差。\n\n为了克服上述问题，本文提出了一种名为基于LLM的相关性框架（LLM-based RElevance Framework, LREF）的新颖框架，旨在增强电子商务搜索的相关性。该框架包含三个主要阶段：带有数据选择的监督微调（Supervised Fine-Tuning, SFT）、多重思维链（Multiple Chain of Thought, Multi-CoT）调优以及用于去偏的直接偏好优化（Direct Preference Optimization, DPO）。我们通过对大规模真实世界数据集的一系列离线实验以及在线A/B测试来评估该框架的性能。结果表明，离线和在线指标均显著提升。最终，该模型被部署在一个知名的电子商务应用中，带来了显著的商业效益。\n\n总结来说，本文提出的LREF框架通过结合监督微调、多重思维链调优和直接偏好优化，有效提升了电子商务搜索中的查询与商品相关性预测能力，并在实际应用中取得了显著的商业成果。"
     }
 ]