chore: update confs

actions-user · actions-user · commit 8fd68a5973d3 · 2025-03-11T10:20:17.000Z
diff --git a/arxiv.json b/arxiv.json
@@ -42306,5 +42306,40 @@
         "pub_date": "2025-03-07",
         "summary": "Large language models (LLMs) are increasingly deployed in real-world question-answering (QA) applications. However, LLMs have been proven to generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal prediction (CP) is well-known to be model-agnostic and distribution-free, which creates statistically rigorous prediction sets in classification tasks. In this work, we for the first time adapt the CP framework to medical multiple-choice question-answering (MCQA) tasks, by correlating the nonconformity score with the frequency score of correct options grounded in self-consistency theory, assuming no access to internal model information. Considering that the adapted CP framework can only control the (mis)coverage rate, we employ a risk control framework, which can manage task-specific metrics by devising a monotonically decreasing loss function. We evaluate our framework on 3 popular medical MCQA datasets utilizing 4 ``off-the-shelf'' LLMs. Empirical results demonstrate that we achieve user-specified average (or marginal) error rates on the test set. Furthermore, we observe that the average prediction set size (APSS) on the test set decreases as the risk level increases, which concludes a promising evaluation metric for the uncertainty of LLMs.",
         "translated": "大型语言模型（LLMs）越来越多地被部署在实际的问答（QA）应用中。然而，LLMs已被证明会产生幻觉和非事实信息，这削弱了它们在高风险的医疗任务中的可信度。保形预测（CP）以其模型无关性和分布自由性而闻名，能够在分类任务中创建统计上严格的预测集。在本研究中，我们首次将CP框架应用于医疗多项选择题问答（MCQA）任务，通过将非一致性分数与基于自洽理论得出的正确答案选项的频率分数相关联，假设无法访问模型的内部信息。考虑到调整后的CP框架只能控制（错误）覆盖率，我们采用了一种风险控制框架，通过设计一个单调递减的损失函数来管理特定任务的指标。我们利用4种“现成的”LLMs在3个流行的医疗MCQA数据集上评估了我们的框架。实证结果表明，我们在测试集上达到了用户指定的平均（或边际）错误率。此外，我们观察到，随着风险水平的增加，测试集上的平均预测集大小（APSS）减少，这为LLMs的不确定性提供了一个有前景的评估指标。"
+    },
+    {
+        "title": "Advancing Vietnamese Information Retrieval with Learning Objective and\n  Benchmark",
+        "url": "http://arxiv.org/abs/2503.07470v1",
+        "pub_date": "2025-03-10",
+        "summary": "With the rapid development of natural language processing, many language models have been invented for multiple tasks. One important task is information retrieval (IR), which requires models to retrieve relevant documents. Despite its importance in many real-life applications, especially in retrieval augmented generation (RAG) systems, this task lacks Vietnamese benchmarks. This situation causes difficulty in assessing and comparing many existing Vietnamese embedding language models on the task and slows down the advancement of Vietnamese natural language processing (NLP) research. In this work, we aim to provide the Vietnamese research community with a new benchmark for information retrieval, which mainly focuses on retrieval and reranking tasks. Furthermore, we also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model. Our function aims to be better than the origin in information retrieval tasks. Finally, we analyze the effect of temperature, a hyper-parameter in both objective functions, on the performance of text embedding models.",
+        "translated": "随着自然语言处理技术的快速发展，许多语言模型已被发明用于多种任务。其中一项重要任务是信息检索（IR），该任务要求模型检索相关文档。尽管信息检索在许多实际应用中，特别是在检索增强生成（RAG）系统中具有重要意义，但该任务在越南语领域缺乏基准测试。这种情况导致在评估和比较许多现有的越南语嵌入语言模型时存在困难，并阻碍了越南语自然语言处理（NLP）研究的进展。在本研究中，我们旨在为越南语研究社区提供一个用于信息检索的新基准，主要关注检索和重排序任务。此外，我们还提出了一种基于InfoNCE损失函数的新目标函数，用于训练我们的越南语嵌入模型。我们的目标函数旨在在信息检索任务中优于原始函数。最后，我们分析了温度这一超参数在两种目标函数中对文本嵌入模型性能的影响。"
+    },
+    {
+        "title": "Process-Supervised LLM Recommenders via Flow-guided Tuning",
+        "url": "http://arxiv.org/abs/2503.07377v1",
+        "pub_date": "2025-03-10",
+        "summary": "While large language models (LLMs) are increasingly adapted for recommendation systems via supervised fine-tuning (SFT), this approach amplifies popularity bias due to its likelihood maximization objective, compromising recommendation diversity and fairness. To address this, we present Flow-guided fine-tuning recommender (Flower), which replaces SFT with a Generative Flow Network (GFlowNet) framework that enacts process supervision through token-level reward propagation. Flower's key innovation lies in decomposing item-level rewards into constituent token rewards, enabling direct alignment between token generation probabilities and their reward signals. This mechanism achieves three critical advancements: (1) popularity bias mitigation and fairness enhancement through empirical distribution matching, (2) preservation of diversity through GFlowNet's proportional sampling, and (3) flexible integration of personalized preferences via adaptable token rewards. Experiments demonstrate Flower's superior distribution-fitting capability and its significant advantages over traditional SFT in terms of fairness, diversity, and accuracy, highlighting its potential to improve LLM-based recommendation systems. The implementation is available via https://github.com/Mr-Peach0301/Flower",
+        "translated": "尽管大型语言模型（LLMs）越来越多地通过监督微调（SFT）应用于推荐系统，但这种方法由于其似然最大化的目标会放大流行度偏差，从而损害推荐的多样性和公平性。为了解决这一问题，我们提出了基于生成流网络（GFlowNet）框架的Flow-guided微调推荐系统（Flower），该框架通过令牌级奖励传播实现过程监督。Flower的核心创新在于将项目级奖励分解为组成令牌的奖励，从而使令牌生成概率与其奖励信号直接对齐。这一机制实现了三个关键进展：（1）通过经验分布匹配缓解流行度偏差并增强公平性，（2）通过GFlowNet的比例采样保持多样性，（3）通过可调整的令牌奖励灵活整合个性化偏好。实验表明，Flower在分布拟合能力上表现出色，并在公平性、多样性和准确性方面显著优于传统的SFT方法，凸显了其在改进基于LLM的推荐系统中的潜力。实现代码可通过https://github.com/Mr-Peach0301/Flower获取。"
+    },
+    {
+        "title": "Zero-Shot Hashing Based on Reconstruction With Part Alignment",
+        "url": "http://arxiv.org/abs/2503.07037v1",
+        "pub_date": "2025-03-10",
+        "summary": "Hashing algorithms have been widely used in large-scale image retrieval tasks, especially for seen class data. Zero-shot hashing algorithms have been proposed to handle unseen class data. The key technique in these algorithms involves learning features from seen classes and transferring them to unseen classes, that is, aligning the feature embeddings between the seen and unseen classes. Most existing zero-shot hashing algorithms use the shared attributes between the two classes of interest to complete alignment tasks. However, the attributes are always described for a whole image, even though they represent specific parts of the image. Hence, these methods ignore the importance of aligning attributes with the corresponding image parts, which explicitly introduces noise and reduces the accuracy achieved when aligning the features of seen and unseen classes. To address this problem, we propose a new zero-shot hashing method called RAZH. We first use a clustering algorithm to group similar patches to image parts for attribute matching and then replace the image parts with the corresponding attribute vectors, gradually aligning each part with its nearest attribute. Extensive evaluation results demonstrate the superiority of the RAZH method over several state-of-the-art methods.",
+        "translated": "哈希算法已被广泛应用于大规模图像检索任务中，尤其是针对已知类别数据。为了处理未知类别数据，零样本哈希算法被提出。这些算法中的关键技术包括从已知类别中学习特征并将其迁移到未知类别中，即对齐已知类别和未知类别之间的特征嵌入。大多数现有的零样本哈希算法使用两个类别之间的共享属性来完成对齐任务。然而，尽管这些属性代表图像的特定部分，但它们通常是对整个图像进行描述的。因此，这些方法忽略了将属性与相应图像部分对齐的重要性，这显式地引入了噪声，并降低了在已知类别和未知类别特征对齐时所能达到的准确性。为了解决这个问题，我们提出了一种新的零样本哈希方法，称为RAZH。我们首先使用聚类算法将相似的图像块分组到图像部分以进行属性匹配，然后用相应的属性向量替换图像部分，逐步将每个部分与其最近的属性对齐。大量评估结果表明，RAZH方法在多个最先进的方法中表现出优越性。"
+    },
+    {
+        "title": "Weak Supervision for Improved Precision in Search Systems",
+        "url": "http://arxiv.org/abs/2503.07025v1",
+        "pub_date": "2025-03-10",
+        "summary": "Labeled datasets are essential for modern search engines, which increasingly rely on supervised learning methods like Learning to Rank and massive amounts of data to power deep learning models. However, creating these datasets is both time-consuming and costly, leading to the common use of user click and activity logs as proxies for relevance. In this paper, we present a weak supervision approach to infer the quality of query-document pairs and apply it within a Learning to Rank framework to enhance the precision of a large-scale search system.",
+        "translated": "标记数据集对于现代搜索引擎至关重要，这些搜索引擎越来越依赖于监督学习方法（如排序学习）和大量数据来支持深度学习模型。然而，创建这些数据集既耗时又昂贵，因此通常使用用户点击和活动日志作为相关性的替代指标。在本文中，我们提出了一种弱监督方法，用于推断查询-文档对的质量，并将其应用于排序学习框架中，以提高大规模搜索系统的精度。"
+    },
+    {
+        "title": "Multi-Behavior Recommender Systems: A Survey",
+        "url": "http://arxiv.org/abs/2503.06963v1",
+        "pub_date": "2025-03-10",
+        "summary": "Traditional recommender systems primarily rely on a single type of user-item interaction, such as item purchases or ratings, to predict user preferences. However, in real-world scenarios, users engage in a variety of behaviors, such as clicking on items or adding them to carts, offering richer insights into their interests. Multi-behavior recommender systems leverage these diverse interactions to enhance recommendation quality, and research on this topic has grown rapidly in recent years. This survey provides a timely review of multi-behavior recommender systems, focusing on three key steps: (1) Data Modeling: representing multi-behaviors at the input level, (2) Encoding: transforming these inputs into vector representations (i.e., embeddings), and (3) Training: optimizing machine-learning models. We systematically categorize existing multi-behavior recommender systems based on the commonalities and differences in their approaches across the above steps. Additionally, we discuss promising future directions for advancing multi-behavior recommender systems.",
+        "translated": "传统的推荐系统主要依赖于单一类型的用户-项目交互，例如项目购买或评分，来预测用户偏好。然而，在现实场景中，用户会进行多种行为，例如点击项目或将其加入购物车，这些行为为理解用户兴趣提供了更丰富的洞察。多行为推荐系统利用这些多样化的交互来提高推荐质量，近年来该领域的研究迅速增长。本文对多行为推荐系统进行了及时的综述，重点关注三个关键步骤：（1）数据建模：在输入层面表示多行为，（2）编码：将这些输入转换为向量表示（即嵌入），以及（3）训练：优化机器学习模型。我们根据现有多行为推荐系统在上述步骤中的方法共性和差异，对其进行了系统分类。此外，我们还讨论了推动多行为推荐系统发展的未来研究方向。"
     }
 ]