论文 | 作者 | 组织 | 摘要 | 翻译 | 代码 | 引用数 |
---|---|---|---|---|---|---|
Large Language Models are Zero-Shot Rankers for Recommender Systems | Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian J. McAuley, Wayne Xin Zhao | Recently, large language models (LLMs) (e.g. GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems. To conduct our empirical study, we first formalize the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by the candidate generation model as candidates. We adopt a specific prompting approach to solving the ranking task by LLMs: we carefully design the prompting template by including the sequential interaction history, the candidate items, and the ranking instruction. We conduct extensive experiments on two widely-used datasets for recommender systems and derive several key findings for the use of LLMs in recommender systems. We show that LLMs have promising zero-shot ranking abilities, even competitive to or better than conventional recommendation models on candidates retrieved by multiple candidate generators. We also demonstrate that LLMs struggle to perceive the order of historical interactions and can be affected by biases like position bias, while these issues can be alleviated via specially designed prompting and bootstrapping strategies. The code to reproduce this work is available at https://github.com/RUCAIBox/LLMRank. | 最近,大型语言模型(LLM)(例如 GPT-4)展示了令人印象深刻的通用任务解决能力,包括接近推荐任务的潜力。沿着这条研究路线,这项工作旨在调查作为推荐系统排名模型的 LLM 的能力。为了进行实证研究,我们首先将推荐问题形式化为一个条件排序任务,将序贯交互历史作为条件,并将候选生成模型检索到的项目作为候选项。我们采用了一种特定的提示方法来解决 LLM 的排序问题: 我们仔细设计了提示模板,包括顺序交互历史,候选项,和排序指令。我们对推荐系统中两个广泛使用的数据集进行了广泛的实验,得出了在推荐系统中使用 LLM 的几个关键发现。我们证明了 LLM 具有良好的零拍排序能力,甚至比传统的推荐模型更有竞争力或更好的候选人由多个候选生成器检索。我们还证明 LLM 很难感知历史交互的次序,并且可能受到位置偏差等偏见的影响,而这些问题可以通过特别设计的激励和自举策略得到缓解。复制这项工作的代码可在 https://github.com/rucaibox/llmrank 找到。 | code | 3 | |
Exploring Large Language Models and Hierarchical Frameworks for Classification of Large Unstructured Legal Documents | Nishchal Prasad, Mohand Boughanem, Taoufiq Dkaki | Legal judgment prediction suffers from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents becomes a challenging task, more so on documents with no structural annotation. We explore the classification of these large legal documents and their lack of structural information with a deep-learning-based hierarchical framework which we call MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment prediction. Specifically, we divide a document into parts to extract their embeddings from the last four layers of a custom fine-tuned Large Language Model, and try to approximate their structure through unsupervised clustering. Which we use in another set of transformer encoder layers to learn the inter-chunk representations. We analyze the adaptability of Large Language Models (LLMs) with multi-billion parameters (GPT-Neo, and GPT-J) with the hierarchical framework of MESc and compare them with their standalone performance on legal texts. We also study their intra-domain(legal) transfer learning capability and the impact of combining embeddings from their last layers in MESc. We test these methods and their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. Our approach achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art methods. | 法律判决预测存在案件文书篇幅过长、文字过长、结构不统一等问题。从这些文档中预测判断是一项具有挑战性的任务,对于没有结构注释的文档更是如此。我们探讨了这些大型法律文件的分类以及它们缺乏结构信息的问题,采用了一种基于深度学习的层次结构框架,我们称之为 MESc,“基于多级编码器的聚类监督”,用于判断预测。具体来说,我们将一个文档分成几个部分,从定制的微调大语言模型的最后四层中提取它们的嵌入,并尝试通过无监督聚类来近似它们的结构。我们在另一组转换器编码器层中使用它来学习块间表示。本文采用 MESc 层次结构分析了具有数十亿参数(GPT-Neo 和 GPT-J)的大语言模型(LLM)的适应性,并与它们在法律文本中的独立性进行了比较。我们还研究了它们的域内(法律)迁移学习能力以及在 MESc 中结合它们最后一层的嵌入所产生的影响。我们使用 ILDC 数据集和 LexGLUE 数据集的子集对来自印度,欧盟和美国的法律文件进行广泛的实验和消融研究,以测试这些方法及其有效性。我们的方法比以前的最先进的方法获得了大约2点的最小总性能增益。 | code | 1 | |
Overview of PAN 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification - Extended Abstract | Janek Bevendorff, Xavier Bonet Casals, Berta Chulvi, Daryna Dementieva, Ashaf Elnagar, Dayne Freitag, Maik Fröbe, Damir Korencic, Maximilian Mayerl, Animesh Mukherjee, Alexander Panchenko, Martin Potthast, Francisco Rangel, Paolo Rosso, Alisa Smirnova, Efstathios Stamatatos, Benno Stein, Mariona Taulé, Dmitry Ustalov, Matti Wiegmann, Eva Zangerle | Bauhaus Univ Weimar, Weimar, Germany; Symanto Res, Valencia, Spain; Univ Politecn Valencia, Valencia, Spain; JetBrains, Belgrade, Serbia; Tech Univ Munich, Munich, Germany; Indian Inst Technol Kharagpur, Kharagpur, W Bengal, India; Univ Barcelona, Barcelona, Spain; Univ Aegean, Samos, Greece; Rudjer Boskovic Inst, Zagreb, Croatia; Univ Sharjah, Sharjah, U Arab Emirates; Toloka, Luzern, Switzerland; Friedrich Schiller Univ Jena, Jena, Germany; Univ Hamburg, Hamburg, Germany; Univ Appl Sci BFI Vienna, Vienna, Austria; SRI Int, 333 Ravenswood Ave, Menlo Pk, CA 94025 USA; Univ Innsbruck, Innsbruck, Austria; Univ Kassel, Kassel, Germany; Univ Santiago de Compostela, Santiago, Spain; Skoltech & AIRI, Skolkovo, Russia; Univ Leipzig, Leipzig, Germany | The paper gives a brief overview of the four shared tasks organized at the PAN 2024 lab on digital text forensics and stylometry to be hosted at CLEF 2024. The goal of the PAN lab is to advance the state-of-the-art in text forensics and stylometry through an objective evaluation of new and established methods on new benchmark datasets. Our four tasks are: (1) multi-author writing style analysis, which we continue from 2023 in a more difficult version, (2) multilingual text detoxification, a new task that aims to translate and re-formulate text in a non-toxic way, (3) oppositional thinking analysis, a new task that aims to discriminate critical thinking from conspiracy narratives and identify their core actors, and (4) generative AI authorship verification, which formulates the detection of AI-generated text as an authorship problem, one of PAN's core tasks. As with the previous editions, PAN invites software submissions as easy-to-reproduce docker containers; more than 400 pieces of software have been submitted from PAN'12 through PAN'23 combined, with all recent evaluations running on the TIRA experimentation platform [8]. | 本文简要介绍了将在CLEF 2024上举办的PAN 2024实验室关于数字文本取证和风格计量学的四项共享任务。PAN实验室的目标是通过在新的基准数据集上对新方法和已有方法进行客观评估,推动文本取证和风格计量学领域的前沿发展。我们的四项任务包括:(1)多作者写作风格分析,这是我们在2023年基础上推出的更具挑战性的版本;(2)多语言文本去毒化,这是一项新任务,旨在以非毒性的方式翻译和重述文本;(3)对立思维分析,这是一项新任务,旨在区分批判性思维与阴谋论叙述,并识别其核心行为者;(4)生成式AI作者身份验证,该任务将AI生成文本的检测问题表述为作者身份验证问题,这是PAN的核心任务之一。与以往版本一样,PAN邀请以易于复现的Docker容器形式提交软件;从PAN'12到PAN'23,已提交了超过400个软件,所有最近的评估均在TIRA实验平台上运行[8]。 | code | 1 |
Incorporating Query Recommendation for Improving In-Car Conversational Search | Md. Rashad Al Hasan Rony, Soumya Ranjan Sahoo, Abbas Goher Khan, Ken E. Friedl, Viju Sudhi, Christian Süß | Fraunhofer IAIS, Zwickauer Str 46, D-01069 Dresden, Germany; BMW Grp, Parkring 19-23, D-85748 Garching, Germany | Retrieval-augmented generation has become an effective mechanism for conversational systems in domain-specific settings. Retrieval of a wrong document due to the lack of context from the user utterance may lead to wrong answer generation. Such an issue may reduce the user engagement and thereby the system reliability. In this paper, we propose a context-guided follow-up question recommendation to internally improve the document retrieval in an iterative approach for developing an in-car conversational system. Specifically, a user utterance is first reformulated, given the context of the conversation to facilitate improved understanding to the retriever. In the cases, where the documents retrieved by the retriever are not relevant enough for answering the user utterance, we employ a large language model (LLM) to generate question recommendation which is then utilized to perform a refined retrieval. An empirical evaluation confirms the effectiveness of our proposed approaches in in-car conversations, achieving 48% and 22% improvement in the retrieval and system generated responses, respectively, against baseline approaches. | 检索增强生成已成为特定领域设置中对话系统的有效机制。由于用户话语中缺乏上下文信息,检索到错误的文档可能会导致生成错误的答案。这一问题可能会降低用户参与度,从而影响系统的可靠性。在本文中,我们提出了一种上下文引导的后续问题推荐方法,通过迭代方式内部改进文档检索,以开发车载对话系统。具体来说,首先根据对话的上下文对用户话语进行重新表述,以帮助检索器更好地理解。在检索器检索到的文档不足以回答用户话语的情况下,我们利用大型语言模型(LLM)生成问题推荐,然后利用该推荐进行更精确的检索。实证评估证实了我们所提出方法在车载对话中的有效性,与基线方法相比,检索和系统生成响应的性能分别提高了48%和22%。 | code | 0 |
ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search | Beatriz Soviero, Daniel Kuhn, Alexandre Salle, Viviane Pereira Moreira | VTEX, Porto Alegre, RS, Brazil; Univ Fed Rio Grande do Sul, Inst Informat, Porto Alegre, RS, Brazil; Inst Educ Sci & Technol Rio Grande do Sul IFRS, Ibiruba, Brazil | The dependence on human relevance judgments limits the development of information retrieval test collections that are vital for evaluating these systems. Since their launch, large language models (LLMs) have been applied to automate several human tasks. Recently, LLMs started being used to provide relevance judgments for document search. In this work, our goal is to assess whether LLMs can replace human annotators in a different setting - product search in eCommerce. We conducted experiments on open and proprietary industrial datasets tomeasure LLM's ability to predict relevance judgments. Our results found that LLM-generated relevance assessments present a strong agreement (similar to 82%) with human annotations indicating that LLMs have an innate ability to perform relevance judgments in an eCommerce setting. Then, we went further and tested whether LLMs can generate annotation guidelines. Our results found that relevance assessments obtained with LLM-generated guidelines are as accurate as the ones obtained from human instructions.(1)(The source code for this work is available at https://github.com/danimtk/chatGPT-goes-shopping) | 对人工相关性判断的依赖限制了信息检索测试集的发展,而这些测试集对于评估这些系统至关重要。自大型语言模型(LLMs)推出以来,已被应用于自动化多项人工任务。最近,LLMs开始被用于提供文档搜索的相关性判断。在这项工作中,我们的目标是评估LLMs在电子商务产品搜索这一不同场景下是否能替代人工标注者。我们在开放和专有的工业数据集上进行了实验,以衡量LLMs预测相关性判断的能力。我们的结果表明,LLMs生成的相关性评估与人工标注表现出高度一致性(约为82%),这表明LLMs在电子商务环境中具有执行相关性判断的先天能力。接着,我们进一步测试了LLMs是否能生成标注指南。我们的研究发现,使用LLMs生成的指南获得的相关性评估与使用人工指令获得的评估同样准确。(1)(本工作的源代码可在https://github.com/danimtk/chatGPT-goes-shopping获取) | code | 0 |
Lottery4CVR: Neuron-Connection Level Sharing for Multi-task Learning in Video Conversion Rate Prediction | Xuanji Xiao, Jimmy Chen, Yuzhen Liu, Xing Yao, Pei Liu, Chaosheng Fan | Tencent Inc, Beijing, Peoples R China | As a fundamental task of industrial ranking systems, conversion rate (CVR) prediction is suffering from data sparsity problems. Most conventional CVR modeling leverages Click-through rate (CTR)&CVR multitask learning because CTR involves far more samples than CVR. However, typical coarse-grained layer-level sharing methods may introduce conflicts and lead to performance degradation, since not every neuron or neuron connection in one layer should be shared between CVR and CTR tasks. This is because users may have different fine-grained content feature preferences between deep consumption and click behaviors, represented by CVR and CTR, respectively. To address this sharing&conflict problem, we propose a neuron-connection level knowledge sharing. We start with an over-parameterized base network from which CVR and CTR extract their own subnetworks. The subnetworks have partially overlapped neuron connections which correspond to the sharing knowledge, and the task-specific neuron connections are utilized to alleviate the conflict problem. As far as we know, this is the first time that a neuron-connection level sharing is proposed in CVR modeling. Experiments on the Tencent video platform demonstrate the superiority of the method, which has been deployed serving major traffic. (The source code is available at https://github.com/xuanjixiao/onerec/tree/main/lt4rec). | 作为工业排名系统的一项基础任务,转化率(CVR)预测一直受到数据稀疏问题的困扰。大多数传统的CVR建模方法利用点击率(CTR)和CVR的多任务学习,因为CTR涉及的样本数量远多于CVR。然而,典型的粗粒度层级共享方法可能会引入冲突并导致性能下降,因为并非每一层中的每个神经元或神经元连接都应在CVR和CTR任务之间共享。这是因为用户在深度消费和点击行为之间可能具有不同的细粒度内容特征偏好,分别由CVR和CTR表示。为了解决这种共享与冲突问题,我们提出了一种神经元连接级别的知识共享方法。我们从一个过参数化的基础网络开始,CVR和CTR从中提取各自的子网络。这些子网络的神经元连接部分重叠,对应共享的知识,而特定任务的神经元连接则用于缓解冲突问题。据我们所知,这是在CVR建模中首次提出神经元连接级别的共享方法。在腾讯视频平台上的实验证明了该方法的优越性,并且该方法已经部署服务于主要流量。(源代码可在https://github.com/xuanjixiao/onerec/tree/main/lt4rec获取)。 | code | 0 |
Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search | Kathryn E. Kirchoff, James Wellnitz, Joshua E. Hochuli, Travis Maxfield, Konstantin I. Popov, Shawn M. Gomez, Alexander Tropsha | Department of Computer Science, UNC Chapel Hill.; Department of Pharmacology, UNC Chapel Hill.; Eshelman School of Pharmacy, UNC Chapel Hill. | Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks. | 基于最近邻的相似性搜索是化学中的一个常见任务,在药物发现中有着显著的应用案例。然而,这项任务中一些最常用的方法仍然使用蛮力方法。在实践中,由于现代化学品数据库的庞大规模,这可能会造成计算成本高昂和过度耗时。此任务之前的计算改进通常依赖于对缺乏普遍性的硬件或数据集特定技巧的改进。利用低复杂度搜索算法的方法仍然相对缺乏探索。然而,许多这些算法是近似解决方案和/或与典型的高维化学嵌入斗争。在这里,我们评估是否结合低维化学嵌入和 k-d 树数据结构可以实现快速最近邻查询,同时保持标准化学相似性搜索基准的性能。我们考察了不同维度的标准化学嵌入降低以及一个学习,结构意识的嵌入-SmallSA-为这项任务。有了这个框架,超过十亿种化学物质的搜索在不到一秒钟的时间内在一个 CPU 核心上执行,比蛮力搜索数量级快5倍。我们亦证明 SmallSA 在化学相似性基准方面取得具竞争力的表现。 | code | 0 |
Evaluating the Impact of Content Deletion on Tabular Data Similarity and Retrieval Using Contextual Word Embeddings | Alberto Berenguer, David Tomás, JoseNorberto Mazón | Table retrieval involves providing a ranked list of relevant tables in response to a search query. A critical aspect of this process is computing the similarity between tables. Recent Transformer-based language models have been effectively employed to generate word embedding representations of tables for assessing their semantic similarity. However, generating such representations for large tables comprising thousands or even millions of rows can be computationally intensive. This study presents the hypothesis that a significant portion of a table's content (i.e., rows) can be removed without substantially impacting its word embedding representation, thereby reducing computational costs while maintaining system performance. To test this hypothesis, two distinct evaluations were conducted. Firstly, an intrinsic evaluation was carried out using two different datasets and five state-of-the-art contextual and not-contextual language models. The findings indicate that, for large tables, retaining just 5% of the content results in a word embedding representation that is 90% similar to the original one. Secondly, an extrinsic evaluation was performed to assess how three different reduction techniques proposed affects the overall performance of the table-based query retrieval system, as measured by MAP, precision, and nDCG. The results demonstrate that these techniques can not only decrease data volume but also improve the performance of the table retrieval system. | 表检索涉及根据搜索查询提供相关表的排名列表。这一过程的一个关键方面是计算表之间的相似性。最近,基于Transformer的语言模型已被有效用于生成表的词嵌入表示,以评估它们的语义相似性。然而,为包含数千甚至数百万行的大型表生成此类表示可能在计算上是密集的。本研究提出了一个假设,即可以移除表中大部分内容(即行)而不会显著影响其词嵌入表示,从而在保持系统性能的同时降低计算成本。为了验证这一假设,进行了两项不同的评估。首先,使用两个不同的数据集和五种最先进的上下文和非上下文语言模型进行了内在评估。结果表明,对于大型表,仅保留5%的内容即可生成与原始词嵌入表示90%相似的表示。其次,进行了外在评估,以评估所提出的三种不同缩减技术如何影响基于表的查询检索系统的整体性能,通过MAP、精度和nDCG来衡量。结果表明,这些技术不仅可以减少数据量,还可以提高表检索系统的性能。 | code | 0 | |
RIGHT: Retrieval-Augmented Generation for Mainstream Hashtag Recommendation | RunZe Fan, Yixing Fan, Jiangui Chen, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng | Automatic mainstream hashtag recommendation aims to accurately provide users with concise and popular topical hashtags before publication. Generally, mainstream hashtag recommendation faces challenges in the comprehensive difficulty of newly posted tweets in response to new topics, and the accurate identification of mainstream hashtags beyond semantic correctness. However, previous retrieval-based methods based on a fixed predefined mainstream hashtag list excel in producing mainstream hashtags, but fail to understand the constant flow of up-to-date information. Conversely, generation-based methods demonstrate a superior ability to comprehend newly posted tweets, but their capacity is constrained to identifying mainstream hashtags without additional features. Inspired by the recent success of the retrieval-augmented technique, in this work, we attempt to adopt this framework to combine the advantages of both approaches. Meantime, with the help of the generator component, we could rethink how to further improve the quality of the retriever component at a low cost. Therefore, we propose RetrIeval-augmented Generative Mainstream HashTag Recommender (RIGHT), which consists of three components: 1) a retriever seeks relevant hashtags from the entire tweet-hashtags set; 2) a selector enhances mainstream identification by introducing global signals; and 3) a generator incorporates input tweets and selected hashtags to directly generate the desired hashtags. The experimental results show that our method achieves significant improvements over state-of-the-art baselines. Moreover, RIGHT can be easily integrated into large language models, improving the performance of ChatGPT by more than 10%. | 自动主流话题标签推荐的目的是准确地为用户提供简洁和流行的话题标签发布前。一般来说,主流话题标签推荐面临的挑战包括: 新发布的推文在回应新话题方面的综合难度,以及在语义正确性之外对主流话题标签的准确识别。然而,以往基于固定预定义主流标签列表的检索方法在生成主流标签方面表现出色,但不能理解不断更新的信息流。相反,基于生成的方法展示了理解新发布的 tweet 的优越能力,但它们的能力仅限于识别主流标签,而没有其他特性。受近年来检索增强技术的成功启发,本文尝试采用这一框架将两种方法的优点结合起来。同时,借助于发生器组件,我们可以重新思考如何以较低的成本进一步提高检索器组件的质量。因此,我们提出了 RetrIeval 增强的生成主流 HashTag 推荐器(RIGHT) ,它由三个组成部分组成: 1)检索器从整个 tweet-HashTag 集中寻找相关的 HashTag; 2)选择器通过引入全局信号增强主流识别; 3)生成器结合输入 tweet 和选定的 HashTag 直接生成所需的 HashTag。实验结果表明,我们的方法取得了显着的改进,在最先进的基线。此外,可以很容易地将权限集成到大型语言模型中,使 ChatGPT 的性能提高10% 以上。 | code | 0 | |
Exploring the Nexus Between Retrievability and Query Generation Strategies | Aman Sinha, Priyanshu Raj Mall, Dwaipayan Roy | Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable. | 通过文档可检索性评分量化检索功能中的偏差对于评估面向回忆的检索系统至关重要。然而,许多研究检索模型偏倚的研究缺乏验证其查询生成方法作为准确表示的可检索性的真实用户和他们的查询。这种局限性是由于在可检索性评估中缺乏确定的查询生成标准造成的。通常,当没有查询日志可用时,研究人员会使用文档语料库中的频繁搭配。在这项研究中,我们解决了重复性的问题,并寻求验证查询生成方法,通过比较从人工生成的查询和从查询日志得到的查询可检索性得分。我们的研究结果表明,人工查询和查询日志的可检索性得分之间的相关性很小,甚至可以忽略不计。这表明人工生成的查询可能不能准确地反映从查询日志中获得的可检索性得分。我们进一步探索替代的查询生成技术,发现具有最高相关性的变体。这种替代方法有望在查询日志不可用时提高可重复性。 | code | 0 | |
GLAD: Graph-Based Long-Term Attentive Dynamic Memory for Sequential Recommendation | Deepanshu Pandey, Arindam Sarkar, Prakash Mandayam Comar | Amazon Dev Ctr, Bengaluru, India | Recommender systems play a crucial role in the e-commerce stores, enabling customers to explore products and facilitating the discovery of relevant items. Typical recommender systems are built using n most recent user interactions, where value of n is chosen based on trade-off between incremental gains in performance and compute/memory costs associated with processing long sequences. State-of-the-art recommendation models like Transformers, based on attention mechanism, have quadratic computation complexity with respect to sequence length, thus limiting the length of past customer interactions to be considered for recommendations. Even with the availability of compute resources, it is crucial to design an algorithm that strikes delicate balance between long term and short term information in identifying relevant products for personalised recommendation. Towards this, we propose a novel extension of Memory Networks, a neural network architecture that harnesses external memory to encapsulate information present in lengthy sequential data. The use of memory networks in recommendation use-cases remains limited in practice owing to their high memory cost, large compute requirements and relatively large inference latency, which makes them prohibitively expensive for online stores with millions of users and products. To address these limitations, we propose a novel transformer-based sequential recommendation model GLAD, with external graph-based memory that dynamically scales user memory by adjusting the memory size according to the user's history, while facilitating the flow of information between users with similar interactions. We establish the efficacy of the proposed model by benchmarking on multiple public datasets as well as an industry dataset against state-of-the-art sequential recommendation baselines. | 推荐系统在电子商务商店中扮演着至关重要的角色,它们不仅帮助顾客探索产品,还促进了相关商品的发现。典型的推荐系统基于用户最近的n次交互来构建,其中n的取值需要在性能的提升与处理长序列所带来的计算/内存成本之间进行权衡。基于注意力机制的最先进推荐模型,如Transformers,其计算复杂度与序列长度呈二次方关系,因此限制了用于推荐的过去客户交互的长度。即使计算资源充足,设计一个算法在长期和短期信息之间找到微妙的平衡,以识别个性化推荐中的相关产品,仍然是至关重要的。为此,我们提出了一种记忆网络(Memory Networks)的新扩展,这是一种利用外部记忆来封装长序列数据中信息的神经网络架构。由于记忆网络的高内存成本、大计算需求以及相对较大的推理延迟,它们在推荐用例中的应用在实践中受到限制,这使得它们对于拥有数百万用户和产品的在线商店来说成本过高。为了解决这些限制,我们提出了一种基于Transformer的新型序列推荐模型GLAD,它带有基于图的外部记忆,能够根据用户的历史动态调整记忆大小,同时在具有相似交互的用户之间促进信息流动。我们通过在多个公共数据集以及一个行业数据集上与最先进的序列推荐基准进行比较,验证了所提出模型的有效性。 | code | 0 |
BertPE: A BERT-Based Pre-retrieval Estimator for Query Performance Prediction | Maryam Khodabakhsh, Fattane Zarrinkalam, Negar Arabzadeh | Univ Waterloo, Waterloo, ON, Canada; Univ Guelph, Guelph, ON, Canada; Shahrood Univ Technol, Shahrood, Iran | Query Performance Prediction (QPP) aims to estimate the effectiveness of a query in addressing the underlying information need without any relevance judgments. More recent works in this area have employed the pre-trained neural embedding representations of the query to go beyond the corpus statistics of query terms and capture the semantics of the query. In this paper, we propose a supervised QPP method by adopting contextualized neural embeddings to directly learn the performance through fine-tuning. To address the challenges arising from disparities in the evaluation of retrieval models through sparse and comprehensive labels, we introduce an innovative strategy for creating synthetic relevance judgments to enable effective performance prediction for queries, irrespective of whether they are evaluated with sparse or more comprehensive labels. Through our experiments on four different query sets accompanied by MS MARCO V1 collection, we show that our approach shows significantly improved performance compared to the state-of-the-art Pre-retrieval QPP methods. | 查询性能预测(Query Performance Prediction, QPP)旨在无需相关判断的情况下,估计查询在满足信息需求方面的有效性。近年来,该领域的研究工作利用预训练的神经嵌入表示来超越查询词的语料库统计信息,捕捉查询的语义。本文提出了一种监督式QPP方法,通过采用上下文神经嵌入,通过微调直接学习查询性能。为了解决通过稀疏和全面标签评估检索模型时产生的差异带来的挑战,我们引入了一种创新的策略,用于创建合成的相关性判断,从而实现对查询的有效性能预测,无论这些查询是通过稀疏标签还是更全面的标签进行评估。通过在MS MARCO V1数据集上的四个不同查询集的实验,我们展示了该方法相较于当前最先进的预检索QPP方法,性能显著提升。 | code | 0 |
Estimating the Usefulness of Clarifying Questions and Answers for Conversational Search | Ivan Sekulic, Weronika Lajewska, Krisztian Balog, Fabio Crestani | While the body of research directed towards constructing and generating clarifying questions in mixed-initiative conversational search systems is vast, research aimed at processing and comprehending users' answers to such questions is scarce. To this end, we present a simple yet effective method for processing answers to clarifying questions, moving away from previous work that simply appends answers to the original query and thus potentially degrades retrieval performance. Specifically, we propose a classifier for assessing usefulness of the prompted clarifying question and an answer given by the user. Useful questions or answers are further appended to the conversation history and passed to a transformer-based query rewriting module. Results demonstrate significant improvements over strong non-mixed-initiative baselines. Furthermore, the proposed approach mitigates the performance drops when non useful questions and answers are utilized. | 尽管在混合主动会话搜索系统中,针对构建和生成澄清问题的研究机构非常庞大,但是针对处理和理解用户对这些问题的回答的研究却很少。为此,我们提出了一种简单而有效的方法,用于处理澄清问题的答案,避免了以前的工作,即只是将答案附加到原始查询,从而可能降低检索性能。具体来说,我们提出了一个分类器,用于评估提示的澄清问题和用户给出的答案的有用性。有用的问题或答案将进一步附加到会话历史中,并传递给基于转换器的查询重写模块。结果显示,与强大的非混合倡议基线相比,有了显著改善。此外,当使用非有用的问题和答案时,提出的方法可以减少性能下降。 | code | 0 | |
Measuring Bias in Search Results Through Retrieval List Comparison | Linda Ratz, Markus Schedl, Simone Kopeinik, Navid Rekabsaz | Johannes Kepler Univ Linz, Linz, Austria; Know Ctr GmbH, Graz, Austria | Many IR systems project harmful societal biases, including gender bias, in their retrieved contents. Uncovering and addressing such biases requires grounded bias measurement principles. However, defining reliable bias metrics for search results is challenging, particularly due to the difficulties in capturing gender-related tendencies in the retrieved documents. In this work, we propose a new framework for search result bias measurement. Within this framework, we first revisit the current metrics for representative search result bias (RepSRB) that are based on the occurrence of gender-specific language in the search results. Addressing their limitations, we additionally propose a metric for comparative search result bias (ComSRB) measurement and integrate it into our framework. ComSRB defines bias as the skew in the set of retrieved documents in response to a non-gendered query toward those for male/female-specific variations of the same query. We evaluate ComSRB against RepSRB on a recent collection of bias-sensitive topics and documents from the MS MARCO collection, using pre-trained bi-encoder and cross-encoder IR models. Our analyses show that, while existing metrics are highly sensitive to the wordings and linguistic formulations, the proposed ComSRB metric mitigates this issue by focusing on the deviations of a retrieval list from its explicitly biased variants, avoiding the need for sub-optimal content analysis processes. | 许多信息检索(IR)系统在其检索内容中反映了有害的社会偏见,包括性别偏见。揭示和解决这些偏见需要基于可靠的偏见测量原则。然而,定义可靠的搜索结果偏见度量标准具有挑战性,特别是由于难以捕捉检索文档中与性别相关的倾向。在这项工作中,我们提出了一个新的搜索结果偏见测量框架。在该框架内,我们首先重新审视了当前基于搜索结果中性别特异性语言出现的代表性搜索结果偏见(RepSRB)度量标准。针对其局限性,我们还提出了用于比较搜索结果偏见(ComSRB)测量的度量标准,并将其集成到我们的框架中。ComSRB将偏见定义为在响应非性别化查询时,检索文档集合向同一查询的男性/女性特异性变体的倾斜。我们在MS MARCO集合中的一个最新的偏见敏感主题和文档集合上,使用预训练的双编码器和交叉编码器IR模型,对ComSRB与RepSRB进行了评估。我们的分析表明,尽管现有度量标准对措辞和语言表达高度敏感,但所提出的ComSRB度量标准通过关注检索列表与其明确偏见的变体之间的偏差,缓解了这一问题,避免了次优的内容分析过程的需求。 | code | 0 |
Cascading Ranking Pipelines for Sensitivity-Aware Search | Jack McKechnie | Univ Glasgow, Glasgow, Lanark, Scotland | Search engines are designed to make information accessible. However, some information should not be accessible, such as documents concerning citizenship applications or personal information. This sensitive information is often found interspersed with other potentially useful non-sensitive information. As such, collections containing sensitive information cannot be made searchable due to the risk of revealing sensitive information. The development of search engines capable of safely searching collections containing sensitive information to provide relevant and non-sensitive information would allow previously hidden collections to be made available. This work aims to develop sensitivity-aware search engines via two-stage cascading retrieval pipelines. | 搜索引擎的设计初衷是使信息易于获取。然而,某些信息不应被轻易访问,例如涉及公民身份申请的文件或个人隐私信息。这些敏感信息通常与其他潜在有用的非敏感信息混杂在一起。因此,包含敏感信息的集合无法被开放搜索,因为存在泄露敏感信息的风险。开发能够安全搜索包含敏感信息的集合并提供相关且非敏感信息的搜索引擎,可以使之前被隐藏的集合得以公开访问。本工作旨在通过两阶段级联检索管道开发具有敏感信息感知能力的搜索引擎。 | code | 0 |
Advancing Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications with ImageCLEF 2024 | Bogdan Ionescu, Henning Müller, AnaMaria Claudia Dragulinescu, Ahmad IdrissiYaghir, Ahmedkhan Radzhabov, Alba Garcia Seco de Herrera, Alexandra Andrei, Alexandru Stan, Andrea M. Storås, Asma Ben Abacha, Benjamin Lecouteux, Benno Stein, Cécile Macaire, Christoph M. Friedrich, Cynthia S. Schmidt, Didier Schwab, Emmanuelle EsperançaRodier, George Ioannidis, Griffin Adams, Henning Schäfer, Hugo Manguinhas, Ioan Coman, Johanna Schöler, Johannes Kiesel, Johannes Rückert, Louise Bloch, Martin Potthast, Maximilian Heinrich, Meliha Yetisgen, Michael A. Riegler, Neal Snider, Pål Halvorsen, Raphael Brüngel, Steven Alexander Hicks, Vajira Thambawita, Vassili Kovalev, Yuri Prokopchuk, Wenwai Yim | Univ Appl Sci Western Switzerland HES SO, Sierre, Switzerland; Natl Univ Sci & Technol Politehn Bucharest, Bucharest, Romania; CEA, LIST, Paris, France | The ImageCLEF evaluation campaign was integrated with CLEF (Conference and Labs of the Evaluation Forum) for more than 20 years and represents a Multimedia Retrieval challenge aimed at evaluating the technologies for annotation, indexing, and retrieval of multimodal data. Thus, it provides information access to large data collections in usage scenarios and domains such as medicine, argumentation and content recommendation. ImageCLEF 2024 has four main tasks: (i) a Medical task targeting automatic image captioning for radiology images, synthetic medical images created with Generative Adversarial Networks (GANs), Visual Question Answering and medical image generation based on text input, and multimodal dermatology response generation; (ii) a joint ImageCLEF-Touché task Image Retrieval/Generation for Arguments to convey the premise of an argument, (iii) a Recommending task addressing cultural heritage content-recommendation, and (iv) a joint ImageCLEF-ToPicto task aiming to provide a translation in pictograms from natural language. In 2023, participation increased by 67% with respect to 2022 which reveals its impact on the community. | ImageCLEF评估活动与CLEF(评估论坛会议与实验室)整合已有20多年,代表了一项多媒体检索挑战,旨在评估多模态数据的注释、索引和检索技术。因此,它在使用场景和领域(如医学、论证和内容推荐)中为大数据集合提供了信息访问。ImageCLEF 2024包含四项主要任务:(i) 医学任务,针对放射影像的自动图像描述、使用生成对抗网络(GANs)创建的合成医学图像、视觉问答以及基于文本输入的医学图像生成,以及多模态皮肤病学响应生成;(ii) 联合ImageCLEF-Touché任务,即图像检索/生成用于传递论证的前提;(iii) 推荐任务,涉及文化遗产内容的推荐;(iv) 联合ImageCLEF-ToPicto任务,旨在从自然语言中提供象形图翻译。2023年,参与人数较2022年增加了67%,显示了其对社区的广泛影响。 | code | 0 |
Ranking Heterogeneous Search Result Pages Using the Interactive Probability Ranking Principle | Kanaad Pathak, Leif Azzopardi, Martin Halvey | The Probability Ranking Principle (PRP) ranks search results based on their expected utility derived solely from document contents, often overlooking the nuances of presentation and user interaction. However, with the evolution of Search Engine Result Pages (SERPs), now comprising a variety of result cards, the manner in which these results are presented is pivotal in influencing user engagement and satisfaction. This shift prompts the question: How does the PRP and its user-centric counterpart, the Interactive Probability Ranking Principle (iPRP), compare in the context of these heterogeneous SERPs? Our study draws a comparison between the PRP and the iPRP, revealing significant differences in their output. The iPRP, accounting for item-specific costs and interaction probabilities to determine the “Expected Perceived Utility" (EPU), yields different result orderings compared to the PRP. We evaluate the effect of the EPU on the ordering of results by observing changes in the ranking within a heterogeneous SERP compared to the traditional “ten blue links”. We find that changing the presentation affects the ranking of items according to the (iPRP) by up to 48% (with respect to DCG, TBG and RBO) in ad-hoc search tasks on the TREC WaPo Collection. This work suggests that the iPRP should be employed when ranking heterogeneous SERPs to provide a user-centric ranking that adapts the ordering based on the presentation and user engagement. | 概率排序原则(PRP)根据搜索结果的预期效用来排序,这些效用完全来自文档内容,往往忽略了表示和用户交互的细微差别。然而,随着搜索引擎结果页面(SERP)的发展,现在包含了各种各样的结果卡,这些结果的呈现方式对于影响用户的参与度和满意度是至关重要的。这种转变提出了一个问题: PRP 和它的以用户为中心的对应物,交互式概率排序原则(iPRP) ,如何在这些异构 SERP 的上下文中进行比较?我们的研究对 PRP 和 iPRP 进行了比较,发现它们的输出存在显著差异。IPRP 考虑了项目特定成本和交互概率,以确定“预期感知效用”(EPU) ,与 PRP 相比产生了不同的结果排序。我们通过观察一个异构 SERP 中的排序变化来评估 EPU 对结果排序的影响,与传统的“十个蓝色链接”相比。我们发现,在 TREC WaPo 集合的特别搜索任务中,根据(iPRP)改变表示影响项目的排名高达48% (相对于 DCG,TBG 和 RBO)。这项工作表明,iPRP 应该被用来排名异构的 SERP 时,提供一个以用户为中心的排名,适应排序的基础上的表示和用户参与。 | code | 0 | |
Query Exposure Prediction for Groups of Documents in Rankings | Thomas Jänich, Graham McDonald, Iadh Ounis | The main objective of an Information Retrieval system is to provide a user with the most relevant documents to the user's query. To do this, modern IR systems typically deploy a re-ranking pipeline in which a set of documents is retrieved by a lightweight first-stage retrieval process and then re-ranked by a more effective but expensive model. However, the success of a re-ranking pipeline is heavily dependent on the performance of the first stage retrieval, since new documents are not usually identified during the re-ranking stage. Moreover, this can impact the amount of exposure that a particular group of documents, such as documents from a particular demographic group, can receive in the final ranking. For example, the fair allocation of exposure becomes more challenging or impossible if the first stage retrieval returns too few documents from certain groups, since the number of group documents in the ranking affects the exposure more than the documents' positions. With this in mind, it is beneficial to predict the amount of exposure that a group of documents is likely to receive in the results of the first stage retrieval process, in order to ensure that there are a sufficient number of documents included from each of the groups. In this paper, we introduce the novel task of query exposure prediction (QEP). Specifically, we propose the first approach for predicting the distribution of exposure that groups of documents will receive for a given query. Our new approach, called GEP, uses lexical information from individual groups of documents to estimate the exposure the groups will receive in a ranking. Our experiments on the TREC 2021 and 2022 Fair Ranking Track test collections show that our proposed GEP approach results in exposure predictions that are up to 40 of adapted existing query performance prediction and resource allocation approaches. | 信息检索系统的主要目的是向用户提供与其查询最相关的文件。为了做到这一点,现代 IR 系统通常部署一个重新排序的管道,其中一组文档通过轻量级的第一阶段检索过程检索,然后通过一个更有效但昂贵的模型重新排序。然而,重新排序管道的成功与否在很大程度上取决于第一阶段检索的性能,因为在重新排序阶段通常不能确定新文档。此外,这可能会影响特定文档组(如来自特定人口组的文档)在最终排名中可以接受的曝光量。例如,如果第一阶段检索从某些群组返回的文档太少,则公平分配曝光变得更具挑战性或不可能,因为排名中群组文档的数量比文档的位置更能影响曝光。考虑到这一点,有益的做法是预测一组文件在第一阶段检索过程的结果中可能接触的数量,以确保每组文件都有足够的数量。本文介绍了一种新的查询暴露预测(QEP)任务。具体来说,我们提出了第一种方法,用于预测给定查询将接收到的文档组的曝光分布。我们的新方法被称为 GEP,它使用来自单个文档组的词汇信息来估计这些组在一个排名中将接收到的信息。我们在 TREC 2021和2022公平排名跟踪测试集合上的实验表明,我们提出的 GEP 方法导致暴露预测,这是多达40种适应现有查询性能预测和资源分配方法。 | code | 0 | |
Investigating the Robustness of Sequential Recommender Systems Against Training Data Perturbations | Filippo Betello, Federico Siciliano, Pushkar Mishra, Fabrizio Silvestri | Sequential Recommender Systems (SRSs) have been widely used to model user behavior over time, but their robustness in the face of perturbations to training data is a critical issue. In this paper, we conduct an empirical study to investigate the effects of removing items at different positions within a temporally ordered sequence. We evaluate two different SRS models on multiple datasets, measuring their performance using Normalized Discounted Cumulative Gain (NDCG) and Rank Sensitivity List metrics. Our results demonstrate that removing items at the end of the sequence significantly impacts performance, with NDCG decreasing up to 60%, while removing items from the beginning or middle has no significant effect. These findings highlight the importance of considering the position of the perturbed items in the training data and shall inform the design of more robust SRSs. | 随着时间的推移,序贯推荐系统(SRS)已被广泛用于模拟用户行为,但是它们在训练数据受到干扰时的鲁棒性是一个关键问题。在本文中,我们进行了一个实证研究,以探讨删除项目在不同位置的时间顺序的影响。我们在多个数据集上评估两种不同的 SRS 模型,使用归一化贴现累积增益(NDCG)和秩敏感性列表度量衡量它们的性能。结果表明: 去除序列末端的项目对性能有显著影响,NDCG 下降幅度达60% ,而去除序列开头或中间的项目对性能无显著影响。这些发现强调了考虑受干扰项目在训练数据中的位置的重要性,并将为设计更强健的战略参考系提供信息。 | code | 0 | |
Conversational Search with Tail Entities | Hai Dang Tran, Andrew Yates, Gerhard Weikum | Max Planck Inst Informat, Saarbrucken, Germany | Conversational search faces incomplete and informal follow-up questions. Prior works address these by contextualizing user utterances with cues derived from the previous turns of the conversation. This approach works well when the conversation centers on prominent entities, for which knowledge bases (KBs) or language models (LMs) can provide rich background. This work addresses the unexplored direction where user questions are about tail entities, not featured in KBs and sparsely covered by LMs. We devise a new method, called CONSENT, for selectively contextualizing a user utterance with turns, KB-linkable entities, and mentions of tail and out-of-KB (OKB) entities. CONSENT derives relatedness weights from Sentence-BERT similarities and employs an integer linear program (ILP) for judiciously selecting the best context cues for a given set of candidate answers. This method couples the contextualization and answer-ranking stages, and jointly infers the best choices for both. | 对话式搜索面临着不完整和非正式的后续问题。先前的工作通过从对话的先前轮次中提取线索来上下文化用户话语,从而解决这些问题。当对话围绕突出的实体展开时,这种方法效果很好,因为知识库(KB)或语言模型(LM)可以提供丰富的背景信息。本研究解决了一个尚未探索的方向,即用户问题涉及尾部实体,这些实体不在知识库中,且语言模型的覆盖较少。我们设计了一种新方法,称为CONSENT,用于选择性地将用户话语与对话轮次、可链接到知识库的实体以及尾部实体和知识库外(OKB)实体的提及进行上下文化。CONSENT从Sentence-BERT的相似性中推导出相关性权重,并采用整数线性规划(ILP)来明智地为给定的一组候选答案选择最佳上下文线索。该方法将上下文化和答案排序阶段结合起来,并共同推断出两者的最佳选择。 | code | 0 |
Event-Specific Document Ranking Through Multi-stage Query Expansion Using an Event Knowledge Graph | Sara Abdollahi, Tin Kuculo, Simon Gottschalk | Leibniz Univ Hannover, Res Ctr L3S, Hannover, Germany | Event-specific document ranking is a crucial task in supporting users when searching for texts covering events such as Brexit or the Olympics. However, the complex nature of events involving multiple aspects like temporal information, location, participants and sub-events poses challenges in effectively modelling their representations for ranking. In this paper, we propose MusQuE (Multi-stage Query Expansion), a multi-stage ranking framework that jointly learns to rank query expansion terms and documents, and in this manner flexibly identifies the optimal combination and number of expansion terms extracted from an event knowledge graph. Experimental results show that MusQuE outperforms state-of-the-art baselines on MS-MARCO EVENT , a new dataset for event-specific document ranking, by 9.1 % and more. | 事件特定文档排序是支持用户搜索涉及诸如英国脱欧或奥运会等事件文本的关键任务。然而,事件的复杂性,包括时间信息、地点、参与者和子事件等多个方面,给有效建模这些表示以进行排序带来了挑战。在本文中,我们提出了MusQuE(多阶段查询扩展),这是一个多阶段排序框架,它联合学习排序查询扩展词和文档,从而灵活地识别从事件知识图谱中提取的扩展词的最佳组合和数量。实验结果表明,MusQuE在MS-MARCO EVENT(一个新的事件特定文档排序数据集)上比现有最先进的基线方法提升了9.1%甚至更多。 | code | 0 |
Simulating Follow-Up Questions in Conversational Search | Johannes Kiesel, Marcel Gohsen, Nailia Mirzakhmedova, Matthias Hagen, Benno Stein | Friedrich Schiller Univ Jena, Ernst Abbe Pl 2, D-07743 Jena, Germany; Bauhaus Univ Weimar, Bauhausstr 9a, D-99423 Weimar, Germany | Evaluating conversational search systems based on simulated user interactions is a potential approach to overcome one of the main problems of static conversational search test collections: the collections contain only very few of all the plausible conversations on a topic. Still, one of the challenges of user simulation is generating realistic follow-up questions on given outputs of a conversational system. We propose to address this challenge by using state-of-the-art language models and find that: (1) on two conversational search datasets, the tested models generate questions that are semantically similar to those in the datasets, especially when tuned for follow-up questions; (2) the generated questions are mostly valid, related, informative, and specific according to human assessment; and (3) for influencing the characteristics of the simulated questions, small changes to the prompt are insufficient. | 基于模拟用户交互来评估对话搜索系统是一种潜在的方法,旨在克服静态对话搜索测试集的一个主要问题:这些测试集仅包含关于某个主题的少数可能的对话。然而,用户模拟的挑战之一是在给定对话系统输出的情况下生成现实的后续问题。我们提出通过使用最先进的语言模型来解决这一挑战,并发现:(1)在两个对话搜索数据集上,测试的模型生成的问题与数据集中的问题在语义上相似,尤其是在针对后续问题进行微调时;(2)根据人工评估,生成的问题大多是有效的、相关的、信息丰富且具体的;(3)对于影响模拟问题特征的需求,仅对提示进行小幅修改是不够的。 | code | 0 |
MOReGIn: Multi-Objective Recommendation at the Global and Individual Levels | Elizabeth Gómez, David Contreras, Ludovico Boratto, Maria Salamó | Multi-Objective Recommender Systems (MORSs) emerged as a paradigm to guarantee multiple (often conflicting) goals. Besides accuracy, a MORS can operate at the global level, where additional beyond-accuracy goals are met for the system as a whole, or at the individual level, meaning that the recommendations are tailored to the needs of each user. The state-of-the-art MORSs either operate at the global or individual level, without assuming the co-existence of the two perspectives. In this study, we show that when global and individual objectives co-exist, MORSs are not able to meet both types of goals. To overcome this issue, we present an approach that regulates the recommendation lists so as to guarantee both global and individual perspectives, while preserving its effectiveness. Specifically, as individual perspective, we tackle genre calibration and, as global perspective, provider fairness. We validate our approach on two real-world datasets, publicly released with this paper. | 多目标推荐系统(MORS)作为一种范式出现,以保证多个(经常相互冲突)的目标。除了准确性之外,一个 MORS 还可以在全球一级运作,在这一级可以为整个系统或在个人一级实现额外的超准确性目标,这意味着建议是根据每个用户的需要量身定制的。最先进的监测系统既可以在全球一级运作,也可以在个人一级运作,而不必假设这两种观点并存。在这项研究中,我们发现当全球目标和个人目标共存时,MORS 不能同时满足这两种目标。为了解决这一问题,我们提出了一种管理建议清单的办法,以保证全球和个人的观点,同时保持其有效性。具体来说,作为个人的角度,我们处理体裁校准和作为全球的角度,提供者的公平性。我们验证了我们的方法在两个真实世界的数据集,公开发布与本文。 | code | 0 | |
VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning | Nanyi Fei, Hao Jiang, Haoyu Lu, Jinqiang Long, Yanqi Dai, Tuo Fan, Zhao Cao, Zhiwu Lu | Huawei Poisson Lab, Hangzhou, Zhejiang, Peoples R China; Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China; Renmin Univ China, Sch Informat, Beijing, Peoples R China | Cross-modal search is one fundamental task in multi-modal learning, but there is hardly any work that aims to solve multiple cross-modal search tasks at once. In this work, we propose a novel Versatile Elastic Multi-mOdal (VEMO) model for search-oriented multi-task learning. VEMO is versatile because we integrate cross-modal semantic search, named entity recognition, and scene text spotting into a unified framework, where the latter two can be further adapted to entity- and character-based image search tasks. VEMO is also elastic because we can freely assemble sub-modules of our flexible network architecture for corresponding tasks. Moreover, to give more choices on the effect-efficiency trade-off when performing cross-modal semantic search, we place multiple encoder exits. Experimental results show the effectiveness of our VEMO with only 37.6% network parameters compared to those needed for uni-task training. Further evaluations on entity- and character-based image search tasks also validate the superiority of search-oriented multi-task learning. | 跨模态搜索是多模态学习中的一项基本任务,但几乎没有工作旨在一次性解决多个跨模态搜索任务。在本研究中,我们提出了一种新颖的通用弹性多模态(VEMO)模型,用于面向搜索的多任务学习。VEMO具有通用性,因为我们集成了跨模态语义搜索、命名实体识别和场景文本识别到一个统一的框架中,后两者可以进一步适应基于实体和字符的图像搜索任务。VEMO还具有弹性,因为我们可以自由组装我们灵活网络架构的子模块以适应相应任务。此外,为了在执行跨模态语义搜索时提供更多关于效果与效率权衡的选择,我们设置了多个编码器出口。实验结果表明,我们的VEMO模型仅需37.6%的网络参数即可达到与单任务训练相当的效果。在基于实体和字符的图像搜索任务上的进一步评估也验证了面向搜索的多任务学习的优越性。 | code | 0 |
Lightweight Modality Adaptation to Sequential Recommendation via Correlation Supervision | Hengchang Hu, Qijiong Liu, Chuang Li, MinYen Kan | In Sequential Recommenders (SR), encoding and utilizing modalities in an end-to-end manner is costly in terms of modality encoder sizes. Two-stage approaches can mitigate such concerns, but they suffer from poor performance due to modality forgetting, where the sequential objective overshadows modality representation. We propose a lightweight knowledge distillation solution that preserves both merits: retaining modality information and maintaining high efficiency. Specifically, we introduce a novel method that enhances the learning of embeddings in SR through the supervision of modality correlations. The supervision signals are distilled from the original modality representations, including both (1) holistic correlations, which quantify their overall associations, and (2) dissected correlation types, which refine their relationship facets (honing in on specific aspects like color or shape consistency). To further address the issue of modality forgetting, we propose an asynchronous learning step, allowing the original information to be retained longer for training the representation learning module. Our approach is compatible with various backbone architectures and outperforms the top baselines by 6.8 original feature associations from modality encoders significantly boosts task-specific recommendation adaptation. Additionally, we find that larger modality encoders (e.g., Large Language Models) contain richer feature sets which necessitate more fine-grained modeling to reach their full performance potential. | 在序列推荐器(SR)中,以端到端的方式编码和利用模式在模式编码器大小方面是昂贵的。两阶段方法可以减轻这种担忧,但是由于情态遗忘,它们的表现很差,其中连续的目标掩盖了情态表示。提出了一种轻量级的知识提取方法,该方法既保留了模态信息,又保持了高效率。具体来说,我们提出了一种新的方法,通过监督情态相关性来提高嵌入的学习效果。监督信号是从原始的模态表示中提取出来的,包括(1)量化其总体关联的整体相关性和(2)剖析的相关类型,这些相关类型细化了它们的关系方面(在特定方面如颜色或形状一致性上打磨)。为了进一步解决模态遗忘问题,我们提出了一个异步学习步骤,允许原始信息保留更长的时间来训练表征学习模块。我们的方法与各种骨干架构兼容,并优于最高基线6.8原始特征关联的形式编码器显着提高任务特定的推荐适应性。此外,我们发现较大的模态编码器(例如,大型语言模型)包含更丰富的特征集,这需要更细粒度的建模来达到其全部性能潜力。 | code | 0 | |
DREQ: Document Re-ranking Using Entity-Based Query Understanding | Shubham Chatterjee, Iain Mackie, Jeff Dalton | While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present DREQ, an entity-oriented dense document re-ranking model. Uniquely, we emphasize the query-relevant entities within a document's representation while simultaneously attenuating the less relevant ones, thus obtaining a query-specific entity-centric document representation. We then combine this entity-centric document representation with the text-centric representation of the document to obtain a "hybrid" representation of the document. We learn a relevance score for the document using this hybrid representation. Using four large-scale benchmarks, we show that DREQ outperforms state-of-the-art neural and non-neural re-ranking methods, highlighting the effectiveness of our entity-oriented representation approach. | 尽管面向实体的神经 IR 模型已经取得了显著的进步,但它们往往忽略了一个关键的细微差别: 文档中各个实体对其总体相关性的不同程度的影响。针对这一差距,我们提出了面向实体的密集文档重排序模型 DREQ。独特的是,我们强调文档表示中的查询相关实体,同时减弱相关性较差的实体,从而获得一个特定于查询的以实体为中心的文档表示。然后,我们将这种以实体为中心的文档表示与以文本为中心的文档表示结合起来,以获得文档的“混合”表示。我们使用这种混合表示学习文档的相关性得分。使用四个大规模的基准测试,我们表明 DREQ 优于最先进的神经元和非神经元重新排序方法,突出了我们的面向实体的表示方法的有效性。 | code | 0 | |
Beyond Topicality: Including Multidimensional Relevance in Cross-encoder Re-ranking - The Health Misinformation Case Study | Rishabh Upadhyay, Arian Askari, Gabriella Pasi, Marco Viviani | Univ Milano Bicocca, Dept Informat Syst & Commun, Viale Sarca 336, I-20126 Milan, Italy; Leiden Univ, Leiden Inst Adv Comp Sci, Niels Bohrweg 1, NL-2333 CA Leiden, Netherlands | In this paper, we propose a novel approach to consider multiple dimensions of relevance in cross-encoder re-ranking. On the one hand, cross-encoders constitute an effective solution for re-ranking when considering a single relevance dimension such as topicality, but are not designed to straightforwardly account for additional relevance dimensions. On the other hand, the majority of re-ranking models accounting for multdimensional relevance are often based on the aggregation of multiple relevance scores at the re-ranking stage, leading to potential compensatory effects. To address these issues, in the proposed solution we enhance the candidate documents retrieved by a first-stage lexical retrieval model with suitable relevance statements related to distinct relevance dimensions, and then perform a re-ranking on them with cross-encoders. In this work we focus, in particular, on an extra dimension of relevance beyond topicality, namely, credibility, to address health misinformation in the Consumer Health Search task. Experimental evaluations are performed by considering publicly available datasets; our results show that the proposed approach statistically outperforms state-of-the-art aggregation-based and cross-encoder re-rankers. | 在本文中,我们提出了一种新颖的方法,用于在交叉编码器重排序中考虑多个相关性维度。一方面,交叉编码器在考虑单一相关性维度(如主题相关性)时,构成了重排序的有效解决方案,但其设计并不直接考虑额外的相关性维度。另一方面,大多数考虑多维相关性的重排序模型通常基于在重排序阶段对多个相关性分数的聚合,这可能导致潜在的补偿效应。为了解决这些问题,在所提出的解决方案中,我们通过增强第一阶段词汇检索模型检索到的候选文档,使用与不同相关性维度相关的适当相关性声明,然后使用交叉编码器对它们进行重排序。在这项工作中,我们特别关注除了主题相关性之外的另一个相关性维度,即可信度,以解决消费者健康搜索任务中的健康错误信息问题。实验评估通过考虑公开可用的数据集进行;我们的结果表明,所提出的方法在统计上优于最先进的基于聚合的交叉编码器重排序方法。 | code | 0 |
Query Obfuscation for Information Retrieval Through Differential Privacy | Guglielmo Faggioli, Nicola Ferro | Univ Padua, Padua, Italy | Protecting the privacy of a user querying an Information Retrieval (IR) system is of utmost importance. The problem is exacerbated when the IR system is not cooperative in satisfying the user's privacy requirements. To address this, obfuscation techniques split the user's sensitive query into multiple non-sensitive ones that can be safely transmitted to the IR system. To generate such queries, current approaches rely on lexical databases, such as WordNet, or heuristics of word co-occurrences. At the same time, advances in Natural Language Processing (NLP) have shown the power of Differential Privacy (DP) in releasing privacy-preserving text for completely different purposes, such as spam detection and sentiment analysis. We investigate for the first time whether DP mechanisms, originally designed for specific NLP tasks, can effectively be used in IR to obfuscate queries. We also assess their performance compared to state-of-the-art techniques in IR. Our empirical evaluation shows that the Vickrey DP mechanism based on theMahalanobis norm with privacy budget epsilon is an element of [10, 12.5] achieves state-of-the-art privacy protection and improved effectiveness. Furthermore, differently from previous approaches that are substantially on/off, by changing the privacy budget epsilon, DP allows users to adjust their desired level of privacy protection, offering a trade-off between effectiveness and privacy. | 保护用户在查询信息检索(IR)系统时的隐私至关重要。当IR系统不合作满足用户的隐私需求时,这一问题变得更加严重。为了解决这一问题,混淆技术将用户的敏感查询分割成多个非敏感查询,这些查询可以安全地传输到IR系统。为了生成此类查询,当前方法依赖于词汇数据库(如WordNet)或词汇共现的启发式方法。与此同时,自然语言处理(NLP)的进展展示了差分隐私(DP)在发布隐私保护文本方面的强大能力,尽管这些文本最初是为完全不同的目的(如垃圾邮件检测和情感分析)设计的。我们首次研究了原本为特定NLP任务设计的DP机制是否能够有效地用于IR中的查询混淆。我们还评估了它们与IR领域最先进技术相比的性能。我们的实证评估表明,基于马氏范数且隐私预算epsilon属于[10, 12.5]范围的Vickrey DP机制能够实现最先进的隐私保护并提高有效性。此外,与之前基本上是开关式的方法不同,通过改变隐私预算epsilon,DP允许用户调整所需的隐私保护级别,从而在有效性和隐私之间提供权衡。 | code | 0 |
On-Device Query Auto-completion for Email Search | Yifan Qiao, Otto Godwin, Hua Ouyang | AbstractTraditional query auto-completion (QAC) relies heavily on search logs collected over many users. However, in on-device email search, the scarcity of logs and the governing privacy constraints make QAC a challenging task. In this work, we propose an on-device QAC method that runs directly on users’ devices, where users’ sensitive data and interaction logs are not collected, shared, or aggregated through web services. This method retrieves candidates using pseudo relevance feedback, and ranks them based on relevance signals that explore the textual and structural information from users’ emails. We also propose a private corpora based evaluation method, and empirically demonstrate the effectiveness of our proposed method. | 传统的查询自动完成(QAC)在很大程度上依赖于多个用户收集的搜索日志。然而,在设备上的电子邮件搜索,日志的稀缺性和管理隐私的约束使 QAC 一个具有挑战性的任务。在这项工作中,我们提出了一个在设备上的 QAC 方法,直接运行在用户的设备上,其中用户的敏感数据和交互日志不收集,共享,或通过 Web 服务聚合。这种方法使用伪关联反馈检索候选人,并根据相关信号对他们进行排名,这些相关信号探索用户电子邮件的文本和结构信息。我们还提出了一种基于私人语料库的评价方法,并通过实例验证了该方法的有效性。 | code | 0 | |
Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query? | Juan Manuel Rodriguez, Nima Tavassoli, Eliezer Levy, Gil Lederman, Dima Sivov, Matteo Lissandrini, Davide Mottin | Tel Aviv Res Ctr Huawei Technol, Pnueli Lab, Tel Aviv, Israel; Aalborg Univ, Aalborg, Denmark; Aarhus Univ, Aarhus, Denmark | Text-image retrieval (T2I) refers to the task of recovering all images relevant to a keyword query. Popular datasets for text-image retrieval, such as Flickr30k, VG, or MS-COCO, utilize annotated image captions, e.g., "a man playing with a kid", as a surrogate for queries. With such surrogate queries, current multi-modal machine learning models, such as CLIP or BLIP, perform remarkably well. The main reason is the descriptive nature of captions, which detail the content of an image. Yet, T2I queries go beyond the mere descriptions in image-caption pairs. Thus, these datasets are ill-suited to test methods on more abstract or conceptual queries, e.g., "family vacations". In such queries, the image content is implied rather than explicitly described. In this paper, we replicate the T2I results on descriptive queries and generalize them to conceptual queries. To this end, we perform new experiments on a novel T2I benchmark for the task of conceptual query answering, called ConQA. ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query. Our results on established measures show that both large pretrained models (e.g., CLIP, BLIP, and BLIP2) and small models (e.g., SGRAF and NAAF), perform up to 4x better on descriptive rather than conceptual queries. We also find that the models perform better on queries with more than 6 keywords as in MS-COCO captions. | 文本-图像检索(T2I)是指根据关键词查询恢复所有相关图像的任务。流行的文本-图像检索数据集,如Flickr30k、VG或MS-COCO,使用带注释的图像描述作为查询的替代,例如“一个男人和一个孩子玩耍”。利用这些替代查询,当前的多模态机器学习模型(如CLIP或BLIP)表现非常出色。主要原因是描述的详细性,这些描述详细说明了图像的内容。然而,T2I查询不仅仅是图像-描述对中的简单描述。因此,这些数据集不适合测试更抽象或概念性查询的方法,例如“家庭度假”。在这种查询中,图像内容是隐含的,而不是明确描述的。在本文中,我们复制了描述性查询的T2I结果,并将其推广到概念性查询。为此,我们在一个新的T2I基准上进行了新的实验,该基准用于概念性查询回答任务,称为ConQA。ConQA包含43k张图像上的30个描述性查询和50个概念性查询,每个查询有超过100张手动注释的图像。我们在已建立的度量标准上的结果显示,无论是大型预训练模型(如CLIP、BLIP和BLIP2)还是小型模型(如SGRAF和NAAF),在描述性查询上的表现都比概念性查询好4倍。我们还发现,模型在超过6个关键词的查询上表现更好,这与MS-COCO描述中的情况一致。 | code | 0 |
Query Generation Using Large Language Models - A Reproducibility Study of Unsupervised Passage Reranking | David Rau, Jaap Kamps | Univ Amsterdam, Amsterdam, Netherlands | Existing passage retrieval techniques predominantly emphasize classification or dense matching strategies. This is in contrast with classic language modeling approaches focusing on query or question generation. Recently, Sachan et al. introduced an Unsupervised Passage Retrieval (UPR) approach that resembles this by exploiting the inherent generative capabilities of large language models. In this replicability study, we revisit the concept of zero-shot question generation for re-ranking and focus our investigation on the ranking experiments, validating the UPR findings, particularly on the widely recognized BEIR benchmark. Furthermore, we extend the original work by evaluating the proposed method additionally on the TREC Deep Learning track benchmarks of 2019 and 2020. To enhance our understanding of the technique’s performance, we introduce novel experiments exploring the influence of different prompts on retrieval outcomes. Our comprehensive analysis provides valuable insights into the robustness and applicability of zero-shot question generation as a re-ranking strategy in passage retrieval. | 现有的段落检索技术主要集中在分类或密集匹配策略上,这与经典的语言建模方法形成对比,后者侧重于查询或问题生成。最近,Sachan等人提出了一种无监督段落检索(UPR)方法,该方法通过利用大型语言模型固有的生成能力,与此类方法类似。在这项可重复性研究中,我们重新审视了零样本问题生成用于重新排序的概念,并将研究重点放在排序实验上,验证了UPR的发现,特别是在广泛认可的BEIR基准测试上。此外,我们通过进一步在2019年和2020年的TREC深度学习赛道基准测试上评估所提出的方法,扩展了原始工作。为了加深对该技术性能的理解,我们引入了一系列新实验,探索不同提示对检索结果的影响。我们的全面分析为零样本问题生成作为段落检索中的重新排序策略的鲁棒性和适用性提供了宝贵的见解。 | code | 0 |
Ranking Distance Metric for Privacy Budget in Distributed Learning of Finite Embedding Data | Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia | JPMorgan Chase, Global Technol Appl Res, New York, NY 10017 USA | Federated Learning (FL) is a collective of distributed learning paradigm that aims to preserve privacy in data. Recent studies have shown FL models to be vulnerable to reconstruction attacks that compromise data privacy by inverting gradients computed on confidential data. To address the challenge of defending against these attacks, it is common to employ methods that guarantee data confidentiality using the principles of Differential Privacy (DP). However, in many cases, especially for machine learning models trained on unstructured data such as text, evaluating privacy requires to consider also the finite space of embedding for client's private data. In this study, we show how privacy in a distributed FL setup is sensitive to the underlying finite embeddings of the confidential data. We show that privacy can be quantified for a client batch that uses either noise, or a mixture of finite embeddings, by introducing a normalised rank distance (d(rank)). This measure has the advantage of taking into account the size of a finite vocabulary embedding, and align the privacy budget to a partitioned space. We further explore the impact of noise and client batch size on the privacy budget and compare it to the standard epsilon derived from Local-DP. | 联邦学习(Federated Learning, FL)是一种分布式学习范式,旨在保护数据隐私。最近的研究表明,FL模型容易受到重建攻击的威胁,这些攻击通过反转在机密数据上计算的梯度来破坏数据隐私。为了应对这些攻击的防御挑战,通常采用基于差分隐私(Differential Privacy, DP)原则的方法来保证数据的机密性。然而,在许多情况下,特别是对于在文本等非结构化数据上训练的机器学习模型,评估隐私还需要考虑客户端私有数据的有限嵌入空间。在本研究中,我们展示了分布式FL设置中的隐私如何对机密数据的底层有限嵌入敏感。我们通过引入归一化秩距离(d(rank))展示了如何为使用噪声或有限嵌入混合的客户端批次量化隐私。这一度量的优势在于考虑了有限词汇嵌入的大小,并将隐私预算与分区空间对齐。我们进一步探讨了噪声和客户端批次大小对隐私预算的影响,并将其与从本地差分隐私(Local-DP)得出的标准ε进行比较。 | code | 0 |
Effective Adhoc Retrieval Through Traversal of a Query-Document Graph | Erlend Frayling, Sean MacAvaney, Craig Macdonald, Iadh Ounis | Univ Glasgow, Glasgow, Lanark, Scotland | Adhoc retrieval is the task of effectively retrieving information for an end-user's information need, usually expressed as a textual query. One of the most well-established retrieval frameworks is the two-stage retrieval pipeline, whereby an inexpensive retrieval algorithm retrieves a subset of candidate documents from a corpus, and a more sophisticated (but costly) model re-ranks these candidates. A notable limitation of this two-stage framework is that the second stage re-ranking model can only re-order documents, and any relevant documents not retrieved from the corpus in the first stage are entirely lost to the second stage. A recently-proposed Adaptive Re-Ranking technique has shown that extending the candidate pool by traversing a document similarity graph can overcome this recall problem. However, this traversal technique is agnostic of the user's query, which has the potential to waste compute resources by scoring documents that are not related to the query. In this work, we propose an alternative formulation of the document similarity graph. Rather than using document similarities, we propose a weighted bipartite graph that consists of both document nodes and query nodes. This overcomes the limitations of prior Adaptive Re-Ranking approaches because the bipartite graph can be navigated in a manner that explicitly acknowledges the original user query issued to the search pipeline. We evaluate the effectiveness of our proposed framework by experimenting with the TREC Deep Learning track in a standard adhoc retrieval setting. We find that our approach outperforms state-of-the-art two-stage re-ranking pipelines, improving the nDCG@10 metric by 5.8% on the DL19 test collection. | 特定信息检索(Adhoc retrieval)是一项针对终端用户信息需求进行有效检索的任务,通常以文本查询的形式表达。其中最为成熟的检索框架之一是两阶段检索流程,即先通过一种成本较低的检索算法从语料库中检索出一部分候选文档,再由一个更为复杂(但成本较高)的模型对这些候选文档进行重新排序。这种两阶段框架的一个显著局限性在于,第二阶段的重新排序模型只能对文档进行重新排序,而在第一阶段未能从语料库中检索到的任何相关文档将完全丢失在第二阶段中。最近提出的一种自适应重新排序技术(Adaptive Re-Ranking)表明,通过遍历文档相似图来扩展候选池可以克服这一召回问题。然而,这种遍历技术与用户的查询无关,可能会导致计算资源的浪费,因为可能会对与查询无关的文档进行评分。 |
在本研究中,我们提出了一种替代的文档相似图构建方法。我们不再使用文档相似性,而是提出了一种由文档节点和查询节点组成的加权二分图。这种方法克服了先前自适应重新排序技术的局限性,因为二分图可以在明确考虑用户原始查询的情况下进行导航。我们在标准的特定信息检索设置中,通过TREC深度学习(Deep Learning)赛道的实验评估了我们提出框架的有效性。实验结果表明,我们的方法优于当前最先进的两阶段重新排序流程,在DL19测试集上的nDCG@10指标上提升了5.8%。|code|0|
|MMCRec: Towards Multi-modal Generative AI in Conversational Recommendation|Tendai Mukande, Esraa Ali, Annalina Caputo, Ruihai Dong, Noel E. O'Connor|Dublin City Univ, Dublin 9, Ireland; Univ Coll Dublin, Dublin 4, Ireland|Personalized recommendation systems have become integral in this digital age by facilitating content discovery to users and products tailored to their preferences. Since the Generative Artificial Intelligence (GAI) boom, research into GAI-enhanced Conversational Recommender Systems (CRSs) has sparked great interest. Most existing methods, however, mainly rely on one mode of input such as text, thereby limiting their ability to capture content diversity. This is also inconsistent with real-world scenarios, which involve multi-modal input data and output data. To address these limitations, we propose the Multi-Modal Conversational Recommender System (MMCRec) model which harnesses multiple modalities, including text, images, voice and video to enhance the recommendation performance and experience. Our model is capable of not only accepting multi-mode input, but also generating multi-modal output in conversational recommendation. Experimental evaluations demonstrate the effectiveness of our model in real-world conversational recommendation scenarios.|在这个数字化时代,个性化推荐系统通过为用户提供符合其偏好的内容和产品,已成为不可或缺的一部分。自生成式人工智能(GAI)兴起以来,关于GAI增强的对话推荐系统(CRSs)的研究引起了极大兴趣。然而,大多数现有方法主要依赖于单一输入模式,如文本,这限制了它们捕捉内容多样性的能力。这与现实场景也不一致,现实场景中涉及多模态的输入和输出数据。为了解决这些限制,我们提出了多模态对话推荐系统(MMCRec)模型,该模型利用多种模态,包括文本、图像、语音和视频,以增强推荐性能和体验。我们的模型不仅能够接受多模态输入,还能在对话推荐中生成多模态输出。实验评估证明了我们的模型在实际对话推荐场景中的有效性。|code|0|
|Federated Conversational Recommender Systems|Allen Lin, Jianling Wang, Ziwei Zhu, James Caverlee|George Mason Univ, Fairfax, VA 22030 USA; Texas A&M Univ, College Stn, TX 77843 USA|Conversational Recommender Systems (CRSs) have become increasingly popular as a powerful tool for providing personalized recommendation experiences. By directly engaging with users in a conversational manner to learn their current and fine-grained preferences, a CRS can quickly derive recommendations that are relevant and justifiable. However, existing CRSs typically rely on a centralized training and deployment process, which involves collecting and storing explicitly-communicated user preferences in a centralized repository. These fine-grained user preferences are completely human-interpretable and can easily be used to infer sensitive information (e.g., financial status, political stands, and health information) about the user, if leaked or breached. To address the user privacy concerns in CRS, we first define a set of privacy protection guidelines for preserving user privacy then propose a novel federated CRS framework that effectively reduces the risk of exposing user privacy. Through extensive experiments, we show that the proposed framework not only satisfies these user privacy protection guidelines, but also achieves competitive recommendation performance comparing to the state-of-the-art non-private conversational recommendation approach.|会话推荐系统(Conversational Recommender Systems, CRSs)作为一种提供个性化推荐体验的强大工具,正变得越来越受欢迎。通过与用户以对话形式直接互动,CRS能够学习用户的当前和细粒度偏好,从而快速生成相关且合理的推荐。然而,现有的CRS通常依赖于集中式的训练和部署过程,这涉及将用户明确表达的偏好收集并存储在集中式存储库中。这些细粒度的用户偏好是完全可被人类理解的,如果泄露或被攻击,很容易被用于推断用户的敏感信息(例如财务状况、政治立场和健康信息)。为了解决CRS中的用户隐私问题,我们首先定义了一组保护用户隐私的隐私保护准则,然后提出了一种新颖的联邦CRS框架,该框架有效降低了用户隐私暴露的风险。通过大量实验,我们证明所提出的框架不仅满足了这些用户隐私保护准则,而且在推荐性能上与最先进的非隐私会话推荐方法相比也具备竞争力。|code|0|
|Improving Exposure Allocation in Rankings by Query Generation|Thomas Jänich, Graham McDonald, Iadh Ounis|Univ Glasgow, Glasgow, Lanark, Scotland|Deploying methods that incorporate generated queries in their retrieval process, such as Doc2Query, has been shown to be effective for retrieving the most relevant documents for a user's query. However, to the best of our knowledge, there has been no work yet on whether generated queries can also be used in the ranking process to achieve other objectives, such as ensuring a fair distribution of exposure in the ranking. Indeed, the amount of exposure that a document is likely to receive depends on the document's position in the ranking, with lower-ranked documents having a lower probability of being examined by the user. While the utility to users remains the main objective of an Information Retrieval (IR) system, an unfair exposure allocation can lead to lost opportunities and unfair economic impacts for particular societal groups. Therefore, in this work, we conduct a first investigation into whether generating relevant queries can help to fairly distribute the exposure over groups of documents in a ranking. In our work, we build on the effective Doc2Query methods to selectively generate relevant queries for underrepresented groups of documents and use their predicted relevance to the original query in order to re-rank the underexposed documents. Our experiments on the TREC 2022 Fair Ranking Track collection show that using generated queries consistently leads to a fairer allocation of exposure compared to a standard ranking while still maintaining utility.|在检索过程中采用生成查询的方法,如Doc2Query,已被证明能够有效检索与用户查询最相关的文档。然而,据我们所知,目前尚未有研究探讨生成查询是否也可用于排序过程中以实现其他目标,例如确保排序中曝光的公平分配。事实上,文档可能获得的曝光量取决于其在排序中的位置,排名较低的文档被用户查看的概率较低。尽管用户效用仍然是信息检索(IR)系统的主要目标,但不公平的曝光分配可能导致某些社会群体错失机会并遭受不公平的经济影响。因此,在本研究中,我们首次探讨了生成相关查询是否有助于在排序中公平分配文档组的曝光量。在我们的研究中,我们基于有效的Doc2Query方法,选择性地为代表性不足的文档组生成相关查询,并利用它们对原始查询的预测相关性来重新排序曝光不足的文档。我们在TREC 2022公平排序赛道数据集上的实验表明,与标准排序相比,使用生成查询能够持续实现更公平的曝光分配,同时仍保持用户效用。|code|0|
|KnowFIRES: A Knowledge-Graph Framework for Interpreting Retrieved Entities from Search|Negar Arabzadeh, Kiarash Golzadeh, Christopher Risi, Charles L. A. Clarke, Jian Zhao|Univ Waterloo, Waterloo, ON, Canada|Entity retrieval is essential in information access domains where people search for specific entities, such as individuals, organizations, and places. While entity retrieval is an active research topic in Information Retrieval, it is necessary to explore the explainability and interpretability of them more extensively. KnowFIRES addresses this by offering a knowledge graph-based visual representation of entity retrieval results, focusing on contrasting different retrieval methods. KnowFIRES allows users to better understand these differences through the juxtaposition and superposition of retrieved sub-graphs. As part of our demo, we make KnowFIRES (Demo: http://knowfires.live , Source: https://github.com/kiarashgl/KnowFIRES ) web interface and its source code publicly available (A demonstration of the tool: https://www.youtube.com/watch?v=9u-877ArNYE ).|实体检索在信息访问领域中至关重要,尤其是在人们搜索特定实体(如个人、组织、地点等)时。尽管实体检索是信息检索领域的一个活跃研究课题,但有必要更广泛地探索其可解释性和可理解性。KnowFIRES通过提供基于知识图谱的实体检索结果可视化表示来解决这一问题,重点在于对比不同的检索方法。KnowFIRES允许用户通过检索子图的并置和叠加来更好地理解这些差异。作为我们演示的一部分,我们公开了KnowFIRES的Web界面及其源代码(演示:http://knowfires.live,源代码:https://github.com/kiarashgl/KnowFIRES)(工具演示视频:https://www.youtube.com/watch?v=9u-877ArNYE)。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=KnowFIRES:+A+Knowledge-Graph+Framework+for+Interpreting+Retrieved+Entities+from+Search)|0|
|A Conversational Search Framework for Multimedia Archives|Anastasia Potyagalova, Gareth J. F. Jones|Dublin City Univ, Sch Comp, ADAPT Ctr, Dublin 9, Ireland|Conversational search system seek to support users in their search activities to improve the effectiveness and efficiency of search while reducing their cognitive load. The challenges of multimedia search mean that search supports provided by conversational search have the potential to improve the user search experience. For example, by assisting users in constructing better queries and making more informed decisions in relevance feedback stages whilst searching. However, previous research on conversational search has been focused almost exclusively on text archives. This demonstration illustrates the potential for the application of conversational methods in multimedia search. We describe a framework to enable multimodal conversational search for use with multimedia archives. Our current prototype demonstrates the use of an conversational AI assistant during the multimedia information retrieval process for both image and video collections.|对话式搜索系统旨在支持用户的搜索活动,以提高搜索的效率和效果,同时减少用户的认知负担。多媒体搜索的挑战意味着对话式搜索提供的支持有潜力提升用户的搜索体验。例如,通过帮助用户构建更好的查询并在搜索过程中的相关性反馈阶段做出更明智的决策。然而,以往关于对话式搜索的研究几乎完全集中在文本档案上。本演示展示了在多媒体搜索中应用对话式方法的潜力。我们描述了一个支持多模态对话式搜索的框架,用于多媒体档案。我们当前的原型展示了在多媒体信息检索过程中使用对话式人工智能助手来处理图像和视频集合。|code|0|
|Effective and Efficient Transformer Models for Sequential Recommendation|Aleksandr V. Petrov|Univ Glasgow, Glasgow, Lanark, Scotland|Sequential Recommender Systems use the order of user-item interactions to predict the next item in the sequence. This task is similar to Language Modelling, where the goal is to predict the next token based on the sequence of past tokens. Therefore, adaptations of language models, and, in particular, Transformer-based models, achieved state-of-the-art results for a sequential recommendation. However, despite similarities, the sequential recommendation problem poses a number of specific challenges not present in Language Modelling. These challenges include the large catalogue size of real-world recommender systems, which increases GPU memory requirements and makes the training and the inference of recommender models slow. Another challenge is that a good recommender system should focus not only on the accuracy of recommendation but also on additional metrics, such as diversity and novelty, which makes the direct adaptation of language model training strategies problematic. Our research focuses on solving these challenges. In this doctoral consortium abstract, we briefly describe the motivation and background for our work and then pose research questions and discuss current progress towards solving the described problems.|序列推荐系统利用用户-物品交互的顺序来预测序列中的下一个物品。这一任务与语言建模类似,其目标是根据过去令牌的序列预测下一个令牌。因此,基于语言模型的改进,特别是基于Transformer的模型,在序列推荐中取得了最先进的结果。然而,尽管存在相似性,序列推荐问题仍带来了一些在语言建模中不存在的特定挑战。这些挑战包括现实世界推荐系统中庞大的物品目录规模,这增加了GPU内存需求,并使得推荐模型的训练和推理速度变慢。另一个挑战是,一个好的推荐系统不仅应关注推荐的准确性,还应关注其他指标,如多样性和新颖性,这使得直接采用语言模型的训练策略变得困难。我们的研究专注于解决这些挑战。在本博士联盟摘要中,我们简要描述了工作的动机和背景,随后提出了研究问题,并讨论了在解决所述问题方面的当前进展。|code|0|
|Quantum Computing for Information Retrieval and Recommender Systems|Maurizio Ferrari Dacrema, Andrea Pasin, Paolo Cremonesi, Nicola Ferro|Univ Padua, Padua, Italy; Politecn Milan, Milan, Italy|The field of Quantum Computing (QC) has gained significant popularity in recent years, due to its potential to provide benefits in terms of efficiency and effectiveness when employed to solve certain computationally intensive tasks. In both Information Retrieval (IR) and Recommender Systems (RS) we are required to build methods that apply complex processing on large and heterogeneous datasets, it is natural therefore to wonder whether QC could also be applied to boost their performance. The tutorial aims to provide first an introduction to QC for an audience that is not familiar with the technology, then to show how to apply the QC paradigm of Quantum Annealing (QA) to solve practical problems that are currently faced by IR and RS systems. During the tutorial, participants will be provided with the fundamentals required to understand QC and to apply it in practice by using a real D-Wave quantum annealer through APIs.|近年来,量子计算(Quantum Computing, QC)领域因其在解决某些计算密集型任务时可能带来的效率和效果上的优势而备受关注。在信息检索(Information Retrieval, IR)和推荐系统(Recommender Systems, RS)中,我们需要构建能够对大规模异构数据集进行复杂处理的方法,因此自然会产生疑问:量子计算是否也能应用于提升这些系统的性能。本教程旨在首先为不熟悉该技术的观众提供量子计算的入门介绍,随后展示如何应用量子退火(Quantum Annealing, QA)范式来解决当前IR和RS系统面临的实际问题。在教程过程中,参与者将通过API使用真实的D-Wave量子退火器,获得理解量子计算并将其应用于实践所需的基础知识。|code|0|
|Transformers for Sequential Recommendation|Aleksandr V. Petrov, Craig Macdonald|National University of Singapore, Singapore, Singapore; University of Hong Kong, Hong Kong, China; Wuhan University, Wuhan, China; Ocean University of China, Qingdao, China|Learning dynamic user preference has become an increasingly important component for many online platforms (e.g., video-sharing sites, e-commerce systems) to make sequential recommendations. Previous works have made many efforts to model item-item transitions over user interaction sequences, based on various architectures, e.g., recurrent neural networks and self-attention mechanism. Recently emerged graph neural networks also serve as useful backbone models to capture item dependencies in sequential recommendation scenarios. Despite their effectiveness, existing methods have far focused on item sequence representation with singular type of interactions, and thus are limited to capture dynamic heterogeneous relational structures between users and items (e.g., page view, add-to-favorite, purchase). To tackle this challenge, we design a Multi-Behavior Hypergraph-enhanced T ransformer framework (MBHT) to capture both short-term and long-term cross-type behavior dependencies. Specifically, a multi-scale Transformer is equipped with low-rank self-attention to jointly encode behavior-aware sequential patterns from fine-grained and coarse-grained levels. Additionally,we incorporate the global multi-behavior dependency into the hypergraph neural architecture to capture the hierarchical long-range item correlations in a customized manner. Experimental results demonstrate the superiority of our MBHT over various state-of- the-art recommendation solutions across different settings. Further ablation studies validate the effectiveness of our model design and benefits of the new MBHT framework. Our implementation code is released at: https://github.com/yuh-yang/MBHT-KDD22.|学习动态用户偏好已经成为许多在线平台(如视频分享网站、电子商务系统)提供连续推荐的一个越来越重要的组成部分。以往的研究基于多种体系结构,如递归神经网络和自我注意机制,对用户交互序列上的项目-项目转换进行了大量的研究。最近出现的图形神经网络也可以作为有用的骨干模型,以捕获项目依赖的顺序推荐场景。尽管现有的方法很有效,但是现有的方法都集中在单一交互类型的项目序列表示上,因此仅限于捕获用户和项目之间的动态异构关系结构(例如,页面查看、添加到收藏夹、购买)。为了应对这一挑战,我们设计了一个多行为超图增强型 T 变换器框架(MBHT)来捕获短期和长期的跨类型行为依赖。具体而言,多尺度变压器配备低级自注意,以从细粒度和粗粒度级别联合编码行为感知的序列模式。此外,我们将全局多行为依赖引入到超图神经结构中,以自定义的方式获取层次化的长期项目相关性。实验结果表明,我们的 MBHT 优于不同设置的各种最先进的推荐解决方案。进一步的消融研究验证了我们的模型设计的有效性和新的 MBHT 框架的好处。我们的实施代码在以下 https://github.com/yuh-yang/mbht-kdd22发布:。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Transformers+for+Sequential+Recommendation)|0|
|Context-Aware Query Term Difficulty Estimation for Performance Prediction|Abbas Saleminezhad, Negar Arabzadeh, Soosan Beheshti, Ebrahim Bagheri|Univ Waterloo, Toronto, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Research has already found that many retrieval methods are sensitive to the choice and order of terms that appear in a query, which can significantly impact retrieval effectiveness. We capitalize on this finding in order to predict the performance of a query. More specifically, we propose to learn query term difficulty weights specifically within the context of each query, which could then be used as indicators of whether each query term has the likelihood of making the query more effective or not. We show how such difficulty weights can be learnt through the finetuning of a language model. In addition, we propose an approach to integrate the learnt weights into a cross-encoder architecture to predict query performance. We show that our proposed approach shows a consistently strong performance prediction on the MSMARCO collection and its associated widely used Trec Deep Learning tracks query sets. Our findings demonstrate that our method is able to show consistently strong performance prediction over different query sets (MSMARCO Dev, TREC DL'19, '20, Hard) and a range of evaluation metrics (Kendall, Spearman, sMARE).|研究发现,许多检索方法对查询中出现的术语选择和顺序非常敏感,这会显著影响检索效果。我们利用这一发现来预测查询的性能。更具体地说,我们提出在每个查询的上下文中学习查询术语的难度权重,这些权重可以作为指标,判断每个查询术语是否有可能使查询更有效。我们展示了如何通过微调语言模型来学习这些难度权重。此外,我们提出了一种方法,将学习到的权重集成到交叉编码器架构中,以预测查询性能。我们展示了所提出的方法在MSMARCO数据集及其相关的广泛使用的Trec深度学习赛道查询集上表现出持续强劲的性能预测能力。我们的研究结果表明,该方法能够在不同查询集(MSMARCO Dev、TREC DL'19、'20、Hard)和一系列评估指标(Kendall、Spearman、sMARE)上表现出持续强劲的性能预测能力。|code|0|
|Navigating the Thin Line: Examining User Behavior in Search to Detect Engagement and Backfire Effects|Federico Maria Cau, Nava Tintarev||Opinionated users often seek information that aligns with their preexisting beliefs while dismissing contradictory evidence due to confirmation bias. This conduct hinders their ability to consider alternative stances when searching the web. Despite this, few studies have analyzed how the diversification of search results on disputed topics influences the search behavior of highly opinionated users. To this end, we present a preregistered user study (n = 257) investigating whether different levels (low and high) of bias metrics and search results presentation (with or without AI-predicted stances labels) can affect the stance diversity consumption and search behavior of opinionated users on three debated topics (i.e., atheism, intellectual property rights, and school uniforms). Our results show that exposing participants to (counter-attitudinally) biased search results increases their consumption of attitude-opposing content, but we also found that bias was associated with a trend toward overall fewer interactions within the search page. We also found that 19 any search results. When we removed these participants in a post-hoc analysis, we found that stance labels increased the diversity of stances consumed by users, particularly when the search results were biased. Our findings highlight the need for future research to explore distinct search scenario settings to gain insight into opinionated users' behavior.|固执己见的用户往往寻求与他们先前存在的信念相一致的信息,而由于确认偏见而排除相互矛盾的证据。这种行为妨碍了他们在搜索网页时考虑其他立场的能力。尽管如此,很少有研究分析有争议话题的搜索结果的多样化如何影响高度固执己见的用户的搜索行为。为此,我们提出了一项预先注册的用户研究(n = 257) ,调查不同水平(低和高)的偏倚指标和搜索结果表示(有或没有 AI 预测的立场标签)是否会影响立场多样性消费和搜索行为有意见的用户在三个有争议的话题(即无神论,知识产权和校服)。我们的研究结果显示,将参与者暴露在(反态度的)有偏见的搜索结果中,会增加他们对与态度相反的内容的消费,但是我们也发现,偏见与搜索页面内的整体互动减少的趋势有关。我们还发现19任何搜索结果。当我们在一个事后比较中移除这些参与者时,我们发现立场标签增加了用户使用的立场的多样性,特别是当搜索结果有偏见时。我们的研究结果强调了未来研究探索不同搜索场景设置的必要性,以深入了解固执己见的用户的行为。|code|0|
|Measuring Bias in a Ranked List Using Term-Based Representations|Amin Abolghasemi, Leif Azzopardi, Arian Askari, Maarten de Rijke, Suzan Verberne||In most recent studies, gender bias in document ranking is evaluated with the NFaiRR metric, which measures bias in a ranked list based on an aggregation over the unbiasedness scores of each ranked document. This perspective in measuring the bias of a ranked list has a key limitation: individual documents of a ranked list might be biased while the ranked list as a whole balances the groups' representations. To address this issue, we propose a novel metric called TExFAIR (term exposure-based fairness), which is based on two new extensions to a generic fairness evaluation framework, attention-weighted ranking fairness (AWRF). TExFAIR assesses fairness based on the term-based representation of groups in a ranked list: (i) an explicit definition of associating documents to groups based on probabilistic term-level associations, and (ii) a rank-biased discounting factor (RBDF) for counting non-representative documents towards the measurement of the fairness of a ranked list. We assess TExFAIR on the task of measuring gender bias in passage ranking, and study the relationship between TExFAIR and NFaiRR. Our experiments show that there is no strong correlation between TExFAIR and NFaiRR, which indicates that TExFAIR measures a different dimension of fairness than NFaiRR. With TExFAIR, we extend the AWRF framework to allow for the evaluation of fairness in settings with term-based representations of groups in documents in a ranked list.|在最近的大多数研究中,文档排名中的性别偏见是通过 NFaiRR 度量来评估的,该度量基于每个排名文档的无偏评分的聚合来衡量排名列表中的偏见。这种测量排名表偏差的视角有一个关键的局限性: 排名表的个别文档可能有偏差,而排名表作为一个整体平衡各组的表示。为了解决这个问题,我们提出了一种新的度量方法 TExFAIR (术语暴露公平性) ,它基于通用公平性评估框架的两个新的扩展,即注意力加权排序公平性(AWRF)。TExFAIR 基于排名列表中基于术语的群体表示来评估公平性: (i)基于概率术语水平关联的关联文档与群体的明确定义,以及(ii)用于计数非代表性文档的排名折扣因子(RBDF)对排名列表的公平性进行测量。我们通过测量文章排序中的性别偏见来评估 TExFAIR,并研究 TExFAIR 和 NFaiRR 之间的关系。我们的实验表明,TExFAIR 和 NFaiRR 之间没有很强的相关性,这表明 TExFAIR 测量的公平性维度不同于 NFaiRR。通过 TExFAIR,我们扩展了 AWRF 框架,允许在排名列表中的文档中使用基于术语的群组表示来评估设置中的公平性。|code|0|
|Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation|Eugene Yang, Dawn J. Lawrie, James Mayfield, Douglas W. Oard, Scott Miller||Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.|先前关于英语单语检索的工作已经表明,使用大量的查询文档对相关性判断训练的交叉编码器可以作为教师来训练更有效但同样有效的双编码器学生模型。应用类似的知识提取方法来训练一个有效的双跨语检索编码器模型(CLIR) ,其中查询和文档使用不同的语言,这是一个挑战,因为当查询和文档语言不同时,缺乏足够大训练集合。因此,CLIR 的技术状态依赖于翻译查询、文档,或者两者都来自大型英文 MS MARCO 训练集,这种方法称为 Translate-Train。本文提出了一种翻译-提取的方法,利用从单语交叉编码器或 CLIR 交叉编码器中提取的知识来训练双语交叉编码器的学生模型。这个更丰富的设计空间使得教师模型能够在一个优化的设置中执行推理,同时直接为 CLIR 培训学生模型。受过训练的模型和工件可以在 Huggingface 上公开获得。|code|0|
|DESIRE-ME: Domain-Enhanced Supervised Information Retrieval Using Mixture-of-Experts|Pranav Kasela, Gabriella Pasi, Raffaele Perego, Nicola Tonellotto||Open-domain question answering requires retrieval systems able to cope with the diverse and varied nature of questions, providing accurate answers across a broad spectrum of query types and topics. To deal with such topic heterogeneity through a unique model, we propose DESIRE-ME, a neural information retrieval model that leverages the Mixture-of-Experts framework to combine multiple specialized neural models. We rely on Wikipedia data to train an effective neural gating mechanism that classifies the incoming query and that weighs the predictions of the different domain-specific experts correspondingly. This allows DESIRE-ME to specialize adaptively in multiple domains. Through extensive experiments on publicly available datasets, we show that our proposal can effectively generalize domain-enhanced neural models. DESIRE-ME excels in handling open-domain questions adaptively, boosting by up to 12 22|开放领域的问题回答要求检索系统能够处理各种各样的问题,提供准确的答案跨广泛的查询类型和主题。为了通过一个独特的模型来处理这样的话题异质性,我们提出了 DESIRE-ME,一个神经信息检索模型,它利用专家混合框架来结合多个专门的神经模型。我们依靠 Wikipedia 数据来训练一种有效的神经门控机制,该机制对传入的查询进行分类,并相应地权衡不同领域专家的预测。这使得 DESIRE-ME 可以自适应地专门处理多个域。通过在公开数据集上的大量实验,我们表明我们的方案可以有效地推广领域增强的神经模型。DESIRE-ME 擅长于自适应地处理开放领域的问题,最多可提高12|code|0|
|A Deep Learning Approach for Selective Relevance Feedback|Suchana Datta, Debasis Ganguly, Sean MacAvaney, Derek Greene||Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-based learning to determine whether a query should be expanded. In contrast, we revisit the problem of selective PRF from a deep learning perspective, presenting a model that is entirely data-driven and trained in an end-to-end manner. The proposed model leverages a transformer-based bi-encoder architecture. Additionally, to further improve retrieval effectiveness with this selective PRF approach, we make use of the model's confidence estimates to combine the information from the original and expanded queries. In our experiments, we apply this selective feedback on a number of different combinations of ranking and feedback models, and show that our proposed approach consistently improves retrieval effectiveness for both sparse and dense ranking models, with the feedback models being either sparse, dense or generative.|伪相关反馈(PRF)可以提高对足够大数量查询的平均检索效率。然而,PRF 常常引入对原始信息需求的漂移,从而影响了多个查询的检索效率。尽管 PRF 的选择性应用有可能缓解这一问题,但以前的方法在很大程度上依赖于无监督或基于特征的学习来确定是否应该扩展查询。相比之下,我们从深度学习的角度重新审视选择性 PRF 的问题,提出了一个完全由数据驱动并以端到端方式进行训练的模型。该模型利用了基于变压器的双编码器结构。此外,为了进一步提高这种选择性 PRF 方法的检索效率,我们利用模型的置信度估计来组合来自原始和扩展查询的信息。在我们的实验中,我们将这种选择性反馈应用于许多不同的排序和反馈模型组合,并且表明我们提出的方法始终如一地提高了稀疏和密集排序模型的检索效率,反馈模型要么是稀疏的,要么是密集的,要么是生成的。|code|0|
|Self Contrastive Learning for Session-Based Recommendation|Zhengxiang Shi, Xi Wang, Aldo Lipani||Session-based recommendation, which aims to predict the next item of users' interest as per an existing sequence interaction of items, has attracted growing applications of Contrastive Learning (CL) with improved user and item representations. However, these contrastive objectives: (1) serve a similar role as the cross-entropy loss while ignoring the item representation space optimisation; and (2) commonly require complicated modelling, including complex positive/negative sample constructions and extra data augmentation. In this work, we introduce Self-Contrastive Learning (SCL), which simplifies the application of CL and enhances the performance of state-of-the-art CL-based recommendation techniques. Specifically, SCL is formulated as an objective function that directly promotes a uniform distribution among item representations and efficiently replaces all the existing contrastive objective components of state-of-the-art models. Unlike previous works, SCL eliminates the need for any positive/negative sample construction or data augmentation, leading to enhanced interpretability of the item representation space and facilitating its extensibility to existing recommender systems. Through experiments on three benchmark datasets, we demonstrate that SCL consistently improves the performance of state-of-the-art models with statistical significance. Notably, our experiments show that SCL improves the performance of two best-performing models by 8.2% and 9.5% in P@10 (Precision) and 9.9% and 11.2% in MRR@10 (Mean Reciprocal Rank) on average across different benchmarks. Additionally, our analysis elucidates the improvement in terms of alignment and uniformity of representations, as well as the effectiveness of SCL with a low computational cost.|基于会话的推荐,旨在根据已有的项目序列交互预测用户的下一个兴趣项目,已经吸引了越来越多的应用对比学习(CL)与改进的用户和项目表示。然而,这些对比的目标: (1)服务于类似的作用作为交叉熵损失,而忽略项目表示空间优化; (2)通常需要复杂的建模,包括复杂的正/负样本结构和额外的数据增强。本文介绍了自对比学习(SCL) ,简化了 CL 的应用,提高了基于 CL 的推荐技术的性能。具体来说,SCL 是一个直接促进项目表征之间均匀分布的目标函数,它有效地替代了现有最先进模型的所有对比性目标成分。与以前的工作不同,SCL 消除了任何正/负样本构建或数据增强的需要,从而增强了项目表示空间的可解释性,并促进了其对现有推荐系统的可扩展性。通过对三个基准数据集的实验,我们证明了 SCL 能够持续地提高具有统计学意义的最先进模型的性能。值得注意的是,我们的实验表明,在不同的基准测试中,SCL 提高了两个性能最好的模型的性能,P@10(精度)平均提高了8.2% 和9.5% ,MRR@10(平均倒数排名)平均提高了9.9% 和11.2% 。此外,我们的分析阐明了改进方面的对齐和一致性的表示,以及有效的 SCL 与低计算成本。|code|0|
|Revealing the Hidden Impact of Top-N Metrics on Optimization in Recommender Systems|Lukas Wegmeth, Tobias Vente, Lennart Purucker||The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top 43 of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions.|为了提高算法的预测性能,对推荐系统的超参数进行了典型的优化。因此,优化算法,例如网格搜索或随机搜索,根据优化目标度量(如 nDCG 或 Precision)搜索最佳超参数配置。相比之下,优化后的算法,在训练期间内部优化了不同的损失函数,如平方误差或交叉熵。为了解决这个差异,最近的工作集中在产生更适合推荐系统的损失函数。然而,当在优化过程中使用 top-n 度量对算法进行评估时,优化目标度量与训练损失之间的另一个差异被忽略了。在优化过程中,选择 top-n 项目来计算 top-n 度量; 忽略 top-n 项目是从使用完全不同的损失函数训练的模型的建议中选择的。适合于优化的项目推荐——目标指标可能不在推荐项目的前列; 这会对优化性能产生隐性影响。因此,我们被激励去分析是否前 n 个项目对于优化目标的前 n 个度量是最佳的。在寻找答案的过程中,我们除了选择前 n 个选择策略外,还对250个选择策略的预测性能进行了详尽的评估。我们使用十二个隐式反馈和8个显式反馈数据集和十一个推荐系统算法对每个选择策略进行了广泛的评估。我们的研究结果表明,除了 top-n 之外,还存在其他的选择策略可以提高各种算法和推荐域的预测性能。然而,前43名选择策略的表现并没有显著差异。我们讨论了我们的研究结果对优化和重新排序的推荐系统和可行的解决方案的影响。|code|0|
|TWOLAR: A TWO-Step LLM-Augmented Distillation Method for Passage Reranking|Davide Baldelli, Junfeng Jiang, Akiko Aizawa, Paolo Torroni||In this paper, we present TWOLAR: a two-stage pipeline for passage reranking based on the distillation of knowledge from Large Language Models (LLM). TWOLAR introduces a new scoring strategy and a distillation process consisting in the creation of a novel and diverse training dataset. The dataset consists of 20K queries, each associated with a set of documents retrieved via four distinct retrieval methods to ensure diversity, and then reranked by exploiting the zero-shot reranking capabilities of an LLM. Our ablation studies demonstrate the contribution of each new component we introduced. Our experimental results show that TWOLAR significantly enhances the document reranking ability of the underlying model, matching and in some cases even outperforming state-of-the-art models with three orders of magnitude more parameters on the TREC-DL test sets and the zero-shot evaluation benchmark BEIR. To facilitate future work we release our data set, finetuned models, and code.|在本文中,我们提出了 TWOLAR: 一个基于从大语言模型(LLM)中提取知识的两阶段通道重新排序流水线。TWOLAR 引入了一个新的评分策略和一个精馏过程,包括创建一个新的和多样化的训练数据集。该数据集由20K 个查询组成,每个查询与一组文档相关联,这些文档通过四种不同的检索方法检索以确保多样性,然后通过利用 LLM 的零拍重新排序功能进行重新排序。我们的消融研究证明了我们引入的每个新组件的贡献。我们的实验结果显示,TWOLAR 显著提高了基础模型的文档重新排序能力,在 TREC-dL 测试集和零拍评估基准 BEIR 上,通过三个以上的参数,匹配甚至在某些情况下超越了最先进的模型,从而提高了文档重新排序的数量级。为了方便未来的工作,我们发布了我们的数据集、微调模型和代码。|code|0|
|Estimating Query Performance Through Rich Contextualized Query Representations|Sajad Ebrahimi, Maryam Khodabakhsh, Negar Arabzadeh, Ebrahim Bagheri|Toronto Metropolitan Univ, Toronto, ON, Canada; Univ Waterloo, Waterloo, ON, Canada; Univ Guelph, Guelph, ON, Canada; Shahrood Univ Technol, Shahrood, Iran|The state-of-the-art query performance prediction methods rely on the fine-tuning of contextual language models to estimate retrieval effectiveness on a per-query basis. Our work in this paper builds on this strong foundation and proposes to learn rich query representations by learning the interactions between the query and two important contextual information, namely (1) the set of documents retrieved by that query, and (2) the set of similar historical queries with known retrieval effectiveness. We propose that such contextualized query representations can be more accurate estimators of query performance as they embed the performance of past similar queries and the semantics of the documents retrieved by the query. We perform extensive experiments on the MSMARCO collection and its accompanying query sets including MSMARCO Dev set and TREC Deep Learning tracks of 2019, 2020, 2021, and DL-Hard. Our experiments reveal that our proposed method shows robust and effective performance compared to state-of-the-art baselines.|当前最先进的查询性能预测方法依赖于对上下文语言模型进行微调,以逐条查询为基础估计检索效果。本文的研究工作在这一坚实基础之上,进一步提出通过学习查询与两种重要上下文信息之间的交互来学习丰富的查询表示,这两种上下文信息分别是:(1) 由该查询检索到的文档集,以及 (2) 已知检索效果的相似历史查询集。我们认为,这种上下文化的查询表示可以作为更准确的查询性能估计器,因为它们嵌入了过去相似查询的性能以及由查询检索到的文档的语义信息。我们在MSMARCO数据集及其伴随的查询集上进行了广泛的实验,这些查询集包括MSMARCO开发集以及2019、2020、2021年的TREC深度学习赛道和DL-Hard数据集。实验结果表明,与最先进的基线方法相比,我们提出的方法展现了稳健且高效的性能。|code|0|
|Performance Comparison of Session-Based Recommendation Algorithms Based on GNNs|Faisal Shehzad, Dietmar Jannach||In session-based recommendation settings, a recommender system has to base its suggestions on the user interactions that are ob served in an ongoing session. Since such sessions can consist of only a small set of interactions, various approaches based on Graph Neural Networks (GNN) were recently proposed, as they allow us to integrate various types of side information about the items in a natural way. Unfortunately, a variety of evaluation settings are used in the literature, e.g., in terms of protocols, metrics and baselines, making it difficult to assess what represents the state of the art. In this work, we present the results of an evaluation of eight recent GNN-based approaches that were published in high-quality outlets. For a fair comparison, all models are systematically tuned and tested under identical conditions using three common datasets. We furthermore include k-nearest-neighbor and sequential rules-based models as baselines, as such models have previously exhibited competitive performance results for similar settings. To our surprise, the evaluation showed that the simple models outperform all recent GNN models in terms of the Mean Reciprocal Rank, which we used as an optimization criterion, and were only outperformed in three cases in terms of the Hit Rate. Additional analyses furthermore reveal that several other factors that are often not deeply discussed in papers, e.g., random seeds, can markedly impact the performance of GNN-based models. Our results therefore (a) point to continuing issues in the community in terms of research methodology and (b) indicate that there is ample room for improvement in session-based recommendation.|在基于会话的推荐设置中,推荐系统必须根据当前会话中的用户交互情况提出建议。由于这样的会议可以只包括一小组交互,最近提出了各种基于图神经网络(GNN)的方法,因为它们允许我们以一种自然的方式整合关于项目的各种类型的副信息。不幸的是,文献中使用了各种各样的评估设置,例如,在协议、指标和基线方面,这使得评估什么代表了最先进的技术变得困难。在这项工作中,我们介绍了最近在高质量网点发表的八种基于 GNN 的方法的评价结果。为了进行公平的比较,使用三个共同的数据集,在相同的条件下系统地调整和测试所有模型。我们进一步包括 k 最近邻和顺序规则为基线的模型,因为这样的模型已经表现出竞争性能结果在类似的设置。令我们惊讶的是,评估显示,简单的模型在平均倒数排名方面表现优于所有最近的 GNN 模型,我们将其作为优化标准,在命中率方面只有三种情况表现优于 GNN 模型。进一步的分析表明,论文中通常不深入讨论的其他几个因素,例如随机种子,可以显著影响基于 GNN 的模型的性能。因此,我们的研究结果(a)指出了社区在研究方法方面仍然存在的问题,(b)表明在基于会话的推荐方面还有很大的改进空间。|code|0|
|Weighted AUReC: Handling Skew in Shard Map Quality Estimation for Selective Search|Gijs Hendriksen, Djoerd Hiemstra, Arjen P. de Vries|Radboud Univ Nijmegen, Nijmegen, Netherlands|In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.|在选择性搜索中,文档集合被划分为多个主题索引分片。为了有效估计分片映射的主题一致性(或质量),引入了AUReC度量。AUReC假设分片大小相似,然而在实际应用中,这一假设往往不成立,即使对于无监督方法也是如此。如果使用具有倾斜类别分布的有监督标注方法,这一问题可能会进一步加剧。为了估计这种不平衡分片映射的质量,我们引入了AUReC度量的加权适应版本,并使用ClueWeb09B和Gov2数据集进行了实证评估。我们表明,当分片大小相似时,该度量与原始AUReC的评估结果高度一致,但在分片大小倾斜时,它能更好地捕捉性能差异。|code|0|
|Measuring Item Fairness in Next Basket Recommendation: A Reproducibility Study|Yuanna Liu, Ming Li, Mozhdeh Ariannezhad, Masoud Mansoury, Mohammad Aliannejadi, Maarten de Rijke|AIRLab, Amsterdam, Netherlands; Univ Amsterdam, Amsterdam, Netherlands; Booking com, Amsterdam, Netherlands|Item fairness of recommender systems aims to evaluate whether items receive a fair share of exposure according to different definitions of fairness. Raj and Ekstrand [26] study multiple fairness metrics under a common evaluation framework and test their sensitivity with respect to various configurations. They find that fairness metrics show varying degrees of sensitivity towards position weighting models and parameter settings under different information access systems. Although their study considers various domains and datasets, their findings do not necessarily generalize to next basket recommendation (NBR) where users exhibit a more repeat-oriented behavior compared to other recommendation domains. This paper investigates fairness metrics in the NBR domain under a unified experimental setup. Specifically, we directly evaluate the item fairness of various NBR methods. These fairness metrics rank NBR methods in different orders, while most of the metrics agree that repeat-biased methods are fairer than explore-biased ones. Furthermore, we study the effect of unique characteristics of the NBR task on the sensitivity of the metrics, including the basket size, position weighting models, and user repeat behavior. Unlike the findings in [26], Inequity of Amortized Attention (IAA) is the most sensitive metric, as observed in multiple experiments. Our experiments lead to novel findings in the field of NBR and fairness. We find that Expected Exposure Loss (EEL) and Expected Exposure Disparity (EED) are the most robust and adaptable fairness metrics to be used in the NBR domain.|推荐系统的项目公平性旨在评估项目是否根据不同的公平定义获得公平的曝光机会。Raj和Ekstrand[26]在一个共同的评估框架下研究了多种公平性指标,并测试了它们对各种配置的敏感性。他们发现,公平性指标在不同信息访问系统下对位置加权模型和参数设置表现出不同程度的敏感性。尽管他们的研究考虑了多个领域和数据集,但其发现并不一定适用于下一篮推荐(NBR)领域,因为与其他推荐领域相比,用户在NBR中表现出更多的重复导向行为。本文在统一的实验设置下研究了NBR领域中的公平性指标。具体而言,我们直接评估了各种NBR方法的项目公平性。这些公平性指标对NBR方法进行了不同顺序的排名,而大多数指标一致认为偏向重复的方法比偏向探索的方法更公平。此外,我们研究了NBR任务的独特特征对指标敏感性的影响,包括篮子大小、位置加权模型和用户重复行为。与[26]中的发现不同,在多个实验中观察到,摊销注意力不平等(IAA)是最敏感的指标。我们的实验在NBR和公平性领域得出了新的发现。我们发现,预期曝光损失(EEL)和预期曝光差异(EED)是在NBR领域中使用的最稳健和适应性最强的公平性指标。|code|0|
|Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?|Lijun Lyu, Nirmal Roy, Harrie Oosterhuis, Avishek Anand|Delft Univ Technol, Delft, Netherlands; Radboud Univ Nijmegen, Nijmegen, Netherlands|Neural ranking models have become increasingly popular for real-world searchand recommendation systems in recent years. Unlike their tree-basedcounterparts, neural models are much less interpretable. That is, it is verydifficult to understand their inner workings and answer questions like how dothey make their ranking decisions? or what document features do they findimportant? This is particularly disadvantageous since interpretability ishighly important for real-world systems. In this work, we explore featureselection for neural learning-to-rank (LTR). In particular, we investigate sixwidely-used methods from the field of interpretable machine learning (ML) andintroduce our own modification, to select the input features that are mostimportant to the ranking behavior. To understand whether these methods areuseful for practitioners, we further study whether they contribute toefficiency enhancement. Our experimental results reveal a large featureredundancy in several LTR benchmarks: the local selection method TabNet canachieve optimal ranking performance with less than 10 features; the globalmethods, particularly our G-L2X, require slightly more selected features, butexhibit higher potential in improving efficiency. We hope that our analysis ofthese feature selection methods will bring the fields of interpretable ML andLTR closer together.|近年来,神经排序模型在现实世界的搜索和推荐系统中变得越来越流行。与基于树的模型不同,神经模型的解释性要差得多。也就是说,很难理解它们的内在机制并回答诸如“它们是如何做出排序决策的?”或“它们认为哪些文档特征重要?”这样的问题。这一点尤其不利,因为解释性对于现实世界的系统至关重要。在这项工作中,我们探索了神经排序学习(LTR)中的特征选择。特别是,我们研究了可解释机器学习(ML)领域中六种广泛使用的方法,并引入了我们自己的改进方法,以选择对排序行为最重要的输入特征。为了了解这些方法是否对实践者有用,我们进一步研究了它们是否有助于提高效率。我们的实验结果表明,在几个LTR基准测试中存在大量的特征冗余:局部选择方法TabNet可以在不到10个特征的情况下实现最佳的排序性能;全局方法,特别是我们的G-L2X,需要稍多的选择特征,但在提高效率方面表现出更高的潜力。我们希望我们对这些特征选择方法的分析能够将可解释ML和LTR领域更紧密地结合在一起。|code|0|
|The Impact of Differential Privacy on Recommendation Accuracy and Popularity Bias|Peter Müllner, Elisabeth Lex, Markus Schedl, Dominik Kowald||Collaborative filtering-based recommender systems leverage vast amounts of behavioral user data, which poses severe privacy risks. Thus, often, random noise is added to the data to ensure Differential Privacy (DP). However, to date, it is not well understood, in which ways this impacts personalized recommendations. In this work, we study how DP impacts recommendation accuracy and popularity bias, when applied to the training data of state-of-the-art recommendation models. Our findings are three-fold: First, we find that nearly all users' recommendations change when DP is applied. Second, recommendation accuracy drops substantially while recommended item popularity experiences a sharp increase, suggesting that popularity bias worsens. Third, we find that DP exacerbates popularity bias more severely for users who prefer unpopular items than for users that prefer popular items.|基于协同过滤的推荐系统利用了大量的行为用户数据,这带来了严重的隐私风险。因此,随机噪音往往被添加到数据中,以确保差分隐私(DP)。然而,到目前为止,人们还没有很好地理解这对个性化推荐的影响。在本研究中,我们研究了当应用于最先进的推荐模型的训练数据时,DP 如何影响推荐的准确性和受欢迎程度偏差。我们的发现有三个方面: 首先,我们发现几乎所有用户的建议在应用 DP 时都会发生变化。其次,推荐的准确性大幅下降,而推荐项目的流行经历了急剧增加,这表明流行偏差恶化。第三,我们发现对于喜欢不受欢迎项目的用户而言,DP 加剧流行偏见的程度要比喜欢受欢迎项目的用户严重得多。|code|0|
|How to Forget Clients in Federated Online Learning to Rank?|Shuyi Wang, Bing Liu, Guido Zuccon||Data protection legislation like the European Union's General Data Protection Regulation (GDPR) establishes the right to be forgotten: a user (client) can request contributions made using their data to be removed from learned models. In this paper, we study how to remove the contributions made by a client participating in a Federated Online Learning to Rank (FOLTR) system. In a FOLTR system, a ranker is learned by aggregating local updates to the global ranking model. Local updates are learned in an online manner at a client-level using queries and implicit interactions that have occurred within that specific client. By doing so, each client's local data is not shared with other clients or with a centralised search service, while at the same time clients can benefit from an effective global ranking model learned from contributions of each client in the federation. In this paper, we study an effective and efficient unlearning method that can remove a client's contribution without compromising the overall ranker effectiveness and without needing to retrain the global ranker from scratch. A key challenge is how to measure whether the model has unlearned the contributions from the client c^* that has requested removal. For this, we instruct c^* to perform a poisoning attack (add noise to this client updates) and then we measure whether the impact of the attack is lessened when the unlearning process has taken place. Through experiments on four datasets, we demonstrate the effectiveness and efficiency of the unlearning strategy under different combinations of parameter settings.|数据保护立法,如欧盟的一般数据保护条例(GDPR)规定了被遗忘的权利: 用户(客户)可以要求使用他们的数据作出贡献,从学习的模型中删除。在本文中,我们研究了如何删除参与联邦在线学习排名(FOLTR)系统的客户所做的贡献。在 FOLTR 系统中,通过将本地更新聚合到全局排名模型中来学习排名器。使用在特定客户端中发生的查询和隐式交互,在客户端级别以联机方式学习本地更新。通过这样做,每个客户的本地数据不会与其他客户共享,也不会与中央搜索服务共享,同时客户可以从联合会中每个客户贡献的有效全球排名模型中受益。在本文中,我们研究了一个有效和高效的去除方法,可以消除客户的贡献,而不损害整体排名有效性,不需要从头再培训全球排名。一个关键的挑战是如何衡量模型是否已经从请求删除的客户机 c ^ * 那里忘记了贡献。为此,我们指示 c ^ * 执行中毒攻击(为客户端更新添加噪声) ,然后在发生忘记过程时测量攻击的影响是否减轻。通过对四个数据集的实验,验证了在不同的参数设置组合下,忘却策略的有效性和效率。|code|0|
|InDi: Informative and Diverse Sampling for Dense Retrieval|Nachshon Cohen, Hedda Cohen Indelman, Yaron Fairstein, Guy Kushilevitz|Technion, Haifa, Israel; Amazon, Haifa, Israel|Negative sample selection has been shown to have a crucial effect on the training procedure of dense retrieval systems. Nevertheless, most existing negative selection methods end by randomly choosing from some pool of samples. This calls for a better sampling solution. We define desired requirements for negative sample selection; the samples chosen should be informative, to advance the learning process, and diverse, to help the model generalize. We compose a sampling method designed to meet these requirements, and show that using our sampling method to enhance the training procedure of a recent significant dense retrieval solution (coCondenser) improves the obtained model's performance. Specifically, we see a similar to 2% improvement in MRR@10 on the MS MARCO dataset (from 38.2 to 38.8) and a similar to 1.5% improvement in Recall@5 on the Natural Questions dataset (from 71% to 72.1%), both statistically significant. Our solution, as opposed to other methods, does not require training or inferencing a large model, and adds only a small overhead (similar to 1% added time) to the training procedure. Finally, we report ablation studies showing that the objectives defined are indeed important when selecting negative samples for dense retrieval.|负样本选择已被证明对密集检索系统的训练过程具有至关重要的影响。然而,大多数现有的负样本选择方法最终都是从某个样本池中随机选择。这促使我们需要一种更好的采样解决方案。我们定义了负样本选择的理想要求:所选样本应具有信息量,以推进学习过程,并且应具有多样性,以帮助模型泛化。我们设计了一种采样方法,旨在满足这些要求,并展示了使用我们的采样方法来增强最近一种重要的密集检索解决方案(coCondenser)的训练过程,从而提高了所获得模型的性能。具体而言,我们在MS MARCO数据集上观察到MRR@10提高了约2%(从38.2提高到38.8),在Natural Questions数据集上观察到Recall@5提高了约1.5%(从71%提高到72.1%),两者均具有统计学意义。与其他方法不同,我们的解决方案不需要训练或推断大型模型,并且仅增加了训练过程的少量开销(约增加1%的时间)。最后,我们报告了消融研究,表明在为密集检索选择负样本时,定义的目标确实非常重要。|code|0|
|Learning-to-Rank with Nested Feedback|Hitesh Sagtani, Olivier Jeunen, Aleksei Ustimenko||Many platforms on the web present ranked lists of content to users, typically optimized for engagement-, satisfaction- or retention- driven metrics. Advances in the Learning-to-Rank (LTR) research literature have enabled rapid growth in this application area. Several popular interfaces now include nested lists, where users can enter a 2nd-level feed via any given 1st-level item. Naturally, this has implications for evaluation metrics, objective functions, and the ranking policies we wish to learn. We propose a theoretically grounded method to incorporate 2nd-level feedback into any 1st-level ranking model. Online experiments on a large-scale recommendation system confirm our theoretical findings.|网络上的许多平台对用户的内容列表进行排序,通常针对参与度、满意度或保留驱动的指标进行优化。学习到等级(LTR)研究文献的进步使得这一应用领域的快速增长成为可能。一些流行的界面现在包括嵌套列表,用户可以通过任何给定的第一级项目输入第二级提要。当然,这对评估指标、目标函数和我们希望学习的排名策略都有影响。我们提出了一个理论基础的方法,将二级反馈纳入任何一级排名模型。在一个大规模推荐系统上的在线实验证实了我们的理论发现。|code|0|
|Simple Domain Adaptation for Sparse Retrievers|Mathias Vast, Yuxuan Zong, Benjamin Piwowarski, Laure Soulier||In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn't exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.|在自然语言处理信息检索,以及更广泛的自然语言处理领域,通过微调来调整模型以适应特定的领域。尽管这种方法取得了成功,而且通用性强,但是对人工管理和标记数据的需求使得在培训数据不存在的情况下将数据转移到新的任务、领域和/或语言是不切实际的。不经训练就使用该模型(零射击)是另一种选择,但是这种方法会带来有效性损失,特别是对于第一阶段的检索器。为了解决这些问题,出现了许多研究方向,其中大多数是在适应一项任务或一种语言的背景下。然而,文献对领域(或主题)的适应性较少。在本文中,我们解决这个问题的跨主题差异的稀疏第一阶段的检索,移位的方法最初设计的语言适应。通过利用对目标数据的预训练来学习特定领域的知识,该技术减轻了对带注释数据的需求,并扩大了领域适应的范围。尽管它们具有相对较好的泛化能力,但是我们表明即使是稀疏的检索器也可以从我们简单的领域自适应方法中受益。|code|0|
|Selma: A Semantic Local Code Search Platform|Anja Reusch, Guilherme C. Lopes, Wilhelm Pertsch, Hannes Ueck, Julius Gonsior, Wolfgang Lehner|Tech Univ Dresden, Dresden Database Syst Grp, Dresden, Germany|Searching for the right code snippet is cumbersome and not a trivial task. Online platforms such as Github.com or searchcode.com provide tools to search, but they are limited to publicly available and internet-hosted code. However, during the development of research prototypes or confidential tools, it is preferable to store source code locally. Consequently, the use of external code search tools becomes impractical. Here, we present Selma (Code and Videos: https://anreu.github.io/selma ): a local code search platform that enables term-based and semantic retrieval of source code. Selma searches code and comments, annotates undocumented code to enable term-based search in natural language, and trains neural models for code retrieval.|寻找合适的代码片段是一项繁琐且非易事。在线平台如Github.com或searchcode.com提供了搜索工具,但这些工具仅限于搜索公开且托管在互联网上的代码。然而,在研究原型或机密工具的开发过程中,更倾向于将源代码存储在本地。因此,使用外部代码搜索工具变得不切实际。在此,我们介绍Selma(代码和视频:https://anreu.github.io/selma):一个本地代码搜索平台,支持基于术语和语义的源代码检索。Selma能够搜索代码和注释,对未记录代码进行注释以支持自然语言的基于术语的搜索,并训练用于代码检索的神经网络模型。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Selma:+A+Semantic+Local+Code+Search+Platform)|0|
|FAR-AI: A Modular Platform for Investment Recommendation in the Financial Domain|Javier SanzCruzado, Edward Richards, Richard McCreadie|Univ Glasgow, Glasgow, Lanark, Scotland|Financial asset recommendation (FAR) is an emerging sub-domain of the wider recommendation field that is concerned with recommending suitable financial assets to customers, with the expectation that those customers will invest capital into a subset of those assets. FAR is a particularly interesting sub-domain to explore, as unlike traditional movie or product recommendation, FAR solutions need to analyse and learn from a combination of time-series pricing data, company fundamentals, social signals and world events, relating the patterns observed to multi-faceted customer representations comprising profiling information, expectations and past investments. In this demo we will present a modular FAR platform; referred to as FAR-AI, with the goal of raising awareness and building a community around this emerging domain, as well as illustrate the challenges, design considerations and new research directions that FAR offers. The demo will comprise two components: 1) we will present the architecture of FAR-AI to attendees, to enable them to understand the how's and the why's of developing a FAR system; and 2) a live demonstration of FAR-AI as a customer-facing product, highlighting the differences in functionality between FAR solutions and traditional recommendation scenarios. The demo is supplemented by online-tutorial materials, to enable attendees new to this space to get practical experience with training FAR models. VIDEO URL.|金融资产推荐(Financial Asset Recommendation, FAR)是更广泛的推荐领域中的一个新兴子领域,其核心任务是为客户推荐合适的金融资产,并期望这些客户将资金投资于这些资产的一部分。FAR是一个特别值得探索的子领域,因为与传统的电影或商品推荐不同,FAR解决方案需要分析和学习时间序列定价数据、公司基本面、社交信号和全球事件的组合,并将观察到的模式与包含客户画像信息、预期和过去投资的多维度客户表征相关联。在本次演示中,我们将展示一个模块化的FAR平台,称为FAR-AI,旨在提高对这一新兴领域的认识并围绕其构建社区,同时展示FAR所面临的挑战、设计考虑因素以及新的研究方向。演示将包括两个部分:1)我们将向与会者介绍FAR-AI的架构,帮助他们理解开发FAR系统的“如何”与“为何”;2)FAR-AI作为面向客户的产品的现场演示,突出FAR解决方案与传统推荐场景在功能上的差异。演示还辅以在线教程材料,以便让初次接触该领域的与会者获得训练FAR模型的实践经验。视频链接:[VIDEO URL]。|code|0|
|Semantic Content Search on IKEA.com|Mateusz Slominski, Ezgi Yildirim, Martin Tegner|Ingka Grp, IKEA Retail, Leiden, Netherlands|In this paper, we present an approach to content search. The aim is to increase customer engagement with content recommendations on IKEA.com. As an alternative to Boolean search, we introduce a method based on semantic textual similarity between content pages and search queries. Our approach improves the relevance of search results by a 2.95% increase in click-through rate in an online A/B test.|本文提出了一种内容搜索方法,旨在通过内容推荐提升用户在宜家官网(IKEA.com)上的参与度。作为布尔搜索的替代方案,我们引入了一种基于内容页面与搜索查询之间语义文本相似度的方法。通过在线A/B测试,我们的方法使搜索结果的点击率提高了2.95%,从而显著提升了搜索结果的相关性。|code|0|
|Semantic Search in Archive Collections Through Interpretable and Adaptable Relation Extraction About Person and Places|Nicolas Gutehrlé|Univ Franche Comte, CRIT, F-25000 Besancon, France|In recent years, libraries and archives have undertaken numerous campaigns to digitise their collections. While these campaigns have increased ease of access to archival documents for a wider audience, ensuring discoverability and promoting their content remain significant challenges. Digitised documents are often unstructured, making them difficult to navigate. Accessing archive materials through search engines restricts users to keyword-based queries, leading to being overwhelmed by irrelevant documents. To enhance the exploration and exploitation of the "Big Data of the Past" [15], it is imperative to structure textual content.|近年来,图书馆和档案馆开展了大量的数字化活动,将其收藏品数字化。尽管这些活动使得更广泛的受众能够更方便地访问档案文件,但确保其可发现性并推广其内容仍然是重大挑战。数字化文件通常是非结构化的,这使得它们难以浏览。通过搜索引擎访问档案材料限制了用户只能进行基于关键词的查询,从而导致用户被大量不相关的文件所淹没。为了增强对“过去的大数据”[15]的探索和利用,对文本内容进行结构化处理是至关重要的。|code|0|
|Reproduction and Simulation of Interactive Retrieval Experiments|Jana Isabelle Friese|Univ Duisburg Essen, Duisburg, Germany|The reproducibility crisis, spanning across various scientific fields, substantially affects information retrieval research [1].|跨越多学科领域的可重复性危机对信息检索研究产生了重大影响[1]。|code|0|
|Efficient Multi-vector Dense Retrieval with Bit Vectors|Franco Maria Nardini, Cosimo Rulli, Rossano Venturini|CNR, ISTI, Pisa, Italy; Univ Pisa, Pisa, Italy|Dense retrieval techniques employ pre-trained large language models to builda high-dimensional representation of queries and passages. Theserepresentations compute the relevance of a passage w.r.t. to a query usingefficient similarity measures. In this line, multi-vector representations showimproved effectiveness at the expense of a one-order-of-magnitude increase inmemory footprint and query latency by encoding queries and documents on aper-token level. Recently, PLAID has tackled these problems by introducing acentroid-based term representation to reduce the memory impact of multi-vectorsystems. By exploiting a centroid interaction mechanism, PLAID filters outnon-relevant documents, thus reducing the cost of the successive rankingstages. This paper proposes “Efficient Multi-Vector dense retrieval with Bitvectors” (EMVB), a novel framework for efficient query processing inmulti-vector dense retrieval. First, EMVB employs a highly efficientpre-filtering step of passages using optimized bit vectors. Second, thecomputation of the centroid interaction happens column-wise, exploiting SIMDinstructions, thus reducing its latency. Third, EMVB leverages ProductQuantization (PQ) to reduce the memory footprint of storing vectorrepresentations while jointly allowing for fast late interaction. Fourth, weintroduce a per-document term filtering method that further improves theefficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVBis up to 2.8x faster while reducing the memory footprint by 1.8x with no lossin retrieval accuracy compared to PLAID.|密集检索技术利用预训练的大型语言模型来构建查询和段落的高维表示。这些表示通过高效的相似度度量来计算段落与查询的相关性。在这一领域,多向量表示通过在每词元级别上编码查询和文档,提高了检索效果,但代价是内存占用和查询延迟增加了一个数量级。最近,PLAID通过引入基于质心的词项表示来解决这些问题,从而减少了多向量系统的内存影响。通过利用质心交互机制,PLAID过滤掉了不相关的文档,从而降低了后续排序阶段的成本。本文提出了“基于位向量的高效多向量密集检索”(EMVB),这是一种用于多向量密集检索中高效查询处理的新框架。首先,EMVB使用优化的位向量对段落进行高效的预过滤。其次,质心交互的计算按列进行,利用SIMD指令,从而减少了延迟。第三,EMVB利用乘积量化(PQ)来减少存储向量表示的内存占用,同时允许快速的后期交互。第四,我们引入了一种每文档词项过滤方法,进一步提高了最后一步的效率。在MS MARCO和LoTTE上的实验表明,与PLAID相比,EMVB的速度提高了2.8倍,同时减少了1.8倍的内存占用,且检索精度没有损失。|code|0|
|Prompt-Based Generative News Recommendation (PGNR): Accuracy and Controllability|Xinyi Li, Yongfeng Zhang, Edward C. Malthouse|Rutgers State Univ, Piscataway, NJ USA; Northwestern Univ, Evanston, IL 60208 USA|Online news platforms often use personalized news recommendation methods to help users discover articles that align with their interests. These methods typically predict a matching score between a user and a candidate article to reflect the user's preference for the article. Given that articles contain rich textual information, current news recommendation systems (RS) leverage natural language processing (NLP) techniques, including the attention mechanism, to capture users' interests based on their historical behaviors and comprehend article content. However, these existing model architectures are usually task-specific and require redesign to adapt to additional features or new tasks. Motivated by the substantial progress in pre-trained large language models for semantic understanding and prompt learning, which involves guiding output generation using pre-trained language models, this paper proposes Prompt-based Generative News Recommendation (PGNR). This approach treats personalized news recommendation as a text-to-text generation task and designs personalized prompts to adapt to the pre-trained language model, taking the generative training and inference paradigm that directly generates the answer for recommendation. Experimental studies using the Microsoft News dataset show that PGNR is capable of making accurate recommendations by taking into account various lengths of past behaviors of different users. It can also easily integrate new features without changing the model architecture and the training loss function. Additionally, PGNR can make recommendations based on users' specific requirements, allowing more straightforward human-computer interaction for news recommendation.|在线新闻平台通常采用个性化新闻推荐方法,以帮助用户发现与其兴趣相符的文章。这些方法通常预测用户与候选文章之间的匹配分数,以反映用户对文章的偏好。鉴于文章包含丰富的文本信息,当前的新闻推荐系统(RS)利用自然语言处理(NLP)技术,包括注意力机制,基于用户的历史行为捕捉用户兴趣并理解文章内容。然而,这些现有的模型架构通常是任务特定的,需要重新设计以适应额外的特征或新任务。受预训练大语言模型在语义理解和提示学习(即使用预训练语言模型引导输出生成)方面取得的显著进展的启发,本文提出了基于提示的生成式新闻推荐(Prompt-based Generative News Recommendation, PGNR)。该方法将个性化新闻推荐视为文本到文本的生成任务,并设计个性化提示以适应预训练语言模型,采用生成式训练和推理范式,直接生成推荐答案。使用微软新闻数据集进行的实验研究表明,PGNR能够通过考虑不同用户的各种历史行为长度来做出准确的推荐。它还可以轻松集成新特征,而无需改变模型架构和训练损失函数。此外,PGNR能够根据用户的特定需求进行推荐,从而实现更直接的人机交互以进行新闻推荐。|code|0|
|CaseGNN: Graph Neural Networks for Legal Case Retrieval with Text-Attributed Graphs|Yanran Tang, Ruihong Qiu, Yilun Liu, Xue Li, Zi Huang||Legal case retrieval is an information retrieval task in the legal domain, which aims to retrieve relevant cases with a given query case. Recent research of legal case retrieval mainly relies on traditional bag-of-words models and language models. Although these methods have achieved significant improvement in retrieval accuracy, there are still two challenges: (1) Legal structural information neglect. Previous neural legal case retrieval models mostly encode the unstructured raw text of case into a case representation, which causes the lack of important legal structural information in a case and leads to poor case representation; (2) Lengthy legal text limitation. When using the powerful BERT-based models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. In this paper, a graph neural networks-based legal case retrieval model, CaseGNN, is developed to tackle these challenges. To effectively utilise the legal structural information during encoding, a case is firstly converted into a Text-Attributed Case Graph (TACG), followed by a designed Edge Graph Attention Layer and a readout function to obtain the case graph representation. The CaseGNN model is optimised with a carefully designed contrastive loss with easy and hard negative sampling. Since the text attributes in the case graph come from individual sentences, the restriction of using language models is further avoided without losing the legal context. Extensive experiments have been conducted on two benchmarks from COLIEE 2022 and COLIEE 2023, which demonstrate that CaseGNN outperforms other state-of-the-art legal case retrieval methods. The code has been released on https://github.com/yanran-tang/CaseGNN.|法律案例检索是法律领域的一项信息检索工作,其目的是检索具有给定查询案例的相关案例。目前法律案例检索的研究主要依赖于传统的词袋模型和语言模型。虽然这些方法在检索精度方面取得了显著的进步,但仍然存在两个挑战: (1)法律结构信息的忽视。以往的神经网络法律案例检索模型大多将非结构化的原始案例文本编码为案例表示,导致案例缺乏重要的法律结构信息,导致案例表示效果不佳;。在使用基于 BERT 的强大模型时,存在输入文本长度的限制,这就不可避免地要求通过截断或除法来缩短输入,同时丢失法律上下文信息。本文提出了一种基于图神经网络的法律案例检索模型 CaseGNN,以解决这些问题。为了在编码过程中有效地利用法律结构信息,首先将案例转换为文本属性案例图(TACG) ,然后设计边缘图注意层和读出功能,得到案例图表示。CaseGNN 模型通过精心设计的对比损失和简单和硬负采样进行优化。由于案例图中的文本属性来自于单个句子,因此在不失去法律上下文的前提下,进一步避免了语言模型的使用限制。对 COLIEE 2022和 COLIEE 2023的两个基准进行了广泛的实验,证明 CaseGNN 优于其他最先进的法律案例检索方法。密码已经在 https://github.com/yanran-tang/casegnn 上发布了。|code|0|
|Context-Driven Interactive Query Simulations Based on Generative Large Language Models|Björn Engelmann, Timo Breuer, Jana Isabelle Friese, Philipp Schaer, Norbert Fuhr||Simulating user interactions enables a more user-oriented evaluation of information retrieval (IR) systems. While user simulations are cost-efficient and reproducible, many approaches often lack fidelity regarding real user behavior. Most notably, current user models neglect the user's context, which is the primary driver of perceived relevance and the interactions with the search results. To this end, this work introduces the simulation of context-driven query reformulations. The proposed query generation methods build upon recent Large Language Model (LLM) approaches and consider the user's context throughout the simulation of a search session. Compared to simple context-free query generation approaches, these methods show better effectiveness and allow the simulation of more efficient IR sessions. Similarly, our evaluations consider more interaction context than current session-based measures and reveal interesting complementary insights in addition to the established evaluation protocols. We conclude with directions for future work and provide an entirely open experimental setup.|通过模拟用户交互,可以对信息检索系统进行更加面向用户的评估。虽然用户模拟具有成本效益和可重复性,但许多方法通常缺乏真实用户行为的保真度。最值得注意的是,当前的用户模型忽视了用户的上下文,而上下文是感知相关性和与搜索结果交互的主要驱动因素。为此,本文介绍了上下文驱动的查询重构的仿真。提出的查询生成方法建立在最新的大型语言模型(LLM)方法的基础上,并在搜索会话的整个仿真过程中考虑用户的上下文。与简单的上下文无关的查询生成方法相比,这些方法显示出更好的效率,并允许模拟更有效的 IR 会话。同样,我们的评价考虑了比目前基于会议的措施更多的互动背景,除了既定的评价方案之外,还揭示了有趣的互补见解。我们总结了未来工作的方向,并提供了一个完全开放的实验装置。|code|0|
|Emotional Insights for Food Recommendations|Mehrdad Rostami, Ali Vardasbi, Mohammad Aliannejadi, Mourad Oussalah|Univ Amsterdam, Informat Retrieval Lab, Amsterdam, Netherlands; Univ Oulu, Ctr Machine Vis & Signal Anal, Oulu, Finland|Food recommendation systems have become pivotal in offering personalized suggestions, enabling users to discover recipes in line with their tastes. However, despite the existence of numerous such systems, there are still unresolved challenges. Much of the previous research predominantly lies on users' past preferences, neglecting the significant aspect of discerning users' emotional insights. Our framework aims to bridge this gap by pioneering emotion-aware food recommendation. The study strives for enhanced accuracy by delivering recommendations tailored to a broad spectrum of emotional and dietary behaviors. Uniquely, we introduce five novel scores for Influencer-Followers, Visual Motivation, Adventurous, Health and Niche to gauge a user's inclination toward specific emotional insights. Subsequently, these indices are used to re-rank the preliminary recommendation, placing a heightened focus on the user's emotional disposition. Experimental results on a real-world food social network dataset reveal that our system outperforms alternative emotion-unaware recommender systems, yielding an average performance boost of roughly 6%. Furthermore, the results reveal a rise of over 30% in accuracy metrics for some users exhibiting particular emotional insights.|食品推荐系统在提供个性化建议方面变得至关重要,使用户能够根据自己的口味发现食谱。然而,尽管存在许多这样的系统,仍有一些未解决的挑战。以往的研究大多主要依赖于用户过去的偏好,忽视了识别用户情感洞察的重要方面。我们的框架旨在通过开创情感感知的食品推荐来弥合这一差距。该研究通过提供针对广泛情感和饮食行为的推荐,力求提高准确性。我们独特地引入了五个新的评分指标:影响力-追随者、视觉动机、冒险性、健康和利基,以衡量用户对特定情感洞察的倾向。随后,这些指标被用于重新排序初步推荐,更加关注用户的情感倾向。在实际食品社交网络数据集上的实验结果表明,我们的系统优于其他不考虑情感的推荐系统,平均性能提升了约6%。此外,结果显示,对于一些表现出特定情感洞察的用户,准确率指标提升了超过30%。|code|0|
|LaQuE: Enabling Entity Search at Scale|Negar Arabzadeh, Amin Bigdeli, Ebrahim Bagheri|Univ Waterloo, Waterloo, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Entity search plays a crucial role in various information access domains, where users seek information about specific entities. Despite significant research efforts to improve entity search methods, the availability of large-scale resources and extensible frameworks has been limiting progress. In this work, we present LaQuE (Large-scale Queries for Entity search), a curated framework for entity search, which includes a reproducible and extensible code base as well as a large relevance judgment collection consisting of real-user queries based on the ORCAS collection. LaQuE is industry-scale and suitable for training complex neural models for entity search. We develop methods for curating and judging entity collections, as well as training entity search methods based on LaQuE. We additionally establish strong baselines within LaQuE based on various retrievers, including traditional bag-of-words-based methods and neural-based models. We show that training neural entity search models on LaQuE enhances retrieval effectiveness compared to the state-of-the-art. Additionally, we categorize the released queries in LaQuE based on their popularity and difficulty, encouraging research on more challenging queries for the entity search task. We publicly release LaQuE at https://github.com/Narabzad/LaQuE .|实体搜索在各种信息获取领域中扮演着至关重要的角色,用户通过它来寻找特定实体的信息。尽管已有大量研究致力于改进实体搜索方法,但大规模资源和可扩展框架的缺乏一直限制着这一领域的进展。在本研究中,我们提出了LaQuE(大规模实体搜索查询),这是一个精心策划的实体搜索框架,它包括一个可复现和可扩展的代码库,以及一个基于ORCAS集合的真实用户查询的大规模相关性判断集合。LaQuE具有工业规模,适合训练复杂的神经模型以进行实体搜索。我们开发了策划和判断实体集合的方法,以及基于LaQuE训练实体搜索方法的技术。此外,我们在LaQuE中建立了强大的基线,包括传统的基于词袋的方法和基于神经网络的模型。我们展示了在LaQuE上训练神经实体搜索模型相较于现有技术提高了检索效果。此外,我们根据查询的流行度和难度对LaQuE中发布的查询进行了分类,鼓励对更具挑战性的实体搜索任务查询进行研究。我们在https://github.com/Narabzad/LaQuE 上公开了LaQuE。|code|0|
|Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models|Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen||Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding target words such as true. Since such possibilities have not yet been considered in retrieval evaluation, we analyze the impact of query-independent prompt injection via manually constructed templates and LLM-based rewriting of documents on several existing relevance models. Our experiments on the TREC Deep Learning track show that adversarial documents can easily manipulate different sequence-to-sequence relevance models, while BM25 (as a typical lexical model) is not affected. Remarkably, the attacks also affect encoder-only relevance models (which do not rely on natural language prompt tokens), albeit to a lesser extent.|现代的序列-序列相关模型,如 monoT5,可以通过交叉编码有效地捕获查询和文档之间复杂的文本交互。然而,在提示符中使用自然语言标记,比如 Query、 Document 和 RelationformonoT5,为恶意文档打开了一个攻击向量,通过提示注入操纵它们的相关性得分,例如,通过添加目标词,比如 true。由于在检索评估中还没有考虑到这种可能性,我们通过手工构建模板和基于 LLM 的文档重写来分析与查询无关的提示注入对几种现有相关性模型的影响。我们在 TREC Deep Learning 进行的实验表明,对抗性文档可以轻易地操纵不同的顺序-顺序关联模型,而 BM25(作为一个典型的词汇模型)不受影响。值得注意的是,这些攻击还会影响编码器相关性模型(不依赖于自然语言提示符) ,尽管影响程度较小。|code|0|
|Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE|Carlos Lassance, Hervé Déjean, Stéphane Clinchant, Nicola Tonellotto||Learned sparse models such as SPLADE have successfully shown how to incorporate the benefits of state-of-the-art neural information retrieval models into the classical inverted index data structure. Despite their improvements in effectiveness, learned sparse models are not as efficient as classical sparse model such as BM25. The problem has been investigated and addressed by recently developed strategies, such as guided traversal query processing and static pruning, with different degrees of success on in-domain and out-of-domain datasets. In this work, we propose a new query processing strategy for SPLADE based on a two-step cascade. The first step uses a pruned and reweighted version of the SPLADE sparse vectors, and the second step uses the original SPLADE vectors to re-score a sample of documents retrieved in the first stage. Our extensive experiments, performed on 30 different in-domain and out-of-domain datasets, show that our proposed strategy is able to improve mean and tail response times over the original single-stage SPLADE processing by up to 30× and 40×, respectively, for in-domain datasets, and by 12x to 25x, for mean response on out-of-domain datasets, while not incurring in statistical significant difference in 60% of datasets.|像 SPLADE 这样的稀疏学习模型已经成功地展示了如何将最先进的神经信息检索模型的优点融入到经典的倒排索引数据结构中。尽管学习稀疏模型的有效性有所提高,但其效率不如经典稀疏模型如 BM25。该问题已经通过最近开发的策略得到了研究和解决,如引导遍历查询处理和静态剪枝,在域内和域外数据集上取得了不同程度的成功。本文提出了一种新的基于两步级联的 SPLADE 查询处理策略。第一步使用 SPLADE 稀疏向量的修剪和重新加权版本,第二步使用原始 SPLADE 向量对在第一阶段检索到的文档样本进行重新评分。我们在30个不同的域内和域外数据集上进行的广泛实验表明,我们提出的策略能够将原始单阶段 SPLADE 处理的平均和尾部响应时间分别提高30倍和40倍,对于域内数据集,提高12倍至25倍,对于域外数据集的平均响应,同时在60% 的数据集中不引起统计学显着差异。|code|0|
|Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers|Negar Arabzadeh, Amin Bigdeli, Charles L. A. Clarke||Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.|大型语言模型现在可以直接生成许多实际问题的答案,而无需引用外部资源。遗憾的是,对于评价这些答案的质量和正确性、比较一个模型与另一个模型的表现或比较一个提示与另一个提示的方法,人们的关注相对较少。此外,生成的答案的质量很少直接比较检索的答案的质量。随着模型的发展和提示的修改,我们没有系统的方法来衡量改进而不诉诸昂贵的人类判断。为了解决这个问题,我们采用标准的检索基准来评估由大型语言模型生成的答案。受到用于总结的 BERTScore 度量的启发,我们探索了两种方法。首先,我们以基准相关性判断为基础进行评价。我们通过实验来研究信息检索相关性判断是如何被用来作为评估生成的答案的锚的。在第二个实验中,我们将生成的答案与不同检索模型(从传统方法到高级方法)检索到的最高结果进行比较,使我们能够在没有人为判断的情况下衡量改进情况。在这两种情况下,我们测量生成的答案的嵌入表示和检索基准中已知或假定的相关段落的嵌入表示之间的相似性。|code|0|
|Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control|Thong Nguyen, Mariya Hendriksen, Andrew Yates, Maarten de Rijke||Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal|学习稀疏检索(LSR)是一类将查询和文档编码成稀疏词汇向量的神经元方法,可以通过反向索引有效地进行索引和检索。我们探讨了 LSR 在多模态领域的应用,重点研究了文本图像检索。虽然 LSR 在文本检索方面取得了成功,但它在多模态检索中的应用仍然有待探索。目前的方法如 LexLIP 和 STAIR 需要对大量数据集进行复杂的多步训练。我们提出的方法有效地将密集向量从一个冻结的密集模型转换成稀疏的词汇向量。通过一种新的训练算法,利用贝努利随机变量控制查询扩展,解决了高维共激活和语义偏差的问题。对两个密集模型(BLIP,ALBEF)和两个数据集(MSCOCO,Flickr30k)的实验表明,该算法有效地减少了协同激活和语义偏差。我们性能最好的稀疏模型优于最先进的文本图像 LSR 模型,具有更短的训练时间和更低的 GPU 内存需求。该方法为在多模态环境下训练 LSR 检索模型提供了一种有效的解决方案。我们的代码和模型检查点在 github.com/thongnt99/lsr-multimodal 都有|code|0|
|Alleviating Confounding Effects with Contrastive Learning in Recommendation|Di You, Kyumin Lee|Worcester Polytech Inst, Worcester, MA 01609 USA|Recently, there has been a growing interest in mitigating the bias effects in recommendations using causal inference. However, Rubin's potential outcome framework may produce inaccurate estimates in real-world scenarios due to the presence of hidden confounders. In addition, existing works adopting the Pearl causal graph framework tend to focus on specific types of bias (e.g., selection bias, popularity bias, exposure bias) instead of directly mitigating the impact of hidden confounders. Motivated by the aforementioned limitations, in this paper, we formulate the recommendation task as a causal graph with unobserved/unmeasurable confounders. We present a novel causality-based architecture called Multi-behavior Debiased Contrastive Collaborative Filtering (MDCCL) and apply the front-door adjustment for intervention. We leverage a pre-like behavior such as clicking an item (i.e., a behavior occurred before the target behavior such as purchasing) to mitigate the bias effects. Additionally, we design a contrastive loss that also provides a debiasing effect benefiting the recommendation. An empirical study on three real-world datasets validates that our proposed method successfully outperforms nine state-of-the-art baselines. Code and the datasets will be available at https://github.com/queenjocey/MDCCL .|近年来,利用因果推理来减轻推荐系统中的偏见效应引起了越来越多的关注。然而,Rubin的潜在结果框架在实际场景中可能会由于存在隐藏的混杂因素而产生不准确的估计。此外,现有采用Pearl因果图框架的研究往往侧重于特定类型的偏见(例如选择偏差、流行度偏差、曝光偏差),而不是直接减轻隐藏混杂因素的影响。基于上述局限性,本文提出将推荐任务建模为包含未观测/未测量混杂因素的因果图。我们提出了一种新颖的基于因果关系的架构,称为多行为去偏对比协同过滤(MDCCL),并应用前门调整进行干预。我们利用诸如点击商品(即在目标行为如购买之前发生的行为)等预喜欢行为来减轻偏见效应。此外,我们设计了一种对比损失函数,该函数也提供了有助于推荐系统的去偏效果。在三个真实世界数据集上的实证研究表明,我们提出的方法成功超越了九种最先进的基线方法。代码和数据集将在https://github.com/queenjocey/MDCCL 上提供。|code|0|
|Align MacridVAE: Multimodal Alignment for Disentangled Recommendations|Ignacio Avas, Liesbeth Allein, Katrien Laenen, MarieFrancine Moens|Katholieke Univ Leuven, Dept Comp Sci, Leuven, Belgium|Explaining why items are recommended to users is challenging, especially when these items are described by multimodal data. Most recommendation systems fail to leverage more than one modality, preferring textual or tabular data. In this work, a new model, Align MacridVAE, that considers the complementarity of visual and textual item descriptions for item recommendation is proposed. This model projects both modalities onto a shared latent space, and a dedicated loss function aligns the text and image of the same item. The aspects of the item are then jointly disentangled for both modalities at a macro level to learn interpretable categorical information about items and at a micro level to model user preferences on each of those categories. Experiments are conducted on six item recommendation datasets, and recommendation performance is compared against multiple baseline methods. The results demonstrate that our model increases recommendation accuracy by 18% in terms of NCDG on average in the studied datasets and allows us to visualise user preference by item aspect across modalities and the learned concept allocation (The code implementation is available at https://github.com/igui/Align-MacridVAE ).|解释为何向用户推荐某些项目是具有挑战性的,尤其是当这些项目由多模态数据描述时。大多数推荐系统未能利用超过一种模态,而是偏好文本或表格数据。在这项工作中,我们提出了一种新模型Align MacridVAE,该模型考虑了视觉和文本项目描述的互补性来进行项目推荐。该模型将两种模态投影到一个共享的潜在空间中,并通过专门的损失函数对齐同一项目的文本和图像。然后,在宏观层面上对项目的各个方面进行联合解缠,以学习关于项目的可解释的类别信息;在微观层面上,对用户在每个类别上的偏好进行建模。我们在六个项目推荐数据集上进行了实验,并将推荐性能与多种基线方法进行了比较。结果表明,我们的模型在研究的数据集上平均将推荐准确率提高了18%(以NCDG衡量),并允许我们通过跨模态的项目方面和学习的概念分配来可视化用户偏好(代码实现可在https://github.com/igui/Align-MacridVAE获取)。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Align+MacridVAE:+Multimodal+Alignment+for+Disentangled+Recommendations)|0|
|Learning Action Embeddings for Off-Policy Evaluation|Matej Cief, Jacek Golebiowski, Philipp Schmidt, Ziawasch Abedjan, Artur Bekasov||Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.|非策略评估(OPE)方法允许我们通过使用不同策略收集的日志数据来计算策略的预期回报。相对于运行昂贵的在线 A/B 测试,OPE 是一种可行的替代方案: 它可以加快新政策的制定,并降低客户接触次优治疗的风险。然而,当操作的数量很大,或者某些操作被日志策略低估时,基于逆倾向评分(IPS)的现有估计量可能会有很高甚至无限的方差。Saito 和 Joachims (arXiv: 2202.06317 v2[ cs.LG ])提出使用动作嵌入的边缘化 IPS (MIPS) ,这减少了大动作空间中 IPS 的方差。MIPS 假设好的操作嵌入可以由从业人员定义,这在许多实际应用程序中是很难做到的。在这项工作中,我们探讨了从日志数据学习动作嵌入。特别地,我们使用训练过的奖励模型的中间输出来定义 MIPS 的行动嵌入。这种方法将 MIPS 扩展到更多的应用程序,并且在我们的实验中通过预定义的嵌入以及在合成和真实世界数据上的标准基线改进了 MIPS。我们的方法不对奖励模型类做假设,并支持使用额外的行动信息,以进一步改善估计。提出的方法提出了一个吸引人的替代 DR 相结合的 DM 的低方差和 IPS 的低偏差。|code|0|
|Simulated Task Oriented Dialogues for Developing Versatile Conversational Agents|Xi Wang, Procheta Sen, Ruizhe Li, Emine Yilmaz|Univ Liverpool, Liverpool, Merseyside, England; Univ Aberdeen, Aberdeen, Scotland; Unvers Coll London, London, England|Task-Oriented Dialogue (TOD) Systems are increasingly important for managing a variety of daily tasks, yet often underperform in unfamiliar scenarios due to limitations in existing training datasets. This study addresses the challenge of generating robust and versatile TOD systems by transforming instructional task descriptions into natural user-system dialogues to serve as enhanced pre-training data. We explore three strategies for synthetic dialogue generation: crowdsourcing, encoder-decoder models, and in-context learning with large language models. The evaluation of these approaches, based on a comprehensive user study employing 10 different metrics, reveals the top quality of the dialogues generated by learning an encoder-decoder model as per human evaluation. Notably, employing this synthetic dialogue further improves the performance of advanced TOD models, especially in unfamiliar domains, with improvements spanning 5.5% to as much as 20.9% in combined evaluation scores. Our findings advocate for the use of specialised, task-oriented knowledge bases and step-wise dialogue generation techniques to advance the capabilities and generalizability of TOD systems.|面向任务的对话系统(Task-Oriented Dialogue, TOD)在管理各种日常任务中变得越来越重要,但在不熟悉的场景中往往表现不佳,原因是现有训练数据集的局限性。本研究通过将任务指令描述转化为自然的用户-系统对话,以作为增强的预训练数据,来解决生成鲁棒且多功能的TOD系统的挑战。我们探索了三种合成对话生成策略:众包、编码器-解码器模型以及利用大语言模型进行上下文学习。基于采用10种不同指标的综合用户研究对这些方法进行评估,结果表明,根据人类评估,学习编码器-解码器模型生成的对话质量最高。值得注意的是,使用这种合成对话进一步提升了先进TOD模型的性能,尤其是在不熟悉的领域中,综合评估分数提升了5.5%至20.9%。我们的研究结果支持使用专门的、面向任务的知识库和分步对话生成技术,以提升TOD系统的能力和泛化性。|code|0|
|Hypergraphs with Attention on Reviews for Explainable Recommendation|Theis E. Jendal, TrungHoang Le, Hady W. Lauw, Matteo Lissandrini, Peter Dolog, Katja Hose|Singapore Management Univ, Singapore, Singapore; Aalborg Univ, Aalborg, Denmark|Given a recommender system based on reviews, the challenges are how to effectively represent the review data and how to explain the produced recommendations. We propose a novel review-specific Hypergraph (HG) model, and further introduce a model-agnostic explainability module. The HG model captures high-order connections between users, items, aspects, and opinions while maintaining information about the review. The explainability module can use the HG model to explain a prediction generated by any model. We propose a path-restricted review-selection method biased by the user preference for item reviews and propose a novel explanation method based on a review graph. Experiments on real-world datasets confirm the ability of the HG model to capture appropriate explanations.|在一个基于评论的推荐系统中,面临的挑战是如何有效地表示评论数据以及如何解释生成的推荐结果。我们提出了一种新颖的针对评论的超图(HG)模型,并进一步引入了一个与模型无关的可解释性模块。HG模型能够捕捉用户、物品、方面和观点之间的高阶关系,同时保留评论的信息。可解释性模块可以利用HG模型来解释任何模型生成的预测结果。我们提出了一种基于用户对物品评论偏好的路径受限评论选择方法,并提出了一种基于评论图的新型解释方法。在真实数据集上的实验验证了HG模型在捕捉适当解释方面的能力。|code|0|
|Investigating the Usage of Formulae in Mathematical Answer Retrieval|Anja Reusch, Julius Gonsior, Claudio Hartmann, Wolfgang Lehner|Tech Univ Dresden, Dresden Database Res Grp, Dresden, Germany|This work focuses on the task of Mathematical Answer Retrieval and studies the factors a recent Transformer-Encoder-based Language Model (LM) uses to assess the relevance of an answer for a given mathematical question. Mainly, we investigate three factors: (1) the general influence of mathematical formulae, (2) the usage of structural information of those formulae, (3) the overlap of variable names in answers and questions. The findings of the investigation indicate that the LM for Mathematical Answer Retrieval mainly relies on shallow features such as the overlap of variables between question and answers. Furthermore, we identified a malicious shortcut in the training data that hinders the usage of structural information and by removing this shortcut improved the overall accuracy. We want to foster future research on how LMs are trained for Mathematical Answer Retrieval and provide a basic evaluation set up (Link to repository: https://github.com/AnReu/math_analysis ) for existing models.|本研究聚焦于数学答案检索任务,探讨了基于Transformer-Encoder架构的最新语言模型(LM)在评估给定数学问题答案相关性时所使用的因素。我们主要研究了三个因素:(1) 数学公式的总体影响,(2) 这些公式结构信息的使用,(3) 答案与问题中变量名的重叠。研究结果表明,用于数学答案检索的语言模型主要依赖于浅层特征,如问题与答案之间变量的重叠。此外,我们发现训练数据中存在一个阻碍结构信息使用的恶意捷径,通过消除这一捷径,整体准确性得到了提升。我们希望推动未来关于如何训练语言模型进行数学答案检索的研究,并为现有模型提供一个基础的评估设置(仓库链接:https://github.com/AnReu/math_analysis)。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Investigating+the+Usage+of+Formulae+in+Mathematical+Answer+Retrieval)|0|
|Empowering Legal Citation Recommendation via Efficient Instruction-Tuning of Pre-trained Language Models|Jie Wang, Kanha Bansal, Ioannis Arapakis, Xuri Ge, Joemon M. Jose|Univ Glasgow, Glasgow, Lanark, Scotland; Telefon Res, Barcelona, Spain|The escalating volume of cases in legal adjudication has amplified the complexity of citing relevant regulations and authoritative cases, posing an increasing challenge for legal professionals. Current legal citation prediction methods, which are predominantly reliant on keyword or interest-based retrieval, are proving insufficient. In particular, Collaborative Filtering (CF) based legal recommendation methods exhibited low accuracy. In response to these challenges, we propose the Instruction GPT with Low-Rank Adaptation architecture (IGPT-LoRA), aiming to enhance the performance of legal citation recommendations and reduce computational demands by tuning Pre-trained Language Models (PLMs). IGPT-LoRA leverages prompting and efficient tuning strategies, thus offering a significant improvement over previous context-aware legal citation prediction methods. We design effective domain-specific instruction templates to guide the adaptation of PLMs for recommendation purposes, shedding light on the potential of prompt-based learning in the legal domain. Furthermore, we optimize the learning process with an efficient tuning layer - the Low-Rank Adaptation (LoRA) architecture - to bolster applicability. Experimental results on a real-world legal data set (BVA) demonstrate that IGPT-LoRA outperforms state-of-the-art methods, delivering substantial improvements in accuracy and also in training time and computational efficiency.|随着法律裁决案件数量的不断增加,引用相关法规和权威案例的复杂性也随之增加,这对法律专业人员提出了越来越大的挑战。当前的法律引用预测方法主要依赖于基于关键词或兴趣的检索,但这些方法已显示出不足。特别是基于协同过滤(CF)的法律推荐方法准确率较低。针对这些挑战,我们提出了基于低秩适应的指令GPT架构(IGPT-LoRA),旨在通过微调预训练语言模型(PLMs)来提高法律引用推荐的性能并减少计算需求。IGPT-LoRA利用提示和高效调优策略,从而显著改进了以往基于上下文的法律引用预测方法。我们设计了有效的领域特定指令模板,以指导PLMs的适应,用于推荐目的,揭示了基于提示的学习在法律领域的潜力。此外,我们通过一个高效的调优层——低秩适应(LoRA)架构——优化了学习过程,以增强适用性。在真实世界法律数据集(BVA)上的实验结果表明,IGPT-LoRA优于最先进的方法,在准确性、训练时间和计算效率方面都提供了显著的改进。|code|0|
|Fine-Tuning CLIP via Explainability Map Propagation for Boosting Image and Video Retrieval|Yoav Shalev, Lior Wolf|Tel Aviv Univ, Tel Aviv, Israel|Recent studies have highlighted the remarkable performance of CLIP for diverse downstream tasks. To understand how CLIP performs these tasks, various explainability methods have been formulated. In this paper, we reveal that the explainability maps associated with CLIP are often focused on a limited portion of the image and overlook objects that are explicitly mentioned in the text. This phenomenon may result in a high similarity score for incongruent image-text pairs, thereby potentially introducing a bias. To address this issue, we introduce a novel fine-tuning technique for CLIP that leverages a transformer explainability method. Unlike traditional approaches that generate a single heatmap using an image-text pair, our method produces multiple heatmaps directly from the image itself. We use these heatmaps both during the fine-tuning process and at inference time to highlight key visual elements, applying them to the features during the image encoding process, steering the visual encoder's attention toward these key elements. This process guides the image encoder across different spatial regions and generates a set of visual embeddings, thereby allowing the model to consider various aspects of the image, ensuring a detailed and comprehensive understanding that surpasses the limited scope of the original CLIP model. Our method leads to a notable improvement in text, image, and video retrieval across multiple benchmarks. It also results in reduced gender bias, making our model more equitable.|最近的研究强调了CLIP在各种下游任务中的卓越表现。为了理解CLIP如何执行这些任务,已经制定了各种可解释性方法。在本文中,我们揭示了与CLIP相关的可解释性图通常集中在图像的有限部分,并忽略了文本中明确提到的对象。这种现象可能导致不协调的图像-文本对获得较高的相似度分数,从而可能引入偏差。为了解决这个问题,我们引入了一种新的CLIP微调技术,该技术利用了变压器可解释性方法。与传统的使用图像-文本对生成单一热图的方法不同,我们的方法直接从图像本身生成多个热图。我们在微调过程和推理时使用这些热图来突出关键视觉元素,在图像编码过程中将这些热图应用于特征,引导视觉编码器的注意力转向这些关键元素。这一过程引导图像编码器跨越不同的空间区域,并生成一组视觉嵌入,从而使模型能够考虑图像的各个方面,确保详细和全面的理解,超越了原始CLIP模型的有限范围。我们的方法在多个基准测试中显著提高了文本、图像和视频检索的性能。它还减少了性别偏见,使我们的模型更加公平。|code|0|
|Cross-Modal Retrieval for Knowledge-Based Visual Question Answering|Paul Lerner, Olivier Ferret, Camille Guinaudeau||Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.|基于知识的命名实体可视化问答是一项具有挑战性的任务,需要从多模态知识库中检索信息。命名实体具有不同的可视化表示,因此难以识别。我们认为,跨模态检索有助于弥合实体与其描述之间的语义鸿沟,并与单模态检索相辅相成。我们提供经验证明通过实验与多模态双编码器,即 CLIP,在最近的 ViQuAE,资讯搜寻和百科全书-VQA 数据集。此外,我们还研究了三种不同的策略来微调这种模型: 单模态、跨模态或联合训练。我们的方法结合了单模态检索和跨模态检索,与三个数据集上的十亿参数模型相比具有竞争力,同时在概念上更简单,计算成本更低。|code|0|
|Learning to Jointly Transform and Rank Difficult Queries|Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri|Univ Waterloo, Waterloo, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Recent empirical studies have shown that while neural rankers exhibit increasingly higher retrieval effectiveness on tasks such as ad hoc retrieval, these improved performances are not experienced uniformly across the range of all queries. There are typically a large subset of queries that are not satisfied by neural rankers. These queries are often referred to as difficult queries. Given the fact that neural rankers operate based on the similarity between the embedding representations of queries and their relevant documents, the poor performance of difficult queries can be due to the sub-optimal representations learnt for difficult queries. As such, the objective of our work in this paper is to learn to rank documents and also transform query representations in tandem such that the representation of queries are transformed into one that shows higher resemblance to their relevant document. This way, our method will provide the opportunity to satisfy a large number of difficult queries that would otherwise not be addressed. In order to learn to jointly rank documents and transform queries, we propose to integrate two forms of triplet loss functions into neural rankers such that they ensure that each query is moved along the embedding space, through the transformation of its embedding representation, in order to be placed close to its relevant document(s). We perform experiments based on the MS MARCO passage ranking task and show that our proposed method has been able to show noticeable performance improvement for queries that were extremely difficult for existing neural rankers. On average, our approach has been able to satisfy 277 queries with an MRR@10 of 0.21 for queries that had a reciprocal rank of zero on the initial neural ranker.|最近的实证研究表明,尽管神经排序器在诸如即席检索等任务上表现出越来越高的检索效果,但这些性能提升并非在所有查询范围内均匀体现。通常存在一个较大的查询子集,这些查询无法被神经排序器有效满足。这些查询通常被称为困难查询。鉴于神经排序器基于查询及其相关文档的嵌入表示之间的相似性进行操作,困难查询表现不佳的原因可能是为这些查询学习到的表示不够理想。因此,本文工作的目标是学习文档排序的同时,同步转换查询表示,使得查询的表示能够转化为与其相关文档更相似的表示。通过这种方式,我们的方法将有机会满足大量原本无法处理的困难查询。为了学习联合排序文档和转换查询,我们提出将两种形式的三重态损失函数集成到神经排序器中,以确保每个查询通过其嵌入表示的转换在嵌入空间中移动,从而更接近其相关文档。我们在MS MARCO段落排序任务上进行了实验,结果表明,所提出的方法对于现有神经排序器极为困难的查询显示出显著的性能提升。平均而言,我们的方法能够满足277个查询,这些查询在初始神经排序器上的倒数排名为零,而我们的方法在MRR@10指标上达到了0.21。|code|0|
|Instant Answering in E-Commerce Buyer-Seller Messaging Using Message-to-Question Reformulation|Besnik Fetahu, Tejas Mehta, Qun Song, Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi||E-commerce customers frequently seek detailed product information for purchase decisions, commonly contacting sellers directly with extended queries. This manual response requirement imposes additional costs and disrupts buyer's shopping experience with response time fluctuations ranging from hours to days. We seek to automate buyer inquiries to sellers in a leading e-commerce store using a domain-specific federated Question Answering (QA) system. The main challenge is adapting current QA systems, designed for single questions, to address detailed customer queries. We address this with a low-latency, sequence-to-sequence approach, MESSAGE-TO-QUESTION ( M2Q ). It reformulates buyer messages into succinct questions by identifying and extracting the most salient information from a message. Evaluation against baselines shows that M2Q yields relative increases of 757 answering rate from the federated QA system. Live deployment shows that automatic answering saves sellers from manually responding to millions of messages per year, and also accelerates customer purchase decisions by eliminating the need for buyers to wait for a reply|电子商务客户经常为购买决策寻找详细的产品信息,通常直接与销售商进行扩展查询。这种手动响应要求增加了额外的成本,并且由于响应时间从几小时到几天的波动而扰乱了买家的购物体验。我们寻求在一个领先的电子商务商店使用领域特定的联邦问题回答(QA)系统自动化的买方询问卖方。主要的挑战是适应当前的 QA 系统,为单个问题设计,以解决详细的客户查询。我们使用低延迟、序列到序列的方法 MESSAGE-TO-QUESTION (M2Q)来解决这个问题。它通过从消息中识别和提取最突出的信息,将买方消息重新表述为简洁的问题。对基线的评估表明,M2Q 在联邦 QA 系统中的应答率相对提高了757。实时部署显示,自动回复可以节省卖家每年手动回复数百万条消息的时间,还可以消除买家等待回复的需要,从而加快客户的购买决策|code|0|
|Towards Automated End-to-End Health Misinformation Free Search with a Large Language Model|Ronak Pradeep, Jimmy Lin|Univ Waterloo, David R Cheriton Sch Comp Sci, Waterloo, ON, Canada|In the information age, health misinformation remains a notable challenge to public welfare. Integral to addressing this issue is the development of search systems adept at identifying and filtering out misleading content. This paper presents the automation of Vera, a state-of-the-art consumer health search system. While Vera can discern articles containing misinformation, it requires expert ground truth answers and rule-based reformulations. We introduce an answer prediction module that integrates GPT x with Vera and a GPT-based query reformulator to yield high-quality stance reformulations and boost downstream retrieval effectiveness. Further, we find that chain-of-thought reasoning is paramount to higher effectiveness. When assessed in the TREC Health Misinformation Track of 2022, our systems surpassed all competitors, including human-in-the-loop configurations, underscoring their pivotal role in the evolution towards a health misinformation-free search landscape. We provide all code necessary to reproduce our results at https://github.com/castorini/pygaggle .|在信息时代,健康错误信息仍然是公共福利面临的一个显著挑战。解决这一问题的关键在于开发能够识别并过滤误导性内容的搜索系统。本文介绍了Vera的自动化实现,Vera是一种先进的消费者健康搜索系统。尽管Vera能够识别包含错误信息的文章,但它需要专家提供的真实答案和基于规则的重述。我们引入了一个答案预测模块,该模块将GPT x与Vera集成,并使用基于GPT的查询重述器来生成高质量的立场重述,从而提升下游检索效果。此外,我们发现思维链推理对于提高效果至关重要。在2022年TREC健康错误信息追踪评估中,我们的系统超越了所有竞争对手,包括人在环配置,突出了它们在向无健康错误信息搜索环境演进过程中的关键作用。我们在https://github.com/castorini/pygaggle 提供了重现我们结果所需的所有代码。|code|0|
|Reproducibility Analysis and Enhancements for Multi-aspect Dense Retriever with Aspect Learning|Keping Bi, Xiaojie Sun, Jiafeng Guo, Xueqi Cheng||Multi-aspect dense retrieval aims to incorporate aspect information (e.g., brand and category) into dual encoders to facilitate relevance matching. As an early and representative multi-aspect dense retriever, MADRAL learns several extra aspect embeddings and fuses the explicit aspects with an implicit aspect "OTHER" for final representation. MADRAL was evaluated on proprietary data and its code was not released, making it challenging to validate its effectiveness on other datasets. We failed to reproduce its effectiveness on the public MA-Amazon data, motivating us to probe the reasons and re-examine its components. We propose several component alternatives for comparisons, including replacing "OTHER" with "CLS" and representing aspects with the first several content tokens. Through extensive experiments, we confirm that learning "OTHER" from scratch in aspect fusion is harmful. In contrast, our proposed variants can greatly enhance the retrieval performance. Our research not only sheds light on the limitations of MADRAL but also provides valuable insights for future studies on more powerful multi-aspect dense retrieval models. Code will be released at: https://github.com/sunxiaojie99/Reproducibility-for-MADRAL.|多方面密集检索旨在将方面信息(例如,品牌和类别)合并到双编码器中,以促进相关性匹配。作为一个早期的、有代表性的多方面密集检索器,MADRAL 学习了一些额外的方面嵌入,并将显式方面与隐式方面“ OTHER”融合以得到最终的表示。MADRAL 是根据专有数据进行评估的,其代码没有发布,这使得在其他数据集上验证其有效性具有挑战性。我们未能在公开的 MA-Amazon 数据上再现其有效性,这促使我们探究其原因并重新检查其组成部分。我们提出了几种可供比较的组件替代方案,包括用“ CLS”替换“ OTHER”,以及用前几个内容标记表示方面。通过大量的实验,我们证实了在方面融合中从头学习“其他”是有害的。相比之下,我们提出的变量可以大大提高检索性能。我们的研究不仅揭示了 MADRAL 的局限性,而且为未来更强大的多方面密集检索模型的研究提供了有价值的见解。密码将在下列 https://github.com/sunxiaojie99/reproducibility-for-madral 公布:。|code|0|
|An Empirical Analysis of Intervention Strategies' Effectiveness for Countering Misinformation Amplification by Recommendation Algorithms|Royal Pathak, Francesca Spezzano|Boise State Univ, Comp Sci Dept, Boise, ID 83725 USA|Social network platforms connect people worldwide, facilitating communication, information sharing, and personal/professional networking. They use recommendation algorithms to personalize content and enhance user experiences. However, these algorithms can unintentionally amplify misinformation by prioritizing engagement over accuracy. For instance, recent works suggest that popularity-based and network-based recommendation algorithms contribute the most to misinformation diffusion. In our study, we present an exploration on two Twitter datasets to understand the impact of intervention techniques on combating misinformation amplification initiated by recommendation algorithms. We simulate various scenarios and evaluate the effectiveness of intervention strategies in social sciences such as Virality Circuit Breakers and accuracy nudges. Our findings highlight that these intervention strategies are generally successful when applied on top of collaborative filtering and content-based recommendation algorithms, while having different levels of effectiveness depending on the number of users keen to spread fake news present in the dataset.|社交网络平台将全球各地的人们连接起来,促进了沟通、信息共享以及个人和专业网络的建立。这些平台利用推荐算法来个性化内容并提升用户体验。然而,这些算法可能会无意中放大虚假信息,因为它们优先考虑用户参与度而非信息准确性。例如,最近的研究表明,基于流行度和基于网络的推荐算法对虚假信息的传播贡献最大。在我们的研究中,我们基于两个Twitter数据集展开探索,旨在理解干预技术对遏制推荐算法引发的虚假信息放大的影响。我们模拟了多种场景,并评估了社会科学中干预策略的有效性,例如“病毒传播断路器”和准确性提示。我们的研究结果表明,当这些干预策略应用于协同过滤和基于内容的推荐算法时,通常能够取得成功,但其有效性水平会因数据集中热衷于传播虚假新闻的用户数量而有所不同。|code|0|
|Not Just Algorithms: Strategically Addressing Consumer Impacts in Information Retrieval|Michael D. Ekstrand, Lex Beattie, Maria Soledad Pera, Henriette Cramer|Drexel Univ, Philadelphia, PA 19104 USA; Delft Univ Technol, Delft, Netherlands; Spotify, Seattle, WA USA|Information Retrieval (IR) systems have a wide range of impacts on consumers. We offer maps to help identify goals IR systems could—or should—strive for, and guide the process of scoping how to gauge a wide range of consumer-side impacts and the possible interventions needed to address these effects. Grounded in prior work on scoping algorithmic impact efforts, our goal is to promote and facilitate research that (1) is grounded in impacts on information consumers, contextualizing these impacts in the broader landscape of positive and negative consumer experience; (2) takes a broad view of the possible means of changing or improving that impact, including non-technical interventions; and (3) uses operationalizations and strategies that are well-matched to the technical, social, ethical, legal, and other dimensions of the specific problem in question.|信息检索(IR)系统对消费者有着广泛的影响。我们提供了一些地图,以帮助识别IR系统可以——或应该——追求的目标,并指导如何衡量广泛的消费者端影响以及解决这些影响所需的可能干预措施的范围界定过程。基于先前关于范围界定算法影响的工作,我们的目标是促进和推动以下研究:(1)以信息消费者的影响为基础,将这些影响置于更广泛的积极和消极消费者体验的背景下;(2)采取广泛的视角来看待改变或改进这些影响的可能手段,包括非技术干预;(3)使用与特定问题的技术、社会、伦理、法律和其他维度相匹配的操作化和策略。|code|0|
|A Study of Pre-processing Fairness Intervention Methods for Ranking People|Clara Rus, Andrew Yates, Maarten de Rijke|Univ Amsterdam, Amsterdam, Netherlands|Fairness interventions are hard to use in practice when ranking people due to legal constraints that limit access to sensitive information. Pre-processing fairness interventions, however, can be used in practice to create more fair training data that encourage the model to generate fair predictions without having access to sensitive information during inference. Little is known about the performance of pre-processing fairness interventions in a recruitment setting. To simulate a real scenario, we train a ranking model on pre-processed representations, while access to sensitive information is limited during inference. We evaluate pre-processing fairness intervention methods in terms of individual fairness and group fairness. On two real-world datasets, the pre-processing methods are found to improve the diversity of rankings with respect to gender, while individual fairness is not affected. Moreover, we discuss advantages and disadvantages of using pre-processing fairness interventions in practice for ranking people.|由于法律限制了对敏感信息的访问,公平性干预措施在排名人员时难以实际应用。然而,预处理公平性干预措施可以在实践中用于创建更公平的训练数据,从而鼓励模型在推理过程中无需访问敏感信息的情况下生成公平的预测。在招聘环境中,预处理公平性干预措施的性能尚不为人所知。为了模拟真实场景,我们在预处理的表示上训练了一个排名模型,同时在推理过程中对敏感信息的访问受到限制。我们从个体公平性和群体公平性的角度评估了预处理公平性干预方法。在两个真实世界的数据集上,预处理方法被发现可以提高与性别相关的排名多样性,而个体公平性不受影响。此外,我们讨论了在排名人员时使用预处理公平性干预措施的实际优缺点。|code|0|
|Evaluating the Explainability of Neural Rankers|Saran Pandian, Debasis Ganguly, Sean MacAvaney||Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.|信息检索模型已经见证了从无监督统计方法到基于特征的监督方法到完全数据驱动方法的范式转变,这种方法利用了大型语言模型的预训练。虽然搜索模型日益增加的复杂性已经能够证明有效性的改善(根据检索结果的相关性衡量) ,但是一个值得彻底检查的问题是——“这些模型如何解释?”这就是本文的目的。特别是,我们提出了一个通用的评估平台,系统地评估任何排名模型的可解释性(解释算法对于所有待评估的模型都是相同的)。在我们提出的框架中,每个模型除了返回排序的文档列表之外,还需要返回每个文档的解释单元或基本原理的列表。然后,利用每份文件中的元信息来衡量这些理由在当地的一致程度,作为衡量可解释性的内在尺度,而不需要人工进行相关性评估。此外,作为一个外在的测量,我们通过利用子文档级别的相关性评估来计算这些基本原理的相关性。我们的研究结果显示,许多有趣的观察结果,例如句子水平的基本原理更加一致,复杂性的增加主要导致不一致的解释,并且可解释性测量提供了 IR 系统评估的补充维度,因为一致性与顶级的 nDCG 不相关。|code|0|
|Knowledge Graph Cross-View Contrastive Learning for Recommendation|Zeyuan Meng, Iadh Ounis, Craig Macdonald, Zixuan Yi|Univ Glasgow, Glasgow, Lanark, Scotland|Knowledge Graphs (KGs) are useful side information that help recommendation systems improve recommendation quality by providing rich semantic information about entities and items. Recently, models based on graph neural networks (GNNs) have adopted knowledge graphs to capture further high-order structural information, such as shared preferences between users and similarities between items. However, existing GNN-based methods suffer from two challenges: (1) Sparse supervisory signal, where a large amount of information in the knowledge graph is non-relevant to recommendation, and the training labels are insufficient, thereby limiting the recommendation performance of the trained model; (2) Valuable information is discarded whereby the use by the existing models of edge or node dropout strategies to obtain augmented views during self-supervised learning could lead to valuable information being discarded in recommendation. These two challenges limit the effective representation of users and items by existing methods. Inspired by self-supervised learning to mine supervision signals from data, in this paper, we focus on exploring contrastive learning based on knowledge graph enhancement, and propose a new model named Knowledge Graph Cross-view Contrastive Learning for Recommendation (KGCCL) to address the two challenges. Specifically, to address supervision sparseness, we perform contrastive learning between graph views at different levels and mine graph feature information in a self-supervised learning manner. In addition, we use noise augmentation to enhance the representation of users and items, while retaining all triplet information in the knowledge graph to address the challenge of valuable information being discarded. Experimental results on three public datasets show that our proposed KGCCL model outperforms existing state-of-the-art methods. In particular, our model outperforms the best baseline performance by 10.65% on the MIND dataset.|知识图谱(KGs)是一种有用的辅助信息,通过提供关于实体和项目的丰富语义信息,帮助推荐系统提高推荐质量。最近,基于图神经网络(GNNs)的模型采用知识图谱来捕捉更多的高阶结构信息,例如用户之间的共享偏好和项目之间的相似性。然而,现有的基于GNN的方法面临两个挑战:(1)稀疏的监督信号,即知识图谱中的大量信息与推荐无关,且训练标签不足,从而限制了训练模型的推荐性能;(2)有价值的信息被丢弃,即现有模型在自监督学习过程中使用边或节点丢弃策略来获得增强视图,可能导致推荐中有价值的信息被丢弃。这两个挑战限制了现有方法对用户和项目的有效表示。受自监督学习从数据中挖掘监督信号的启发,本文重点探索基于知识图谱增强的对比学习,并提出了一种名为知识图谱跨视图对比学习推荐模型(KGCCL)的新模型来解决这两个挑战。具体来说,为了解决监督稀疏性问题,我们在不同层次的图视图之间进行对比学习,并以自监督学习的方式挖掘图特征信息。此外,我们使用噪声增强来增强用户和项目的表示,同时保留知识图谱中的所有三元组信息,以解决有价值信息被丢弃的挑战。在三个公开数据集上的实验结果表明,我们提出的KGCCL模型优于现有的最先进方法。特别是在MIND数据集上,我们的模型比最佳基线性能提高了10.65%。|code|0|
|Recommendation Fairness in eParticipation: Listening to Minority, Vulnerable and NIMBY Citizens|Marina AlonsoCortés, Iván Cantador, Alejandro Bellogín|Univ Autonoma Madrid, Escuela Politecn Super, Madrid 28049, Spain|E-participation refers to the use of digital technologies and online platforms to engage citizens and other stakeholders in democratic and government decision-making processes. Recent research work has explored the application of recommender systems to e-participation, focusing on the development of algorithmic solutions to be effective in terms of personalized content retrieval accuracy, but ignoring underlying societal issues, such as biases, fairness, privacy and transparency. Motivated by this research gap, on a public e-participatory budgeting dataset, we measure and analyze recommendation fairness metrics oriented to several minority, vulnerable and NIMBY (Not In My Back Yard) groups of citizens. Our empirical results show that there is a strong popularity bias (especially for the minority groups) due to how content is presented and accessed in a reference e-participation platform; and that hybrid algorithms exploiting user geolocation information in a collaborative filtering fashion are good candidates to satisfy the proposed fairness conceptualization for the above underrepresented citizen collectives.|电子参与(E-participation)是指利用数字技术和在线平台,使公民及其他利益相关者参与到民主和政府决策过程中。近期的研究工作探索了推荐系统在电子参与中的应用,重点开发了在个性化内容检索准确性方面有效的算法解决方案,但忽略了潜在的社会问题,如偏见、公平性、隐私和透明度。受这一研究空白的启发,我们在一个公共电子参与预算数据集上,针对多个少数群体、弱势群体和“不要在我家后院”(NIMBY)的公民群体,测量并分析了面向推荐公平性的指标。我们的实证结果表明,由于内容在参考电子参与平台上的呈现和访问方式,存在强烈的流行度偏见(尤其是对少数群体);而利用用户地理位置信息以协同过滤方式工作的混合算法,是满足上述代表性不足公民群体所提出的公平性概念的良好候选方案。|code|0|
|Responsible Opinion Formation on Debated Topics in Web Search|Alisa Rieger, Tim Draws, Nicolas Mattis, David Maxwell, David Elsweiler, Ujwal Gadiraju, Dana McKay, Alessandro Bozzon, Maria Soledad Pera|Univ Regensburg, Regensburg, Germany; OTTO, Hamburg, Germany; Vrije Univ Amsterdam, Amsterdam, Netherlands; RMIT Univ Melbourne, Melbourne, Australia; Booking com, Amsterdam, Netherlands; Delft Univ Technol, Delft, Netherlands|Web search has evolved into a platform people rely on for opinion formation on debated topics. Yet, pursuing this search intent can carry serious consequences for individuals and society and involves a high risk of biases. We argue that web search can and should empower users to form opinions responsibly and that the information retrieval community is uniquely positioned to lead interdisciplinary efforts to this end. Building on digital humanism—a perspective focused on shaping technology to align with human values and needs—and through an extensive interdisciplinary literature review, we identify challenges and research opportunities that focus on the searcher, search engine, and their complex interplay. We outline a research agenda that provides a foundation for research efforts toward addressing these challenges.|网络搜索已经发展成为一个人们依赖的平台,用于就争议话题形成意见。然而,追求这种搜索意图可能会对个人和社会带来严重后果,并涉及高度的偏见风险。我们认为,网络搜索能够并且应该使用户负责任地形成意见,信息检索社区在这方面具有独特的优势,可以领导跨学科的努力。基于数字人文主义——一种专注于塑造技术以符合人类价值观和需求的视角——并通过广泛的跨学科文献综述,我们确定了关注搜索者、搜索引擎及其复杂互动的挑战和研究机会。我们概述了一个研究议程,为应对这些挑战的研究工作提供了基础。|code|0|
|Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines|Janek Bevendorff, Matti Wiegmann, Martin Potthast, Benno Stein|Bauhaus Univ Weimar, Weimar, Germany; Univ Leipzig, Leipzig, Germany|Many users of web search engines have been complaining in recent years about the supposedly decreasing quality of search results. This is often attributed to an increasing amount of search-engine-optimized but low-quality content. Evidence for this has always been anecdotal, yet it's not unreasonable to think that popular online marketing strategies such as affiliate marketing incentivize the mass production of such content to maximize clicks. Since neither this complaint nor affiliate marketing as such have received much attention from the IR community, we hereby lay the groundwork by conducting an in-depth exploratory study of how affiliate content affects today's search engines. We monitored Google, Bing and DuckDuckGo for a year on 7,392 product review queries. Our findings suggest that all search engines have significant problems with highly optimized (affiliate) content—more than is representative for the entire web according to a baseline retrieval system on the ClueWeb22. Focussing on the product review genre, we find that only a small portion of product reviews on the web uses affiliate marketing, but the majority of all search results do. Of all affiliate networks, Amazon Associates is by far the most popular. We further observe an inverse relationship between affiliate marketing use and content complexity, and that all search engines fall victim to large-scale affiliate link spam campaigns. However, we also notice that the line between benign content and spam in the form of content and link farms becomes increasingly blurry—a situation that will surely worsen in the wake of generative AI. We conclude that dynamic adversarial spam in the form of low-quality, mass-produced commercial content deserves more attention. (Code and data: https://github.com/webis-de/ECIR-24 ).|近年来,许多网络搜索引擎的用户抱怨搜索结果的质量似乎在下降。这通常归因于搜索引擎优化但低质量内容的增加。虽然这些证据大多是轶事性的,但认为诸如联盟营销等流行的在线营销策略激励了此类内容的大规模生产以最大化点击量并非不合理。由于这一投诉和联盟营销本身都未受到信息检索(IR)社区的太多关注,我们在此通过深入探索性研究来奠定基础,研究联盟内容如何影响当今的搜索引擎。我们对Google、Bing和DuckDuckGo进行了为期一年的监控,涉及7,392个产品评论查询。我们的研究结果表明,所有搜索引擎在高度优化的(联盟)内容方面都存在显著问题——根据ClueWeb22上的基线检索系统,这种情况比整个网络上的情况更为严重。聚焦于产品评论这一类别,我们发现网络上只有一小部分产品评论使用联盟营销,但大多数搜索结果都使用了。在所有联盟网络中,Amazon Associates是最受欢迎的。我们进一步观察到联盟营销使用与内容复杂性之间的反向关系,并且所有搜索引擎都成为大规模联盟链接垃圾邮件活动的受害者。然而,我们也注意到良性内容与垃圾邮件(如内容和链接农场)之间的界限越来越模糊——这种情况在生成式AI的推动下肯定会进一步恶化。我们得出结论,低质量、大规模生产的商业内容形式的动态对抗性垃圾邮件值得更多关注。(代码和数据:https://github.com/webis-de/ECIR-24)|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Is+Google+Getting+Worse?+A+Longitudinal+Investigation+of+SEO+Spam+in+Search+Engines)|0|
|Robustness in Fairness Against Edge-Level Perturbations in GNN-Based Recommendation|Ludovico Boratto, Francesco Fabbri, Gianni Fenu, Mirko Marras, Giacomo Medda||Efforts in the recommendation community are shifting from the sole emphasis on utility to considering beyond-utility factors, such as fairness and robustness. Robustness of recommendation models is typically linked to their ability to maintain the original utility when subjected to attacks. Limited research has explored the robustness of a recommendation model in terms of fairness, e.g., the parity in performance across groups, under attack scenarios. In this paper, we aim to assess the robustness of graph-based recommender systems concerning fairness, when exposed to attacks based on edge-level perturbations. To this end, we considered four different fairness operationalizations, including both consumer and provider perspectives. Experiments on three datasets shed light on the impact of perturbations on the targeted fairness notion, uncovering key shortcomings in existing evaluation protocols for robustness. As an example, we observed perturbations affect consumer fairness on a higher extent than provider fairness, with alarming unfairness for the former. Source code: https://github.com/jackmedda/CPFairRobust|推荐社区的工作正在从单纯强调效用转向考虑效用以外的因素,如公平性和稳健性。推荐模型的健壮性通常与它们在受到攻击时维护原始实用程序的能力有关。有限的研究已经探索了推荐模型在公平性方面的健壮性,例如,在攻击场景下组间的性能均等。本文旨在评估基于图形的推荐系统在受到基于边界层扰动的攻击时对公平性的鲁棒性。为此,我们考虑了四种不同的公平可操作性,包括消费者和提供者视角。在三个数据集上的实验揭示了扰动对目标公平性概念的影响,揭示了现有鲁棒性评估协议的关键缺陷。作为一个例子,我们观察到扰动对消费者公平性的影响程度高于提供者公平性,前者的不公平性令人担忧。源代码: https://github.com/jackmedda/cpfairrobust|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Robustness+in+Fairness+Against+Edge-Level+Perturbations+in+GNN-Based+Recommendation)|0|
|Shallow Cross-Encoders for Low-Latency Retrieval|Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald||Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51 Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3 latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.|基于变压器的交叉编码器在文本检索中取得了最先进的效果。然而,基于大型转换器模型(如 BERT 或 T5)的交叉编码器在计算上是昂贵的,并且只允许在一个相当小的延迟窗口内对少量文档进行评分。然而,保持较低的搜索延迟对于用户满意度和能量使用非常重要。在本文中,我们表明,较弱的浅层变压器模型(即,有限层数的变压器)实际上比全尺寸模型表现更好,当约束到这些实际的低延迟设置,因为他们可以估计相关性更多的文件在同一时间预算。我们进一步表明,浅变压器可以受益于广义二进制交叉熵(gBCE)训练方案,最近证明了推荐任务的成功。我们对 TREC 深度学习段落排序查询集的实验表明,在低延迟场景中,浅层和全尺度模型有了显著的改进。例如,当每个查询的延迟限制为25毫秒时,MonoBERT-Large (一种基于全尺寸 BERT 模型的交叉编码器)在 TREC dL 2019上只能达到0.431的 NDCG@10,而 TinyBERT-gBCE (一种基于 TinyBERT 的交叉编码器,经 gBCE 训练)达到0.652的 NDCG@10,a + 51交叉编码器即使在没有图形处理器的情况下也是有效的(例如,根据 CPU 推断,NDCG@10只减少了3个延迟) ,这使得交叉编码器即使没有专门的硬件加速也能实际运行。|code|0|
|Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies|Puxuan Yu, Antonio Mallia, Matthias Petri||We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.|我们探索利用特定于语料库的词汇来提高学习的稀疏检索系统的效率和有效性。我们发现,在目标语料库上预先训练基础的 BERT 模型,特别是针对文档扩展过程中包含的不同词汇量,可以提高检索质量达12% ,而在某些情况下可以减少50% 的延迟。我们的实验表明,采用特定语料库词汇和增加词汇量减少了平均发布列表长度,从而减少了延迟。消融研究显示了自定义词汇表、文档扩展技术和稀疏模型的稀疏化目标之间有趣的交互作用。成效和效率的提高都转移到不同的检索方法,如 uniCOIL 和 SPLADE,并提供了一种简单而有效的方法,为学习的稀疏检索系统提供新的效率效益权衡。|code|0|
|An Adaptive Framework of Geographical Group-Specific Network on O2O Recommendation|Luo Ji, Jiayu Mao, Hailong Shi, Qian Li, Yunfei Chu, Hongxia Yang||Online to offline recommendation strongly correlates with the user and service's spatiotemporal information, therefore calling for a higher degree of model personalization. The traditional methodology is based on a uniform model structure trained by collected centralized data, which is unlikely to capture all user patterns over different geographical areas or time periods. To tackle this challenge, we propose a geographical group-specific modeling method called GeoGrouse, which simultaneously studies the common knowledge as well as group-specific knowledge of user preferences. An automatic grouping paradigm is employed and verified based on users' geographical grouping indicators. Offline and online experiments are conducted to verify the effectiveness of our approach, and substantial business improvement is achieved.|Online To Offline线上到线下推荐与用户和服务的时空信息密切相关,因此需要更高程度的模型个性化。传统的方法是基于由收集的中央数据训练的统一模型结构,这种结构不太可能捕获不同地理区域或不同时期的所有用户模式。为了应对这一挑战,我们提出了一种名为 GeoGrouse 的地理组特定建模方法,该方法同时研究用户偏好的常识和组特定知识。基于用户的地理分组指标,采用自动分组范式进行验证。通过离线和在线实验验证了该方法的有效性,并取得了实质性的业务改进。|code|0|
|GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation|Kaustubh D. Dhole, Eugene Agichtein||Query Reformulation(QR) is a set of techniques used to transform a user's original search query to a text that better aligns with the user's intent and improves their search experience. Recently, zero-shot QR has been shown to be a promising approach due to its ability to exploit knowledge inherent in large language models. By taking inspiration from the success of ensemble prompting strategies which have benefited many tasks, we investigate if they can help improve query reformulation. In this context, we propose an ensemble based prompting technique, GenQREnsemble which leverages paraphrases of a zero-shot instruction to generate multiple sets of keywords ultimately improving retrieval performance. We further introduce its post-retrieval variant, GenQREnsembleRF to incorporate pseudo relevant feedback. On evaluations over four IR benchmarks, we find that GenQREnsemble generates better reformulations with relative nDCG@10 improvements up to 18 the previous zero-shot state-of-art. On the MSMarco Passage Ranking task, GenQREnsembleRF shows relative gains of 5 and 9|查询重构(Query Reformation,QR)是一组技术,用于将用户的原始搜索查询转换为更好地符合用户意图并改善其搜索体验的文本。最近,零拍 QR 已被证明是一种有前途的方法,因为它能够利用知识固有的大型语言模型。本文从集成提示策略的成功经验中得到启发,探讨了集成提示策略是否有助于改进查询重构。在这种背景下,我们提出了一种基于集成的提示技术,GenQR 集成,它利用一个零拍指令的释义来生成多组关键字,最终提高检索性能。我们进一步引入其检索后变体,GenQREnsembleRF,以纳入伪相关反馈。通过对四个 IR 基准的评估,我们发现 GenQREnamble 在相对 nDCG@10的改进下产生了更好的重构效果,最高达到了之前的最高水平。在 MSMarco 通道排名任务中,GenQREnsembleRF 显示相对增益为5和9|code|0|
|Improving the Robustness of Dense Retrievers Against Typos via Multi-Positive Contrastive Learning|Georgios Sidiropoulos, Evangelos Kanoulas||Dense retrieval has become the new paradigm in passage retrieval. Despite its effectiveness on typo-free queries, it is not robust when dealing with queries that contain typos. Current works on improving the typo-robustness of dense retrievers combine (i) data augmentation to obtain the typoed queries during training time with (ii) additional robustifying subtasks that aim to align the original, typo-free queries with their typoed variants. Even though multiple typoed variants are available as positive samples per query, some methods assume a single positive sample and a set of negative ones per anchor and tackle the robustifying subtask with contrastive learning; therefore, making insufficient use of the multiple positives (typoed queries). In contrast, in this work, we argue that all available positives can be used at the same time and employ contrastive learning that supports multiple positives (multi-positive). Experimental results on two datasets show that our proposed approach of leveraging all positives simultaneously and employing multi-positive contrastive learning on the robustifying subtask yields improvements in robustness against using contrastive learning with a single positive.|密集检索已经成为文章检索的新范式。尽管它对于无输入错误的查询有效,但是在处理包含输入错误的查询时并不健壮。目前致力于改善密集检索器的输入鲁棒性,将(i)数据增强结合起来以在训练期间获得输入查询和(ii)额外的鲁棒子任务,旨在将原始的,无输入的查询与其输入变体对齐。尽管每个查询可以提供多个类型变体作为正面样本,但是一些方法假设每个锚具有单个正面样本和一组负面样本,并通过对比学习处理强健的子任务; 因此,没有充分利用多个正面样本(类型查询)。相比之下,在这项工作中,我们认为所有可用的积极因素可以同时使用,并采用对比学习,支持多个积极因素(多积极)。在两个数据集上的实验结果表明,我们提出的方法同时利用所有的积极和使用多个积极的对比学习的鲁棒性子任务产生的鲁棒性对使用单个积极的对比学习的改善。|code|0|
|Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-Seeking Conversations|Weronika Lajewska, Krisztian Balog||Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems. We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response. This way we can automatically assess if the answer to the user's question is present in the corpus. Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate. For training and evaluation, we develop a dataset based on the TREC CAsT benchmark that includes answerability labels on the sentence, passage, and ranking levels. We demonstrate that our proposed method represents a strong baseline and outperforms a state-of-the-art LLM on the answerability prediction task.|生成型人工智能模型面临着幻觉的挑战,幻觉可能会破坏用户对这类系统的信任。我们将会话信息搜寻问题分为两个步骤,首先识别语料库中的相关段落,然后将其归纳为最终的系统反应。这样我们就可以自动评估用户问题的答案是否在语料库中。具体来说,我们提出的方法使用一个句子级别的分类器来检测答案是否存在,然后将这些预测集中在短文级别,最终通过排名最高的段落来得到最终的可回答性估计。为了培训和评估,我们开发了一个基于 TREC CAsT 基准的数据集,包括句子、段落和排名等级上的可回答性标签。我们证明,我们提出的方法代表了一个强大的基线和优于国家的最先进的 LLM 的应答性预测任务。|code|0|
|On the Influence of Reading Sequences on Knowledge Gain During Web Search|Wolfgang Gritz, Anett Hoppe, Ralph Ewerth||Nowadays, learning increasingly involves the usage of search engines and web resources. The related interdisciplinary research field search as learning aims to understand how people learn on the web. Previous work has investigated several feature classes to predict, for instance, the expected knowledge gain during web search. Therein, eye-tracking features have not been extensively studied so far. In this paper, we extend a previously used reading model from a line-based one to one that can detect reading sequences across multiple lines. We use publicly available study data from a web-based learning task to examine the relationship between our feature set and the participants' test scores. Our findings demonstrate that learners with higher knowledge gain spent significantly more time reading, and processing more words in total. We also find evidence that faster reading at the expense of more backward regressions may be an indicator of better web-based learning. We make our code publicly available at https://github.com/TIBHannover/reading_web_search.|如今,学习越来越多地涉及到搜索引擎和网络资源的使用。相关的科际整合领域搜索作为学习,旨在了解人们如何在网上学习。以前的工作已经调查了几个特征类来预测,例如,在网络搜索期间的预期知识增益。其中,眼球跟踪特征还没有被广泛研究到目前为止。在本文中,我们将以前使用的读取模型从基于行的模型扩展到能够跨多行检测读取序列的模型。我们使用来自网络学习任务的公开可用的研究数据来检查我们的特征集和参与者的测试分数之间的关系。我们的研究结果表明,获得更高知识的学习者花费更多的时间阅读,并处理更多的单词总数。我们还发现,以更多倒退回归为代价的更快阅读可能是更好的网络学习的一个指标。我们让我们的代码在 https://github.com/tibhannover/reading_web_search 上公开。|code|0|
|SPARe: Supercharged Lexical Retrievers on GPU with Sparse Kernels|Tiago Almeida, Sérgio Matos|Univ Aveiro, IEETA, DETI, LASI, P-3810193 Aveiro, Portugal|Lexical sparse retrievers, rely on efficient searching algorithms that operate over inverted index structures, tailored specifically for CPU. This CPU-centric design poses a challenge when adapting these algorithms for highly parallel accelerators, such as GPUs, thus deterring potential performance gains. To address this, we propose to leverage the recent advances in sparse computations offered by deep learning frameworks to directly implementing sparse retrievals on these accelerators. This paper presents the SPARe (SPArse Retrievers) Python package, which provides a high-level API to deal with sparse retrievers on (single or multi)-accelerators, by leveraging deep learning frameworks at its core. Experimental results show that SPARe, running on an accessible GPU (RTX 2070), can calculate the BM25 scores for close to 9 million MSMARCO documents at a rate of 800 questions per second with our specialized algorithm. Notably, SPARe proves highly effective for denser LSR indexes, significantly surpassing the performance of established systems such as PISA, Pyserini and PyTerrier. SPARe is publicly available at https://github.com/ieeta-pt/SPARe .|词汇稀疏检索器依赖于专门为CPU设计的倒排索引结构的高效搜索算法。这种以CPU为中心的设计在将这些算法适配到高度并行的加速器(如GPU)时带来了挑战,从而阻碍了潜在的性能提升。为了解决这一问题,我们提出利用深度学习框架在稀疏计算方面的最新进展,直接在加速器上实现稀疏检索。本文介绍了SPARe(SPArse Retrievers)Python包,它提供了一个高级API,通过在核心中利用深度学习框架来处理(单或多)加速器上的稀疏检索器。实验结果表明,SPARe在可访问的GPU(RTX 2070)上运行时,使用我们的专用算法可以以每秒800个问题的速度计算近900万篇MSMARCO文档的BM25分数。值得注意的是,SPARe在较密集的LSR索引上表现出极高的效率,显著超越了PISA、Pyserini和PyTerrier等现有系统的性能。SPARe已在https://github.com/ieeta-pt/SPARe 公开提供。|code|0|
|Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT|Ben Giacalone, Greg Paiement, Quinn Tucker, Richard Zanibbi|Rochester Inst Technol, Rochester, NY 14623 USA|ColBERT is a highly effective and interpretable retrieval model based on token embeddings. For scoring, the model adds cosine similarities between the most similar pairs of query and document token embeddings. Previous work on interpreting how tokens affect scoring pay little attention to non-text tokens used in ColBERT such as [MASK]. Using MS MARCO and the TREC 2019-2020 deep passage retrieval task, we show that [MASK] embeddings may be replaced by other query and structural token embeddings to obtain similar effectiveness, and that [Q] and [MASK] are sensitive to token order, while [CLS] and [SEP] are not.|ColBERT是一种基于词元嵌入的高效且可解释的检索模型。在评分过程中,该模型通过计算查询和文档词元嵌入之间最相似对的余弦相似度来进行累加。先前关于解释词元如何影响评分的研究很少关注ColBERT中使用的非文本词元,例如[MASK]。通过使用MS MARCO数据集和TREC 2019-2020深度段落检索任务,我们发现[MASK]嵌入可以被其他查询和结构词元嵌入替代,从而获得相似的检索效果。此外,[Q]和[MASK]对词元顺序敏感,而[CLS]和[SEP]则不受词元顺序影响。|code|0|
|A Cost-Sensitive Meta-learning Strategy for Fair Provider Exposure in Recommendation|Ludovico Boratto, Giulia Cerniglia, Mirko Marras, Alessandra Perniciano, Barbara Pes||When devising recommendation services, it is important to account for the interests of all content providers, encompassing not only newcomers but also minority demographic groups. In various instances, certain provider groups find themselves underrepresented in the item catalog, a situation that can influence recommendation results. Hence, platform owners often seek to regulate the exposure of these provider groups in the recommended lists. In this paper, we propose a novel cost-sensitive approach designed to guarantee these target exposure levels in pairwise recommendation models. This approach quantifies, and consequently mitigate, the discrepancies between the volume of recommendations allocated to groups and their contribution in the item catalog, under the principle of equity. Our results show that this approach, while aligning groups exposure with their assigned levels, does not compromise to the original recommendation utility. Source code and pre-processed data can be retrieved at https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure.|在设计推荐服务时,必须考虑到所有内容提供者的利益,不仅包括新来者,还包括少数人口群体。在各种情况下,某些提供者组发现自己在项目目录中的表示不足,这种情况可能会影响推荐结果。因此,平台所有者往往试图在建议名单中规范这些提供商群体的风险敞口。在本文中,我们提出了一种新的成本敏感的方法来保证这些目标暴露水平的配对推荐模型。根据公平原则,这种方法量化并因此减少了分配给各组的建议数量与它们在项目目录中的贡献之间的差异。我们的研究结果表明,这种方法,虽然调整组暴露与他们指定的水平,不妥协的原始推荐实用程序。源代码和预处理数据可以在 https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure 检索到。|code|0|
|Multiple Testing for IR and Recommendation System Experiments|Ngozi Ihemelandu, Michael D. Ekstrand|Drexel Univ, Dept Informat Sci, Philadelphia, PA 19104 USA; Boise State Univ, Boise, ID 83725 USA|While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).|尽管已有大量研究关注于比较两个信息检索(IR)系统的统计技术,但许多IR实验往往会测试超过两个系统。这可能导致由于多重比较问题(MCP)而增加的误发现。一些IR研究已经探讨了多重比较程序;这些研究主要使用TREC数据并控制族系错误率。在本研究中,我们扩展了他们的研究范围,包括推荐系统评估数据以及控制误发现率(FDR)的多重比较程序。|code|0|
|An In-Depth Comparison of Neural and Probabilistic Tree Models for Learning-to-rank|Haonan Tan, Kaiyu Yang, Haitao Yu|Univ Tsukuba, Inst Lilbray Informat & Media Sci, 1-2 Kasuga, Tsukuba, Ibaraki 3050821, Japan; Univ Tsukuba, Grad Sch Comprehens Human Sci, 1-2 Kasuga, Tsukuba, Ibaraki 3050821, Japan|Learning-to-rank has been intensively studied and has demonstrated significant value in several fields, such as web search and recommender systems. Over the learning-to-rank datasets given as vectors of feature values, LambdaMART proposed more than a decade ago, and its subsequent descendants based on gradient-boosted decision trees (GBDT), have demonstrated leading performance. Recently, different novel tree models have been developed, such as neural tree ensembles that utilize neural networks to emulate decision tree models and probabilistic gradient boosting machines (PGBM). However, the effectiveness of these tree models for learning-to-rank has not been comprehensively explored. Hence, this study bridges the gap by systematically comparing several representative neural tree ensembles (e.g., TabNet, NODE, and GANDALF), PGBM, and traditional learning-to-rank models on two benchmark datasets. The experimental results reveal that benefiting from end-to-end gradient-based optimization and the power of feature representation and adaptive feature selection, the neural tree ensemble does have its advantage for learning-to-rank over the conventional tree-based ranking model, such as LambdaMART. This finding is important as LambdaMART has achieved leading performance in a long period.|学习排序(learning-to-rank)技术已被深入研究,并在多个领域展示了显著的价值,如网络搜索和推荐系统。在给定特征值向量的学习排序数据集上,十多年前提出的LambdaMART及其基于梯度提升决策树(GBDT)的后续改进模型,已经展示了领先的性能。近年来,不同的新型树模型被开发出来,例如利用神经网络模拟决策树模型的神经树集成模型(neural tree ensembles)和概率梯度提升机(PGBM)。然而,这些树模型在学习排序任务中的有效性尚未得到全面探索。因此,本研究通过系统比较几种代表性的神经树集成模型(如TabNet、NODE和GANDALF)、PGBM以及传统的学习排序模型,填补了这一空白。实验结果表明,得益于端到端的基于梯度的优化以及特征表示和自适应特征选择的能力,神经树集成模型在学习排序任务中确实比传统的基于树的排序模型(如LambdaMART)更具优势。这一发现具有重要意义,因为LambdaMART在很长一段时间内一直保持着领先的性能。|code|0|
|GenRec: Large Language Model for Generative Recommendation|Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, Yongfeng Zhang||In recent years, large language models (LLM) have emerged as powerful tools for diverse natural language processing tasks. However, their potential for recommender systems under the generative recommendation paradigm remains relatively unexplored. This paper presents an innovative approach to recommendation systems using large language models (LLMs) based on text data. In this paper, we present a novel LLM for generative recommendation (GenRec) that utilized the expressive power of LLM to directly generate the target item to recommend, rather than calculating ranking score for each candidate item one by one as in traditional discriminative recommendation. GenRec uses LLM's understanding ability to interpret context, learn user preferences, and generate relevant recommendation. Our proposed approach leverages the vast knowledge encoded in large language models to accomplish recommendation tasks. We first we formulate specialized prompts to enhance the ability of LLM to comprehend recommendation tasks. Subsequently, we use these prompts to fine-tune the LLaMA backbone LLM on a dataset of user-item interactions, represented by textual data, to capture user preferences and item characteristics. Our research underscores the potential of LLM-based generative recommendation in revolutionizing the domain of recommendation systems and offers a foundational framework for future explorations in this field. We conduct extensive experiments on benchmark datasets, and the experiments shows that our GenRec has significant better results on large dataset.|近年来,大型语言模型(LLM)已经成为处理各种自然语言处理任务的强大工具。然而,他们的潜力推荐系统在生成推荐范式仍然相对未开发。本文提出了一种基于文本数据的大语言模型(LLM)推荐系统的创新方法。本文提出了一种新的生成推荐 LLM (GenRec) ,它利用 LLM 的表达能力直接生成推荐的目标项目,而不是像传统的区分推荐那样逐个计算每个候选项目的排名得分。GenRec 使用 LLM 的理解能力来解释上下文、学习用户偏好并生成相关推荐。我们提出的方法利用大型语言模型中编码的大量知识来完成推荐任务。我们首先制定专门的提示来增强 LLM 理解推荐任务的能力。随后,我们使用这些提示对用户-项目交互的数据集(由文本数据表示)上的 LLaMA 主干 LLM 进行微调,以捕获用户偏好和项目特征。我们的研究强调了基于 LLM 的生成式推荐在推荐系统领域革命性变革中的潜力,并为未来该领域的探索提供了一个基础框架。我们在基准数据集上进行了广泛的实验,实验结果表明我们的 GenRec 在大数据集上有明显的更好的结果。|code|0|
|News Gathering: Leveraging Transformers to Rank News|Carlos Muñoz, María José Apolo, Maximiliano Ojeda, Hans Lobel, Marcelo Mendoza|Pontificia Univ Catolica Chile, Vicuna Mackenna 6840, Santiago, Chile; Univ Tecn Federico Santa Maria, Vicuna Mackenna 3939, Santiago, Chile|News media outlets disseminate information across various platforms. Often, these posts present complementary content and perspectives on the same news story. However, to compile a set of related news articles, users must thoroughly scour multiple sources and platforms, manually identifying which publications pertain to the same story. This tedious process hinders the speed at which journalists can perform essential tasks, notably fact-checking. To tackle this problem, we created a dataset containing both related and unrelated news pairs. This dataset allows us to develop information retrieval models grounded in the principle of binary relevance. Recognizing that many Transformer-based models might be suited for this task but could overemphasize relationships based on lexical connections, we tailored a dataset to fine-tune these models to focus on semantically relevant connections in the news domain. To craft this dataset, we introduced a methodology to identify pairs of news stories that are lexically similar yet refer to different events and pairs that discuss the same event but have distinct lexical structures. This design compels Transformers to recognize semantic connections between stories, even when their lexical similarities might be absent. Following a human-annotation assessment, we reveal that BERT outperformed other techniques, excelling even in challenging test cases. To ensure the reproducibility of our approach, we have made the dataset and top-performing models publicly available.|新闻媒体机构通过各种平台传播信息。通常情况下,这些帖子会针对同一新闻事件提供互补的内容和视角。然而,要汇编一组相关的新闻报道,用户必须彻底搜索多个来源和平台,手动识别哪些出版物涉及同一事件。这一繁琐的过程阻碍了记者执行关键任务(如事实核查)的速度。为了解决这一问题,我们创建了一个包含相关和不相关新闻对的数据集。该数据集使我们能够开发基于二元相关性原则的信息检索模型。考虑到许多基于Transformer的模型可能适合此任务,但可能会过度强调基于词汇连接的关系,我们定制了一个数据集,以微调这些模型,使其专注于新闻领域中的语义相关连接。为了构建该数据集,我们引入了一种方法来识别词汇相似但涉及不同事件的新闻对,以及讨论同一事件但具有不同词汇结构的新闻对。这种设计迫使Transformer模型识别故事之间的语义连接,即使它们的词汇相似性可能不存在。经过人工注释评估后,我们发现BERT优于其他技术,即使在具有挑战性的测试案例中也表现出色。为确保我们方法的可重复性,我们已将数据集和表现最佳的模型公开发布。|code|0|
|Answer Retrieval in Legal Community Question Answering|Arian Askari, Zihui Yang, Zhaochun Ren, Suzan Verberne||The task of answer retrieval in the legal domain aims to help users to seek relevant legal advice from massive amounts of professional responses. Two main challenges hinder applying existing answer retrieval approaches in other domains to the legal domain: (1) a huge knowledge gap between lawyers and non-professionals; and (2) a mix of informal and formal content on legal QA websites. To tackle these challenges, we propose CE_FS, a novel cross-encoder (CE) re-ranker based on the fine-grained structured inputs. CE_FS uses additional structured information in the CQA data to improve the effectiveness of cross-encoder re-rankers. Furthermore, we propose LegalQA: a real-world benchmark dataset for evaluating answer retrieval in the legal domain. Experiments conducted on LegalQA show that our proposed method significantly outperforms strong cross-encoder re-rankers fine-tuned on MS MARCO. Our novel finding is that adding the question tags of each question besides the question description and title into the input of cross-encoder re-rankers structurally boosts the rankers' effectiveness. While we study our proposed method in the legal domain, we believe that our method can be applied in similar applications in other domains.|法律领域的答案检索任务旨在帮助用户从大量的专业答复中寻求相关的法律咨询。两个主要的挑战阻碍了现有的回答检索方法在其他领域的法律领域: (1)律师和非专业人士之间的巨大知识差距; 和(2)在法律质量保证网站的非正式和正式内容的组合。为了解决这些问题,我们提出了一种基于细粒度结构化输入的交叉编码器(CE)重排序算法 CE _ FS。CE _ FS 在 CQA 数据中使用额外的结构化信息来提高交叉编码器重新排序的有效性。此外,我们提出了 LegalQA: 一个真实世界的基准数据集,用于评估法律领域的答案检索。在 LegalQA 上进行的实验表明,本文提出的方法明显优于在 MS MARCO 上进行微调的强交叉编码器重排序器。我们的新发现是,除了问题描述和题名之外,在交叉编码器重新排序的输入中增加每个问题的问题标签,从结构上提高了排序的有效性。在法律领域研究我们提出的方法的同时,我们相信我们的方法可以应用于其他领域的类似应用。|code|0|
|Towards Optimizing Ranking in Grid-Layout for Provider-Side Fairness|Amifa Raj, Michael D. Ekstrand|Microsoft, Redmond, WA 98052 USA; Drexel Univ, Dept Informat Sci, Philadelphia, PA 19104 USA|Information access systems, such as search engines and recommender systems, order and position results based on their estimated relevance. These results are then evaluated for a range of concerns, including provider-side fairness: whether exposure to users is fairly distributed among items and the people who created them. Several fairness-aware ranking and re-ranking techniques have been proposed to ensure fair exposure for providers, but this work focuses almost exclusively on linear layouts in which items are displayed in single ranked list. Many widely-used systems use other layouts, such as the grid views common in streaming platforms, image search, and other applications. Providing fair exposure to providers in such layouts is not well-studied. We seek to fill this gap by providing a grid-aware re-ranking algorithm to optimize layouts for provider-side fairness by adapting existing re-ranking techniques to grid-aware browsing models, and an analysis of the effect of grid-specific factors such as device size on the resulting fairness optimization. Our work provides a starting point and identifies open gaps in ensuring provider-side fairness in grid-based layouts.|信息访问系统,如搜索引擎和推荐系统,根据其估计的相关性对结果进行排序和定位。这些结果随后会被评估以考虑一系列问题,包括提供方公平性:即用户对项目和项目创建者的曝光是否公平分配。已有多种公平性感知的排序和重排序技术被提出,以确保提供方的公平曝光,但这些工作几乎完全集中在线性布局上,即项目以单一排序列表的形式展示。许多广泛使用的系统采用其他布局,如流媒体平台、图像搜索及其他应用中常见的网格视图。在这些布局中为提供方提供公平曝光的研究尚不充分。我们旨在填补这一空白,通过提供一种网格感知的重排序算法,将现有的重排序技术适应于网格感知的浏览模型,以优化提供方公平性的布局,并分析设备大小等网格特定因素对最终公平性优化的影响。我们的工作为基于网格布局中确保提供方公平性提供了一个起点,并指出了尚未解决的问题。|code|0|
|A Conversational Robot for Children's Access to a Cultural Heritage Multimedia Archive|Thomas Beelen, Roeland Ordelman, Khiet P. Truong, Vanessa Evers, Theo Huibers|Univ Twente, Enschede, Netherlands|In this paper we introduce a conversational robot designed to assist children in searching a museum's cultural heritage video archive. The robot employs a form of Spoken Conversational Search to facilitate the clarification of children's interest (their information need) in specific videos from the archive. Children are typically insufficiently supported in this process by common search technologies such as search-bar and keyboard, or one-shot voice interfaces. We present our approach, which leverages a knowledge-graph representation of the museum's video archive to facilitate conversational search interactions and suggest content based on the interaction, in order to study information-seeking conversations with children. We plan to use the robot test-bed to investigate the effectiveness of conversational designs over one-shot voice interactions for clarifying children's information needs in a museum context.|本文介绍了一种对话机器人,旨在帮助儿童搜索博物馆文化遗产视频档案。该机器人采用一种口语对话搜索形式,以帮助澄清儿童对档案中特定视频的兴趣(即他们的信息需求)。常见的搜索技术,如搜索栏和键盘,或一次性语音界面,通常无法充分支持儿童在这一过程中的需求。我们提出了一种方法,利用博物馆视频档案的知识图谱表示来促进对话搜索交互,并根据交互建议内容,以研究与儿童的信息寻求对话。我们计划使用该机器人测试平台,研究在博物馆环境中,对话设计相对于一次性语音交互在澄清儿童信息需求方面的有效性。|code|0|
|MathMex: Search Engine for Math Definitions|Shea Durgin, James Gore, Behrooz Mansouri|Univ Southern Maine, Portland, ME 04103 USA|This paper introduces MathMex, an open-source search engine for math definitions. With MathMex, users can search for definitions of mathematical concepts extracted from a variety of data sources and types including text, images, and videos. Definitions are extracted using a fine-tuned SciBERT classifier, and the search is done with a fine-tuned Sentence-BERT model. MathMex interface provides means of issuing a text, formula, and combined queries and logging features.|本文介绍了MathMex,一个用于数学定义的开源搜索引擎。通过MathMex,用户可以搜索从多种数据源和类型(包括文本、图像和视频)中提取的数学概念定义。定义提取使用了一个经过微调的SciBERT分类器,而搜索则通过一个经过微调的Sentence-BERT模型完成。MathMex界面提供了文本查询、公式查询以及组合查询的功能,并具备日志记录功能。|code|0|
|XSearchKG: A Platform for Explainable Keyword Search over Knowledge Graphs|Leila Feddoul, Martin Birke, Sirko Schindler|Friedrich Schiller Univ Jena, Heinz Nixdorf Chair Distributed Informat Syst, Jena, Germany; German Aerosp Ctr DLR, Inst Data Sci, Jena, Germany|One of the most user-friendly methods to search over knowledge graphs is the usage of keyword queries. They offer a simple text input that requires no technical or domain knowledge. Most existing approaches for keyword search over graph-shaped data rely on graph traversal algorithms to find connections between keywords. They mostly concentrate on achieving efficiency and effectiveness (accurate ranking), but ignore usability, visualization, and interactive result presentation. All of which offer better support to non-experienced users. Moreover, it is not sufficient to just show a raw list of results, but it is also important to explain why a specific result is proposed. This not only provides an abstract view of the capabilities and limitations of the search system, but also increases confidence and helps discover new interesting facts. We propose XSearchKG, a platform for explainable keyword search over knowledge graphs that extends our previously proposed graph traversal-based approach and complements it with an interactive user interface for results explanation and browsing.|在知识图谱上进行搜索的最用户友好方法之一是使用关键字查询。这种方法提供了一个简单的文本输入,不需要任何技术或领域知识。大多数现有的基于图形数据的关键字搜索方法依赖于图遍历算法来找到关键字之间的连接。这些方法主要集中在实现效率和有效性(准确排名)上,但忽略了可用性、可视化和交互式结果展示,而这些都为非专业用户提供了更好的支持。此外,仅仅显示一个原始的结果列表是不够的,解释为什么提出某个特定结果也很重要。这不仅提供了搜索系统能力和局限性的抽象视图,还增加了信心,并有助于发现新的有趣事实。我们提出了XSearchKG,这是一个用于知识图谱上可解释关键字搜索的平台,它扩展了我们之前提出的基于图遍历的方法,并通过一个交互式用户界面补充了结果解释和浏览功能。|code|0|
|Result Assessment Tool: Software to Support Studies Based on Data from Search Engines|Sebastian Sünkler, Nurce Yagci, Sebastian Schultheiß, Sonja von Mach, Dirk Lewandowski|Hamburg Univ Appl Sci, Dept Informat Media & Commun, Finkenau 35, D-22081 Hamburg, Germany|The Result Assessment Tool (RAT) is a software toolkit for conducting research with results from commercial search engines and other information retrieval (IR) systems. The software integrates modules for study design and management, automatic collection of search results via web scraping, and evaluation of search results in an assessment interface using different question types. RAT can be used for conducting a wide range of studies, including retrieval effectiveness studies, classification studies, and content analyses.|结果评估工具(RAT)是一款用于基于商业搜索引擎和其他信息检索(IR)系统结果进行研究的软件工具包。该软件集成了研究设计与管理模块、通过网页抓取自动收集搜索结果的模块,以及在使用不同问题类型的评估界面中对搜索结果进行评估的模块。RAT可用于开展多种研究,包括检索效果研究、分类研究和内容分析。|code|0|
|Translating Justice: A Cross-Lingual Information Retrieval System for Maltese Case Law Documents|Joel Azzopardi|Univ Malta, Fac ICT, Dept Artificial Intelligence, Msida, Malta|In jurisdictions adhering to the Common Law system, previous court judgements inform future rulings based on the Stare Decisis principle. For enhanced accessibility and retrieval of such judgements, we introduced a cross-lingual Legal Information Retrieval system prototype focused on Malta's small claims tribunal. This system utilises Neural Machine Translation (NMT) to automatically translate Maltese judgement documents into English, enabling dual-language querying. Additionally, it employs Rhetorical Role Labelling (RRL) on sentences within the judgements, allowing for targeted searches based on specific rhetorical roles. Developed without depending on high-end resources or commercial systems, this prototype showcases the potential of AI in advancing legal research tools and making legal documents more accessible, especially for non-native speakers.|在遵循普通法体系的司法管辖区中,先前的法院判决根据遵循先例原则(Stare Decisis)为未来的裁决提供参考。为了提高此类判决的可访问性和检索效率,我们引入了一个跨语言的法律信息检索系统原型,重点关注马耳他的小额索赔法庭。该系统利用神经机器翻译(Neural Machine Translation, NMT)技术,自动将马耳他语的判决文件翻译成英语,从而实现双语查询。此外,该系统还对判决中的句子进行修辞角色标注(Rhetorical Role Labelling, RRL),从而支持基于特定修辞角色的定向搜索。该原型系统在开发过程中未依赖高端资源或商业系统,展示了人工智能在推进法律研究工具和使法律文件更易于访问方面的潜力,尤其对非母语使用者而言。|code|0|
|Displaying Evolving Events Via Hierarchical Information Threads for Sensitivity Review|Hitarth Narvala, Graham McDonald, Iadh Ounis|Univ Glasgow, Glasgow, Lanark, Scotland|Many government documents contain sensitive (e.g. personal or confidential) information that must be protected before the documents can be released to the public. However, reviewing documents to identify sensitive information is a complex task, which often requires analysing multiple related documents that mention a particular context of sensitivity. For example, coherent information about evolving events, such as legal proceedings, is often dispersed across documents produced at different times. In this paper, we present a novel system for sensitivity review, which automatically identifies hierarchical information threads to capture diverse aspects of an event. In particular, our system aims to assist sensitivity reviewers in making accurate sensitivity judgements efficiently by presenting hierarchical information threads that provide coherent and chronological information about an event's evolution. Through a user study, we demonstrate our system's effectiveness in improving the sensitivity reviewers' reviewing speed and accuracy compared to the traditional document-by-document review process.|许多政府文件包含敏感(例如个人或机密)信息,这些信息在文件公开之前必须得到保护。然而,审查文件以识别敏感信息是一项复杂的任务,通常需要分析多个相关文件,这些文件可能涉及特定的敏感背景。例如,关于不断演变的事件(如法律诉讼)的连贯信息通常分散在不同时间生成的文件中。本文提出了一种新颖的敏感性审查系统,该系统能够自动识别层次化的信息线索,以捕捉事件的不同方面。特别是,我们的系统旨在通过提供关于事件演变的连贯且按时间顺序排列的层次化信息线索,帮助敏感性审查员高效地做出准确的敏感性判断。通过一项用户研究,我们证明了与传统逐文件审查流程相比,我们的系统在提高审查员的审查速度和准确性方面的有效性。|code|0|
|Analyzing Mathematical Content for Plagiarism and Recommendations|Ankit Satpute|Georg August Univ Gottingen, Gottingen, Germany|Defined as "the use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected" [6], plagiarism poses a severe concern in the rapidly increasing number of scientific publications.|抄袭被定义为“在预期原创性的环境中,未经适当承认来源而使用思想、概念、词语或结构以获取利益”[6],在迅速增加的科学出版物中,抄袭成为一个严重的问题。|code|0|
|Explainable Recommender Systems with Knowledge Graphs and Language Models|Giacomo Balloccu, Ludovico Boratto, Gianni Fenu, Francesca Maridina Malloci, Mirko Marras|Meta Platforms Inc, Menlo Pk, CA USA; Univ Potsdam, HPI, Potsdam, Germany; Rutgers State Univ, New Brunswick, NJ 08901 USA|To facilitate human decisions with credible suggestions, personalized recommender systems should have the ability to generate corresponding explanations while making recommendations. Knowledge graphs (KG), which contain comprehensive information about users and products, are widely used to enable this. By reasoning over a KG in a node-by-node manner, existing explainable models provide a KG-grounded path for each user-recommended item. Such paths serve as an explanation and reflect the historical behavior pattern of the user. However, not all items can be reached following the connections within the constructed KG under finite hops. Hence, previous approaches are constrained by a recall bias in terms of existing connectivity of KG structures. To overcome this, we propose a novel Path Language Modeling Recommendation (PLM-Rec) framework, learning a language model over KG paths consisting of entities and edges. Through path sequence decoding, PLM-Rec unifies recommendation and explanation in a single step and fulfills them simultaneously. As a result, PLM-Rec not only captures the user behaviors but also eliminates the restriction to pre-existing KG connections, thereby alleviating the aforementioned recall bias. Moreover, the proposed technique makes it possible to conduct explainable recommendation even when the KG is sparse or possesses a large number of relations. Experiments and extensive ablation studies on three Amazon e-commerce datasets demonstrate the effectiveness and explainability of the PLM-Rec framework.|为了通过可信的建议来辅助人类决策,个性化推荐系统应具备在生成推荐的同时提供相应解释的能力。知识图谱(KG)包含关于用户和产品的全面信息,因此被广泛用于实现这一目标。通过在知识图谱上以逐节点的方式进行推理,现有的可解释模型为每个用户推荐的项目提供了一条基于知识图谱的路径。这些路径作为解释,反映了用户的历史行为模式。然而,并非所有项目都能在有限跳数内通过构建的知识图谱中的连接到达。因此,先前的方法受到知识图谱结构现有连接性的召回偏差的限制。为了克服这一问题,我们提出了一种新颖的路径语言建模推荐(PLM-Rec)框架,该框架在由实体和边组成的知识图谱路径上学习语言模型。通过路径序列解码,PLM-Rec将推荐和解释统一在一个步骤中,并同时完成这两项任务。因此,PLM-Rec不仅能够捕捉用户行为,还能消除对预先存在知识图谱连接的限制,从而缓解上述的召回偏差。此外,所提出的技术使得即使在知识图谱稀疏或具有大量关系的情况下,也能进行可解释的推荐。在三个亚马逊电子商务数据集上的实验和广泛的消融研究证明了PLM-Rec框架的有效性和可解释性。|code|0|
|Recent Advances in Generative Information Retrieval|Yubao Tang, Ruqing Zhang, Zhaochun Ren, Jiafeng Guo, Maarten de Rijke|Leiden Univ, Leiden, Netherlands; Univ Chinese Acad Sci, CAS Key Lab Network Data Sci & Technol, ICT, CAS, Beijing, Peoples R China; Univ Amsterdam, Amsterdam, Netherlands|Generative retrieval (GR) has become a highly active area of information retrieval (IR) that has witnessed significant growth recently. Compared to the traditional “index-retrieve-then-rank” pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications. We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and the applications of GR. We end by outlining remaining challenges and issuing a call for future GR research. This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.|生成式检索(Generative Retrieval, GR)已成为信息检索(IR)领域中一个高度活跃的研究方向,近年来取得了显著的发展。与传统的“索引-检索-排序”流程相比,GR范式旨在将所有语料库中的信息整合到一个单一模型中。通常,训练一个序列到序列(sequence-to-sequence)模型,以直接将查询映射到其相关的文档标识符(即docids)。本教程介绍了GR范式的核心概念,并全面概述了其基础和应用方面的最新进展。我们首先提供了涵盖GR基础知识和问题表述的初步信息。然后,重点转向了docid设计、训练方法、推理策略以及GR应用方面的最新进展。最后,我们概述了当前面临的挑战,并呼吁未来对GR研究的进一步探索。本教程旨在为有兴趣开发新型GR解决方案或将其应用于实际场景的研究人员和行业从业者提供有益的参考。|code|0|
|Affective Computing for Social Good Applications: Current Advances, Gaps and Opportunities in Conversational Setting|Priyanshu Priya, Mauajama Firdaus, Gopendra Vikram Singh, Asif Ekbal|Indian Inst Technol Patna, Dayalpur Daulatpur, India; Univ Alberta, Edmonton, AB, Canada|Affective computing involves examining and advancing systems and devices capable of identifying, comprehending, processing, and emulating human emotions, sentiment, politeness and personality characteristics. This is an ever-expanding multidisciplinary domain that investigates how technology can contribute to the comprehension of human affect, how affect can influence interactions between humans and machines, how systems can be engineered to harness affect for enhanced capabilities, and how integrating affective strategies can revolutionize interactions between humans and machines. Recognizing the fact that affective computing encompasses disciplines such as computer science, psychology, and cognitive science, this tutorial aims to delve into the historical underpinnings and overarching objectives of affective computing, explore various approaches for affect detection and generation, its practical applications across diverse areas, including but not limited to social good (like persuasion, therapy and support, etc.), address ethical concerns, and outline potential future directions.|情感计算涉及研究和开发能够识别、理解、处理和模拟人类情感、情绪、礼貌及个性特征的系统与设备。这是一个不断扩展的多学科领域,研究技术如何有助于理解人类情感、情感如何影响人机交互、如何设计系统以利用情感来增强能力,以及如何通过整合情感策略来革新人机交互。鉴于情感计算涵盖了计算机科学、心理学和认知科学等学科,本教程旨在深入探讨情感计算的历史基础和总体目标,探索情感检测和生成的各种方法,其在多个领域的实际应用(包括但不限于社会公益领域,如说服、治疗和支持等),讨论伦理问题,并概述未来可能的发展方向。|code|0|
|Query Performance Prediction: From Fundamentals to Advanced Techniques|Negar Arabzadeh, Chuan Meng, Mohammad Aliannejadi, Ebrahim Bagheri|Univ Amsterdam, Amsterdam, Netherlands; Univ Waterloo, Waterloo, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Query performance prediction (QPP) is a core task in information retrieval (IR) that aims at predicting the retrieval quality for a given query without relevance judgments. QPP has been investigated for decades and has witnessed a surge in research activity in recent years; QPP has been shown to benefit various aspects, e.g., improving retrieval effectiveness by selecting the most effective ranking function per query [5, 7]. Despite its importance, there is no recent tutorial to provide a comprehensive overview of QPP techniques in the era of pre-trained/large language models or in the scenario of emerging conversational search (CS); In this tutorial, we have three main objectives. First, we aim to disseminate the latest advancements in QPP to the IR community. Second, we go beyond investigating QPP in ad-hoc search and cover QPP for CS. Third, the tutorial offers a unique opportunity to bridge the gap between theory and practice; we aim to equip participants with the essential skills and insights needed to navigate the evolving landscape of QPP, ultimately benefiting both researchers and practitioners in the field of IR and encouraging them to work around the future avenues on QPP.|查询性能预测(Query Performance Prediction, QPP)是信息检索(Information Retrieval, IR)中的一项核心任务,旨在在没有相关性判断的情况下预测给定查询的检索质量。QPP已经被研究了数十年,并且近年来研究活动显著增加;QPP已被证明在多个方面具有重要价值,例如通过为每个查询选择最有效的排序函数来提高检索效果[5, 7]。尽管QPP的重要性不言而喻,但在预训练/大语言模型时代或新兴的对话式搜索(Conversational Search, CS)场景中,目前还没有最新的教程对QPP技术进行全面概述。在本教程中,我们有三个主要目标。首先,我们旨在向IR社区传播QPP的最新进展。其次,我们不仅研究QPP在临时搜索中的应用,还涵盖了QPP在CS中的应用。第三,本教程为弥合理论与实践之间的差距提供了一个独特的机会;我们旨在为参与者提供应对QPP不断演变的技术格局所需的基本技能和洞察力,最终使IR领域的研究人员和从业人员受益,并鼓励他们在QPP的未来发展方向上进行探索。|code|0|
|Fairness Through Domain Awareness: Mitigating Popularity Bias for Music Discovery|Rebecca Salganik, Fernando Diaz, Golnoosh Farnadi||As online music platforms grow, music recommender systems play a vital role in helping users navigate and discover content within their vast musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between music discovery and popularity bias. To mitigate this issue we propose a domain-aware, individual fairness-based approach which addresses popularity bias in graph neural network (GNNs) based recommender systems. Our approach uses individual fairness to reflect a ground truth listening experience, i.e., if two songs sound similar, this similarity should be reflected in their representations. In doing so, we facilitate meaningful music discovery that is robust to popularity bias and grounded in the music domain. We apply our BOOST methodology to two discovery based tasks, performing recommendations at both the playlist level and user level. Then, we ground our evaluation in the cold start setting, showing that our approach outperforms existing fairness benchmarks in both performance and recommendation of lesser-known content. Finally, our analysis explains why our proposed methodology is a novel and promising approach to mitigating popularity bias and improving the discovery of new and niche content in music recommender systems.|随着在线音乐平台的发展,音乐推荐系统在帮助用户浏览和发现其庞大的音乐数据库中的内容方面发挥着至关重要的作用。与这个更大的目标不一致的是流行偏见的存在,它导致算法系统偏爱主流内容,而不是潜在的更相关的,但是利基项目。在这项工作中,我们探讨音乐发现和流行偏见之间的内在关系。为了缓解这一问题,我们提出了一种基于领域感知的、基于个体公平性的方法,该方法解决了基于图神经网络(GNN)的推荐系统中的流行偏差问题。我们的方法使用个人的公平性来反映一个基本的真理倾听经验,也就是说,如果两首歌听起来相似,这种相似性应该反映在他们的表述中。这样做,我们促进了有意义的音乐发现,这是强大的流行偏见,并在音乐领域的基础。我们将 BOOST 方法应用于两个基于发现的任务,在播放列表级别和用户级别执行建议。然后,我们在冷启动环境下进行评估,结果表明我们的方法在性能和推荐不太知名内容方面都优于现有的公平性基准。最后,我们的分析解释了为什么我们提出的方法是一个新颖和有前途的方法,以减少流行偏见和改善发现新的和利基内容的音乐推荐系统。|code|0|
|Countering Mainstream Bias via End-to-End Adaptive Local Learning|Jinhao Pan, Ziwei Zhu, Jianling Wang, Allen Lin, James Caverlee||Collaborative filtering (CF) based recommendations suffer from mainstream bias – where mainstream users are favored over niche users, leading to poor recommendation quality for many long-tail users. In this paper, we identify two root causes of this mainstream bias: (i) discrepancy modeling, whereby CF algorithms focus on modeling mainstream users while neglecting niche users with unique preferences; and (ii) unsynchronized learning, where niche users require more training epochs than mainstream users to reach peak performance. Targeting these causes, we propose a novel end-To-end Adaptive Local Learning (TALL) framework to provide high-quality recommendations to both mainstream and niche users. TALL uses a loss-driven Mixture-of-Experts module to adaptively ensemble experts to provide customized local models for different users. Further, it contains an adaptive weight module to synchronize the learning paces of different users by dynamically adjusting weights in the loss. Extensive experiments demonstrate the state-of-the-art performance of the proposed model. Code and data are provided at https://github.com/JP-25/end-To-end-Adaptive-Local-Leanring-TALL-|基于协同过滤(CF)的推荐受到主流偏见的影响——主流用户比小众用户更受青睐,导致许多长尾用户的推荐质量较差。在本文中,我们确定了这种主流偏见的两个根本原因: (i)差异建模,即 CF 算法侧重于建模主流用户,而忽视具有独特偏好的小生境用户; 和(ii)非同步学习,其中小生境用户需要比主流用户更多的训练周期才能达到峰值性能。针对这些原因,我们提出了一个新颖的端到端适应性本地学习(TALL)框架,为主流和小众用户提供高质量的建议。TALL 使用一个损耗驱动的专家混合模块来自适应地集成专家,为不同的用户提供定制的本地模型。此外,它还包含一个自适应权重模块,通过动态调整权重来同步不同用户的学习步伐。大量的实验证明了该模型的最新性能。代码和数据载于 < https://github.com/jp-25/end-to-end-adaptive-local-leanring-tall- |code|0|
|BioASQ at CLEF2024: The Twelfth Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge|Anastasios Nentidis, Anastasia Krithara, Georgios Paliouras, Martin Krallinger, Luis Gascó Sánchez, Salvador LimaLópez, Eulàlia Farré, Natalia V. Loukachevitch, Vera Davydova, Elena Tutubalina|Sber AI, Moscow, Russia; Moscow MV Lomonosov State Univ, Moscow, Russia; Natl Ctr Sci Res Demokritos, Athens, Greece; Barcelona Supercomp Ctr, Barcelona, Spain|The large-scale biomedical semantic indexing and question-answering challenge (BioASQ) aims at the continuous advancement of methods and tools to meet the needs of biomedical researchers and practitioners for efficient and precise access to the ever-increasing resources of their domain. With this purpose, during the last eleven years, a series of annual challenges have been organized with specific shared tasks on large-scale biomedical semantic indexing and question answering. Benchmark datasets have been concomitantly provided in alignment with the real needs of biomedical experts, providing a unique common testbed where different teams around the world can investigate and compare new approaches for accessing biomedical knowledge. The twelfth version of the BioASQ Challenge will be held as an evaluation Lab within CLEF2024 providing four shared tasks: (i) Task b on the information retrieval for biomedical questions, and the generation of comprehensible answers. (ii) Task Synergy the information retrieval and generation of answers for open biomedical questions on developing topics, in collaboration with the experts posing the questions. (iii) Task MultiCardioNER on the automated annotation of clinical entities in medical documents in the field of cardiology, primarily in Spanish, English, Italian and Dutch. (iv) Task BioNNE on the automated annotation of biomedical documents in Russian and English with nested named entity annotations. As BioASQ rewards the methods that outperform the state of the art in these shared tasks, it pushes the research frontier towards approaches that accelerate access to biomedical knowledge.|大规模生物医学语义索引与问答挑战赛(BioASQ)旨在持续推动方法和工具的发展,以满足生物医学研究人员和实践者对高效、精确获取其领域日益增长资源的需求。为此,在过去十一年中,已组织了一系列年度挑战赛,专注于大规模生物医学语义索引和问答的特定共享任务。与此同时,根据生物医学专家的实际需求提供了基准数据集,为全球不同团队提供了一个独特的共同测试平台,用于研究和比较获取生物医学知识的新方法。第十二届BioASQ挑战赛将作为CLEF2024评估实验室的一部分举办,提供四个共享任务:(i) 任务b,针对生物医学问题的信息检索及生成可理解的答案。(ii) 任务Synergy,与提出问题的专家合作,针对发展中的开放生物医学问题进行信息检索和答案生成。(iii) 任务MultiCardioNER,专注于心脏病学领域医疗文档中临床实体的自动标注,主要涉及西班牙语、英语、意大利语和荷兰语。(iv) 任务BioNNE,针对俄语和英语生物医学文档的自动标注,包含嵌套命名实体标注。由于BioASQ奖励在这些共享任务中超越现有技术水平的方法,它推动了研究前沿向着加速获取生物医学知识的方向发展。|code|0|
|ProMap: Product Mapping Datasets|Katerina Macková, Martin Pilát|Charles Univ Prague, Fac Math & Phys, Malostranske Namesti 25, Prague 11800 1, Czech Republic|The goal of product mapping is to decide, whether two listings from two different e-shops describe the same products. Existing datasets of matching and non-matching pairs of products, however, often suffer from incomplete product information or contain only very distant non-matching products. In this paper, we introduce two new datasets for product mapping: ProMapCz consisting of 1,495 Czech product pairs and ProMapEn consisting of 1,555 English product pairs of matching and non-matching products manually scraped from two pairs of e-shops. The datasets contain both images and textual descriptions of the products, including their specifications, making them one of the most complete datasets for product mapping. Additionally, we divide the non-matching products into two different categories – close non-matches and medium non-matches, based on how similar the products are to each other. Even the medium non-matches are, however, pairs of products that are much more similar than non-matches in other datasets – for example, they still need to have the same brand and similar name and price. Finally, we train a number of product matching models on these datasets to demonstrate the advantages of having these two types of non-matches for the analysis of these models.|产品映射的目标是判断来自两个不同电商平台的商品列表是否描述的是同一产品。然而,现有的匹配和不匹配产品对数据集往往存在产品信息不完整或仅包含非常不相似的不匹配产品的问题。在本文中,我们引入了两个新的产品映射数据集:ProMapCz 包含 1,495 对捷克产品对,ProMapEn 包含 1,555 对英语产品对,这些产品对是从两对电商平台手动抓取的匹配和不匹配产品。这些数据集包含产品的图像和文本描述,包括其规格,使其成为最完整的产品映射数据集之一。此外,我们根据产品之间的相似程度,将不匹配产品分为两个不同的类别——接近不匹配和中等不匹配。然而,即使是中等不匹配的产品对,也比其他数据集中的不匹配产品对更为相似——例如,它们仍然需要具有相同的品牌以及相似的名称和价格。最后,我们在这些数据集上训练了多个产品匹配模型,以展示这两种不匹配类型对这些模型分析的优势。|code|0|
|Eliminating Contextual Bias in Aspect-Based Sentiment Analysis|Ruize An, Chen Zhang, Dawei Song|Beijing Inst Technol, Beijing, Peoples R China|Pretrained language models (LMs) have made remarkable achievements in aspect-based sentiment analysis (ABSA). However, it is discovered that these models may struggle in some particular cases (e.g., to detect sentiments expressed towards targeted aspects with only implicit or adversarial expressions). Since it is hard for models to align implicit or adversarial expressions with their corresponding aspects, the sentiments of the targeted aspects would largely be impacted by the expressions towards other aspects in the sentence. We name this phenomenon as contextual bias. To tackle the problem, we propose a flexible aspect-oriented debiasing method (Arde) to eliminate the harmful contextual bias without the need of adjusting the underlying LMs. Intuitively, Arde calibrates the prediction towards the targeted aspect by subtracting the bias towards the context. Favorably, Arde can get theoretical support from counterfactual reasoning theory. Experiments are conducted on SemEval benchmark, and the results show that Arde can empirically improve the accuracy on contextually biased aspect sentiments without degrading the accuracy on unbiased ones. Driven by recent success of large language models (LLMs, e.g., ChatGPT), we further uncover that even LLMs can fail to address certain contextual bias, which yet can be effectively tackled by Arde.|预训练语言模型(LMs)在基于方面的情感分析(ABSA)中取得了显著成就。然而,研究发现这些模型在某些特定情况下可能会遇到困难(例如,检测仅通过隐含或对抗性表达针对目标方面的情感)。由于模型难以将隐含或对抗性表达与其对应的方面对齐,因此目标方面的情感很大程度上会受到句子中其他方面表达的影响。我们将这种现象称为上下文偏差。为了解决这一问题,我们提出了一种灵活的面向方面的去偏方法(Arde),以消除有害的上下文偏差,而无需调整底层LMs。直观上,Arde通过减去对上下文的偏差来校准对目标方面的预测。令人欣慰的是,Arde可以从反事实推理理论中获得理论支持。我们在SemEval基准上进行了实验,结果表明,Arde能够在经验上提高对具有上下文偏差的方面情感的准确性,而不会降低对无偏差方面情感的准确性。受大型语言模型(LLMs,例如ChatGPT)近期成功的推动,我们进一步发现,即使是LLMs也可能无法解决某些上下文偏差,而Arde却能有效应对这一问题。|code|0|
|A Streaming Approach to Neural Team Formation Training|Hossein Fani, Reza Barzegar, Arman Dashti, Mahdis Saeedi|Univ Windsor, Windsor, ON, Canada|Predicting future successful teams of experts who can effectively collaborate is challenging due to the experts' temporality of skill sets, levels of expertise, and collaboration ties, which is overlooked by prior work. Specifically, state-of-the-art neural-based methods learn vector representations of experts and skills in a static latent space, falling short of incorporating the possible drift and variability of experts' skills and collaboration ties in time. In this paper, we propose (1) a streaming-based training strategy for neural models to capture the evolution of experts' skills and collaboration ties over time and (2) to consume time information as an additional signal to the model for predicting future successful teams. We empirically benchmark our proposed method against state-of-the-art neural team formation methods and a strong temporal recommender system on datasets from varying domains with distinct distributions of skills and experts in teams. The results demonstrate neural models that utilize our proposed training strategy excel at efficacy in terms of classification and information retrieval metrics. The codebase is available at https://github.com/fani-lab/OpeNTF/tree/ecir24 .|预测未来能够有效协作的成功专家团队具有挑战性,因为专家的技能集、专业水平和协作关系的时效性往往被先前的研究所忽视。具体而言,现有的基于神经网络的先进方法在静态潜在空间中学习专家和技能的向量表示,未能充分考虑专家技能和协作关系随时间可能发生的漂移和变化。在本文中,我们提出了(1)一种基于流式训练的神经网络模型策略,以捕捉专家技能和协作关系随时间的演变;(2)将时间信息作为模型的额外输入信号,用于预测未来的成功团队。我们在多个领域的团队数据集上,针对不同的技能和专家分布,将我们提出的方法与最先进的神经团队形成方法和一个强大的时序推荐系统进行了实证对比。结果表明,采用我们提出的训练策略的神经网络模型在分类和信息检索指标上表现出色。代码库可在以下网址获取:https://github.com/fani-lab/OpeNTF/tree/ecir24。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=A+Streaming+Approach+to+Neural+Team+Formation+Training)|0|
|A Second Look on BASS - Boosting Abstractive Summarization with Unified Semantic Graphs - A Replication Study|Osman Alperen Koras, Jörg Schlötterer, Christin Seifert||We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.|我们提出了一个详细的复制研究的 BASS 框架,一个抽象的摘要系统的概念为基础的统一语义图。我们的研究包括复制关键组件的挑战和一项消融研究,以系统地隔离根植于复制新组件的错误源。我们的发现揭示了与原始工作相比在性能上的差异。我们强调认真注意甚至合理忽略复制高级框架(如 BASS)的细节的重要性,并强调编写可复制论文的关键实践。|code|0|
|Absolute Variation Distance: An Inversion Attack Evaluation Metric for Federated Learning|Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia|JPMorgan Chase, Global Technol Appl Res, New York, NY 10017 USA|Federated Learning (FL) has emerged as a pivotal approach for training models on decentralized data sources by sharing only model gradients. However, the shared gradients in FL are susceptible to inversion attacks which can expose sensitive information. While several defense and attack strategies have been proposed, their effectiveness is often evaluated using metrics that may not necessarily reflect the success rate of an attack or information retrieval, especially in the context of multidimensional data such as images. Traditional metrics like the Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE) are typically used as lightweight metrics, assume only pixel-wise comparison, but fail to consider the semantic context of the recovered data. This paper introduces the Absolute Variation Distance (AVD), a lightweight metric derived from total variation, to assess data recovery and information leakage in FL. Unlike traditional metrics, AVD offers a continuous measure for extracting information in noisy images and aligns closely with human perception. Our results combined with a user experience survey demonstrate that AVD provides a more accurate and consistent measure of data recovery. It also matches the accuracy of the more costly and complex Neural Network based metric, the Learned Perceptual Image Patch Similarity (LPIPS). Hence it offers an effective tool for automatic evaluation of data security in FL and a reliable way of studying defence and inversion attacks strategies in FL.|联邦学习(Federated Learning, FL)作为一种关键方法,通过仅共享模型梯度来在分散的数据源上训练模型。然而,FL中共享的梯度容易受到反演攻击,从而可能暴露敏感信息。尽管已经提出了多种防御和攻击策略,但其有效性通常使用可能无法准确反映攻击成功率或信息检索效果的指标进行评估,尤其是在处理如图像等多维数据时。传统的指标如结构相似性指数(SSIM)、峰值信噪比(PSNR)和均方误差(MSE)通常被用作轻量级指标,仅假设像素级别的比较,但未能考虑恢复数据的语义上下文。本文引入了绝对变差距离(Absolute Variation Distance, AVD),这是一种基于总变差的轻量级指标,用于评估FL中的数据恢复和信息泄露。与传统指标不同,AVD为在噪声图像中提取信息提供了连续的度量,并且与人类感知高度一致。我们的研究结果结合用户体验调查表明,AVD提供了更准确和一致的数据恢复度量。同时,它与更昂贵且复杂的基于神经网络的度量——学习感知图像块相似性(LPIPS)的准确性相当。因此,AVD为FL中的数据安全自动评估提供了有效工具,并为研究FL中的防御和反演攻击策略提供了可靠的方法。|code|0|
|Experiments in News Bias Detection with Pre-trained Neural Transformers|Tim Menzner, Jochen L. Leidner|Coburg Univ Appl Sci, Informat Access Res Grp, Friedrich Streib Str 2, D-96459 Coburg, Germany|The World Wide Web provides unrivalled access to information globally, including factual news reporting and commentary. However, state actors and commercial players increasingly spread biased (distorted) or fake (non-factual) information to promote their agendas. We compare several large, pre-trained language models on the task of sentence-level news bias detection and sub-type classification, providing quantitative and qualitative results. Our findings are to be seen as part of a wider effort towards realizing the conceptual vision, articulated by Fuhr et al. [10], of a "nutrition label" for online content for the social good.|万维网为全球信息获取提供了无与伦比的便利,包括事实新闻报道和评论。然而,国家行为体和商业参与者越来越多地传播带有偏见(扭曲)或虚假(非事实)的信息,以推动其议程。我们比较了几种大型预训练语言模型在句子级新闻偏见检测及其子类型分类任务上的表现,并提供了定量和定性的结果。我们的研究结果应被视为实现Fuhr等人[10]所阐述的“营养标签”概念愿景的一部分,该愿景旨在为社会公益提供在线内容的透明度和可信度评估。|code|0|
|A Transformer-Based Object-Centric Approach for Date Estimation of Historical Photographs|Francesc Net, Núria Hernández, Adrià Molina, Lluís Gómez|Univ Autonoma Barcelona, Comp Vis Ctr, Catalunya, Spain|The accurate estimation of the creation date of cultural heritage photographic assets is a challenging and complex task, typically requiring the expertise of qualified archivists, with significant implications for archival and preservation purposes. This paper introduces a new dataset for image date estimation, which complements existing datasets, thus creating a more balanced and realistic training set for deep learning models. On this dataset, we present a set of modern strong baselines that outperform previous state-of-the-art methods for this task. Additionally, we propose a novel approach that leverages “dating indicators” or “dating clues” through object detection and a self-attention based Transformer encoder. Our experiments demonstrate that the proposed approach has promising applicability in real scenarios and that incorporating “dating indicators” through object detection can improve the performance of image date estimation models. The dataset and code of our models are publicly available at https://github.com/cesc47/DEXPERT .|文化遗产摄影资料的准确创建日期估计是一项具有挑战性且复杂的任务,通常需要合格档案管理员的专业知识,对于档案保存具有重要意义。本文引入了一个新的图像日期估计数据集,该数据集补充了现有数据集,从而为深度学习模型创建了一个更加平衡和现实的训练集。在该数据集上,我们提出了一组现代强基线模型,这些模型在此任务上优于以往的最先进方法。此外,我们提出了一种新颖的方法,通过物体检测和基于自注意力机制的Transformer编码器来利用“年代指示器”或“年代线索”。我们的实验表明,所提出的方法在实际场景中具有良好的适用性,并且通过物体检测引入“年代指示器”可以提高图像日期估计模型的性能。我们的模型的数据集和代码公开在https://github.com/cesc47/DEXPERT。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=A+Transformer-Based+Object-Centric+Approach+for+Date+Estimation+of+Historical+Photographs)|0|
|Bias Detection and Mitigation in Textual Data: A Study on Fake News and Hate Speech Detection|Apostolos Kasampalis, Despoina Chatzakou, Theodora Tsikrika, Stefanos Vrochidis, Ioannis Kompatsiaris|Ctr Res & Technol Hellas, Informat Technol Inst, Thessaloniki, Greece|Addressing bias in NLP-based solutions is crucial to promoting fairness, avoiding discrimination, building trust, upholding ethical standards, and ultimately improving their performance and reliability. On the topic of bias detection and mitigation in textual data, this work examines the effect of different bias detection models along with standard debiasing methods on the effectiveness of fake news and hate speech detection tasks. Extensive discussion of the results draws useful conclusions, highlighting the inherent difficulties in effectively managing bias.|在基于自然语言处理(NLP)的解决方案中,解决偏见问题对于促进公平性、避免歧视、建立信任、维护道德标准以及最终提高其性能和可靠性至关重要。本文围绕文本数据中的偏见检测与缓解这一主题,探讨了不同偏见检测模型以及标准去偏方法对虚假新闻和仇恨言论检测任务效果的影响。通过对结果的广泛讨论,得出了有益的结论,突显了在有效管理偏见方面所固有的困难。|code|0|
|DQNC2S: DQN-Based Cross-Stream Crisis Event Summarizer|Daniele Rege Cambrin, Luca Cagliero, Paolo Garza||Summarizing multiple disaster-relevant data streams simultaneously is particularly challenging as existing Retrieve&Re-ranking strategies suffer from the inherent redundancy of multi-stream data and limited scalability in a multi-query setting. This work proposes an online approach to crisis timeline generation based on weak annotation with Deep Q-Networks. It selects on-the-fly the relevant pieces of text without requiring neither human annotations nor content re-ranking. This makes the inference time independent of the number of input queries. The proposed approach also incorporates a redundancy filter into the reward function to effectively handle cross-stream content overlaps. The achieved ROUGE and BERTScore results are superior to those of best-performing models on the CrisisFACTS 2022 benchmark.|同时汇总多个与灾难相关的数据流尤其具有挑战性,因为现有的检索和重新排序策略受到多流数据的固有冗余和多查询设置中有限的可伸缩性的影响。提出了一种基于深度 Q 网络弱注释的危机时间表在线生成方法。它动态地选择相关的文本片段,而不需要人工注释或内容重新排序。这使得推理时间与输入查询的数量无关。该方法还在奖励函数中引入了冗余过滤器,以有效地处理跨流内容重叠。所获得的 ROUGE 和 BERTScore 结果优于那些在 CrisisFACTS 2022基准上表现最好的模型。|code|0|
|QuantPlorer: Exploration of Quantities in Text|Satya Almasian, Alexander Kosnac, Michael Gertz|Heidelberg Univ, Heidelberg, Germany|Quantities play an important role in documents of various domains such as finance, business, and medicine. Despite the role of quantities, only a limited number of works focus on their extraction from text and even less on creating respective user-friendly document exploration frameworks. In this work, we introduce QuantPlorer, an online quantity extractor and explorer. Through an intuitive web interface, QuantExplorer extracts quantities from unstructured text, enables users to interactively investigate and visualize quantities in text, and it supports filtering based on diverse features, i.e., value ranges, units, trends, and concepts. Furthermore, users can explore and visualize distributions of values for specific units and concepts. Our demonstration is available at https://quantplorer.ifi.uni-heidelberg.de/ .|在各种领域的文档中,如金融、商业和医学,数量扮演着重要角色。尽管数量在文档中具有重要作用,但只有有限的研究工作专注于从文本中提取数量,而创建相应的用户友好文档探索框架的研究则更少。在本研究中,我们介绍了QuantPlorer,一个在线数量提取和探索工具。通过直观的网页界面,QuantPlorer能够从非结构化文本中提取数量,使用户能够交互式地研究和可视化文本中的数量,并支持基于多种特征的过滤,如数值范围、单位、趋势和概念。此外,用户还可以探索和可视化特定单位和概念下的数值分布。我们的演示可在https://quantplorer.ifi.uni-heidelberg.de/ 访问。|code|0|
|ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction|Nicolay Rusnachenko, Huizhi Liang, Maksim Kalameyets, Lei Shi|Newcastle Univ, Sch Comp, Newcastle Upon Tyne, Tyne & Wear, England|The escalating volume of textual data necessitates adept and scalable Information Extraction (IE) systems in the field of Natural Language Processing (NLP) to analyse massive text collections in a detailed manner. While most deep learning systems are designed to handle textual information as it is, the gap in the existence of the interface between a document and the annotation of its parts is still poorly covered. Concurrently, one of the major limitations of most deep-learning models is a constrained input size caused by architectural and computational specifics. To address this, we introduce ARElight
- 评估QA方法与传统方法的对比;
- 识别新的问题表述,以发现利用QA能力改进解决方案的新方法;
- 促进不同领域研究者之间的合作,利用他们的知识和技能来解决所面临的挑战,并推动QA的应用。
该实验室将使用由CINECA提供的QC资源,CINECA是全球最重要的计算中心之一。我们还描述了基础设施的设计,该设计使用Docker和Kubernetes来确保可扩展性、容错性和可复制性。|code|0| |EXIST 2024: sEXism Identification in Social neTworks and Memes|Laura Plaza, Jorge CarrillodeAlbornoz, Enrique Amigó, Julio Gonzalo, Roser Morante, Paolo Rosso, Damiano Spina, Berta Chulvi, Alba Maeso, Víctor Ruiz|RMIT Univ, Melbourne, Vic 3000, Australia; Univ Nacl Educ Distancia UNED, Madrid 28040, Spain; Univ Politecn Valencia UPV, Valencia 46022, Spain|The paper describes the EXIST 2024 lab on Sexism identification in social networks, that is expected to take place at the CLEF 2024 conference and represents the fourth edition of the EXIST challenge. The lab comprises five tasks in two languages, English and Spanish, with the initial three tasks building upon those from EXIST 2023 (sexism identification in tweets, source intention detection in tweets, and sexism categorization in tweets). In this edition, two new tasks have been introduced: sexism detection in memes and sexism categorization in memes. Similar to the prior edition, this one will adopt the Learning With Disagreement paradigm. The dataset for the various tasks will provide all annotations from multiple annotators, enabling models to learn from a range of training data, which may sometimes present contradictory opinions or labels. This approach facilitates the model's ability to handle and navigate diverse perspectives. Data bias will be handled both in the sampling and in the labeling processes: seed, topic, temporal and user bias will be taken into account when gathering data; in the annotation process, bias will be reduced by involving annotators from different social and demographic backgrounds.|本文介绍了将在CLEF 2024会议上举行的EXIST 2024实验室,该实验室专注于社交媒体中的性别歧视识别,这是EXIST挑战赛的第四届。实验室包含五种任务,涉及英语和西班牙语两种语言,其中前三个任务延续了EXIST 2023的内容(推文中的性别歧视识别、推文中的意图检测以及推文中的性别歧视分类)。在本届实验室中,新增了两项任务:迷因中的性别歧视检测和迷因中的性别歧视分类。与往届类似,本届实验室将采用“学习分歧”(Learning With Disagreement)范式。各项任务的数据集将提供来自多位标注者的所有标注,使模型能够从多样化的训练数据中学习,这些数据有时可能包含相互矛盾的观点或标签。这种方法有助于模型处理和应对不同的观点。数据偏差将在采样和标注过程中得到处理:在数据收集时,将考虑种子、主题、时间以及用户偏差;在标注过程中,将通过引入来自不同社会背景和人口统计背景的标注者来减少偏差。|code|0|