ECIR2024 Paper List

论文	作者	组织	摘要	翻译	代码	引用数
Large Language Models are Zero-Shot Rankers for Recommender Systems	Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian J. McAuley, Wayne Xin Zhao		Recently, large language models (LLMs) (e.g. GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems. To conduct our empirical study, we first formalize the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by the candidate generation model as candidates. We adopt a specific prompting approach to solving the ranking task by LLMs: we carefully design the prompting template by including the sequential interaction history, the candidate items, and the ranking instruction. We conduct extensive experiments on two widely-used datasets for recommender systems and derive several key findings for the use of LLMs in recommender systems. We show that LLMs have promising zero-shot ranking abilities, even competitive to or better than conventional recommendation models on candidates retrieved by multiple candidate generators. We also demonstrate that LLMs struggle to perceive the order of historical interactions and can be affected by biases like position bias, while these issues can be alleviated via specially designed prompting and bootstrapping strategies. The code to reproduce this work is available at https://github.com/RUCAIBox/LLMRank.	最近，大型语言模型(LLM)(例如 GPT-4)展示了令人印象深刻的通用任务解决能力，包括接近推荐任务的潜力。沿着这条研究路线，这项工作旨在调查作为推荐系统排名模型的 LLM 的能力。为了进行实证研究，我们首先将推荐问题形式化为一个条件排序任务，将序贯交互历史作为条件，并将候选生成模型检索到的项目作为候选项。我们采用了一种特定的提示方法来解决 LLM 的排序问题: 我们仔细设计了提示模板，包括顺序交互历史，候选项，和排序指令。我们对推荐系统中两个广泛使用的数据集进行了广泛的实验，得出了在推荐系统中使用 LLM 的几个关键发现。我们证明了 LLM 具有良好的零拍排序能力，甚至比传统的推荐模型更有竞争力或更好的候选人由多个候选生成器检索。我们还证明 LLM 很难感知历史交互的次序，并且可能受到位置偏差等偏见的影响，而这些问题可以通过特别设计的激励和自举策略得到缓解。复制这项工作的代码可在 https://github.com/rucaibox/llmrank 找到。	code	3
Exploring Large Language Models and Hierarchical Frameworks for Classification of Large Unstructured Legal Documents	Nishchal Prasad, Mohand Boughanem, Taoufiq Dkaki		Legal judgment prediction suffers from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents becomes a challenging task, more so on documents with no structural annotation. We explore the classification of these large legal documents and their lack of structural information with a deep-learning-based hierarchical framework which we call MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment prediction. Specifically, we divide a document into parts to extract their embeddings from the last four layers of a custom fine-tuned Large Language Model, and try to approximate their structure through unsupervised clustering. Which we use in another set of transformer encoder layers to learn the inter-chunk representations. We analyze the adaptability of Large Language Models (LLMs) with multi-billion parameters (GPT-Neo, and GPT-J) with the hierarchical framework of MESc and compare them with their standalone performance on legal texts. We also study their intra-domain(legal) transfer learning capability and the impact of combining embeddings from their last layers in MESc. We test these methods and their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. Our approach achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art methods.	法律判决预测存在案件文书篇幅过长、文字过长、结构不统一等问题。从这些文档中预测判断是一项具有挑战性的任务，对于没有结构注释的文档更是如此。我们探讨了这些大型法律文件的分类以及它们缺乏结构信息的问题，采用了一种基于深度学习的层次结构框架，我们称之为 MESc，“基于多级编码器的聚类监督”，用于判断预测。具体来说，我们将一个文档分成几个部分，从定制的微调大语言模型的最后四层中提取它们的嵌入，并尝试通过无监督聚类来近似它们的结构。我们在另一组转换器编码器层中使用它来学习块间表示。本文采用 MESc 层次结构分析了具有数十亿参数(GPT-Neo 和 GPT-J)的大语言模型(LLM)的适应性，并与它们在法律文本中的独立性进行了比较。我们还研究了它们的域内(法律)迁移学习能力以及在 MESc 中结合它们最后一层的嵌入所产生的影响。我们使用 ILDC 数据集和 LexGLUE 数据集的子集对来自印度，欧盟和美国的法律文件进行广泛的实验和消融研究，以测试这些方法及其有效性。我们的方法比以前的最先进的方法获得了大约2点的最小总性能增益。	code	1
Overview of PAN 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification - Extended Abstract	Janek Bevendorff, Xavier Bonet Casals, Berta Chulvi, Daryna Dementieva, Ashaf Elnagar, Dayne Freitag, Maik Fröbe, Damir Korencic, Maximilian Mayerl, Animesh Mukherjee, Alexander Panchenko, Martin Potthast, Francisco Rangel, Paolo Rosso, Alisa Smirnova, Efstathios Stamatatos, Benno Stein, Mariona Taulé, Dmitry Ustalov, Matti Wiegmann, Eva Zangerle	Bauhaus Univ Weimar, Weimar, Germany; Symanto Res, Valencia, Spain; Univ Politecn Valencia, Valencia, Spain; JetBrains, Belgrade, Serbia; Tech Univ Munich, Munich, Germany; Indian Inst Technol Kharagpur, Kharagpur, W Bengal, India; Univ Barcelona, Barcelona, Spain; Univ Aegean, Samos, Greece; Rudjer Boskovic Inst, Zagreb, Croatia; Univ Sharjah, Sharjah, U Arab Emirates; Toloka, Luzern, Switzerland; Friedrich Schiller Univ Jena, Jena, Germany; Univ Hamburg, Hamburg, Germany; Univ Appl Sci BFI Vienna, Vienna, Austria; SRI Int, 333 Ravenswood Ave, Menlo Pk, CA 94025 USA; Univ Innsbruck, Innsbruck, Austria; Univ Kassel, Kassel, Germany; Univ Santiago de Compostela, Santiago, Spain; Skoltech & AIRI, Skolkovo, Russia; Univ Leipzig, Leipzig, Germany	The paper gives a brief overview of the four shared tasks organized at the PAN 2024 lab on digital text forensics and stylometry to be hosted at CLEF 2024. The goal of the PAN lab is to advance the state-of-the-art in text forensics and stylometry through an objective evaluation of new and established methods on new benchmark datasets. Our four tasks are: (1) multi-author writing style analysis, which we continue from 2023 in a more difficult version, (2) multilingual text detoxification, a new task that aims to translate and re-formulate text in a non-toxic way, (3) oppositional thinking analysis, a new task that aims to discriminate critical thinking from conspiracy narratives and identify their core actors, and (4) generative AI authorship verification, which formulates the detection of AI-generated text as an authorship problem, one of PAN's core tasks. As with the previous editions, PAN invites software submissions as easy-to-reproduce docker containers; more than 400 pieces of software have been submitted from PAN'12 through PAN'23 combined, with all recent evaluations running on the TIRA experimentation platform [8].	本文简要介绍了将在CLEF 2024上举办的PAN 2024实验室关于数字文本取证和风格计量学的四项共享任务。PAN实验室的目标是通过在新的基准数据集上对新方法和已有方法进行客观评估，推动文本取证和风格计量学领域的前沿发展。我们的四项任务包括：（1）多作者写作风格分析，这是我们在2023年基础上推出的更具挑战性的版本；（2）多语言文本去毒化，这是一项新任务，旨在以非毒性的方式翻译和重述文本；（3）对立思维分析，这是一项新任务，旨在区分批判性思维与阴谋论叙述，并识别其核心行为者；（4）生成式AI作者身份验证，该任务将AI生成文本的检测问题表述为作者身份验证问题，这是PAN的核心任务之一。与以往版本一样，PAN邀请以易于复现的Docker容器形式提交软件；从PAN'12到PAN'23，已提交了超过400个软件，所有最近的评估均在TIRA实验平台上运行[8]。	code	1
Incorporating Query Recommendation for Improving In-Car Conversational Search	Md. Rashad Al Hasan Rony, Soumya Ranjan Sahoo, Abbas Goher Khan, Ken E. Friedl, Viju Sudhi, Christian Süß	Fraunhofer IAIS, Zwickauer Str 46, D-01069 Dresden, Germany; BMW Grp, Parkring 19-23, D-85748 Garching, Germany	Retrieval-augmented generation has become an effective mechanism for conversational systems in domain-specific settings. Retrieval of a wrong document due to the lack of context from the user utterance may lead to wrong answer generation. Such an issue may reduce the user engagement and thereby the system reliability. In this paper, we propose a context-guided follow-up question recommendation to internally improve the document retrieval in an iterative approach for developing an in-car conversational system. Specifically, a user utterance is first reformulated, given the context of the conversation to facilitate improved understanding to the retriever. In the cases, where the documents retrieved by the retriever are not relevant enough for answering the user utterance, we employ a large language model (LLM) to generate question recommendation which is then utilized to perform a refined retrieval. An empirical evaluation confirms the effectiveness of our proposed approaches in in-car conversations, achieving 48% and 22% improvement in the retrieval and system generated responses, respectively, against baseline approaches.	检索增强生成已成为特定领域设置中对话系统的有效机制。由于用户话语中缺乏上下文信息，检索到错误的文档可能会导致生成错误的答案。这一问题可能会降低用户参与度，从而影响系统的可靠性。在本文中，我们提出了一种上下文引导的后续问题推荐方法，通过迭代方式内部改进文档检索，以开发车载对话系统。具体来说，首先根据对话的上下文对用户话语进行重新表述，以帮助检索器更好地理解。在检索器检索到的文档不足以回答用户话语的情况下，我们利用大型语言模型（LLM）生成问题推荐，然后利用该推荐进行更精确的检索。实证评估证实了我们所提出方法在车载对话中的有效性，与基线方法相比，检索和系统生成响应的性能分别提高了48%和22%。	code	0
ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search	Beatriz Soviero, Daniel Kuhn, Alexandre Salle, Viviane Pereira Moreira	VTEX, Porto Alegre, RS, Brazil; Univ Fed Rio Grande do Sul, Inst Informat, Porto Alegre, RS, Brazil; Inst Educ Sci & Technol Rio Grande do Sul IFRS, Ibiruba, Brazil	The dependence on human relevance judgments limits the development of information retrieval test collections that are vital for evaluating these systems. Since their launch, large language models (LLMs) have been applied to automate several human tasks. Recently, LLMs started being used to provide relevance judgments for document search. In this work, our goal is to assess whether LLMs can replace human annotators in a different setting - product search in eCommerce. We conducted experiments on open and proprietary industrial datasets tomeasure LLM's ability to predict relevance judgments. Our results found that LLM-generated relevance assessments present a strong agreement (similar to 82%) with human annotations indicating that LLMs have an innate ability to perform relevance judgments in an eCommerce setting. Then, we went further and tested whether LLMs can generate annotation guidelines. Our results found that relevance assessments obtained with LLM-generated guidelines are as accurate as the ones obtained from human instructions.(1)(The source code for this work is available at https://github.com/danimtk/chatGPT-goes-shopping)	对人工相关性判断的依赖限制了信息检索测试集的发展，而这些测试集对于评估这些系统至关重要。自大型语言模型（LLMs）推出以来，已被应用于自动化多项人工任务。最近，LLMs开始被用于提供文档搜索的相关性判断。在这项工作中，我们的目标是评估LLMs在电子商务产品搜索这一不同场景下是否能替代人工标注者。我们在开放和专有的工业数据集上进行了实验，以衡量LLMs预测相关性判断的能力。我们的结果表明，LLMs生成的相关性评估与人工标注表现出高度一致性（约为82%），这表明LLMs在电子商务环境中具有执行相关性判断的先天能力。接着，我们进一步测试了LLMs是否能生成标注指南。我们的研究发现，使用LLMs生成的指南获得的相关性评估与使用人工指令获得的评估同样准确。（1）（本工作的源代码可在https://github.com/danimtk/chatGPT-goes-shopping获取）	code	0
Lottery4CVR: Neuron-Connection Level Sharing for Multi-task Learning in Video Conversion Rate Prediction	Xuanji Xiao, Jimmy Chen, Yuzhen Liu, Xing Yao, Pei Liu, Chaosheng Fan	Tencent Inc, Beijing, Peoples R China	As a fundamental task of industrial ranking systems, conversion rate (CVR) prediction is suffering from data sparsity problems. Most conventional CVR modeling leverages Click-through rate (CTR)&CVR multitask learning because CTR involves far more samples than CVR. However, typical coarse-grained layer-level sharing methods may introduce conflicts and lead to performance degradation, since not every neuron or neuron connection in one layer should be shared between CVR and CTR tasks. This is because users may have different fine-grained content feature preferences between deep consumption and click behaviors, represented by CVR and CTR, respectively. To address this sharing&conflict problem, we propose a neuron-connection level knowledge sharing. We start with an over-parameterized base network from which CVR and CTR extract their own subnetworks. The subnetworks have partially overlapped neuron connections which correspond to the sharing knowledge, and the task-specific neuron connections are utilized to alleviate the conflict problem. As far as we know, this is the first time that a neuron-connection level sharing is proposed in CVR modeling. Experiments on the Tencent video platform demonstrate the superiority of the method, which has been deployed serving major traffic. (The source code is available at https://github.com/xuanjixiao/onerec/tree/main/lt4rec).	作为工业排名系统的一项基础任务，转化率（CVR）预测一直受到数据稀疏问题的困扰。大多数传统的CVR建模方法利用点击率（CTR）和CVR的多任务学习，因为CTR涉及的样本数量远多于CVR。然而，典型的粗粒度层级共享方法可能会引入冲突并导致性能下降，因为并非每一层中的每个神经元或神经元连接都应在CVR和CTR任务之间共享。这是因为用户在深度消费和点击行为之间可能具有不同的细粒度内容特征偏好，分别由CVR和CTR表示。为了解决这种共享与冲突问题，我们提出了一种神经元连接级别的知识共享方法。我们从一个过参数化的基础网络开始，CVR和CTR从中提取各自的子网络。这些子网络的神经元连接部分重叠，对应共享的知识，而特定任务的神经元连接则用于缓解冲突问题。据我们所知，这是在CVR建模中首次提出神经元连接级别的共享方法。在腾讯视频平台上的实验证明了该方法的优越性，并且该方法已经部署服务于主要流量。（源代码可在https://github.com/xuanjixiao/onerec/tree/main/lt4rec获取）。	code	0
Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search	Kathryn E. Kirchoff, James Wellnitz, Joshua E. Hochuli, Travis Maxfield, Konstantin I. Popov, Shawn M. Gomez, Alexander Tropsha	Department of Computer Science, UNC Chapel Hill.; Department of Pharmacology, UNC Chapel Hill.; Eshelman School of Pharmacy, UNC Chapel Hill.	Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.	基于最近邻的相似性搜索是化学中的一个常见任务，在药物发现中有着显著的应用案例。然而，这项任务中一些最常用的方法仍然使用蛮力方法。在实践中，由于现代化学品数据库的庞大规模，这可能会造成计算成本高昂和过度耗时。此任务之前的计算改进通常依赖于对缺乏普遍性的硬件或数据集特定技巧的改进。利用低复杂度搜索算法的方法仍然相对缺乏探索。然而，许多这些算法是近似解决方案和/或与典型的高维化学嵌入斗争。在这里，我们评估是否结合低维化学嵌入和 k-d 树数据结构可以实现快速最近邻查询，同时保持标准化学相似性搜索基准的性能。我们考察了不同维度的标准化学嵌入降低以及一个学习，结构意识的嵌入-SmallSA-为这项任务。有了这个框架，超过十亿种化学物质的搜索在不到一秒钟的时间内在一个 CPU 核心上执行，比蛮力搜索数量级快5倍。我们亦证明 SmallSA 在化学相似性基准方面取得具竞争力的表现。	code	0
Evaluating the Impact of Content Deletion on Tabular Data Similarity and Retrieval Using Contextual Word Embeddings	Alberto Berenguer, David Tomás, JoseNorberto Mazón		Table retrieval involves providing a ranked list of relevant tables in response to a search query. A critical aspect of this process is computing the similarity between tables. Recent Transformer-based language models have been effectively employed to generate word embedding representations of tables for assessing their semantic similarity. However, generating such representations for large tables comprising thousands or even millions of rows can be computationally intensive. This study presents the hypothesis that a significant portion of a table's content (i.e., rows) can be removed without substantially impacting its word embedding representation, thereby reducing computational costs while maintaining system performance. To test this hypothesis, two distinct evaluations were conducted. Firstly, an intrinsic evaluation was carried out using two different datasets and five state-of-the-art contextual and not-contextual language models. The findings indicate that, for large tables, retaining just 5% of the content results in a word embedding representation that is 90% similar to the original one. Secondly, an extrinsic evaluation was performed to assess how three different reduction techniques proposed affects the overall performance of the table-based query retrieval system, as measured by MAP, precision, and nDCG. The results demonstrate that these techniques can not only decrease data volume but also improve the performance of the table retrieval system.	表检索涉及根据搜索查询提供相关表的排名列表。这一过程的一个关键方面是计算表之间的相似性。最近，基于Transformer的语言模型已被有效用于生成表的词嵌入表示，以评估它们的语义相似性。然而，为包含数千甚至数百万行的大型表生成此类表示可能在计算上是密集的。本研究提出了一个假设，即可以移除表中大部分内容（即行）而不会显著影响其词嵌入表示，从而在保持系统性能的同时降低计算成本。为了验证这一假设，进行了两项不同的评估。首先，使用两个不同的数据集和五种最先进的上下文和非上下文语言模型进行了内在评估。结果表明，对于大型表，仅保留5%的内容即可生成与原始词嵌入表示90%相似的表示。其次，进行了外在评估，以评估所提出的三种不同缩减技术如何影响基于表的查询检索系统的整体性能，通过MAP、精度和nDCG来衡量。结果表明，这些技术不仅可以减少数据量，还可以提高表检索系统的性能。	code	0
RIGHT: Retrieval-Augmented Generation for Mainstream Hashtag Recommendation	RunZe Fan, Yixing Fan, Jiangui Chen, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng		Automatic mainstream hashtag recommendation aims to accurately provide users with concise and popular topical hashtags before publication. Generally, mainstream hashtag recommendation faces challenges in the comprehensive difficulty of newly posted tweets in response to new topics, and the accurate identification of mainstream hashtags beyond semantic correctness. However, previous retrieval-based methods based on a fixed predefined mainstream hashtag list excel in producing mainstream hashtags, but fail to understand the constant flow of up-to-date information. Conversely, generation-based methods demonstrate a superior ability to comprehend newly posted tweets, but their capacity is constrained to identifying mainstream hashtags without additional features. Inspired by the recent success of the retrieval-augmented technique, in this work, we attempt to adopt this framework to combine the advantages of both approaches. Meantime, with the help of the generator component, we could rethink how to further improve the quality of the retriever component at a low cost. Therefore, we propose RetrIeval-augmented Generative Mainstream HashTag Recommender (RIGHT), which consists of three components: 1) a retriever seeks relevant hashtags from the entire tweet-hashtags set; 2) a selector enhances mainstream identification by introducing global signals; and 3) a generator incorporates input tweets and selected hashtags to directly generate the desired hashtags. The experimental results show that our method achieves significant improvements over state-of-the-art baselines. Moreover, RIGHT can be easily integrated into large language models, improving the performance of ChatGPT by more than 10%.	自动主流话题标签推荐的目的是准确地为用户提供简洁和流行的话题标签发布前。一般来说，主流话题标签推荐面临的挑战包括: 新发布的推文在回应新话题方面的综合难度，以及在语义正确性之外对主流话题标签的准确识别。然而，以往基于固定预定义主流标签列表的检索方法在生成主流标签方面表现出色，但不能理解不断更新的信息流。相反，基于生成的方法展示了理解新发布的 tweet 的优越能力，但它们的能力仅限于识别主流标签，而没有其他特性。受近年来检索增强技术的成功启发，本文尝试采用这一框架将两种方法的优点结合起来。同时，借助于发生器组件，我们可以重新思考如何以较低的成本进一步提高检索器组件的质量。因此，我们提出了 RetrIeval 增强的生成主流 HashTag 推荐器(RIGHT) ，它由三个组成部分组成: 1)检索器从整个 tweet-HashTag 集中寻找相关的 HashTag; 2)选择器通过引入全局信号增强主流识别; 3)生成器结合输入 tweet 和选定的 HashTag 直接生成所需的 HashTag。实验结果表明，我们的方法取得了显着的改进，在最先进的基线。此外，可以很容易地将权限集成到大型语言模型中，使 ChatGPT 的性能提高10% 以上。	code	0
Exploring the Nexus Between Retrievability and Query Generation Strategies	Aman Sinha, Priyanshu Raj Mall, Dwaipayan Roy		Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable.	通过文档可检索性评分量化检索功能中的偏差对于评估面向回忆的检索系统至关重要。然而，许多研究检索模型偏倚的研究缺乏验证其查询生成方法作为准确表示的可检索性的真实用户和他们的查询。这种局限性是由于在可检索性评估中缺乏确定的查询生成标准造成的。通常，当没有查询日志可用时，研究人员会使用文档语料库中的频繁搭配。在这项研究中，我们解决了重复性的问题，并寻求验证查询生成方法，通过比较从人工生成的查询和从查询日志得到的查询可检索性得分。我们的研究结果表明，人工查询和查询日志的可检索性得分之间的相关性很小，甚至可以忽略不计。这表明人工生成的查询可能不能准确地反映从查询日志中获得的可检索性得分。我们进一步探索替代的查询生成技术，发现具有最高相关性的变体。这种替代方法有望在查询日志不可用时提高可重复性。	code	0
GLAD: Graph-Based Long-Term Attentive Dynamic Memory for Sequential Recommendation	Deepanshu Pandey, Arindam Sarkar, Prakash Mandayam Comar	Amazon Dev Ctr, Bengaluru, India	Recommender systems play a crucial role in the e-commerce stores, enabling customers to explore products and facilitating the discovery of relevant items. Typical recommender systems are built using n most recent user interactions, where value of n is chosen based on trade-off between incremental gains in performance and compute/memory costs associated with processing long sequences. State-of-the-art recommendation models like Transformers, based on attention mechanism, have quadratic computation complexity with respect to sequence length, thus limiting the length of past customer interactions to be considered for recommendations. Even with the availability of compute resources, it is crucial to design an algorithm that strikes delicate balance between long term and short term information in identifying relevant products for personalised recommendation. Towards this, we propose a novel extension of Memory Networks, a neural network architecture that harnesses external memory to encapsulate information present in lengthy sequential data. The use of memory networks in recommendation use-cases remains limited in practice owing to their high memory cost, large compute requirements and relatively large inference latency, which makes them prohibitively expensive for online stores with millions of users and products. To address these limitations, we propose a novel transformer-based sequential recommendation model GLAD, with external graph-based memory that dynamically scales user memory by adjusting the memory size according to the user's history, while facilitating the flow of information between users with similar interactions. We establish the efficacy of the proposed model by benchmarking on multiple public datasets as well as an industry dataset against state-of-the-art sequential recommendation baselines.	推荐系统在电子商务商店中扮演着至关重要的角色，它们不仅帮助顾客探索产品，还促进了相关商品的发现。典型的推荐系统基于用户最近的n次交互来构建，其中n的取值需要在性能的提升与处理长序列所带来的计算/内存成本之间进行权衡。基于注意力机制的最先进推荐模型，如Transformers，其计算复杂度与序列长度呈二次方关系，因此限制了用于推荐的过去客户交互的长度。即使计算资源充足，设计一个算法在长期和短期信息之间找到微妙的平衡，以识别个性化推荐中的相关产品，仍然是至关重要的。为此，我们提出了一种记忆网络（Memory Networks）的新扩展，这是一种利用外部记忆来封装长序列数据中信息的神经网络架构。由于记忆网络的高内存成本、大计算需求以及相对较大的推理延迟，它们在推荐用例中的应用在实践中受到限制，这使得它们对于拥有数百万用户和产品的在线商店来说成本过高。为了解决这些限制，我们提出了一种基于Transformer的新型序列推荐模型GLAD，它带有基于图的外部记忆，能够根据用户的历史动态调整记忆大小，同时在具有相似交互的用户之间促进信息流动。我们通过在多个公共数据集以及一个行业数据集上与最先进的序列推荐基准进行比较，验证了所提出模型的有效性。	code	0
BertPE: A BERT-Based Pre-retrieval Estimator for Query Performance Prediction	Maryam Khodabakhsh, Fattane Zarrinkalam, Negar Arabzadeh	Univ Waterloo, Waterloo, ON, Canada; Univ Guelph, Guelph, ON, Canada; Shahrood Univ Technol, Shahrood, Iran	Query Performance Prediction (QPP) aims to estimate the effectiveness of a query in addressing the underlying information need without any relevance judgments. More recent works in this area have employed the pre-trained neural embedding representations of the query to go beyond the corpus statistics of query terms and capture the semantics of the query. In this paper, we propose a supervised QPP method by adopting contextualized neural embeddings to directly learn the performance through fine-tuning. To address the challenges arising from disparities in the evaluation of retrieval models through sparse and comprehensive labels, we introduce an innovative strategy for creating synthetic relevance judgments to enable effective performance prediction for queries, irrespective of whether they are evaluated with sparse or more comprehensive labels. Through our experiments on four different query sets accompanied by MS MARCO V1 collection, we show that our approach shows significantly improved performance compared to the state-of-the-art Pre-retrieval QPP methods.	查询性能预测（Query Performance Prediction, QPP）旨在无需相关判断的情况下，估计查询在满足信息需求方面的有效性。近年来，该领域的研究工作利用预训练的神经嵌入表示来超越查询词的语料库统计信息，捕捉查询的语义。本文提出了一种监督式QPP方法，通过采用上下文神经嵌入，通过微调直接学习查询性能。为了解决通过稀疏和全面标签评估检索模型时产生的差异带来的挑战，我们引入了一种创新的策略，用于创建合成的相关性判断，从而实现对查询的有效性能预测，无论这些查询是通过稀疏标签还是更全面的标签进行评估。通过在MS MARCO V1数据集上的四个不同查询集的实验，我们展示了该方法相较于当前最先进的预检索QPP方法，性能显著提升。	code	0
Estimating the Usefulness of Clarifying Questions and Answers for Conversational Search	Ivan Sekulic, Weronika Lajewska, Krisztian Balog, Fabio Crestani		While the body of research directed towards constructing and generating clarifying questions in mixed-initiative conversational search systems is vast, research aimed at processing and comprehending users' answers to such questions is scarce. To this end, we present a simple yet effective method for processing answers to clarifying questions, moving away from previous work that simply appends answers to the original query and thus potentially degrades retrieval performance. Specifically, we propose a classifier for assessing usefulness of the prompted clarifying question and an answer given by the user. Useful questions or answers are further appended to the conversation history and passed to a transformer-based query rewriting module. Results demonstrate significant improvements over strong non-mixed-initiative baselines. Furthermore, the proposed approach mitigates the performance drops when non useful questions and answers are utilized.	尽管在混合主动会话搜索系统中，针对构建和生成澄清问题的研究机构非常庞大，但是针对处理和理解用户对这些问题的回答的研究却很少。为此，我们提出了一种简单而有效的方法，用于处理澄清问题的答案，避免了以前的工作，即只是将答案附加到原始查询，从而可能降低检索性能。具体来说，我们提出了一个分类器，用于评估提示的澄清问题和用户给出的答案的有用性。有用的问题或答案将进一步附加到会话历史中，并传递给基于转换器的查询重写模块。结果显示，与强大的非混合倡议基线相比，有了显著改善。此外，当使用非有用的问题和答案时，提出的方法可以减少性能下降。	code	0
Measuring Bias in Search Results Through Retrieval List Comparison	Linda Ratz, Markus Schedl, Simone Kopeinik, Navid Rekabsaz	Johannes Kepler Univ Linz, Linz, Austria; Know Ctr GmbH, Graz, Austria	Many IR systems project harmful societal biases, including gender bias, in their retrieved contents. Uncovering and addressing such biases requires grounded bias measurement principles. However, defining reliable bias metrics for search results is challenging, particularly due to the difficulties in capturing gender-related tendencies in the retrieved documents. In this work, we propose a new framework for search result bias measurement. Within this framework, we first revisit the current metrics for representative search result bias (RepSRB) that are based on the occurrence of gender-specific language in the search results. Addressing their limitations, we additionally propose a metric for comparative search result bias (ComSRB) measurement and integrate it into our framework. ComSRB defines bias as the skew in the set of retrieved documents in response to a non-gendered query toward those for male/female-specific variations of the same query. We evaluate ComSRB against RepSRB on a recent collection of bias-sensitive topics and documents from the MS MARCO collection, using pre-trained bi-encoder and cross-encoder IR models. Our analyses show that, while existing metrics are highly sensitive to the wordings and linguistic formulations, the proposed ComSRB metric mitigates this issue by focusing on the deviations of a retrieval list from its explicitly biased variants, avoiding the need for sub-optimal content analysis processes.	许多信息检索（IR）系统在其检索内容中反映了有害的社会偏见，包括性别偏见。揭示和解决这些偏见需要基于可靠的偏见测量原则。然而，定义可靠的搜索结果偏见度量标准具有挑战性，特别是由于难以捕捉检索文档中与性别相关的倾向。在这项工作中，我们提出了一个新的搜索结果偏见测量框架。在该框架内，我们首先重新审视了当前基于搜索结果中性别特异性语言出现的代表性搜索结果偏见（RepSRB）度量标准。针对其局限性，我们还提出了用于比较搜索结果偏见（ComSRB）测量的度量标准，并将其集成到我们的框架中。ComSRB将偏见定义为在响应非性别化查询时，检索文档集合向同一查询的男性/女性特异性变体的倾斜。我们在MS MARCO集合中的一个最新的偏见敏感主题和文档集合上，使用预训练的双编码器和交叉编码器IR模型，对ComSRB与RepSRB进行了评估。我们的分析表明，尽管现有度量标准对措辞和语言表达高度敏感，但所提出的ComSRB度量标准通过关注检索列表与其明确偏见的变体之间的偏差，缓解了这一问题，避免了次优的内容分析过程的需求。	code	0
Cascading Ranking Pipelines for Sensitivity-Aware Search	Jack McKechnie	Univ Glasgow, Glasgow, Lanark, Scotland	Search engines are designed to make information accessible. However, some information should not be accessible, such as documents concerning citizenship applications or personal information. This sensitive information is often found interspersed with other potentially useful non-sensitive information. As such, collections containing sensitive information cannot be made searchable due to the risk of revealing sensitive information. The development of search engines capable of safely searching collections containing sensitive information to provide relevant and non-sensitive information would allow previously hidden collections to be made available. This work aims to develop sensitivity-aware search engines via two-stage cascading retrieval pipelines.	搜索引擎的设计初衷是使信息易于获取。然而，某些信息不应被轻易访问，例如涉及公民身份申请的文件或个人隐私信息。这些敏感信息通常与其他潜在有用的非敏感信息混杂在一起。因此，包含敏感信息的集合无法被开放搜索，因为存在泄露敏感信息的风险。开发能够安全搜索包含敏感信息的集合并提供相关且非敏感信息的搜索引擎，可以使之前被隐藏的集合得以公开访问。本工作旨在通过两阶段级联检索管道开发具有敏感信息感知能力的搜索引擎。	code	0
Advancing Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications with ImageCLEF 2024	Bogdan Ionescu, Henning Müller, AnaMaria Claudia Dragulinescu, Ahmad IdrissiYaghir, Ahmedkhan Radzhabov, Alba Garcia Seco de Herrera, Alexandra Andrei, Alexandru Stan, Andrea M. Storås, Asma Ben Abacha, Benjamin Lecouteux, Benno Stein, Cécile Macaire, Christoph M. Friedrich, Cynthia S. Schmidt, Didier Schwab, Emmanuelle EsperançaRodier, George Ioannidis, Griffin Adams, Henning Schäfer, Hugo Manguinhas, Ioan Coman, Johanna Schöler, Johannes Kiesel, Johannes Rückert, Louise Bloch, Martin Potthast, Maximilian Heinrich, Meliha Yetisgen, Michael A. Riegler, Neal Snider, Pål Halvorsen, Raphael Brüngel, Steven Alexander Hicks, Vajira Thambawita, Vassili Kovalev, Yuri Prokopchuk, Wenwai Yim	Univ Appl Sci Western Switzerland HES SO, Sierre, Switzerland; Natl Univ Sci & Technol Politehn Bucharest, Bucharest, Romania; CEA, LIST, Paris, France	The ImageCLEF evaluation campaign was integrated with CLEF (Conference and Labs of the Evaluation Forum) for more than 20 years and represents a Multimedia Retrieval challenge aimed at evaluating the technologies for annotation, indexing, and retrieval of multimodal data. Thus, it provides information access to large data collections in usage scenarios and domains such as medicine, argumentation and content recommendation. ImageCLEF 2024 has four main tasks: (i) a Medical task targeting automatic image captioning for radiology images, synthetic medical images created with Generative Adversarial Networks (GANs), Visual Question Answering and medical image generation based on text input, and multimodal dermatology response generation; (ii) a joint ImageCLEF-Touché task Image Retrieval/Generation for Arguments to convey the premise of an argument, (iii) a Recommending task addressing cultural heritage content-recommendation, and (iv) a joint ImageCLEF-ToPicto task aiming to provide a translation in pictograms from natural language. In 2023, participation increased by 67% with respect to 2022 which reveals its impact on the community.	ImageCLEF评估活动与CLEF（评估论坛会议与实验室）整合已有20多年，代表了一项多媒体检索挑战，旨在评估多模态数据的注释、索引和检索技术。因此，它在使用场景和领域（如医学、论证和内容推荐）中为大数据集合提供了信息访问。ImageCLEF 2024包含四项主要任务：(i) 医学任务，针对放射影像的自动图像描述、使用生成对抗网络（GANs）创建的合成医学图像、视觉问答以及基于文本输入的医学图像生成，以及多模态皮肤病学响应生成；(ii) 联合ImageCLEF-Touché任务，即图像检索/生成用于传递论证的前提；(iii) 推荐任务，涉及文化遗产内容的推荐；(iv) 联合ImageCLEF-ToPicto任务，旨在从自然语言中提供象形图翻译。2023年，参与人数较2022年增加了67%，显示了其对社区的广泛影响。	code	0
Ranking Heterogeneous Search Result Pages Using the Interactive Probability Ranking Principle	Kanaad Pathak, Leif Azzopardi, Martin Halvey		The Probability Ranking Principle (PRP) ranks search results based on their expected utility derived solely from document contents, often overlooking the nuances of presentation and user interaction. However, with the evolution of Search Engine Result Pages (SERPs), now comprising a variety of result cards, the manner in which these results are presented is pivotal in influencing user engagement and satisfaction. This shift prompts the question: How does the PRP and its user-centric counterpart, the Interactive Probability Ranking Principle (iPRP), compare in the context of these heterogeneous SERPs? Our study draws a comparison between the PRP and the iPRP, revealing significant differences in their output. The iPRP, accounting for item-specific costs and interaction probabilities to determine the “Expected Perceived Utility" (EPU), yields different result orderings compared to the PRP. We evaluate the effect of the EPU on the ordering of results by observing changes in the ranking within a heterogeneous SERP compared to the traditional “ten blue links”. We find that changing the presentation affects the ranking of items according to the (iPRP) by up to 48% (with respect to DCG, TBG and RBO) in ad-hoc search tasks on the TREC WaPo Collection. This work suggests that the iPRP should be employed when ranking heterogeneous SERPs to provide a user-centric ranking that adapts the ordering based on the presentation and user engagement.	概率排序原则(PRP)根据搜索结果的预期效用来排序，这些效用完全来自文档内容，往往忽略了表示和用户交互的细微差别。然而，随着搜索引擎结果页面(SERP)的发展，现在包含了各种各样的结果卡，这些结果的呈现方式对于影响用户的参与度和满意度是至关重要的。这种转变提出了一个问题: PRP 和它的以用户为中心的对应物，交互式概率排序原则(iPRP) ，如何在这些异构 SERP 的上下文中进行比较？我们的研究对 PRP 和 iPRP 进行了比较，发现它们的输出存在显著差异。IPRP 考虑了项目特定成本和交互概率，以确定“预期感知效用”(EPU) ，与 PRP 相比产生了不同的结果排序。我们通过观察一个异构 SERP 中的排序变化来评估 EPU 对结果排序的影响，与传统的“十个蓝色链接”相比。我们发现，在 TREC WaPo 集合的特别搜索任务中，根据(iPRP)改变表示影响项目的排名高达48% (相对于 DCG，TBG 和 RBO)。这项工作表明，iPRP 应该被用来排名异构的 SERP 时，提供一个以用户为中心的排名，适应排序的基础上的表示和用户参与。	code	0
Query Exposure Prediction for Groups of Documents in Rankings	Thomas Jänich, Graham McDonald, Iadh Ounis		The main objective of an Information Retrieval system is to provide a user with the most relevant documents to the user's query. To do this, modern IR systems typically deploy a re-ranking pipeline in which a set of documents is retrieved by a lightweight first-stage retrieval process and then re-ranked by a more effective but expensive model. However, the success of a re-ranking pipeline is heavily dependent on the performance of the first stage retrieval, since new documents are not usually identified during the re-ranking stage. Moreover, this can impact the amount of exposure that a particular group of documents, such as documents from a particular demographic group, can receive in the final ranking. For example, the fair allocation of exposure becomes more challenging or impossible if the first stage retrieval returns too few documents from certain groups, since the number of group documents in the ranking affects the exposure more than the documents' positions. With this in mind, it is beneficial to predict the amount of exposure that a group of documents is likely to receive in the results of the first stage retrieval process, in order to ensure that there are a sufficient number of documents included from each of the groups. In this paper, we introduce the novel task of query exposure prediction (QEP). Specifically, we propose the first approach for predicting the distribution of exposure that groups of documents will receive for a given query. Our new approach, called GEP, uses lexical information from individual groups of documents to estimate the exposure the groups will receive in a ranking. Our experiments on the TREC 2021 and 2022 Fair Ranking Track test collections show that our proposed GEP approach results in exposure predictions that are up to 40 of adapted existing query performance prediction and resource allocation approaches.	信息检索系统的主要目的是向用户提供与其查询最相关的文件。为了做到这一点，现代 IR 系统通常部署一个重新排序的管道，其中一组文档通过轻量级的第一阶段检索过程检索，然后通过一个更有效但昂贵的模型重新排序。然而，重新排序管道的成功与否在很大程度上取决于第一阶段检索的性能，因为在重新排序阶段通常不能确定新文档。此外，这可能会影响特定文档组(如来自特定人口组的文档)在最终排名中可以接受的曝光量。例如，如果第一阶段检索从某些群组返回的文档太少，则公平分配曝光变得更具挑战性或不可能，因为排名中群组文档的数量比文档的位置更能影响曝光。考虑到这一点，有益的做法是预测一组文件在第一阶段检索过程的结果中可能接触的数量，以确保每组文件都有足够的数量。本文介绍了一种新的查询暴露预测(QEP)任务。具体来说，我们提出了第一种方法，用于预测给定查询将接收到的文档组的曝光分布。我们的新方法被称为 GEP，它使用来自单个文档组的词汇信息来估计这些组在一个排名中将接收到的信息。我们在 TREC 2021和2022公平排名跟踪测试集合上的实验表明，我们提出的 GEP 方法导致暴露预测，这是多达40种适应现有查询性能预测和资源分配方法。	code	0
Investigating the Robustness of Sequential Recommender Systems Against Training Data Perturbations	Filippo Betello, Federico Siciliano, Pushkar Mishra, Fabrizio Silvestri		Sequential Recommender Systems (SRSs) have been widely used to model user behavior over time, but their robustness in the face of perturbations to training data is a critical issue. In this paper, we conduct an empirical study to investigate the effects of removing items at different positions within a temporally ordered sequence. We evaluate two different SRS models on multiple datasets, measuring their performance using Normalized Discounted Cumulative Gain (NDCG) and Rank Sensitivity List metrics. Our results demonstrate that removing items at the end of the sequence significantly impacts performance, with NDCG decreasing up to 60%, while removing items from the beginning or middle has no significant effect. These findings highlight the importance of considering the position of the perturbed items in the training data and shall inform the design of more robust SRSs.	随着时间的推移，序贯推荐系统(SRS)已被广泛用于模拟用户行为，但是它们在训练数据受到干扰时的鲁棒性是一个关键问题。在本文中，我们进行了一个实证研究，以探讨删除项目在不同位置的时间顺序的影响。我们在多个数据集上评估两种不同的 SRS 模型，使用归一化贴现累积增益(NDCG)和秩敏感性列表度量衡量它们的性能。结果表明: 去除序列末端的项目对性能有显著影响，NDCG 下降幅度达60% ，而去除序列开头或中间的项目对性能无显著影响。这些发现强调了考虑受干扰项目在训练数据中的位置的重要性，并将为设计更强健的战略参考系提供信息。	code	0
Conversational Search with Tail Entities	Hai Dang Tran, Andrew Yates, Gerhard Weikum	Max Planck Inst Informat, Saarbrucken, Germany	Conversational search faces incomplete and informal follow-up questions. Prior works address these by contextualizing user utterances with cues derived from the previous turns of the conversation. This approach works well when the conversation centers on prominent entities, for which knowledge bases (KBs) or language models (LMs) can provide rich background. This work addresses the unexplored direction where user questions are about tail entities, not featured in KBs and sparsely covered by LMs. We devise a new method, called CONSENT, for selectively contextualizing a user utterance with turns, KB-linkable entities, and mentions of tail and out-of-KB (OKB) entities. CONSENT derives relatedness weights from Sentence-BERT similarities and employs an integer linear program (ILP) for judiciously selecting the best context cues for a given set of candidate answers. This method couples the contextualization and answer-ranking stages, and jointly infers the best choices for both.	对话式搜索面临着不完整和非正式的后续问题。先前的工作通过从对话的先前轮次中提取线索来上下文化用户话语，从而解决这些问题。当对话围绕突出的实体展开时，这种方法效果很好，因为知识库（KB）或语言模型（LM）可以提供丰富的背景信息。本研究解决了一个尚未探索的方向，即用户问题涉及尾部实体，这些实体不在知识库中，且语言模型的覆盖较少。我们设计了一种新方法，称为CONSENT，用于选择性地将用户话语与对话轮次、可链接到知识库的实体以及尾部实体和知识库外（OKB）实体的提及进行上下文化。CONSENT从Sentence-BERT的相似性中推导出相关性权重，并采用整数线性规划（ILP）来明智地为给定的一组候选答案选择最佳上下文线索。该方法将上下文化和答案排序阶段结合起来，并共同推断出两者的最佳选择。	code	0
Event-Specific Document Ranking Through Multi-stage Query Expansion Using an Event Knowledge Graph	Sara Abdollahi, Tin Kuculo, Simon Gottschalk	Leibniz Univ Hannover, Res Ctr L3S, Hannover, Germany	Event-specific document ranking is a crucial task in supporting users when searching for texts covering events such as Brexit or the Olympics. However, the complex nature of events involving multiple aspects like temporal information, location, participants and sub-events poses challenges in effectively modelling their representations for ranking. In this paper, we propose MusQuE (Multi-stage Query Expansion), a multi-stage ranking framework that jointly learns to rank query expansion terms and documents, and in this manner flexibly identifies the optimal combination and number of expansion terms extracted from an event knowledge graph. Experimental results show that MusQuE outperforms state-of-the-art baselines on MS-MARCO EVENT , a new dataset for event-specific document ranking, by 9.1 % and more.	事件特定文档排序是支持用户搜索涉及诸如英国脱欧或奥运会等事件文本的关键任务。然而，事件的复杂性，包括时间信息、地点、参与者和子事件等多个方面，给有效建模这些表示以进行排序带来了挑战。在本文中，我们提出了MusQuE（多阶段查询扩展），这是一个多阶段排序框架，它联合学习排序查询扩展词和文档，从而灵活地识别从事件知识图谱中提取的扩展词的最佳组合和数量。实验结果表明，MusQuE在MS-MARCO EVENT（一个新的事件特定文档排序数据集）上比现有最先进的基线方法提升了9.1%甚至更多。	code	0
Simulating Follow-Up Questions in Conversational Search	Johannes Kiesel, Marcel Gohsen, Nailia Mirzakhmedova, Matthias Hagen, Benno Stein	Friedrich Schiller Univ Jena, Ernst Abbe Pl 2, D-07743 Jena, Germany; Bauhaus Univ Weimar, Bauhausstr 9a, D-99423 Weimar, Germany	Evaluating conversational search systems based on simulated user interactions is a potential approach to overcome one of the main problems of static conversational search test collections: the collections contain only very few of all the plausible conversations on a topic. Still, one of the challenges of user simulation is generating realistic follow-up questions on given outputs of a conversational system. We propose to address this challenge by using state-of-the-art language models and find that: (1) on two conversational search datasets, the tested models generate questions that are semantically similar to those in the datasets, especially when tuned for follow-up questions; (2) the generated questions are mostly valid, related, informative, and specific according to human assessment; and (3) for influencing the characteristics of the simulated questions, small changes to the prompt are insufficient.	基于模拟用户交互来评估对话搜索系统是一种潜在的方法，旨在克服静态对话搜索测试集的一个主要问题：这些测试集仅包含关于某个主题的少数可能的对话。然而，用户模拟的挑战之一是在给定对话系统输出的情况下生成现实的后续问题。我们提出通过使用最先进的语言模型来解决这一挑战，并发现：（1）在两个对话搜索数据集上，测试的模型生成的问题与数据集中的问题在语义上相似，尤其是在针对后续问题进行微调时；（2）根据人工评估，生成的问题大多是有效的、相关的、信息丰富且具体的；（3）对于影响模拟问题特征的需求，仅对提示进行小幅修改是不够的。	code	0
MOReGIn: Multi-Objective Recommendation at the Global and Individual Levels	Elizabeth Gómez, David Contreras, Ludovico Boratto, Maria Salamó		Multi-Objective Recommender Systems (MORSs) emerged as a paradigm to guarantee multiple (often conflicting) goals. Besides accuracy, a MORS can operate at the global level, where additional beyond-accuracy goals are met for the system as a whole, or at the individual level, meaning that the recommendations are tailored to the needs of each user. The state-of-the-art MORSs either operate at the global or individual level, without assuming the co-existence of the two perspectives. In this study, we show that when global and individual objectives co-exist, MORSs are not able to meet both types of goals. To overcome this issue, we present an approach that regulates the recommendation lists so as to guarantee both global and individual perspectives, while preserving its effectiveness. Specifically, as individual perspective, we tackle genre calibration and, as global perspective, provider fairness. We validate our approach on two real-world datasets, publicly released with this paper.	多目标推荐系统(MORS)作为一种范式出现，以保证多个(经常相互冲突)的目标。除了准确性之外，一个 MORS 还可以在全球一级运作，在这一级可以为整个系统或在个人一级实现额外的超准确性目标，这意味着建议是根据每个用户的需要量身定制的。最先进的监测系统既可以在全球一级运作，也可以在个人一级运作，而不必假设这两种观点并存。在这项研究中，我们发现当全球目标和个人目标共存时，MORS 不能同时满足这两种目标。为了解决这一问题，我们提出了一种管理建议清单的办法，以保证全球和个人的观点，同时保持其有效性。具体来说，作为个人的角度，我们处理体裁校准和作为全球的角度，提供者的公平性。我们验证了我们的方法在两个真实世界的数据集，公开发布与本文。	code	0
VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning	Nanyi Fei, Hao Jiang, Haoyu Lu, Jinqiang Long, Yanqi Dai, Tuo Fan, Zhao Cao, Zhiwu Lu	Huawei Poisson Lab, Hangzhou, Zhejiang, Peoples R China; Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China; Renmin Univ China, Sch Informat, Beijing, Peoples R China	Cross-modal search is one fundamental task in multi-modal learning, but there is hardly any work that aims to solve multiple cross-modal search tasks at once. In this work, we propose a novel Versatile Elastic Multi-mOdal (VEMO) model for search-oriented multi-task learning. VEMO is versatile because we integrate cross-modal semantic search, named entity recognition, and scene text spotting into a unified framework, where the latter two can be further adapted to entity- and character-based image search tasks. VEMO is also elastic because we can freely assemble sub-modules of our flexible network architecture for corresponding tasks. Moreover, to give more choices on the effect-efficiency trade-off when performing cross-modal semantic search, we place multiple encoder exits. Experimental results show the effectiveness of our VEMO with only 37.6% network parameters compared to those needed for uni-task training. Further evaluations on entity- and character-based image search tasks also validate the superiority of search-oriented multi-task learning.	跨模态搜索是多模态学习中的一项基本任务，但几乎没有工作旨在一次性解决多个跨模态搜索任务。在本研究中，我们提出了一种新颖的通用弹性多模态（VEMO）模型，用于面向搜索的多任务学习。VEMO具有通用性，因为我们集成了跨模态语义搜索、命名实体识别和场景文本识别到一个统一的框架中，后两者可以进一步适应基于实体和字符的图像搜索任务。VEMO还具有弹性，因为我们可以自由组装我们灵活网络架构的子模块以适应相应任务。此外，为了在执行跨模态语义搜索时提供更多关于效果与效率权衡的选择，我们设置了多个编码器出口。实验结果表明，我们的VEMO模型仅需37.6%的网络参数即可达到与单任务训练相当的效果。在基于实体和字符的图像搜索任务上的进一步评估也验证了面向搜索的多任务学习的优越性。	code	0
Lightweight Modality Adaptation to Sequential Recommendation via Correlation Supervision	Hengchang Hu, Qijiong Liu, Chuang Li, MinYen Kan		In Sequential Recommenders (SR), encoding and utilizing modalities in an end-to-end manner is costly in terms of modality encoder sizes. Two-stage approaches can mitigate such concerns, but they suffer from poor performance due to modality forgetting, where the sequential objective overshadows modality representation. We propose a lightweight knowledge distillation solution that preserves both merits: retaining modality information and maintaining high efficiency. Specifically, we introduce a novel method that enhances the learning of embeddings in SR through the supervision of modality correlations. The supervision signals are distilled from the original modality representations, including both (1) holistic correlations, which quantify their overall associations, and (2) dissected correlation types, which refine their relationship facets (honing in on specific aspects like color or shape consistency). To further address the issue of modality forgetting, we propose an asynchronous learning step, allowing the original information to be retained longer for training the representation learning module. Our approach is compatible with various backbone architectures and outperforms the top baselines by 6.8 original feature associations from modality encoders significantly boosts task-specific recommendation adaptation. Additionally, we find that larger modality encoders (e.g., Large Language Models) contain richer feature sets which necessitate more fine-grained modeling to reach their full performance potential.	在序列推荐器(SR)中，以端到端的方式编码和利用模式在模式编码器大小方面是昂贵的。两阶段方法可以减轻这种担忧，但是由于情态遗忘，它们的表现很差，其中连续的目标掩盖了情态表示。提出了一种轻量级的知识提取方法，该方法既保留了模态信息，又保持了高效率。具体来说，我们提出了一种新的方法，通过监督情态相关性来提高嵌入的学习效果。监督信号是从原始的模态表示中提取出来的，包括(1)量化其总体关联的整体相关性和(2)剖析的相关类型，这些相关类型细化了它们的关系方面(在特定方面如颜色或形状一致性上打磨)。为了进一步解决模态遗忘问题，我们提出了一个异步学习步骤，允许原始信息保留更长的时间来训练表征学习模块。我们的方法与各种骨干架构兼容，并优于最高基线6.8原始特征关联的形式编码器显着提高任务特定的推荐适应性。此外，我们发现较大的模态编码器(例如，大型语言模型)包含更丰富的特征集，这需要更细粒度的建模来达到其全部性能潜力。	code	0
DREQ: Document Re-ranking Using Entity-Based Query Understanding	Shubham Chatterjee, Iain Mackie, Jeff Dalton		While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present DREQ, an entity-oriented dense document re-ranking model. Uniquely, we emphasize the query-relevant entities within a document's representation while simultaneously attenuating the less relevant ones, thus obtaining a query-specific entity-centric document representation. We then combine this entity-centric document representation with the text-centric representation of the document to obtain a "hybrid" representation of the document. We learn a relevance score for the document using this hybrid representation. Using four large-scale benchmarks, we show that DREQ outperforms state-of-the-art neural and non-neural re-ranking methods, highlighting the effectiveness of our entity-oriented representation approach.	尽管面向实体的神经 IR 模型已经取得了显著的进步，但它们往往忽略了一个关键的细微差别: 文档中各个实体对其总体相关性的不同程度的影响。针对这一差距，我们提出了面向实体的密集文档重排序模型 DREQ。独特的是，我们强调文档表示中的查询相关实体，同时减弱相关性较差的实体，从而获得一个特定于查询的以实体为中心的文档表示。然后，我们将这种以实体为中心的文档表示与以文本为中心的文档表示结合起来，以获得文档的“混合”表示。我们使用这种混合表示学习文档的相关性得分。使用四个大规模的基准测试，我们表明 DREQ 优于最先进的神经元和非神经元重新排序方法，突出了我们的面向实体的表示方法的有效性。	code	0
Beyond Topicality: Including Multidimensional Relevance in Cross-encoder Re-ranking - The Health Misinformation Case Study	Rishabh Upadhyay, Arian Askari, Gabriella Pasi, Marco Viviani	Univ Milano Bicocca, Dept Informat Syst & Commun, Viale Sarca 336, I-20126 Milan, Italy; Leiden Univ, Leiden Inst Adv Comp Sci, Niels Bohrweg 1, NL-2333 CA Leiden, Netherlands	In this paper, we propose a novel approach to consider multiple dimensions of relevance in cross-encoder re-ranking. On the one hand, cross-encoders constitute an effective solution for re-ranking when considering a single relevance dimension such as topicality, but are not designed to straightforwardly account for additional relevance dimensions. On the other hand, the majority of re-ranking models accounting for multdimensional relevance are often based on the aggregation of multiple relevance scores at the re-ranking stage, leading to potential compensatory effects. To address these issues, in the proposed solution we enhance the candidate documents retrieved by a first-stage lexical retrieval model with suitable relevance statements related to distinct relevance dimensions, and then perform a re-ranking on them with cross-encoders. In this work we focus, in particular, on an extra dimension of relevance beyond topicality, namely, credibility, to address health misinformation in the Consumer Health Search task. Experimental evaluations are performed by considering publicly available datasets; our results show that the proposed approach statistically outperforms state-of-the-art aggregation-based and cross-encoder re-rankers.	在本文中，我们提出了一种新颖的方法，用于在交叉编码器重排序中考虑多个相关性维度。一方面，交叉编码器在考虑单一相关性维度（如主题相关性）时，构成了重排序的有效解决方案，但其设计并不直接考虑额外的相关性维度。另一方面，大多数考虑多维相关性的重排序模型通常基于在重排序阶段对多个相关性分数的聚合，这可能导致潜在的补偿效应。为了解决这些问题，在所提出的解决方案中，我们通过增强第一阶段词汇检索模型检索到的候选文档，使用与不同相关性维度相关的适当相关性声明，然后使用交叉编码器对它们进行重排序。在这项工作中，我们特别关注除了主题相关性之外的另一个相关性维度，即可信度，以解决消费者健康搜索任务中的健康错误信息问题。实验评估通过考虑公开可用的数据集进行；我们的结果表明，所提出的方法在统计上优于最先进的基于聚合的交叉编码器重排序方法。	code	0
Query Obfuscation for Information Retrieval Through Differential Privacy	Guglielmo Faggioli, Nicola Ferro	Univ Padua, Padua, Italy	Protecting the privacy of a user querying an Information Retrieval (IR) system is of utmost importance. The problem is exacerbated when the IR system is not cooperative in satisfying the user's privacy requirements. To address this, obfuscation techniques split the user's sensitive query into multiple non-sensitive ones that can be safely transmitted to the IR system. To generate such queries, current approaches rely on lexical databases, such as WordNet, or heuristics of word co-occurrences. At the same time, advances in Natural Language Processing (NLP) have shown the power of Differential Privacy (DP) in releasing privacy-preserving text for completely different purposes, such as spam detection and sentiment analysis. We investigate for the first time whether DP mechanisms, originally designed for specific NLP tasks, can effectively be used in IR to obfuscate queries. We also assess their performance compared to state-of-the-art techniques in IR. Our empirical evaluation shows that the Vickrey DP mechanism based on theMahalanobis norm with privacy budget epsilon is an element of [10, 12.5] achieves state-of-the-art privacy protection and improved effectiveness. Furthermore, differently from previous approaches that are substantially on/off, by changing the privacy budget epsilon, DP allows users to adjust their desired level of privacy protection, offering a trade-off between effectiveness and privacy.	保护用户在查询信息检索（IR）系统时的隐私至关重要。当IR系统不合作满足用户的隐私需求时，这一问题变得更加严重。为了解决这一问题，混淆技术将用户的敏感查询分割成多个非敏感查询，这些查询可以安全地传输到IR系统。为了生成此类查询，当前方法依赖于词汇数据库（如WordNet）或词汇共现的启发式方法。与此同时，自然语言处理（NLP）的进展展示了差分隐私（DP）在发布隐私保护文本方面的强大能力，尽管这些文本最初是为完全不同的目的（如垃圾邮件检测和情感分析）设计的。我们首次研究了原本为特定NLP任务设计的DP机制是否能够有效地用于IR中的查询混淆。我们还评估了它们与IR领域最先进技术相比的性能。我们的实证评估表明，基于马氏范数且隐私预算epsilon属于[10, 12.5]范围的Vickrey DP机制能够实现最先进的隐私保护并提高有效性。此外，与之前基本上是开关式的方法不同，通过改变隐私预算epsilon，DP允许用户调整所需的隐私保护级别，从而在有效性和隐私之间提供权衡。	code	0
On-Device Query Auto-completion for Email Search	Yifan Qiao, Otto Godwin, Hua Ouyang		AbstractTraditional query auto-completion (QAC) relies heavily on search logs collected over many users. However, in on-device email search, the scarcity of logs and the governing privacy constraints make QAC a challenging task. In this work, we propose an on-device QAC method that runs directly on users’ devices, where users’ sensitive data and interaction logs are not collected, shared, or aggregated through web services. This method retrieves candidates using pseudo relevance feedback, and ranks them based on relevance signals that explore the textual and structural information from users’ emails. We also propose a private corpora based evaluation method, and empirically demonstrate the effectiveness of our proposed method.	传统的查询自动完成(QAC)在很大程度上依赖于多个用户收集的搜索日志。然而，在设备上的电子邮件搜索，日志的稀缺性和管理隐私的约束使 QAC 一个具有挑战性的任务。在这项工作中，我们提出了一个在设备上的 QAC 方法，直接运行在用户的设备上，其中用户的敏感数据和交互日志不收集，共享，或通过 Web 服务聚合。这种方法使用伪关联反馈检索候选人，并根据相关信号对他们进行排名，这些相关信号探索用户电子邮件的文本和结构信息。我们还提出了一种基于私人语料库的评价方法，并通过实例验证了该方法的有效性。	code	0
Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?	Juan Manuel Rodriguez, Nima Tavassoli, Eliezer Levy, Gil Lederman, Dima Sivov, Matteo Lissandrini, Davide Mottin	Tel Aviv Res Ctr Huawei Technol, Pnueli Lab, Tel Aviv, Israel; Aalborg Univ, Aalborg, Denmark; Aarhus Univ, Aarhus, Denmark	Text-image retrieval (T2I) refers to the task of recovering all images relevant to a keyword query. Popular datasets for text-image retrieval, such as Flickr30k, VG, or MS-COCO, utilize annotated image captions, e.g., "a man playing with a kid", as a surrogate for queries. With such surrogate queries, current multi-modal machine learning models, such as CLIP or BLIP, perform remarkably well. The main reason is the descriptive nature of captions, which detail the content of an image. Yet, T2I queries go beyond the mere descriptions in image-caption pairs. Thus, these datasets are ill-suited to test methods on more abstract or conceptual queries, e.g., "family vacations". In such queries, the image content is implied rather than explicitly described. In this paper, we replicate the T2I results on descriptive queries and generalize them to conceptual queries. To this end, we perform new experiments on a novel T2I benchmark for the task of conceptual query answering, called ConQA. ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query. Our results on established measures show that both large pretrained models (e.g., CLIP, BLIP, and BLIP2) and small models (e.g., SGRAF and NAAF), perform up to 4x better on descriptive rather than conceptual queries. We also find that the models perform better on queries with more than 6 keywords as in MS-COCO captions.	文本-图像检索（T2I）是指根据关键词查询恢复所有相关图像的任务。流行的文本-图像检索数据集，如Flickr30k、VG或MS-COCO，使用带注释的图像描述作为查询的替代，例如“一个男人和一个孩子玩耍”。利用这些替代查询，当前的多模态机器学习模型（如CLIP或BLIP）表现非常出色。主要原因是描述的详细性，这些描述详细说明了图像的内容。然而，T2I查询不仅仅是图像-描述对中的简单描述。因此，这些数据集不适合测试更抽象或概念性查询的方法，例如“家庭度假”。在这种查询中，图像内容是隐含的，而不是明确描述的。在本文中，我们复制了描述性查询的T2I结果，并将其推广到概念性查询。为此，我们在一个新的T2I基准上进行了新的实验，该基准用于概念性查询回答任务，称为ConQA。ConQA包含43k张图像上的30个描述性查询和50个概念性查询，每个查询有超过100张手动注释的图像。我们在已建立的度量标准上的结果显示，无论是大型预训练模型（如CLIP、BLIP和BLIP2）还是小型模型（如SGRAF和NAAF），在描述性查询上的表现都比概念性查询好4倍。我们还发现，模型在超过6个关键词的查询上表现更好，这与MS-COCO描述中的情况一致。	code	0
Query Generation Using Large Language Models - A Reproducibility Study of Unsupervised Passage Reranking	David Rau, Jaap Kamps	Univ Amsterdam, Amsterdam, Netherlands	Existing passage retrieval techniques predominantly emphasize classification or dense matching strategies. This is in contrast with classic language modeling approaches focusing on query or question generation. Recently, Sachan et al. introduced an Unsupervised Passage Retrieval (UPR) approach that resembles this by exploiting the inherent generative capabilities of large language models. In this replicability study, we revisit the concept of zero-shot question generation for re-ranking and focus our investigation on the ranking experiments, validating the UPR findings, particularly on the widely recognized BEIR benchmark. Furthermore, we extend the original work by evaluating the proposed method additionally on the TREC Deep Learning track benchmarks of 2019 and 2020. To enhance our understanding of the technique’s performance, we introduce novel experiments exploring the influence of different prompts on retrieval outcomes. Our comprehensive analysis provides valuable insights into the robustness and applicability of zero-shot question generation as a re-ranking strategy in passage retrieval.	现有的段落检索技术主要集中在分类或密集匹配策略上，这与经典的语言建模方法形成对比，后者侧重于查询或问题生成。最近，Sachan等人提出了一种无监督段落检索（UPR）方法，该方法通过利用大型语言模型固有的生成能力，与此类方法类似。在这项可重复性研究中，我们重新审视了零样本问题生成用于重新排序的概念，并将研究重点放在排序实验上，验证了UPR的发现，特别是在广泛认可的BEIR基准测试上。此外，我们通过进一步在2019年和2020年的TREC深度学习赛道基准测试上评估所提出的方法，扩展了原始工作。为了加深对该技术性能的理解，我们引入了一系列新实验，探索不同提示对检索结果的影响。我们的全面分析为零样本问题生成作为段落检索中的重新排序策略的鲁棒性和适用性提供了宝贵的见解。	code	0
Ranking Distance Metric for Privacy Budget in Distributed Learning of Finite Embedding Data	Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia	JPMorgan Chase, Global Technol Appl Res, New York, NY 10017 USA	Federated Learning (FL) is a collective of distributed learning paradigm that aims to preserve privacy in data. Recent studies have shown FL models to be vulnerable to reconstruction attacks that compromise data privacy by inverting gradients computed on confidential data. To address the challenge of defending against these attacks, it is common to employ methods that guarantee data confidentiality using the principles of Differential Privacy (DP). However, in many cases, especially for machine learning models trained on unstructured data such as text, evaluating privacy requires to consider also the finite space of embedding for client's private data. In this study, we show how privacy in a distributed FL setup is sensitive to the underlying finite embeddings of the confidential data. We show that privacy can be quantified for a client batch that uses either noise, or a mixture of finite embeddings, by introducing a normalised rank distance (d(rank)). This measure has the advantage of taking into account the size of a finite vocabulary embedding, and align the privacy budget to a partitioned space. We further explore the impact of noise and client batch size on the privacy budget and compare it to the standard epsilon derived from Local-DP.	联邦学习（Federated Learning, FL）是一种分布式学习范式，旨在保护数据隐私。最近的研究表明，FL模型容易受到重建攻击的威胁，这些攻击通过反转在机密数据上计算的梯度来破坏数据隐私。为了应对这些攻击的防御挑战，通常采用基于差分隐私（Differential Privacy, DP）原则的方法来保证数据的机密性。然而，在许多情况下，特别是对于在文本等非结构化数据上训练的机器学习模型，评估隐私还需要考虑客户端私有数据的有限嵌入空间。在本研究中，我们展示了分布式FL设置中的隐私如何对机密数据的底层有限嵌入敏感。我们通过引入归一化秩距离（d(rank)）展示了如何为使用噪声或有限嵌入混合的客户端批次量化隐私。这一度量的优势在于考虑了有限词汇嵌入的大小，并将隐私预算与分区空间对齐。我们进一步探讨了噪声和客户端批次大小对隐私预算的影响，并将其与从本地差分隐私（Local-DP）得出的标准ε进行比较。	code	0
Effective Adhoc Retrieval Through Traversal of a Query-Document Graph	Erlend Frayling, Sean MacAvaney, Craig Macdonald, Iadh Ounis	Univ Glasgow, Glasgow, Lanark, Scotland	Adhoc retrieval is the task of effectively retrieving information for an end-user's information need, usually expressed as a textual query. One of the most well-established retrieval frameworks is the two-stage retrieval pipeline, whereby an inexpensive retrieval algorithm retrieves a subset of candidate documents from a corpus, and a more sophisticated (but costly) model re-ranks these candidates. A notable limitation of this two-stage framework is that the second stage re-ranking model can only re-order documents, and any relevant documents not retrieved from the corpus in the first stage are entirely lost to the second stage. A recently-proposed Adaptive Re-Ranking technique has shown that extending the candidate pool by traversing a document similarity graph can overcome this recall problem. However, this traversal technique is agnostic of the user's query, which has the potential to waste compute resources by scoring documents that are not related to the query. In this work, we propose an alternative formulation of the document similarity graph. Rather than using document similarities, we propose a weighted bipartite graph that consists of both document nodes and query nodes. This overcomes the limitations of prior Adaptive Re-Ranking approaches because the bipartite graph can be navigated in a manner that explicitly acknowledges the original user query issued to the search pipeline. We evaluate the effectiveness of our proposed framework by experimenting with the TREC Deep Learning track in a standard adhoc retrieval setting. We find that our approach outperforms state-of-the-art two-stage re-ranking pipelines, improving the nDCG@10 metric by 5.8% on the DL19 test collection.	特定信息检索（Adhoc retrieval）是一项针对终端用户信息需求进行有效检索的任务，通常以文本查询的形式表达。其中最为成熟的检索框架之一是两阶段检索流程，即先通过一种成本较低的检索算法从语料库中检索出一部分候选文档，再由一个更为复杂（但成本较高）的模型对这些候选文档进行重新排序。这种两阶段框架的一个显著局限性在于，第二阶段的重新排序模型只能对文档进行重新排序，而在第一阶段未能从语料库中检索到的任何相关文档将完全丢失在第二阶段中。最近提出的一种自适应重新排序技术（Adaptive Re-Ranking）表明，通过遍历文档相似图来扩展候选池可以克服这一召回问题。然而，这种遍历技术与用户的查询无关，可能会导致计算资源的浪费，因为可能会对与查询无关的文档进行评分。

在本研究中，我们提出了一种替代的文档相似图构建方法。我们不再使用文档相似性，而是提出了一种由文档节点和查询节点组成的加权二分图。这种方法克服了先前自适应重新排序技术的局限性，因为二分图可以在明确考虑用户原始查询的情况下进行导航。我们在标准的特定信息检索设置中，通过TREC深度学习（Deep Learning）赛道的实验评估了我们提出框架的有效性。实验结果表明，我们的方法优于当前最先进的两阶段重新排序流程，在DL19测试集上的nDCG@10指标上提升了5.8%。|code|0| |MMCRec: Towards Multi-modal Generative AI in Conversational Recommendation|Tendai Mukande, Esraa Ali, Annalina Caputo, Ruihai Dong, Noel E. O'Connor|Dublin City Univ, Dublin 9, Ireland; Univ Coll Dublin, Dublin 4, Ireland|Personalized recommendation systems have become integral in this digital age by facilitating content discovery to users and products tailored to their preferences. Since the Generative Artificial Intelligence (GAI) boom, research into GAI-enhanced Conversational Recommender Systems (CRSs) has sparked great interest. Most existing methods, however, mainly rely on one mode of input such as text, thereby limiting their ability to capture content diversity. This is also inconsistent with real-world scenarios, which involve multi-modal input data and output data. To address these limitations, we propose the Multi-Modal Conversational Recommender System (MMCRec) model which harnesses multiple modalities, including text, images, voice and video to enhance the recommendation performance and experience. Our model is capable of not only accepting multi-mode input, but also generating multi-modal output in conversational recommendation. Experimental evaluations demonstrate the effectiveness of our model in real-world conversational recommendation scenarios.|在这个数字化时代，个性化推荐系统通过为用户提供符合其偏好的内容和产品，已成为不可或缺的一部分。自生成式人工智能（GAI）兴起以来，关于GAI增强的对话推荐系统（CRSs）的研究引起了极大兴趣。然而，大多数现有方法主要依赖于单一输入模式，如文本，这限制了它们捕捉内容多样性的能力。这与现实场景也不一致，现实场景中涉及多模态的输入和输出数据。为了解决这些限制，我们提出了多模态对话推荐系统（MMCRec）模型，该模型利用多种模态，包括文本、图像、语音和视频，以增强推荐性能和体验。我们的模型不仅能够接受多模态输入，还能在对话推荐中生成多模态输出。实验评估证明了我们的模型在实际对话推荐场景中的有效性。|code|0| |Federated Conversational Recommender Systems|Allen Lin, Jianling Wang, Ziwei Zhu, James Caverlee|George Mason Univ, Fairfax, VA 22030 USA; Texas A&M Univ, College Stn, TX 77843 USA|Conversational Recommender Systems (CRSs) have become increasingly popular as a powerful tool for providing personalized recommendation experiences. By directly engaging with users in a conversational manner to learn their current and fine-grained preferences, a CRS can quickly derive recommendations that are relevant and justifiable. However, existing CRSs typically rely on a centralized training and deployment process, which involves collecting and storing explicitly-communicated user preferences in a centralized repository. These fine-grained user preferences are completely human-interpretable and can easily be used to infer sensitive information (e.g., financial status, political stands, and health information) about the user, if leaked or breached. To address the user privacy concerns in CRS, we first define a set of privacy protection guidelines for preserving user privacy then propose a novel federated CRS framework that effectively reduces the risk of exposing user privacy. Through extensive experiments, we show that the proposed framework not only satisfies these user privacy protection guidelines, but also achieves competitive recommendation performance comparing to the state-of-the-art non-private conversational recommendation approach.|会话推荐系统（Conversational Recommender Systems, CRSs）作为一种提供个性化推荐体验的强大工具，正变得越来越受欢迎。通过与用户以对话形式直接互动，CRS能够学习用户的当前和细粒度偏好，从而快速生成相关且合理的推荐。然而，现有的CRS通常依赖于集中式的训练和部署过程，这涉及将用户明确表达的偏好收集并存储在集中式存储库中。这些细粒度的用户偏好是完全可被人类理解的，如果泄露或被攻击，很容易被用于推断用户的敏感信息（例如财务状况、政治立场和健康信息）。为了解决CRS中的用户隐私问题，我们首先定义了一组保护用户隐私的隐私保护准则，然后提出了一种新颖的联邦CRS框架，该框架有效降低了用户隐私暴露的风险。通过大量实验，我们证明所提出的框架不仅满足了这些用户隐私保护准则，而且在推荐性能上与最先进的非隐私会话推荐方法相比也具备竞争力。|code|0| |Improving Exposure Allocation in Rankings by Query Generation|Thomas Jänich, Graham McDonald, Iadh Ounis|Univ Glasgow, Glasgow, Lanark, Scotland|Deploying methods that incorporate generated queries in their retrieval process, such as Doc2Query, has been shown to be effective for retrieving the most relevant documents for a user's query. However, to the best of our knowledge, there has been no work yet on whether generated queries can also be used in the ranking process to achieve other objectives, such as ensuring a fair distribution of exposure in the ranking. Indeed, the amount of exposure that a document is likely to receive depends on the document's position in the ranking, with lower-ranked documents having a lower probability of being examined by the user. While the utility to users remains the main objective of an Information Retrieval (IR) system, an unfair exposure allocation can lead to lost opportunities and unfair economic impacts for particular societal groups. Therefore, in this work, we conduct a first investigation into whether generating relevant queries can help to fairly distribute the exposure over groups of documents in a ranking. In our work, we build on the effective Doc2Query methods to selectively generate relevant queries for underrepresented groups of documents and use their predicted relevance to the original query in order to re-rank the underexposed documents. Our experiments on the TREC 2022 Fair Ranking Track collection show that using generated queries consistently leads to a fairer allocation of exposure compared to a standard ranking while still maintaining utility.|在检索过程中采用生成查询的方法，如Doc2Query，已被证明能够有效检索与用户查询最相关的文档。然而，据我们所知，目前尚未有研究探讨生成查询是否也可用于排序过程中以实现其他目标，例如确保排序中曝光的公平分配。事实上，文档可能获得的曝光量取决于其在排序中的位置，排名较低的文档被用户查看的概率较低。尽管用户效用仍然是信息检索（IR）系统的主要目标，但不公平的曝光分配可能导致某些社会群体错失机会并遭受不公平的经济影响。因此，在本研究中，我们首次探讨了生成相关查询是否有助于在排序中公平分配文档组的曝光量。在我们的研究中，我们基于有效的Doc2Query方法，选择性地为代表性不足的文档组生成相关查询，并利用它们对原始查询的预测相关性来重新排序曝光不足的文档。我们在TREC 2022公平排序赛道数据集上的实验表明，与标准排序相比，使用生成查询能够持续实现更公平的曝光分配，同时仍保持用户效用。|code|0| |KnowFIRES: A Knowledge-Graph Framework for Interpreting Retrieved Entities from Search|Negar Arabzadeh, Kiarash Golzadeh, Christopher Risi, Charles L. A. Clarke, Jian Zhao|Univ Waterloo, Waterloo, ON, Canada|Entity retrieval is essential in information access domains where people search for specific entities, such as individuals, organizations, and places. While entity retrieval is an active research topic in Information Retrieval, it is necessary to explore the explainability and interpretability of them more extensively. KnowFIRES addresses this by offering a knowledge graph-based visual representation of entity retrieval results, focusing on contrasting different retrieval methods. KnowFIRES allows users to better understand these differences through the juxtaposition and superposition of retrieved sub-graphs. As part of our demo, we make KnowFIRES (Demo: http://knowfires.live , Source: https://github.com/kiarashgl/KnowFIRES ) web interface and its source code publicly available (A demonstration of the tool: https://www.youtube.com/watch?v=9u-877ArNYE ).|实体检索在信息访问领域中至关重要，尤其是在人们搜索特定实体（如个人、组织、地点等）时。尽管实体检索是信息检索领域的一个活跃研究课题，但有必要更广泛地探索其可解释性和可理解性。KnowFIRES通过提供基于知识图谱的实体检索结果可视化表示来解决这一问题，重点在于对比不同的检索方法。KnowFIRES允许用户通过检索子图的并置和叠加来更好地理解这些差异。作为我们演示的一部分，我们公开了KnowFIRES的Web界面及其源代码（演示：http://knowfires.live，源代码：https://github.com/kiarashgl/KnowFIRES）（工具演示视频：https://www.youtube.com/watch?v=9u-877ArNYE）。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=KnowFIRES:+A+Knowledge-Graph+Framework+for+Interpreting+Retrieved+Entities+from+Search)|0| |A Conversational Search Framework for Multimedia Archives|Anastasia Potyagalova, Gareth J. F. Jones|Dublin City Univ, Sch Comp, ADAPT Ctr, Dublin 9, Ireland|Conversational search system seek to support users in their search activities to improve the effectiveness and efficiency of search while reducing their cognitive load. The challenges of multimedia search mean that search supports provided by conversational search have the potential to improve the user search experience. For example, by assisting users in constructing better queries and making more informed decisions in relevance feedback stages whilst searching. However, previous research on conversational search has been focused almost exclusively on text archives. This demonstration illustrates the potential for the application of conversational methods in multimedia search. We describe a framework to enable multimodal conversational search for use with multimedia archives. Our current prototype demonstrates the use of an conversational AI assistant during the multimedia information retrieval process for both image and video collections.|对话式搜索系统旨在支持用户的搜索活动，以提高搜索的效率和效果，同时减少用户的认知负担。多媒体搜索的挑战意味着对话式搜索提供的支持有潜力提升用户的搜索体验。例如，通过帮助用户构建更好的查询并在搜索过程中的相关性反馈阶段做出更明智的决策。然而，以往关于对话式搜索的研究几乎完全集中在文本档案上。本演示展示了在多媒体搜索中应用对话式方法的潜力。我们描述了一个支持多模态对话式搜索的框架，用于多媒体档案。我们当前的原型展示了在多媒体信息检索过程中使用对话式人工智能助手来处理图像和视频集合。|code|0| |Effective and Efficient Transformer Models for Sequential Recommendation|Aleksandr V. Petrov|Univ Glasgow, Glasgow, Lanark, Scotland|Sequential Recommender Systems use the order of user-item interactions to predict the next item in the sequence. This task is similar to Language Modelling, where the goal is to predict the next token based on the sequence of past tokens. Therefore, adaptations of language models, and, in particular, Transformer-based models, achieved state-of-the-art results for a sequential recommendation. However, despite similarities, the sequential recommendation problem poses a number of specific challenges not present in Language Modelling. These challenges include the large catalogue size of real-world recommender systems, which increases GPU memory requirements and makes the training and the inference of recommender models slow. Another challenge is that a good recommender system should focus not only on the accuracy of recommendation but also on additional metrics, such as diversity and novelty, which makes the direct adaptation of language model training strategies problematic. Our research focuses on solving these challenges. In this doctoral consortium abstract, we briefly describe the motivation and background for our work and then pose research questions and discuss current progress towards solving the described problems.|序列推荐系统利用用户-物品交互的顺序来预测序列中的下一个物品。这一任务与语言建模类似，其目标是根据过去令牌的序列预测下一个令牌。因此，基于语言模型的改进，特别是基于Transformer的模型，在序列推荐中取得了最先进的结果。然而，尽管存在相似性，序列推荐问题仍带来了一些在语言建模中不存在的特定挑战。这些挑战包括现实世界推荐系统中庞大的物品目录规模，这增加了GPU内存需求，并使得推荐模型的训练和推理速度变慢。另一个挑战是，一个好的推荐系统不仅应关注推荐的准确性，还应关注其他指标，如多样性和新颖性，这使得直接采用语言模型的训练策略变得困难。我们的研究专注于解决这些挑战。在本博士联盟摘要中，我们简要描述了工作的动机和背景，随后提出了研究问题，并讨论了在解决所述问题方面的当前进展。|code|0| |Quantum Computing for Information Retrieval and Recommender Systems|Maurizio Ferrari Dacrema, Andrea Pasin, Paolo Cremonesi, Nicola Ferro|Univ Padua, Padua, Italy; Politecn Milan, Milan, Italy|The field of Quantum Computing (QC) has gained significant popularity in recent years, due to its potential to provide benefits in terms of efficiency and effectiveness when employed to solve certain computationally intensive tasks. In both Information Retrieval (IR) and Recommender Systems (RS) we are required to build methods that apply complex processing on large and heterogeneous datasets, it is natural therefore to wonder whether QC could also be applied to boost their performance. The tutorial aims to provide first an introduction to QC for an audience that is not familiar with the technology, then to show how to apply the QC paradigm of Quantum Annealing (QA) to solve practical problems that are currently faced by IR and RS systems. During the tutorial, participants will be provided with the fundamentals required to understand QC and to apply it in practice by using a real D-Wave quantum annealer through APIs.|近年来，量子计算（Quantum Computing, QC）领域因其在解决某些计算密集型任务时可能带来的效率和效果上的优势而备受关注。在信息检索（Information Retrieval, IR）和推荐系统（Recommender Systems, RS）中，我们需要构建能够对大规模异构数据集进行复杂处理的方法，因此自然会产生疑问：量子计算是否也能应用于提升这些系统的性能。本教程旨在首先为不熟悉该技术的观众提供量子计算的入门介绍，随后展示如何应用量子退火（Quantum Annealing, QA）范式来解决当前IR和RS系统面临的实际问题。在教程过程中，参与者将通过API使用真实的D-Wave量子退火器，获得理解量子计算并将其应用于实践所需的基础知识。|code|0| |Transformers for Sequential Recommendation|Aleksandr V. Petrov, Craig Macdonald|National University of Singapore, Singapore, Singapore; University of Hong Kong, Hong Kong, China; Wuhan University, Wuhan, China; Ocean University of China, Qingdao, China|Learning dynamic user preference has become an increasingly important component for many online platforms (e.g., video-sharing sites, e-commerce systems) to make sequential recommendations. Previous works have made many efforts to model item-item transitions over user interaction sequences, based on various architectures, e.g., recurrent neural networks and self-attention mechanism. Recently emerged graph neural networks also serve as useful backbone models to capture item dependencies in sequential recommendation scenarios. Despite their effectiveness, existing methods have far focused on item sequence representation with singular type of interactions, and thus are limited to capture dynamic heterogeneous relational structures between users and items (e.g., page view, add-to-favorite, purchase). To tackle this challenge, we design a Multi-Behavior Hypergraph-enhanced T ransformer framework (MBHT) to capture both short-term and long-term cross-type behavior dependencies. Specifically, a multi-scale Transformer is equipped with low-rank self-attention to jointly encode behavior-aware sequential patterns from fine-grained and coarse-grained levels. Additionally,we incorporate the global multi-behavior dependency into the hypergraph neural architecture to capture the hierarchical long-range item correlations in a customized manner. Experimental results demonstrate the superiority of our MBHT over various state-of- the-art recommendation solutions across different settings. Further ablation studies validate the effectiveness of our model design and benefits of the new MBHT framework. Our implementation code is released at: https://github.com/yuh-yang/MBHT-KDD22.|学习动态用户偏好已经成为许多在线平台(如视频分享网站、电子商务系统)提供连续推荐的一个越来越重要的组成部分。以往的研究基于多种体系结构，如递归神经网络和自我注意机制，对用户交互序列上的项目-项目转换进行了大量的研究。最近出现的图形神经网络也可以作为有用的骨干模型，以捕获项目依赖的顺序推荐场景。尽管现有的方法很有效，但是现有的方法都集中在单一交互类型的项目序列表示上，因此仅限于捕获用户和项目之间的动态异构关系结构(例如，页面查看、添加到收藏夹、购买)。为了应对这一挑战，我们设计了一个多行为超图增强型 T 变换器框架(MBHT)来捕获短期和长期的跨类型行为依赖。具体而言，多尺度变压器配备低级自注意，以从细粒度和粗粒度级别联合编码行为感知的序列模式。此外，我们将全局多行为依赖引入到超图神经结构中，以自定义的方式获取层次化的长期项目相关性。实验结果表明，我们的 MBHT 优于不同设置的各种最先进的推荐解决方案。进一步的消融研究验证了我们的模型设计的有效性和新的 MBHT 框架的好处。我们的实施代码在以下 https://github.com/yuh-yang/mbht-kdd22发布:。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Transformers+for+Sequential+Recommendation)|0| |Context-Aware Query Term Difficulty Estimation for Performance Prediction|Abbas Saleminezhad, Negar Arabzadeh, Soosan Beheshti, Ebrahim Bagheri|Univ Waterloo, Toronto, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Research has already found that many retrieval methods are sensitive to the choice and order of terms that appear in a query, which can significantly impact retrieval effectiveness. We capitalize on this finding in order to predict the performance of a query. More specifically, we propose to learn query term difficulty weights specifically within the context of each query, which could then be used as indicators of whether each query term has the likelihood of making the query more effective or not. We show how such difficulty weights can be learnt through the finetuning of a language model. In addition, we propose an approach to integrate the learnt weights into a cross-encoder architecture to predict query performance. We show that our proposed approach shows a consistently strong performance prediction on the MSMARCO collection and its associated widely used Trec Deep Learning tracks query sets. Our findings demonstrate that our method is able to show consistently strong performance prediction over different query sets (MSMARCO Dev, TREC DL'19, '20, Hard) and a range of evaluation metrics (Kendall, Spearman, sMARE).|研究发现，许多检索方法对查询中出现的术语选择和顺序非常敏感，这会显著影响检索效果。我们利用这一发现来预测查询的性能。更具体地说，我们提出在每个查询的上下文中学习查询术语的难度权重，这些权重可以作为指标，判断每个查询术语是否有可能使查询更有效。我们展示了如何通过微调语言模型来学习这些难度权重。此外，我们提出了一种方法，将学习到的权重集成到交叉编码器架构中，以预测查询性能。我们展示了所提出的方法在MSMARCO数据集及其相关的广泛使用的Trec深度学习赛道查询集上表现出持续强劲的性能预测能力。我们的研究结果表明，该方法能够在不同查询集（MSMARCO Dev、TREC DL'19、'20、Hard）和一系列评估指标（Kendall、Spearman、sMARE）上表现出持续强劲的性能预测能力。|code|0| |Navigating the Thin Line: Examining User Behavior in Search to Detect Engagement and Backfire Effects|Federico Maria Cau, Nava Tintarev||Opinionated users often seek information that aligns with their preexisting beliefs while dismissing contradictory evidence due to confirmation bias. This conduct hinders their ability to consider alternative stances when searching the web. Despite this, few studies have analyzed how the diversification of search results on disputed topics influences the search behavior of highly opinionated users. To this end, we present a preregistered user study (n = 257) investigating whether different levels (low and high) of bias metrics and search results presentation (with or without AI-predicted stances labels) can affect the stance diversity consumption and search behavior of opinionated users on three debated topics (i.e., atheism, intellectual property rights, and school uniforms). Our results show that exposing participants to (counter-attitudinally) biased search results increases their consumption of attitude-opposing content, but we also found that bias was associated with a trend toward overall fewer interactions within the search page. We also found that 19 any search results. When we removed these participants in a post-hoc analysis, we found that stance labels increased the diversity of stances consumed by users, particularly when the search results were biased. Our findings highlight the need for future research to explore distinct search scenario settings to gain insight into opinionated users' behavior.|固执己见的用户往往寻求与他们先前存在的信念相一致的信息，而由于确认偏见而排除相互矛盾的证据。这种行为妨碍了他们在搜索网页时考虑其他立场的能力。尽管如此，很少有研究分析有争议话题的搜索结果的多样化如何影响高度固执己见的用户的搜索行为。为此，我们提出了一项预先注册的用户研究(n = 257) ，调查不同水平(低和高)的偏倚指标和搜索结果表示(有或没有 AI 预测的立场标签)是否会影响立场多样性消费和搜索行为有意见的用户在三个有争议的话题(即无神论，知识产权和校服)。我们的研究结果显示，将参与者暴露在(反态度的)有偏见的搜索结果中，会增加他们对与态度相反的内容的消费，但是我们也发现，偏见与搜索页面内的整体互动减少的趋势有关。我们还发现19任何搜索结果。当我们在一个事后比较中移除这些参与者时，我们发现立场标签增加了用户使用的立场的多样性，特别是当搜索结果有偏见时。我们的研究结果强调了未来研究探索不同搜索场景设置的必要性，以深入了解固执己见的用户的行为。|code|0| |Measuring Bias in a Ranked List Using Term-Based Representations|Amin Abolghasemi, Leif Azzopardi, Arian Askari, Maarten de Rijke, Suzan Verberne||In most recent studies, gender bias in document ranking is evaluated with the NFaiRR metric, which measures bias in a ranked list based on an aggregation over the unbiasedness scores of each ranked document. This perspective in measuring the bias of a ranked list has a key limitation: individual documents of a ranked list might be biased while the ranked list as a whole balances the groups' representations. To address this issue, we propose a novel metric called TExFAIR (term exposure-based fairness), which is based on two new extensions to a generic fairness evaluation framework, attention-weighted ranking fairness (AWRF). TExFAIR assesses fairness based on the term-based representation of groups in a ranked list: (i) an explicit definition of associating documents to groups based on probabilistic term-level associations, and (ii) a rank-biased discounting factor (RBDF) for counting non-representative documents towards the measurement of the fairness of a ranked list. We assess TExFAIR on the task of measuring gender bias in passage ranking, and study the relationship between TExFAIR and NFaiRR. Our experiments show that there is no strong correlation between TExFAIR and NFaiRR, which indicates that TExFAIR measures a different dimension of fairness than NFaiRR. With TExFAIR, we extend the AWRF framework to allow for the evaluation of fairness in settings with term-based representations of groups in documents in a ranked list.|在最近的大多数研究中，文档排名中的性别偏见是通过 NFaiRR 度量来评估的，该度量基于每个排名文档的无偏评分的聚合来衡量排名列表中的偏见。这种测量排名表偏差的视角有一个关键的局限性: 排名表的个别文档可能有偏差，而排名表作为一个整体平衡各组的表示。为了解决这个问题，我们提出了一种新的度量方法 TExFAIR (术语暴露公平性) ，它基于通用公平性评估框架的两个新的扩展，即注意力加权排序公平性(AWRF)。TExFAIR 基于排名列表中基于术语的群体表示来评估公平性: (i)基于概率术语水平关联的关联文档与群体的明确定义，以及(ii)用于计数非代表性文档的排名折扣因子(RBDF)对排名列表的公平性进行测量。我们通过测量文章排序中的性别偏见来评估 TExFAIR，并研究 TExFAIR 和 NFaiRR 之间的关系。我们的实验表明，TExFAIR 和 NFaiRR 之间没有很强的相关性，这表明 TExFAIR 测量的公平性维度不同于 NFaiRR。通过 TExFAIR，我们扩展了 AWRF 框架，允许在排名列表中的文档中使用基于术语的群组表示来评估设置中的公平性。|code|0| |Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation|Eugene Yang, Dawn J. Lawrie, James Mayfield, Douglas W. Oard, Scott Miller||Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.|先前关于英语单语检索的工作已经表明，使用大量的查询文档对相关性判断训练的交叉编码器可以作为教师来训练更有效但同样有效的双编码器学生模型。应用类似的知识提取方法来训练一个有效的双跨语检索编码器模型(CLIR) ，其中查询和文档使用不同的语言，这是一个挑战，因为当查询和文档语言不同时，缺乏足够大训练集合。因此，CLIR 的技术状态依赖于翻译查询、文档，或者两者都来自大型英文 MS MARCO 训练集，这种方法称为 Translate-Train。本文提出了一种翻译-提取的方法，利用从单语交叉编码器或 CLIR 交叉编码器中提取的知识来训练双语交叉编码器的学生模型。这个更丰富的设计空间使得教师模型能够在一个优化的设置中执行推理，同时直接为 CLIR 培训学生模型。受过训练的模型和工件可以在 Huggingface 上公开获得。|code|0| |DESIRE-ME: Domain-Enhanced Supervised Information Retrieval Using Mixture-of-Experts|Pranav Kasela, Gabriella Pasi, Raffaele Perego, Nicola Tonellotto||Open-domain question answering requires retrieval systems able to cope with the diverse and varied nature of questions, providing accurate answers across a broad spectrum of query types and topics. To deal with such topic heterogeneity through a unique model, we propose DESIRE-ME, a neural information retrieval model that leverages the Mixture-of-Experts framework to combine multiple specialized neural models. We rely on Wikipedia data to train an effective neural gating mechanism that classifies the incoming query and that weighs the predictions of the different domain-specific experts correspondingly. This allows DESIRE-ME to specialize adaptively in multiple domains. Through extensive experiments on publicly available datasets, we show that our proposal can effectively generalize domain-enhanced neural models. DESIRE-ME excels in handling open-domain questions adaptively, boosting by up to 12 22|开放领域的问题回答要求检索系统能够处理各种各样的问题，提供准确的答案跨广泛的查询类型和主题。为了通过一个独特的模型来处理这样的话题异质性，我们提出了 DESIRE-ME，一个神经信息检索模型，它利用专家混合框架来结合多个专门的神经模型。我们依靠 Wikipedia 数据来训练一种有效的神经门控机制，该机制对传入的查询进行分类，并相应地权衡不同领域专家的预测。这使得 DESIRE-ME 可以自适应地专门处理多个域。通过在公开数据集上的大量实验，我们表明我们的方案可以有效地推广领域增强的神经模型。DESIRE-ME 擅长于自适应地处理开放领域的问题，最多可提高12|code|0| |A Deep Learning Approach for Selective Relevance Feedback|Suchana Datta, Debasis Ganguly, Sean MacAvaney, Derek Greene||Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-based learning to determine whether a query should be expanded. In contrast, we revisit the problem of selective PRF from a deep learning perspective, presenting a model that is entirely data-driven and trained in an end-to-end manner. The proposed model leverages a transformer-based bi-encoder architecture. Additionally, to further improve retrieval effectiveness with this selective PRF approach, we make use of the model's confidence estimates to combine the information from the original and expanded queries. In our experiments, we apply this selective feedback on a number of different combinations of ranking and feedback models, and show that our proposed approach consistently improves retrieval effectiveness for both sparse and dense ranking models, with the feedback models being either sparse, dense or generative.|伪相关反馈(PRF)可以提高对足够大数量查询的平均检索效率。然而，PRF 常常引入对原始信息需求的漂移，从而影响了多个查询的检索效率。尽管 PRF 的选择性应用有可能缓解这一问题，但以前的方法在很大程度上依赖于无监督或基于特征的学习来确定是否应该扩展查询。相比之下，我们从深度学习的角度重新审视选择性 PRF 的问题，提出了一个完全由数据驱动并以端到端方式进行训练的模型。该模型利用了基于变压器的双编码器结构。此外，为了进一步提高这种选择性 PRF 方法的检索效率，我们利用模型的置信度估计来组合来自原始和扩展查询的信息。在我们的实验中，我们将这种选择性反馈应用于许多不同的排序和反馈模型组合，并且表明我们提出的方法始终如一地提高了稀疏和密集排序模型的检索效率，反馈模型要么是稀疏的，要么是密集的，要么是生成的。|code|0| |Self Contrastive Learning for Session-Based Recommendation|Zhengxiang Shi, Xi Wang, Aldo Lipani||Session-based recommendation, which aims to predict the next item of users' interest as per an existing sequence interaction of items, has attracted growing applications of Contrastive Learning (CL) with improved user and item representations. However, these contrastive objectives: (1) serve a similar role as the cross-entropy loss while ignoring the item representation space optimisation; and (2) commonly require complicated modelling, including complex positive/negative sample constructions and extra data augmentation. In this work, we introduce Self-Contrastive Learning (SCL), which simplifies the application of CL and enhances the performance of state-of-the-art CL-based recommendation techniques. Specifically, SCL is formulated as an objective function that directly promotes a uniform distribution among item representations and efficiently replaces all the existing contrastive objective components of state-of-the-art models. Unlike previous works, SCL eliminates the need for any positive/negative sample construction or data augmentation, leading to enhanced interpretability of the item representation space and facilitating its extensibility to existing recommender systems. Through experiments on three benchmark datasets, we demonstrate that SCL consistently improves the performance of state-of-the-art models with statistical significance. Notably, our experiments show that SCL improves the performance of two best-performing models by 8.2% and 9.5% in P@10 (Precision) and 9.9% and 11.2% in MRR@10 (Mean Reciprocal Rank) on average across different benchmarks. Additionally, our analysis elucidates the improvement in terms of alignment and uniformity of representations, as well as the effectiveness of SCL with a low computational cost.|基于会话的推荐，旨在根据已有的项目序列交互预测用户的下一个兴趣项目，已经吸引了越来越多的应用对比学习(CL)与改进的用户和项目表示。然而，这些对比的目标: (1)服务于类似的作用作为交叉熵损失，而忽略项目表示空间优化; (2)通常需要复杂的建模，包括复杂的正/负样本结构和额外的数据增强。本文介绍了自对比学习(SCL) ，简化了 CL 的应用，提高了基于 CL 的推荐技术的性能。具体来说，SCL 是一个直接促进项目表征之间均匀分布的目标函数，它有效地替代了现有最先进模型的所有对比性目标成分。与以前的工作不同，SCL 消除了任何正/负样本构建或数据增强的需要，从而增强了项目表示空间的可解释性，并促进了其对现有推荐系统的可扩展性。通过对三个基准数据集的实验，我们证明了 SCL 能够持续地提高具有统计学意义的最先进模型的性能。值得注意的是，我们的实验表明，在不同的基准测试中，SCL 提高了两个性能最好的模型的性能，P@10(精度)平均提高了8.2% 和9.5% ，MRR@10(平均倒数排名)平均提高了9.9% 和11.2% 。此外，我们的分析阐明了改进方面的对齐和一致性的表示，以及有效的 SCL 与低计算成本。|code|0| |Revealing the Hidden Impact of Top-N Metrics on Optimization in Recommender Systems|Lukas Wegmeth, Tobias Vente, Lennart Purucker||The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top 43 of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions.|为了提高算法的预测性能，对推荐系统的超参数进行了典型的优化。因此，优化算法，例如网格搜索或随机搜索，根据优化目标度量(如 nDCG 或 Precision)搜索最佳超参数配置。相比之下，优化后的算法，在训练期间内部优化了不同的损失函数，如平方误差或交叉熵。为了解决这个差异，最近的工作集中在产生更适合推荐系统的损失函数。然而，当在优化过程中使用 top-n 度量对算法进行评估时，优化目标度量与训练损失之间的另一个差异被忽略了。在优化过程中，选择 top-n 项目来计算 top-n 度量; 忽略 top-n 项目是从使用完全不同的损失函数训练的模型的建议中选择的。适合于优化的项目推荐——目标指标可能不在推荐项目的前列; 这会对优化性能产生隐性影响。因此，我们被激励去分析是否前 n 个项目对于优化目标的前 n 个度量是最佳的。在寻找答案的过程中，我们除了选择前 n 个选择策略外，还对250个选择策略的预测性能进行了详尽的评估。我们使用十二个隐式反馈和8个显式反馈数据集和十一个推荐系统算法对每个选择策略进行了广泛的评估。我们的研究结果表明，除了 top-n 之外，还存在其他的选择策略可以提高各种算法和推荐域的预测性能。然而，前43名选择策略的表现并没有显著差异。我们讨论了我们的研究结果对优化和重新排序的推荐系统和可行的解决方案的影响。|code|0| |TWOLAR: A TWO-Step LLM-Augmented Distillation Method for Passage Reranking|Davide Baldelli, Junfeng Jiang, Akiko Aizawa, Paolo Torroni||In this paper, we present TWOLAR: a two-stage pipeline for passage reranking based on the distillation of knowledge from Large Language Models (LLM). TWOLAR introduces a new scoring strategy and a distillation process consisting in the creation of a novel and diverse training dataset. The dataset consists of 20K queries, each associated with a set of documents retrieved via four distinct retrieval methods to ensure diversity, and then reranked by exploiting the zero-shot reranking capabilities of an LLM. Our ablation studies demonstrate the contribution of each new component we introduced. Our experimental results show that TWOLAR significantly enhances the document reranking ability of the underlying model, matching and in some cases even outperforming state-of-the-art models with three orders of magnitude more parameters on the TREC-DL test sets and the zero-shot evaluation benchmark BEIR. To facilitate future work we release our data set, finetuned models, and code.|在本文中，我们提出了 TWOLAR: 一个基于从大语言模型(LLM)中提取知识的两阶段通道重新排序流水线。TWOLAR 引入了一个新的评分策略和一个精馏过程，包括创建一个新的和多样化的训练数据集。该数据集由20K 个查询组成，每个查询与一组文档相关联，这些文档通过四种不同的检索方法检索以确保多样性，然后通过利用 LLM 的零拍重新排序功能进行重新排序。我们的消融研究证明了我们引入的每个新组件的贡献。我们的实验结果显示，TWOLAR 显著提高了基础模型的文档重新排序能力，在 TREC-dL 测试集和零拍评估基准 BEIR 上，通过三个以上的参数，匹配甚至在某些情况下超越了最先进的模型，从而提高了文档重新排序的数量级。为了方便未来的工作，我们发布了我们的数据集、微调模型和代码。|code|0| |Estimating Query Performance Through Rich Contextualized Query Representations|Sajad Ebrahimi, Maryam Khodabakhsh, Negar Arabzadeh, Ebrahim Bagheri|Toronto Metropolitan Univ, Toronto, ON, Canada; Univ Waterloo, Waterloo, ON, Canada; Univ Guelph, Guelph, ON, Canada; Shahrood Univ Technol, Shahrood, Iran|The state-of-the-art query performance prediction methods rely on the fine-tuning of contextual language models to estimate retrieval effectiveness on a per-query basis. Our work in this paper builds on this strong foundation and proposes to learn rich query representations by learning the interactions between the query and two important contextual information, namely (1) the set of documents retrieved by that query, and (2) the set of similar historical queries with known retrieval effectiveness. We propose that such contextualized query representations can be more accurate estimators of query performance as they embed the performance of past similar queries and the semantics of the documents retrieved by the query. We perform extensive experiments on the MSMARCO collection and its accompanying query sets including MSMARCO Dev set and TREC Deep Learning tracks of 2019, 2020, 2021, and DL-Hard. Our experiments reveal that our proposed method shows robust and effective performance compared to state-of-the-art baselines.|当前最先进的查询性能预测方法依赖于对上下文语言模型进行微调，以逐条查询为基础估计检索效果。本文的研究工作在这一坚实基础之上，进一步提出通过学习查询与两种重要上下文信息之间的交互来学习丰富的查询表示，这两种上下文信息分别是：(1) 由该查询检索到的文档集，以及 (2) 已知检索效果的相似历史查询集。我们认为，这种上下文化的查询表示可以作为更准确的查询性能估计器，因为它们嵌入了过去相似查询的性能以及由查询检索到的文档的语义信息。我们在MSMARCO数据集及其伴随的查询集上进行了广泛的实验，这些查询集包括MSMARCO开发集以及2019、2020、2021年的TREC深度学习赛道和DL-Hard数据集。实验结果表明，与最先进的基线方法相比，我们提出的方法展现了稳健且高效的性能。|code|0| |Performance Comparison of Session-Based Recommendation Algorithms Based on GNNs|Faisal Shehzad, Dietmar Jannach||In session-based recommendation settings, a recommender system has to base its suggestions on the user interactions that are ob served in an ongoing session. Since such sessions can consist of only a small set of interactions, various approaches based on Graph Neural Networks (GNN) were recently proposed, as they allow us to integrate various types of side information about the items in a natural way. Unfortunately, a variety of evaluation settings are used in the literature, e.g., in terms of protocols, metrics and baselines, making it difficult to assess what represents the state of the art. In this work, we present the results of an evaluation of eight recent GNN-based approaches that were published in high-quality outlets. For a fair comparison, all models are systematically tuned and tested under identical conditions using three common datasets. We furthermore include k-nearest-neighbor and sequential rules-based models as baselines, as such models have previously exhibited competitive performance results for similar settings. To our surprise, the evaluation showed that the simple models outperform all recent GNN models in terms of the Mean Reciprocal Rank, which we used as an optimization criterion, and were only outperformed in three cases in terms of the Hit Rate. Additional analyses furthermore reveal that several other factors that are often not deeply discussed in papers, e.g., random seeds, can markedly impact the performance of GNN-based models. Our results therefore (a) point to continuing issues in the community in terms of research methodology and (b) indicate that there is ample room for improvement in session-based recommendation.|在基于会话的推荐设置中，推荐系统必须根据当前会话中的用户交互情况提出建议。由于这样的会议可以只包括一小组交互，最近提出了各种基于图神经网络(GNN)的方法，因为它们允许我们以一种自然的方式整合关于项目的各种类型的副信息。不幸的是，文献中使用了各种各样的评估设置，例如，在协议、指标和基线方面，这使得评估什么代表了最先进的技术变得困难。在这项工作中，我们介绍了最近在高质量网点发表的八种基于 GNN 的方法的评价结果。为了进行公平的比较，使用三个共同的数据集，在相同的条件下系统地调整和测试所有模型。我们进一步包括 k 最近邻和顺序规则为基线的模型，因为这样的模型已经表现出竞争性能结果在类似的设置。令我们惊讶的是，评估显示，简单的模型在平均倒数排名方面表现优于所有最近的 GNN 模型，我们将其作为优化标准，在命中率方面只有三种情况表现优于 GNN 模型。进一步的分析表明，论文中通常不深入讨论的其他几个因素，例如随机种子，可以显著影响基于 GNN 的模型的性能。因此，我们的研究结果(a)指出了社区在研究方法方面仍然存在的问题，(b)表明在基于会话的推荐方面还有很大的改进空间。|code|0| |Weighted AUReC: Handling Skew in Shard Map Quality Estimation for Selective Search|Gijs Hendriksen, Djoerd Hiemstra, Arjen P. de Vries|Radboud Univ Nijmegen, Nijmegen, Netherlands|In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.|在选择性搜索中，文档集合被划分为多个主题索引分片。为了有效估计分片映射的主题一致性（或质量），引入了AUReC度量。AUReC假设分片大小相似，然而在实际应用中，这一假设往往不成立，即使对于无监督方法也是如此。如果使用具有倾斜类别分布的有监督标注方法，这一问题可能会进一步加剧。为了估计这种不平衡分片映射的质量，我们引入了AUReC度量的加权适应版本，并使用ClueWeb09B和Gov2数据集进行了实证评估。我们表明，当分片大小相似时，该度量与原始AUReC的评估结果高度一致，但在分片大小倾斜时，它能更好地捕捉性能差异。|code|0| |Measuring Item Fairness in Next Basket Recommendation: A Reproducibility Study|Yuanna Liu, Ming Li, Mozhdeh Ariannezhad, Masoud Mansoury, Mohammad Aliannejadi, Maarten de Rijke|AIRLab, Amsterdam, Netherlands; Univ Amsterdam, Amsterdam, Netherlands; Booking com, Amsterdam, Netherlands|Item fairness of recommender systems aims to evaluate whether items receive a fair share of exposure according to different definitions of fairness. Raj and Ekstrand [26] study multiple fairness metrics under a common evaluation framework and test their sensitivity with respect to various configurations. They find that fairness metrics show varying degrees of sensitivity towards position weighting models and parameter settings under different information access systems. Although their study considers various domains and datasets, their findings do not necessarily generalize to next basket recommendation (NBR) where users exhibit a more repeat-oriented behavior compared to other recommendation domains. This paper investigates fairness metrics in the NBR domain under a unified experimental setup. Specifically, we directly evaluate the item fairness of various NBR methods. These fairness metrics rank NBR methods in different orders, while most of the metrics agree that repeat-biased methods are fairer than explore-biased ones. Furthermore, we study the effect of unique characteristics of the NBR task on the sensitivity of the metrics, including the basket size, position weighting models, and user repeat behavior. Unlike the findings in [26], Inequity of Amortized Attention (IAA) is the most sensitive metric, as observed in multiple experiments. Our experiments lead to novel findings in the field of NBR and fairness. We find that Expected Exposure Loss (EEL) and Expected Exposure Disparity (EED) are the most robust and adaptable fairness metrics to be used in the NBR domain.|推荐系统的项目公平性旨在评估项目是否根据不同的公平定义获得公平的曝光机会。Raj和Ekstrand[26]在一个共同的评估框架下研究了多种公平性指标，并测试了它们对各种配置的敏感性。他们发现，公平性指标在不同信息访问系统下对位置加权模型和参数设置表现出不同程度的敏感性。尽管他们的研究考虑了多个领域和数据集，但其发现并不一定适用于下一篮推荐（NBR）领域，因为与其他推荐领域相比，用户在NBR中表现出更多的重复导向行为。本文在统一的实验设置下研究了NBR领域中的公平性指标。具体而言，我们直接评估了各种NBR方法的项目公平性。这些公平性指标对NBR方法进行了不同顺序的排名，而大多数指标一致认为偏向重复的方法比偏向探索的方法更公平。此外，我们研究了NBR任务的独特特征对指标敏感性的影响，包括篮子大小、位置加权模型和用户重复行为。与[26]中的发现不同，在多个实验中观察到，摊销注意力不平等（IAA）是最敏感的指标。我们的实验在NBR和公平性领域得出了新的发现。我们发现，预期曝光损失（EEL）和预期曝光差异（EED）是在NBR领域中使用的最稳健和适应性最强的公平性指标。|code|0| |Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?|Lijun Lyu, Nirmal Roy, Harrie Oosterhuis, Avishek Anand|Delft Univ Technol, Delft, Netherlands; Radboud Univ Nijmegen, Nijmegen, Netherlands|Neural ranking models have become increasingly popular for real-world searchand recommendation systems in recent years. Unlike their tree-basedcounterparts, neural models are much less interpretable. That is, it is verydifficult to understand their inner workings and answer questions like how dothey make their ranking decisions? or what document features do they findimportant? This is particularly disadvantageous since interpretability ishighly important for real-world systems. In this work, we explore featureselection for neural learning-to-rank (LTR). In particular, we investigate sixwidely-used methods from the field of interpretable machine learning (ML) andintroduce our own modification, to select the input features that are mostimportant to the ranking behavior. To understand whether these methods areuseful for practitioners, we further study whether they contribute toefficiency enhancement. Our experimental results reveal a large featureredundancy in several LTR benchmarks: the local selection method TabNet canachieve optimal ranking performance with less than 10 features; the globalmethods, particularly our G-L2X, require slightly more selected features, butexhibit higher potential in improving efficiency. We hope that our analysis ofthese feature selection methods will bring the fields of interpretable ML andLTR closer together.|近年来，神经排序模型在现实世界的搜索和推荐系统中变得越来越流行。与基于树的模型不同，神经模型的解释性要差得多。也就是说，很难理解它们的内在机制并回答诸如“它们是如何做出排序决策的？”或“它们认为哪些文档特征重要？”这样的问题。这一点尤其不利，因为解释性对于现实世界的系统至关重要。在这项工作中，我们探索了神经排序学习（LTR）中的特征选择。特别是，我们研究了可解释机器学习（ML）领域中六种广泛使用的方法，并引入了我们自己的改进方法，以选择对排序行为最重要的输入特征。为了了解这些方法是否对实践者有用，我们进一步研究了它们是否有助于提高效率。我们的实验结果表明，在几个LTR基准测试中存在大量的特征冗余：局部选择方法TabNet可以在不到10个特征的情况下实现最佳的排序性能；全局方法，特别是我们的G-L2X，需要稍多的选择特征，但在提高效率方面表现出更高的潜力。我们希望我们对这些特征选择方法的分析能够将可解释ML和LTR领域更紧密地结合在一起。|code|0| |The Impact of Differential Privacy on Recommendation Accuracy and Popularity Bias|Peter Müllner, Elisabeth Lex, Markus Schedl, Dominik Kowald||Collaborative filtering-based recommender systems leverage vast amounts of behavioral user data, which poses severe privacy risks. Thus, often, random noise is added to the data to ensure Differential Privacy (DP). However, to date, it is not well understood, in which ways this impacts personalized recommendations. In this work, we study how DP impacts recommendation accuracy and popularity bias, when applied to the training data of state-of-the-art recommendation models. Our findings are three-fold: First, we find that nearly all users' recommendations change when DP is applied. Second, recommendation accuracy drops substantially while recommended item popularity experiences a sharp increase, suggesting that popularity bias worsens. Third, we find that DP exacerbates popularity bias more severely for users who prefer unpopular items than for users that prefer popular items.|基于协同过滤的推荐系统利用了大量的行为用户数据，这带来了严重的隐私风险。因此，随机噪音往往被添加到数据中，以确保差分隐私(DP)。然而，到目前为止，人们还没有很好地理解这对个性化推荐的影响。在本研究中，我们研究了当应用于最先进的推荐模型的训练数据时，DP 如何影响推荐的准确性和受欢迎程度偏差。我们的发现有三个方面: 首先，我们发现几乎所有用户的建议在应用 DP 时都会发生变化。其次，推荐的准确性大幅下降，而推荐项目的流行经历了急剧增加，这表明流行偏差恶化。第三，我们发现对于喜欢不受欢迎项目的用户而言，DP 加剧流行偏见的程度要比喜欢受欢迎项目的用户严重得多。|code|0| |How to Forget Clients in Federated Online Learning to Rank?|Shuyi Wang, Bing Liu, Guido Zuccon||Data protection legislation like the European Union's General Data Protection Regulation (GDPR) establishes the right to be forgotten: a user (client) can request contributions made using their data to be removed from learned models. In this paper, we study how to remove the contributions made by a client participating in a Federated Online Learning to Rank (FOLTR) system. In a FOLTR system, a ranker is learned by aggregating local updates to the global ranking model. Local updates are learned in an online manner at a client-level using queries and implicit interactions that have occurred within that specific client. By doing so, each client's local data is not shared with other clients or with a centralised search service, while at the same time clients can benefit from an effective global ranking model learned from contributions of each client in the federation. In this paper, we study an effective and efficient unlearning method that can remove a client's contribution without compromising the overall ranker effectiveness and without needing to retrain the global ranker from scratch. A key challenge is how to measure whether the model has unlearned the contributions from the client c^* that has requested removal. For this, we instruct c^* to perform a poisoning attack (add noise to this client updates) and then we measure whether the impact of the attack is lessened when the unlearning process has taken place. Through experiments on four datasets, we demonstrate the effectiveness and efficiency of the unlearning strategy under different combinations of parameter settings.|数据保护立法，如欧盟的一般数据保护条例(GDPR)规定了被遗忘的权利: 用户(客户)可以要求使用他们的数据作出贡献，从学习的模型中删除。在本文中，我们研究了如何删除参与联邦在线学习排名(FOLTR)系统的客户所做的贡献。在 FOLTR 系统中，通过将本地更新聚合到全局排名模型中来学习排名器。使用在特定客户端中发生的查询和隐式交互，在客户端级别以联机方式学习本地更新。通过这样做，每个客户的本地数据不会与其他客户共享，也不会与中央搜索服务共享，同时客户可以从联合会中每个客户贡献的有效全球排名模型中受益。在本文中，我们研究了一个有效和高效的去除方法，可以消除客户的贡献，而不损害整体排名有效性，不需要从头再培训全球排名。一个关键的挑战是如何衡量模型是否已经从请求删除的客户机 c ^ * 那里忘记了贡献。为此，我们指示 c ^ * 执行中毒攻击(为客户端更新添加噪声) ，然后在发生忘记过程时测量攻击的影响是否减轻。通过对四个数据集的实验，验证了在不同的参数设置组合下，忘却策略的有效性和效率。|code|0| |InDi: Informative and Diverse Sampling for Dense Retrieval|Nachshon Cohen, Hedda Cohen Indelman, Yaron Fairstein, Guy Kushilevitz|Technion, Haifa, Israel; Amazon, Haifa, Israel|Negative sample selection has been shown to have a crucial effect on the training procedure of dense retrieval systems. Nevertheless, most existing negative selection methods end by randomly choosing from some pool of samples. This calls for a better sampling solution. We define desired requirements for negative sample selection; the samples chosen should be informative, to advance the learning process, and diverse, to help the model generalize. We compose a sampling method designed to meet these requirements, and show that using our sampling method to enhance the training procedure of a recent significant dense retrieval solution (coCondenser) improves the obtained model's performance. Specifically, we see a similar to 2% improvement in MRR@10 on the MS MARCO dataset (from 38.2 to 38.8) and a similar to 1.5% improvement in Recall@5 on the Natural Questions dataset (from 71% to 72.1%), both statistically significant. Our solution, as opposed to other methods, does not require training or inferencing a large model, and adds only a small overhead (similar to 1% added time) to the training procedure. Finally, we report ablation studies showing that the objectives defined are indeed important when selecting negative samples for dense retrieval.|负样本选择已被证明对密集检索系统的训练过程具有至关重要的影响。然而，大多数现有的负样本选择方法最终都是从某个样本池中随机选择。这促使我们需要一种更好的采样解决方案。我们定义了负样本选择的理想要求：所选样本应具有信息量，以推进学习过程，并且应具有多样性，以帮助模型泛化。我们设计了一种采样方法，旨在满足这些要求，并展示了使用我们的采样方法来增强最近一种重要的密集检索解决方案（coCondenser）的训练过程，从而提高了所获得模型的性能。具体而言，我们在MS MARCO数据集上观察到MRR@10提高了约2%（从38.2提高到38.8），在Natural Questions数据集上观察到Recall@5提高了约1.5%（从71%提高到72.1%），两者均具有统计学意义。与其他方法不同，我们的解决方案不需要训练或推断大型模型，并且仅增加了训练过程的少量开销（约增加1%的时间）。最后，我们报告了消融研究，表明在为密集检索选择负样本时，定义的目标确实非常重要。|code|0| |Learning-to-Rank with Nested Feedback|Hitesh Sagtani, Olivier Jeunen, Aleksei Ustimenko||Many platforms on the web present ranked lists of content to users, typically optimized for engagement-, satisfaction- or retention- driven metrics. Advances in the Learning-to-Rank (LTR) research literature have enabled rapid growth in this application area. Several popular interfaces now include nested lists, where users can enter a 2nd-level feed via any given 1st-level item. Naturally, this has implications for evaluation metrics, objective functions, and the ranking policies we wish to learn. We propose a theoretically grounded method to incorporate 2nd-level feedback into any 1st-level ranking model. Online experiments on a large-scale recommendation system confirm our theoretical findings.|网络上的许多平台对用户的内容列表进行排序，通常针对参与度、满意度或保留驱动的指标进行优化。学习到等级(LTR)研究文献的进步使得这一应用领域的快速增长成为可能。一些流行的界面现在包括嵌套列表，用户可以通过任何给定的第一级项目输入第二级提要。当然，这对评估指标、目标函数和我们希望学习的排名策略都有影响。我们提出了一个理论基础的方法，将二级反馈纳入任何一级排名模型。在一个大规模推荐系统上的在线实验证实了我们的理论发现。|code|0| |Simple Domain Adaptation for Sparse Retrievers|Mathias Vast, Yuxuan Zong, Benjamin Piwowarski, Laure Soulier||In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn't exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.|在自然语言处理信息检索，以及更广泛的自然语言处理领域，通过微调来调整模型以适应特定的领域。尽管这种方法取得了成功，而且通用性强，但是对人工管理和标记数据的需求使得在培训数据不存在的情况下将数据转移到新的任务、领域和/或语言是不切实际的。不经训练就使用该模型(零射击)是另一种选择，但是这种方法会带来有效性损失，特别是对于第一阶段的检索器。为了解决这些问题，出现了许多研究方向，其中大多数是在适应一项任务或一种语言的背景下。然而，文献对领域(或主题)的适应性较少。在本文中，我们解决这个问题的跨主题差异的稀疏第一阶段的检索，移位的方法最初设计的语言适应。通过利用对目标数据的预训练来学习特定领域的知识，该技术减轻了对带注释数据的需求，并扩大了领域适应的范围。尽管它们具有相对较好的泛化能力，但是我们表明即使是稀疏的检索器也可以从我们简单的领域自适应方法中受益。|code|0| |Selma: A Semantic Local Code Search Platform|Anja Reusch, Guilherme C. Lopes, Wilhelm Pertsch, Hannes Ueck, Julius Gonsior, Wolfgang Lehner|Tech Univ Dresden, Dresden Database Syst Grp, Dresden, Germany|Searching for the right code snippet is cumbersome and not a trivial task. Online platforms such as Github.com or searchcode.com provide tools to search, but they are limited to publicly available and internet-hosted code. However, during the development of research prototypes or confidential tools, it is preferable to store source code locally. Consequently, the use of external code search tools becomes impractical. Here, we present Selma (Code and Videos: https://anreu.github.io/selma ): a local code search platform that enables term-based and semantic retrieval of source code. Selma searches code and comments, annotates undocumented code to enable term-based search in natural language, and trains neural models for code retrieval.|寻找合适的代码片段是一项繁琐且非易事。在线平台如Github.com或searchcode.com提供了搜索工具，但这些工具仅限于搜索公开且托管在互联网上的代码。然而，在研究原型或机密工具的开发过程中，更倾向于将源代码存储在本地。因此，使用外部代码搜索工具变得不切实际。在此，我们介绍Selma（代码和视频：https://anreu.github.io/selma）：一个本地代码搜索平台，支持基于术语和语义的源代码检索。Selma能够搜索代码和注释，对未记录代码进行注释以支持自然语言的基于术语的搜索，并训练用于代码检索的神经网络模型。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Selma:+A+Semantic+Local+Code+Search+Platform)|0| |FAR-AI: A Modular Platform for Investment Recommendation in the Financial Domain|Javier SanzCruzado, Edward Richards, Richard McCreadie|Univ Glasgow, Glasgow, Lanark, Scotland|Financial asset recommendation (FAR) is an emerging sub-domain of the wider recommendation field that is concerned with recommending suitable financial assets to customers, with the expectation that those customers will invest capital into a subset of those assets. FAR is a particularly interesting sub-domain to explore, as unlike traditional movie or product recommendation, FAR solutions need to analyse and learn from a combination of time-series pricing data, company fundamentals, social signals and world events, relating the patterns observed to multi-faceted customer representations comprising profiling information, expectations and past investments. In this demo we will present a modular FAR platform; referred to as FAR-AI, with the goal of raising awareness and building a community around this emerging domain, as well as illustrate the challenges, design considerations and new research directions that FAR offers. The demo will comprise two components: 1) we will present the architecture of FAR-AI to attendees, to enable them to understand the how's and the why's of developing a FAR system; and 2) a live demonstration of FAR-AI as a customer-facing product, highlighting the differences in functionality between FAR solutions and traditional recommendation scenarios. The demo is supplemented by online-tutorial materials, to enable attendees new to this space to get practical experience with training FAR models. VIDEO URL.|金融资产推荐（Financial Asset Recommendation, FAR）是更广泛的推荐领域中的一个新兴子领域，其核心任务是为客户推荐合适的金融资产，并期望这些客户将资金投资于这些资产的一部分。FAR是一个特别值得探索的子领域，因为与传统的电影或商品推荐不同，FAR解决方案需要分析和学习时间序列定价数据、公司基本面、社交信号和全球事件的组合，并将观察到的模式与包含客户画像信息、预期和过去投资的多维度客户表征相关联。在本次演示中，我们将展示一个模块化的FAR平台，称为FAR-AI，旨在提高对这一新兴领域的认识并围绕其构建社区，同时展示FAR所面临的挑战、设计考虑因素以及新的研究方向。演示将包括两个部分：1）我们将向与会者介绍FAR-AI的架构，帮助他们理解开发FAR系统的“如何”与“为何”；2）FAR-AI作为面向客户的产品的现场演示，突出FAR解决方案与传统推荐场景在功能上的差异。演示还辅以在线教程材料，以便让初次接触该领域的与会者获得训练FAR模型的实践经验。视频链接：[VIDEO URL]。|code|0| |Semantic Content Search on IKEA.com|Mateusz Slominski, Ezgi Yildirim, Martin Tegner|Ingka Grp, IKEA Retail, Leiden, Netherlands|In this paper, we present an approach to content search. The aim is to increase customer engagement with content recommendations on IKEA.com. As an alternative to Boolean search, we introduce a method based on semantic textual similarity between content pages and search queries. Our approach improves the relevance of search results by a 2.95% increase in click-through rate in an online A/B test.|本文提出了一种内容搜索方法，旨在通过内容推荐提升用户在宜家官网（IKEA.com）上的参与度。作为布尔搜索的替代方案，我们引入了一种基于内容页面与搜索查询之间语义文本相似度的方法。通过在线A/B测试，我们的方法使搜索结果的点击率提高了2.95%，从而显著提升了搜索结果的相关性。|code|0| |Semantic Search in Archive Collections Through Interpretable and Adaptable Relation Extraction About Person and Places|Nicolas Gutehrlé|Univ Franche Comte, CRIT, F-25000 Besancon, France|In recent years, libraries and archives have undertaken numerous campaigns to digitise their collections. While these campaigns have increased ease of access to archival documents for a wider audience, ensuring discoverability and promoting their content remain significant challenges. Digitised documents are often unstructured, making them difficult to navigate. Accessing archive materials through search engines restricts users to keyword-based queries, leading to being overwhelmed by irrelevant documents. To enhance the exploration and exploitation of the "Big Data of the Past" [15], it is imperative to structure textual content.|近年来，图书馆和档案馆开展了大量的数字化活动，将其收藏品数字化。尽管这些活动使得更广泛的受众能够更方便地访问档案文件，但确保其可发现性并推广其内容仍然是重大挑战。数字化文件通常是非结构化的，这使得它们难以浏览。通过搜索引擎访问档案材料限制了用户只能进行基于关键词的查询，从而导致用户被大量不相关的文件所淹没。为了增强对“过去的大数据”[15]的探索和利用，对文本内容进行结构化处理是至关重要的。|code|0| |Reproduction and Simulation of Interactive Retrieval Experiments|Jana Isabelle Friese|Univ Duisburg Essen, Duisburg, Germany|The reproducibility crisis, spanning across various scientific fields, substantially affects information retrieval research [1].|跨越多学科领域的可重复性危机对信息检索研究产生了重大影响[1]。|code|0| |Efficient Multi-vector Dense Retrieval with Bit Vectors|Franco Maria Nardini, Cosimo Rulli, Rossano Venturini|CNR, ISTI, Pisa, Italy; Univ Pisa, Pisa, Italy|Dense retrieval techniques employ pre-trained large language models to builda high-dimensional representation of queries and passages. Theserepresentations compute the relevance of a passage w.r.t. to a query usingefficient similarity measures. In this line, multi-vector representations showimproved effectiveness at the expense of a one-order-of-magnitude increase inmemory footprint and query latency by encoding queries and documents on aper-token level. Recently, PLAID has tackled these problems by introducing acentroid-based term representation to reduce the memory impact of multi-vectorsystems. By exploiting a centroid interaction mechanism, PLAID filters outnon-relevant documents, thus reducing the cost of the successive rankingstages. This paper proposes “Efficient Multi-Vector dense retrieval with Bitvectors” (EMVB), a novel framework for efficient query processing inmulti-vector dense retrieval. First, EMVB employs a highly efficientpre-filtering step of passages using optimized bit vectors. Second, thecomputation of the centroid interaction happens column-wise, exploiting SIMDinstructions, thus reducing its latency. Third, EMVB leverages ProductQuantization (PQ) to reduce the memory footprint of storing vectorrepresentations while jointly allowing for fast late interaction. Fourth, weintroduce a per-document term filtering method that further improves theefficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVBis up to 2.8x faster while reducing the memory footprint by 1.8x with no lossin retrieval accuracy compared to PLAID.|密集检索技术利用预训练的大型语言模型来构建查询和段落的高维表示。这些表示通过高效的相似度度量来计算段落与查询的相关性。在这一领域，多向量表示通过在每词元级别上编码查询和文档，提高了检索效果，但代价是内存占用和查询延迟增加了一个数量级。最近，PLAID通过引入基于质心的词项表示来解决这些问题，从而减少了多向量系统的内存影响。通过利用质心交互机制，PLAID过滤掉了不相关的文档，从而降低了后续排序阶段的成本。本文提出了“基于位向量的高效多向量密集检索”（EMVB），这是一种用于多向量密集检索中高效查询处理的新框架。首先，EMVB使用优化的位向量对段落进行高效的预过滤。其次，质心交互的计算按列进行，利用SIMD指令，从而减少了延迟。第三，EMVB利用乘积量化（PQ）来减少存储向量表示的内存占用，同时允许快速的后期交互。第四，我们引入了一种每文档词项过滤方法，进一步提高了最后一步的效率。在MS MARCO和LoTTE上的实验表明，与PLAID相比，EMVB的速度提高了2.8倍，同时减少了1.8倍的内存占用，且检索精度没有损失。|code|0| |Prompt-Based Generative News Recommendation (PGNR): Accuracy and Controllability|Xinyi Li, Yongfeng Zhang, Edward C. Malthouse|Rutgers State Univ, Piscataway, NJ USA; Northwestern Univ, Evanston, IL 60208 USA|Online news platforms often use personalized news recommendation methods to help users discover articles that align with their interests. These methods typically predict a matching score between a user and a candidate article to reflect the user's preference for the article. Given that articles contain rich textual information, current news recommendation systems (RS) leverage natural language processing (NLP) techniques, including the attention mechanism, to capture users' interests based on their historical behaviors and comprehend article content. However, these existing model architectures are usually task-specific and require redesign to adapt to additional features or new tasks. Motivated by the substantial progress in pre-trained large language models for semantic understanding and prompt learning, which involves guiding output generation using pre-trained language models, this paper proposes Prompt-based Generative News Recommendation (PGNR). This approach treats personalized news recommendation as a text-to-text generation task and designs personalized prompts to adapt to the pre-trained language model, taking the generative training and inference paradigm that directly generates the answer for recommendation. Experimental studies using the Microsoft News dataset show that PGNR is capable of making accurate recommendations by taking into account various lengths of past behaviors of different users. It can also easily integrate new features without changing the model architecture and the training loss function. Additionally, PGNR can make recommendations based on users' specific requirements, allowing more straightforward human-computer interaction for news recommendation.|在线新闻平台通常采用个性化新闻推荐方法，以帮助用户发现与其兴趣相符的文章。这些方法通常预测用户与候选文章之间的匹配分数，以反映用户对文章的偏好。鉴于文章包含丰富的文本信息，当前的新闻推荐系统（RS）利用自然语言处理（NLP）技术，包括注意力机制，基于用户的历史行为捕捉用户兴趣并理解文章内容。然而，这些现有的模型架构通常是任务特定的，需要重新设计以适应额外的特征或新任务。受预训练大语言模型在语义理解和提示学习（即使用预训练语言模型引导输出生成）方面取得的显著进展的启发，本文提出了基于提示的生成式新闻推荐（Prompt-based Generative News Recommendation, PGNR）。该方法将个性化新闻推荐视为文本到文本的生成任务，并设计个性化提示以适应预训练语言模型，采用生成式训练和推理范式，直接生成推荐答案。使用微软新闻数据集进行的实验研究表明，PGNR能够通过考虑不同用户的各种历史行为长度来做出准确的推荐。它还可以轻松集成新特征，而无需改变模型架构和训练损失函数。此外，PGNR能够根据用户的特定需求进行推荐，从而实现更直接的人机交互以进行新闻推荐。|code|0| |CaseGNN: Graph Neural Networks for Legal Case Retrieval with Text-Attributed Graphs|Yanran Tang, Ruihong Qiu, Yilun Liu, Xue Li, Zi Huang||Legal case retrieval is an information retrieval task in the legal domain, which aims to retrieve relevant cases with a given query case. Recent research of legal case retrieval mainly relies on traditional bag-of-words models and language models. Although these methods have achieved significant improvement in retrieval accuracy, there are still two challenges: (1) Legal structural information neglect. Previous neural legal case retrieval models mostly encode the unstructured raw text of case into a case representation, which causes the lack of important legal structural information in a case and leads to poor case representation; (2) Lengthy legal text limitation. When using the powerful BERT-based models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. In this paper, a graph neural networks-based legal case retrieval model, CaseGNN, is developed to tackle these challenges. To effectively utilise the legal structural information during encoding, a case is firstly converted into a Text-Attributed Case Graph (TACG), followed by a designed Edge Graph Attention Layer and a readout function to obtain the case graph representation. The CaseGNN model is optimised with a carefully designed contrastive loss with easy and hard negative sampling. Since the text attributes in the case graph come from individual sentences, the restriction of using language models is further avoided without losing the legal context. Extensive experiments have been conducted on two benchmarks from COLIEE 2022 and COLIEE 2023, which demonstrate that CaseGNN outperforms other state-of-the-art legal case retrieval methods. The code has been released on https://github.com/yanran-tang/CaseGNN.|法律案例检索是法律领域的一项信息检索工作，其目的是检索具有给定查询案例的相关案例。目前法律案例检索的研究主要依赖于传统的词袋模型和语言模型。虽然这些方法在检索精度方面取得了显著的进步，但仍然存在两个挑战: (1)法律结构信息的忽视。以往的神经网络法律案例检索模型大多将非结构化的原始案例文本编码为案例表示，导致案例缺乏重要的法律结构信息，导致案例表示效果不佳;。在使用基于 BERT 的强大模型时，存在输入文本长度的限制，这就不可避免地要求通过截断或除法来缩短输入，同时丢失法律上下文信息。本文提出了一种基于图神经网络的法律案例检索模型 CaseGNN，以解决这些问题。为了在编码过程中有效地利用法律结构信息，首先将案例转换为文本属性案例图(TACG) ，然后设计边缘图注意层和读出功能，得到案例图表示。CaseGNN 模型通过精心设计的对比损失和简单和硬负采样进行优化。由于案例图中的文本属性来自于单个句子，因此在不失去法律上下文的前提下，进一步避免了语言模型的使用限制。对 COLIEE 2022和 COLIEE 2023的两个基准进行了广泛的实验，证明 CaseGNN 优于其他最先进的法律案例检索方法。密码已经在 https://github.com/yanran-tang/casegnn 上发布了。|code|0| |Context-Driven Interactive Query Simulations Based on Generative Large Language Models|Björn Engelmann, Timo Breuer, Jana Isabelle Friese, Philipp Schaer, Norbert Fuhr||Simulating user interactions enables a more user-oriented evaluation of information retrieval (IR) systems. While user simulations are cost-efficient and reproducible, many approaches often lack fidelity regarding real user behavior. Most notably, current user models neglect the user's context, which is the primary driver of perceived relevance and the interactions with the search results. To this end, this work introduces the simulation of context-driven query reformulations. The proposed query generation methods build upon recent Large Language Model (LLM) approaches and consider the user's context throughout the simulation of a search session. Compared to simple context-free query generation approaches, these methods show better effectiveness and allow the simulation of more efficient IR sessions. Similarly, our evaluations consider more interaction context than current session-based measures and reveal interesting complementary insights in addition to the established evaluation protocols. We conclude with directions for future work and provide an entirely open experimental setup.|通过模拟用户交互，可以对信息检索系统进行更加面向用户的评估。虽然用户模拟具有成本效益和可重复性，但许多方法通常缺乏真实用户行为的保真度。最值得注意的是，当前的用户模型忽视了用户的上下文，而上下文是感知相关性和与搜索结果交互的主要驱动因素。为此，本文介绍了上下文驱动的查询重构的仿真。提出的查询生成方法建立在最新的大型语言模型(LLM)方法的基础上，并在搜索会话的整个仿真过程中考虑用户的上下文。与简单的上下文无关的查询生成方法相比，这些方法显示出更好的效率，并允许模拟更有效的 IR 会话。同样，我们的评价考虑了比目前基于会议的措施更多的互动背景，除了既定的评价方案之外，还揭示了有趣的互补见解。我们总结了未来工作的方向，并提供了一个完全开放的实验装置。|code|0| |Emotional Insights for Food Recommendations|Mehrdad Rostami, Ali Vardasbi, Mohammad Aliannejadi, Mourad Oussalah|Univ Amsterdam, Informat Retrieval Lab, Amsterdam, Netherlands; Univ Oulu, Ctr Machine Vis & Signal Anal, Oulu, Finland|Food recommendation systems have become pivotal in offering personalized suggestions, enabling users to discover recipes in line with their tastes. However, despite the existence of numerous such systems, there are still unresolved challenges. Much of the previous research predominantly lies on users' past preferences, neglecting the significant aspect of discerning users' emotional insights. Our framework aims to bridge this gap by pioneering emotion-aware food recommendation. The study strives for enhanced accuracy by delivering recommendations tailored to a broad spectrum of emotional and dietary behaviors. Uniquely, we introduce five novel scores for Influencer-Followers, Visual Motivation, Adventurous, Health and Niche to gauge a user's inclination toward specific emotional insights. Subsequently, these indices are used to re-rank the preliminary recommendation, placing a heightened focus on the user's emotional disposition. Experimental results on a real-world food social network dataset reveal that our system outperforms alternative emotion-unaware recommender systems, yielding an average performance boost of roughly 6%. Furthermore, the results reveal a rise of over 30% in accuracy metrics for some users exhibiting particular emotional insights.|食品推荐系统在提供个性化建议方面变得至关重要，使用户能够根据自己的口味发现食谱。然而，尽管存在许多这样的系统，仍有一些未解决的挑战。以往的研究大多主要依赖于用户过去的偏好，忽视了识别用户情感洞察的重要方面。我们的框架旨在通过开创情感感知的食品推荐来弥合这一差距。该研究通过提供针对广泛情感和饮食行为的推荐，力求提高准确性。我们独特地引入了五个新的评分指标：影响力-追随者、视觉动机、冒险性、健康和利基，以衡量用户对特定情感洞察的倾向。随后，这些指标被用于重新排序初步推荐，更加关注用户的情感倾向。在实际食品社交网络数据集上的实验结果表明，我们的系统优于其他不考虑情感的推荐系统，平均性能提升了约6%。此外，结果显示，对于一些表现出特定情感洞察的用户，准确率指标提升了超过30%。|code|0| |LaQuE: Enabling Entity Search at Scale|Negar Arabzadeh, Amin Bigdeli, Ebrahim Bagheri|Univ Waterloo, Waterloo, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Entity search plays a crucial role in various information access domains, where users seek information about specific entities. Despite significant research efforts to improve entity search methods, the availability of large-scale resources and extensible frameworks has been limiting progress. In this work, we present LaQuE (Large-scale Queries for Entity search), a curated framework for entity search, which includes a reproducible and extensible code base as well as a large relevance judgment collection consisting of real-user queries based on the ORCAS collection. LaQuE is industry-scale and suitable for training complex neural models for entity search. We develop methods for curating and judging entity collections, as well as training entity search methods based on LaQuE. We additionally establish strong baselines within LaQuE based on various retrievers, including traditional bag-of-words-based methods and neural-based models. We show that training neural entity search models on LaQuE enhances retrieval effectiveness compared to the state-of-the-art. Additionally, we categorize the released queries in LaQuE based on their popularity and difficulty, encouraging research on more challenging queries for the entity search task. We publicly release LaQuE at https://github.com/Narabzad/LaQuE .|实体搜索在各种信息获取领域中扮演着至关重要的角色，用户通过它来寻找特定实体的信息。尽管已有大量研究致力于改进实体搜索方法，但大规模资源和可扩展框架的缺乏一直限制着这一领域的进展。在本研究中，我们提出了LaQuE（大规模实体搜索查询），这是一个精心策划的实体搜索框架，它包括一个可复现和可扩展的代码库，以及一个基于ORCAS集合的真实用户查询的大规模相关性判断集合。LaQuE具有工业规模，适合训练复杂的神经模型以进行实体搜索。我们开发了策划和判断实体集合的方法，以及基于LaQuE训练实体搜索方法的技术。此外，我们在LaQuE中建立了强大的基线，包括传统的基于词袋的方法和基于神经网络的模型。我们展示了在LaQuE上训练神经实体搜索模型相较于现有技术提高了检索效果。此外，我们根据查询的流行度和难度对LaQuE中发布的查询进行了分类，鼓励对更具挑战性的实体搜索任务查询进行研究。我们在https://github.com/Narabzad/LaQuE 上公开了LaQuE。|code|0| |Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models|Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen||Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding target words such as true. Since such possibilities have not yet been considered in retrieval evaluation, we analyze the impact of query-independent prompt injection via manually constructed templates and LLM-based rewriting of documents on several existing relevance models. Our experiments on the TREC Deep Learning track show that adversarial documents can easily manipulate different sequence-to-sequence relevance models, while BM25 (as a typical lexical model) is not affected. Remarkably, the attacks also affect encoder-only relevance models (which do not rely on natural language prompt tokens), albeit to a lesser extent.|现代的序列-序列相关模型，如 monoT5，可以通过交叉编码有效地捕获查询和文档之间复杂的文本交互。然而，在提示符中使用自然语言标记，比如 Query、 Document 和 RelationformonoT5，为恶意文档打开了一个攻击向量，通过提示注入操纵它们的相关性得分，例如，通过添加目标词，比如 true。由于在检索评估中还没有考虑到这种可能性，我们通过手工构建模板和基于 LLM 的文档重写来分析与查询无关的提示注入对几种现有相关性模型的影响。我们在 TREC Deep Learning 进行的实验表明，对抗性文档可以轻易地操纵不同的顺序-顺序关联模型，而 BM25(作为一个典型的词汇模型)不受影响。值得注意的是，这些攻击还会影响编码器相关性模型(不依赖于自然语言提示符) ，尽管影响程度较小。|code|0| |Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE|Carlos Lassance, Hervé Déjean, Stéphane Clinchant, Nicola Tonellotto||Learned sparse models such as SPLADE have successfully shown how to incorporate the benefits of state-of-the-art neural information retrieval models into the classical inverted index data structure. Despite their improvements in effectiveness, learned sparse models are not as efficient as classical sparse model such as BM25. The problem has been investigated and addressed by recently developed strategies, such as guided traversal query processing and static pruning, with different degrees of success on in-domain and out-of-domain datasets. In this work, we propose a new query processing strategy for SPLADE based on a two-step cascade. The first step uses a pruned and reweighted version of the SPLADE sparse vectors, and the second step uses the original SPLADE vectors to re-score a sample of documents retrieved in the first stage. Our extensive experiments, performed on 30 different in-domain and out-of-domain datasets, show that our proposed strategy is able to improve mean and tail response times over the original single-stage SPLADE processing by up to 30× and 40×, respectively, for in-domain datasets, and by 12x to 25x, for mean response on out-of-domain datasets, while not incurring in statistical significant difference in 60% of datasets.|像 SPLADE 这样的稀疏学习模型已经成功地展示了如何将最先进的神经信息检索模型的优点融入到经典的倒排索引数据结构中。尽管学习稀疏模型的有效性有所提高，但其效率不如经典稀疏模型如 BM25。该问题已经通过最近开发的策略得到了研究和解决，如引导遍历查询处理和静态剪枝，在域内和域外数据集上取得了不同程度的成功。本文提出了一种新的基于两步级联的 SPLADE 查询处理策略。第一步使用 SPLADE 稀疏向量的修剪和重新加权版本，第二步使用原始 SPLADE 向量对在第一阶段检索到的文档样本进行重新评分。我们在30个不同的域内和域外数据集上进行的广泛实验表明，我们提出的策略能够将原始单阶段 SPLADE 处理的平均和尾部响应时间分别提高30倍和40倍，对于域内数据集，提高12倍至25倍，对于域外数据集的平均响应，同时在60% 的数据集中不引起统计学显着差异。|code|0| |Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers|Negar Arabzadeh, Amin Bigdeli, Charles L. A. Clarke||Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.|大型语言模型现在可以直接生成许多实际问题的答案，而无需引用外部资源。遗憾的是，对于评价这些答案的质量和正确性、比较一个模型与另一个模型的表现或比较一个提示与另一个提示的方法，人们的关注相对较少。此外，生成的答案的质量很少直接比较检索的答案的质量。随着模型的发展和提示的修改，我们没有系统的方法来衡量改进而不诉诸昂贵的人类判断。为了解决这个问题，我们采用标准的检索基准来评估由大型语言模型生成的答案。受到用于总结的 BERTScore 度量的启发，我们探索了两种方法。首先，我们以基准相关性判断为基础进行评价。我们通过实验来研究信息检索相关性判断是如何被用来作为评估生成的答案的锚的。在第二个实验中，我们将生成的答案与不同检索模型(从传统方法到高级方法)检索到的最高结果进行比较，使我们能够在没有人为判断的情况下衡量改进情况。在这两种情况下，我们测量生成的答案的嵌入表示和检索基准中已知或假定的相关段落的嵌入表示之间的相似性。|code|0| |Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control|Thong Nguyen, Mariya Hendriksen, Andrew Yates, Maarten de Rijke||Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal|学习稀疏检索(LSR)是一类将查询和文档编码成稀疏词汇向量的神经元方法，可以通过反向索引有效地进行索引和检索。我们探讨了 LSR 在多模态领域的应用，重点研究了文本图像检索。虽然 LSR 在文本检索方面取得了成功，但它在多模态检索中的应用仍然有待探索。目前的方法如 LexLIP 和 STAIR 需要对大量数据集进行复杂的多步训练。我们提出的方法有效地将密集向量从一个冻结的密集模型转换成稀疏的词汇向量。通过一种新的训练算法，利用贝努利随机变量控制查询扩展，解决了高维共激活和语义偏差的问题。对两个密集模型(BLIP，ALBEF)和两个数据集(MSCOCO，Flickr30k)的实验表明，该算法有效地减少了协同激活和语义偏差。我们性能最好的稀疏模型优于最先进的文本图像 LSR 模型，具有更短的训练时间和更低的 GPU 内存需求。该方法为在多模态环境下训练 LSR 检索模型提供了一种有效的解决方案。我们的代码和模型检查点在 github.com/thongnt99/lsr-multimodal 都有|code|0| |Alleviating Confounding Effects with Contrastive Learning in Recommendation|Di You, Kyumin Lee|Worcester Polytech Inst, Worcester, MA 01609 USA|Recently, there has been a growing interest in mitigating the bias effects in recommendations using causal inference. However, Rubin's potential outcome framework may produce inaccurate estimates in real-world scenarios due to the presence of hidden confounders. In addition, existing works adopting the Pearl causal graph framework tend to focus on specific types of bias (e.g., selection bias, popularity bias, exposure bias) instead of directly mitigating the impact of hidden confounders. Motivated by the aforementioned limitations, in this paper, we formulate the recommendation task as a causal graph with unobserved/unmeasurable confounders. We present a novel causality-based architecture called Multi-behavior Debiased Contrastive Collaborative Filtering (MDCCL) and apply the front-door adjustment for intervention. We leverage a pre-like behavior such as clicking an item (i.e., a behavior occurred before the target behavior such as purchasing) to mitigate the bias effects. Additionally, we design a contrastive loss that also provides a debiasing effect benefiting the recommendation. An empirical study on three real-world datasets validates that our proposed method successfully outperforms nine state-of-the-art baselines. Code and the datasets will be available at https://github.com/queenjocey/MDCCL .|近年来，利用因果推理来减轻推荐系统中的偏见效应引起了越来越多的关注。然而，Rubin的潜在结果框架在实际场景中可能会由于存在隐藏的混杂因素而产生不准确的估计。此外，现有采用Pearl因果图框架的研究往往侧重于特定类型的偏见（例如选择偏差、流行度偏差、曝光偏差），而不是直接减轻隐藏混杂因素的影响。基于上述局限性，本文提出将推荐任务建模为包含未观测/未测量混杂因素的因果图。我们提出了一种新颖的基于因果关系的架构，称为多行为去偏对比协同过滤（MDCCL），并应用前门调整进行干预。我们利用诸如点击商品（即在目标行为如购买之前发生的行为）等预喜欢行为来减轻偏见效应。此外，我们设计了一种对比损失函数，该函数也提供了有助于推荐系统的去偏效果。在三个真实世界数据集上的实证研究表明，我们提出的方法成功超越了九种最先进的基线方法。代码和数据集将在https://github.com/queenjocey/MDCCL 上提供。|code|0| |Align MacridVAE: Multimodal Alignment for Disentangled Recommendations|Ignacio Avas, Liesbeth Allein, Katrien Laenen, MarieFrancine Moens|Katholieke Univ Leuven, Dept Comp Sci, Leuven, Belgium|Explaining why items are recommended to users is challenging, especially when these items are described by multimodal data. Most recommendation systems fail to leverage more than one modality, preferring textual or tabular data. In this work, a new model, Align MacridVAE, that considers the complementarity of visual and textual item descriptions for item recommendation is proposed. This model projects both modalities onto a shared latent space, and a dedicated loss function aligns the text and image of the same item. The aspects of the item are then jointly disentangled for both modalities at a macro level to learn interpretable categorical information about items and at a micro level to model user preferences on each of those categories. Experiments are conducted on six item recommendation datasets, and recommendation performance is compared against multiple baseline methods. The results demonstrate that our model increases recommendation accuracy by 18% in terms of NCDG on average in the studied datasets and allows us to visualise user preference by item aspect across modalities and the learned concept allocation (The code implementation is available at https://github.com/igui/Align-MacridVAE ).|解释为何向用户推荐某些项目是具有挑战性的，尤其是当这些项目由多模态数据描述时。大多数推荐系统未能利用超过一种模态，而是偏好文本或表格数据。在这项工作中，我们提出了一种新模型Align MacridVAE，该模型考虑了视觉和文本项目描述的互补性来进行项目推荐。该模型将两种模态投影到一个共享的潜在空间中，并通过专门的损失函数对齐同一项目的文本和图像。然后，在宏观层面上对项目的各个方面进行联合解缠，以学习关于项目的可解释的类别信息；在微观层面上，对用户在每个类别上的偏好进行建模。我们在六个项目推荐数据集上进行了实验，并将推荐性能与多种基线方法进行了比较。结果表明，我们的模型在研究的数据集上平均将推荐准确率提高了18%（以NCDG衡量），并允许我们通过跨模态的项目方面和学习的概念分配来可视化用户偏好（代码实现可在https://github.com/igui/Align-MacridVAE获取）。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Align+MacridVAE:+Multimodal+Alignment+for+Disentangled+Recommendations)|0| |Learning Action Embeddings for Off-Policy Evaluation|Matej Cief, Jacek Golebiowski, Philipp Schmidt, Ziawasch Abedjan, Artur Bekasov||Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.|非策略评估(OPE)方法允许我们通过使用不同策略收集的日志数据来计算策略的预期回报。相对于运行昂贵的在线 A/B 测试，OPE 是一种可行的替代方案: 它可以加快新政策的制定，并降低客户接触次优治疗的风险。然而，当操作的数量很大，或者某些操作被日志策略低估时，基于逆倾向评分(IPS)的现有估计量可能会有很高甚至无限的方差。Saito 和 Joachims (arXiv: 2202.06317 v2[ cs.LG ])提出使用动作嵌入的边缘化 IPS (MIPS) ，这减少了大动作空间中 IPS 的方差。MIPS 假设好的操作嵌入可以由从业人员定义，这在许多实际应用程序中是很难做到的。在这项工作中，我们探讨了从日志数据学习动作嵌入。特别地，我们使用训练过的奖励模型的中间输出来定义 MIPS 的行动嵌入。这种方法将 MIPS 扩展到更多的应用程序，并且在我们的实验中通过预定义的嵌入以及在合成和真实世界数据上的标准基线改进了 MIPS。我们的方法不对奖励模型类做假设，并支持使用额外的行动信息，以进一步改善估计。提出的方法提出了一个吸引人的替代 DR 相结合的 DM 的低方差和 IPS 的低偏差。|code|0| |Simulated Task Oriented Dialogues for Developing Versatile Conversational Agents|Xi Wang, Procheta Sen, Ruizhe Li, Emine Yilmaz|Univ Liverpool, Liverpool, Merseyside, England; Univ Aberdeen, Aberdeen, Scotland; Unvers Coll London, London, England|Task-Oriented Dialogue (TOD) Systems are increasingly important for managing a variety of daily tasks, yet often underperform in unfamiliar scenarios due to limitations in existing training datasets. This study addresses the challenge of generating robust and versatile TOD systems by transforming instructional task descriptions into natural user-system dialogues to serve as enhanced pre-training data. We explore three strategies for synthetic dialogue generation: crowdsourcing, encoder-decoder models, and in-context learning with large language models. The evaluation of these approaches, based on a comprehensive user study employing 10 different metrics, reveals the top quality of the dialogues generated by learning an encoder-decoder model as per human evaluation. Notably, employing this synthetic dialogue further improves the performance of advanced TOD models, especially in unfamiliar domains, with improvements spanning 5.5% to as much as 20.9% in combined evaluation scores. Our findings advocate for the use of specialised, task-oriented knowledge bases and step-wise dialogue generation techniques to advance the capabilities and generalizability of TOD systems.|面向任务的对话系统（Task-Oriented Dialogue, TOD）在管理各种日常任务中变得越来越重要，但在不熟悉的场景中往往表现不佳，原因是现有训练数据集的局限性。本研究通过将任务指令描述转化为自然的用户-系统对话，以作为增强的预训练数据，来解决生成鲁棒且多功能的TOD系统的挑战。我们探索了三种合成对话生成策略：众包、编码器-解码器模型以及利用大语言模型进行上下文学习。基于采用10种不同指标的综合用户研究对这些方法进行评估，结果表明，根据人类评估，学习编码器-解码器模型生成的对话质量最高。值得注意的是，使用这种合成对话进一步提升了先进TOD模型的性能，尤其是在不熟悉的领域中，综合评估分数提升了5.5%至20.9%。我们的研究结果支持使用专门的、面向任务的知识库和分步对话生成技术，以提升TOD系统的能力和泛化性。|code|0| |Hypergraphs with Attention on Reviews for Explainable Recommendation|Theis E. Jendal, TrungHoang Le, Hady W. Lauw, Matteo Lissandrini, Peter Dolog, Katja Hose|Singapore Management Univ, Singapore, Singapore; Aalborg Univ, Aalborg, Denmark|Given a recommender system based on reviews, the challenges are how to effectively represent the review data and how to explain the produced recommendations. We propose a novel review-specific Hypergraph (HG) model, and further introduce a model-agnostic explainability module. The HG model captures high-order connections between users, items, aspects, and opinions while maintaining information about the review. The explainability module can use the HG model to explain a prediction generated by any model. We propose a path-restricted review-selection method biased by the user preference for item reviews and propose a novel explanation method based on a review graph. Experiments on real-world datasets confirm the ability of the HG model to capture appropriate explanations.|在一个基于评论的推荐系统中，面临的挑战是如何有效地表示评论数据以及如何解释生成的推荐结果。我们提出了一种新颖的针对评论的超图（HG）模型，并进一步引入了一个与模型无关的可解释性模块。HG模型能够捕捉用户、物品、方面和观点之间的高阶关系，同时保留评论的信息。可解释性模块可以利用HG模型来解释任何模型生成的预测结果。我们提出了一种基于用户对物品评论偏好的路径受限评论选择方法，并提出了一种基于评论图的新型解释方法。在真实数据集上的实验验证了HG模型在捕捉适当解释方面的能力。|code|0| |Investigating the Usage of Formulae in Mathematical Answer Retrieval|Anja Reusch, Julius Gonsior, Claudio Hartmann, Wolfgang Lehner|Tech Univ Dresden, Dresden Database Res Grp, Dresden, Germany|This work focuses on the task of Mathematical Answer Retrieval and studies the factors a recent Transformer-Encoder-based Language Model (LM) uses to assess the relevance of an answer for a given mathematical question. Mainly, we investigate three factors: (1) the general influence of mathematical formulae, (2) the usage of structural information of those formulae, (3) the overlap of variable names in answers and questions. The findings of the investigation indicate that the LM for Mathematical Answer Retrieval mainly relies on shallow features such as the overlap of variables between question and answers. Furthermore, we identified a malicious shortcut in the training data that hinders the usage of structural information and by removing this shortcut improved the overall accuracy. We want to foster future research on how LMs are trained for Mathematical Answer Retrieval and provide a basic evaluation set up (Link to repository: https://github.com/AnReu/math_analysis ) for existing models.|本研究聚焦于数学答案检索任务，探讨了基于Transformer-Encoder架构的最新语言模型（LM）在评估给定数学问题答案相关性时所使用的因素。我们主要研究了三个因素：(1) 数学公式的总体影响，(2) 这些公式结构信息的使用，(3) 答案与问题中变量名的重叠。研究结果表明，用于数学答案检索的语言模型主要依赖于浅层特征，如问题与答案之间变量的重叠。此外，我们发现训练数据中存在一个阻碍结构信息使用的恶意捷径，通过消除这一捷径，整体准确性得到了提升。我们希望推动未来关于如何训练语言模型进行数学答案检索的研究，并为现有模型提供一个基础的评估设置（仓库链接：https://github.com/AnReu/math_analysis）。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Investigating+the+Usage+of+Formulae+in+Mathematical+Answer+Retrieval)|0| |Empowering Legal Citation Recommendation via Efficient Instruction-Tuning of Pre-trained Language Models|Jie Wang, Kanha Bansal, Ioannis Arapakis, Xuri Ge, Joemon M. Jose|Univ Glasgow, Glasgow, Lanark, Scotland; Telefon Res, Barcelona, Spain|The escalating volume of cases in legal adjudication has amplified the complexity of citing relevant regulations and authoritative cases, posing an increasing challenge for legal professionals. Current legal citation prediction methods, which are predominantly reliant on keyword or interest-based retrieval, are proving insufficient. In particular, Collaborative Filtering (CF) based legal recommendation methods exhibited low accuracy. In response to these challenges, we propose the Instruction GPT with Low-Rank Adaptation architecture (IGPT-LoRA), aiming to enhance the performance of legal citation recommendations and reduce computational demands by tuning Pre-trained Language Models (PLMs). IGPT-LoRA leverages prompting and efficient tuning strategies, thus offering a significant improvement over previous context-aware legal citation prediction methods. We design effective domain-specific instruction templates to guide the adaptation of PLMs for recommendation purposes, shedding light on the potential of prompt-based learning in the legal domain. Furthermore, we optimize the learning process with an efficient tuning layer - the Low-Rank Adaptation (LoRA) architecture - to bolster applicability. Experimental results on a real-world legal data set (BVA) demonstrate that IGPT-LoRA outperforms state-of-the-art methods, delivering substantial improvements in accuracy and also in training time and computational efficiency.|随着法律裁决案件数量的不断增加，引用相关法规和权威案例的复杂性也随之增加，这对法律专业人员提出了越来越大的挑战。当前的法律引用预测方法主要依赖于基于关键词或兴趣的检索，但这些方法已显示出不足。特别是基于协同过滤（CF）的法律推荐方法准确率较低。针对这些挑战，我们提出了基于低秩适应的指令GPT架构（IGPT-LoRA），旨在通过微调预训练语言模型（PLMs）来提高法律引用推荐的性能并减少计算需求。IGPT-LoRA利用提示和高效调优策略，从而显著改进了以往基于上下文的法律引用预测方法。我们设计了有效的领域特定指令模板，以指导PLMs的适应，用于推荐目的，揭示了基于提示的学习在法律领域的潜力。此外，我们通过一个高效的调优层——低秩适应（LoRA）架构——优化了学习过程，以增强适用性。在真实世界法律数据集（BVA）上的实验结果表明，IGPT-LoRA优于最先进的方法，在准确性、训练时间和计算效率方面都提供了显著的改进。|code|0| |Fine-Tuning CLIP via Explainability Map Propagation for Boosting Image and Video Retrieval|Yoav Shalev, Lior Wolf|Tel Aviv Univ, Tel Aviv, Israel|Recent studies have highlighted the remarkable performance of CLIP for diverse downstream tasks. To understand how CLIP performs these tasks, various explainability methods have been formulated. In this paper, we reveal that the explainability maps associated with CLIP are often focused on a limited portion of the image and overlook objects that are explicitly mentioned in the text. This phenomenon may result in a high similarity score for incongruent image-text pairs, thereby potentially introducing a bias. To address this issue, we introduce a novel fine-tuning technique for CLIP that leverages a transformer explainability method. Unlike traditional approaches that generate a single heatmap using an image-text pair, our method produces multiple heatmaps directly from the image itself. We use these heatmaps both during the fine-tuning process and at inference time to highlight key visual elements, applying them to the features during the image encoding process, steering the visual encoder's attention toward these key elements. This process guides the image encoder across different spatial regions and generates a set of visual embeddings, thereby allowing the model to consider various aspects of the image, ensuring a detailed and comprehensive understanding that surpasses the limited scope of the original CLIP model. Our method leads to a notable improvement in text, image, and video retrieval across multiple benchmarks. It also results in reduced gender bias, making our model more equitable.|最近的研究强调了CLIP在各种下游任务中的卓越表现。为了理解CLIP如何执行这些任务，已经制定了各种可解释性方法。在本文中，我们揭示了与CLIP相关的可解释性图通常集中在图像的有限部分，并忽略了文本中明确提到的对象。这种现象可能导致不协调的图像-文本对获得较高的相似度分数，从而可能引入偏差。为了解决这个问题，我们引入了一种新的CLIP微调技术，该技术利用了变压器可解释性方法。与传统的使用图像-文本对生成单一热图的方法不同，我们的方法直接从图像本身生成多个热图。我们在微调过程和推理时使用这些热图来突出关键视觉元素，在图像编码过程中将这些热图应用于特征，引导视觉编码器的注意力转向这些关键元素。这一过程引导图像编码器跨越不同的空间区域，并生成一组视觉嵌入，从而使模型能够考虑图像的各个方面，确保详细和全面的理解，超越了原始CLIP模型的有限范围。我们的方法在多个基准测试中显著提高了文本、图像和视频检索的性能。它还减少了性别偏见，使我们的模型更加公平。|code|0| |Cross-Modal Retrieval for Knowledge-Based Visual Question Answering|Paul Lerner, Olivier Ferret, Camille Guinaudeau||Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.|基于知识的命名实体可视化问答是一项具有挑战性的任务，需要从多模态知识库中检索信息。命名实体具有不同的可视化表示，因此难以识别。我们认为，跨模态检索有助于弥合实体与其描述之间的语义鸿沟，并与单模态检索相辅相成。我们提供经验证明通过实验与多模态双编码器，即 CLIP，在最近的 ViQuAE，资讯搜寻和百科全书-VQA 数据集。此外，我们还研究了三种不同的策略来微调这种模型: 单模态、跨模态或联合训练。我们的方法结合了单模态检索和跨模态检索，与三个数据集上的十亿参数模型相比具有竞争力，同时在概念上更简单，计算成本更低。|code|0| |Learning to Jointly Transform and Rank Difficult Queries|Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri|Univ Waterloo, Waterloo, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Recent empirical studies have shown that while neural rankers exhibit increasingly higher retrieval effectiveness on tasks such as ad hoc retrieval, these improved performances are not experienced uniformly across the range of all queries. There are typically a large subset of queries that are not satisfied by neural rankers. These queries are often referred to as difficult queries. Given the fact that neural rankers operate based on the similarity between the embedding representations of queries and their relevant documents, the poor performance of difficult queries can be due to the sub-optimal representations learnt for difficult queries. As such, the objective of our work in this paper is to learn to rank documents and also transform query representations in tandem such that the representation of queries are transformed into one that shows higher resemblance to their relevant document. This way, our method will provide the opportunity to satisfy a large number of difficult queries that would otherwise not be addressed. In order to learn to jointly rank documents and transform queries, we propose to integrate two forms of triplet loss functions into neural rankers such that they ensure that each query is moved along the embedding space, through the transformation of its embedding representation, in order to be placed close to its relevant document(s). We perform experiments based on the MS MARCO passage ranking task and show that our proposed method has been able to show noticeable performance improvement for queries that were extremely difficult for existing neural rankers. On average, our approach has been able to satisfy 277 queries with an MRR@10 of 0.21 for queries that had a reciprocal rank of zero on the initial neural ranker.|最近的实证研究表明，尽管神经排序器在诸如即席检索等任务上表现出越来越高的检索效果，但这些性能提升并非在所有查询范围内均匀体现。通常存在一个较大的查询子集，这些查询无法被神经排序器有效满足。这些查询通常被称为困难查询。鉴于神经排序器基于查询及其相关文档的嵌入表示之间的相似性进行操作，困难查询表现不佳的原因可能是为这些查询学习到的表示不够理想。因此，本文工作的目标是学习文档排序的同时，同步转换查询表示，使得查询的表示能够转化为与其相关文档更相似的表示。通过这种方式，我们的方法将有机会满足大量原本无法处理的困难查询。为了学习联合排序文档和转换查询，我们提出将两种形式的三重态损失函数集成到神经排序器中，以确保每个查询通过其嵌入表示的转换在嵌入空间中移动，从而更接近其相关文档。我们在MS MARCO段落排序任务上进行了实验，结果表明，所提出的方法对于现有神经排序器极为困难的查询显示出显著的性能提升。平均而言，我们的方法能够满足277个查询，这些查询在初始神经排序器上的倒数排名为零，而我们的方法在MRR@10指标上达到了0.21。|code|0| |Instant Answering in E-Commerce Buyer-Seller Messaging Using Message-to-Question Reformulation|Besnik Fetahu, Tejas Mehta, Qun Song, Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi||E-commerce customers frequently seek detailed product information for purchase decisions, commonly contacting sellers directly with extended queries. This manual response requirement imposes additional costs and disrupts buyer's shopping experience with response time fluctuations ranging from hours to days. We seek to automate buyer inquiries to sellers in a leading e-commerce store using a domain-specific federated Question Answering (QA) system. The main challenge is adapting current QA systems, designed for single questions, to address detailed customer queries. We address this with a low-latency, sequence-to-sequence approach, MESSAGE-TO-QUESTION ( M2Q ). It reformulates buyer messages into succinct questions by identifying and extracting the most salient information from a message. Evaluation against baselines shows that M2Q yields relative increases of 757 answering rate from the federated QA system. Live deployment shows that automatic answering saves sellers from manually responding to millions of messages per year, and also accelerates customer purchase decisions by eliminating the need for buyers to wait for a reply|电子商务客户经常为购买决策寻找详细的产品信息，通常直接与销售商进行扩展查询。这种手动响应要求增加了额外的成本，并且由于响应时间从几小时到几天的波动而扰乱了买家的购物体验。我们寻求在一个领先的电子商务商店使用领域特定的联邦问题回答(QA)系统自动化的买方询问卖方。主要的挑战是适应当前的 QA 系统，为单个问题设计，以解决详细的客户查询。我们使用低延迟、序列到序列的方法 MESSAGE-TO-QUESTION (M2Q)来解决这个问题。它通过从消息中识别和提取最突出的信息，将买方消息重新表述为简洁的问题。对基线的评估表明，M2Q 在联邦 QA 系统中的应答率相对提高了757。实时部署显示，自动回复可以节省卖家每年手动回复数百万条消息的时间，还可以消除买家等待回复的需要，从而加快客户的购买决策|code|0| |Towards Automated End-to-End Health Misinformation Free Search with a Large Language Model|Ronak Pradeep, Jimmy Lin|Univ Waterloo, David R Cheriton Sch Comp Sci, Waterloo, ON, Canada|In the information age, health misinformation remains a notable challenge to public welfare. Integral to addressing this issue is the development of search systems adept at identifying and filtering out misleading content. This paper presents the automation of Vera, a state-of-the-art consumer health search system. While Vera can discern articles containing misinformation, it requires expert ground truth answers and rule-based reformulations. We introduce an answer prediction module that integrates GPT x with Vera and a GPT-based query reformulator to yield high-quality stance reformulations and boost downstream retrieval effectiveness. Further, we find that chain-of-thought reasoning is paramount to higher effectiveness. When assessed in the TREC Health Misinformation Track of 2022, our systems surpassed all competitors, including human-in-the-loop configurations, underscoring their pivotal role in the evolution towards a health misinformation-free search landscape. We provide all code necessary to reproduce our results at https://github.com/castorini/pygaggle .|在信息时代，健康错误信息仍然是公共福利面临的一个显著挑战。解决这一问题的关键在于开发能够识别并过滤误导性内容的搜索系统。本文介绍了Vera的自动化实现，Vera是一种先进的消费者健康搜索系统。尽管Vera能够识别包含错误信息的文章，但它需要专家提供的真实答案和基于规则的重述。我们引入了一个答案预测模块，该模块将GPT x与Vera集成，并使用基于GPT的查询重述器来生成高质量的立场重述，从而提升下游检索效果。此外，我们发现思维链推理对于提高效果至关重要。在2022年TREC健康错误信息追踪评估中，我们的系统超越了所有竞争对手，包括人在环配置，突出了它们在向无健康错误信息搜索环境演进过程中的关键作用。我们在https://github.com/castorini/pygaggle 提供了重现我们结果所需的所有代码。|code|0| |Reproducibility Analysis and Enhancements for Multi-aspect Dense Retriever with Aspect Learning|Keping Bi, Xiaojie Sun, Jiafeng Guo, Xueqi Cheng||Multi-aspect dense retrieval aims to incorporate aspect information (e.g., brand and category) into dual encoders to facilitate relevance matching. As an early and representative multi-aspect dense retriever, MADRAL learns several extra aspect embeddings and fuses the explicit aspects with an implicit aspect "OTHER" for final representation. MADRAL was evaluated on proprietary data and its code was not released, making it challenging to validate its effectiveness on other datasets. We failed to reproduce its effectiveness on the public MA-Amazon data, motivating us to probe the reasons and re-examine its components. We propose several component alternatives for comparisons, including replacing "OTHER" with "CLS" and representing aspects with the first several content tokens. Through extensive experiments, we confirm that learning "OTHER" from scratch in aspect fusion is harmful. In contrast, our proposed variants can greatly enhance the retrieval performance. Our research not only sheds light on the limitations of MADRAL but also provides valuable insights for future studies on more powerful multi-aspect dense retrieval models. Code will be released at: https://github.com/sunxiaojie99/Reproducibility-for-MADRAL.|多方面密集检索旨在将方面信息(例如，品牌和类别)合并到双编码器中，以促进相关性匹配。作为一个早期的、有代表性的多方面密集检索器，MADRAL 学习了一些额外的方面嵌入，并将显式方面与隐式方面“ OTHER”融合以得到最终的表示。MADRAL 是根据专有数据进行评估的，其代码没有发布，这使得在其他数据集上验证其有效性具有挑战性。我们未能在公开的 MA-Amazon 数据上再现其有效性，这促使我们探究其原因并重新检查其组成部分。我们提出了几种可供比较的组件替代方案，包括用“ CLS”替换“ OTHER”，以及用前几个内容标记表示方面。通过大量的实验，我们证实了在方面融合中从头学习“其他”是有害的。相比之下，我们提出的变量可以大大提高检索性能。我们的研究不仅揭示了 MADRAL 的局限性，而且为未来更强大的多方面密集检索模型的研究提供了有价值的见解。密码将在下列 https://github.com/sunxiaojie99/reproducibility-for-madral 公布:。|code|0| |An Empirical Analysis of Intervention Strategies' Effectiveness for Countering Misinformation Amplification by Recommendation Algorithms|Royal Pathak, Francesca Spezzano|Boise State Univ, Comp Sci Dept, Boise, ID 83725 USA|Social network platforms connect people worldwide, facilitating communication, information sharing, and personal/professional networking. They use recommendation algorithms to personalize content and enhance user experiences. However, these algorithms can unintentionally amplify misinformation by prioritizing engagement over accuracy. For instance, recent works suggest that popularity-based and network-based recommendation algorithms contribute the most to misinformation diffusion. In our study, we present an exploration on two Twitter datasets to understand the impact of intervention techniques on combating misinformation amplification initiated by recommendation algorithms. We simulate various scenarios and evaluate the effectiveness of intervention strategies in social sciences such as Virality Circuit Breakers and accuracy nudges. Our findings highlight that these intervention strategies are generally successful when applied on top of collaborative filtering and content-based recommendation algorithms, while having different levels of effectiveness depending on the number of users keen to spread fake news present in the dataset.|社交网络平台将全球各地的人们连接起来，促进了沟通、信息共享以及个人和专业网络的建立。这些平台利用推荐算法来个性化内容并提升用户体验。然而，这些算法可能会无意中放大虚假信息，因为它们优先考虑用户参与度而非信息准确性。例如，最近的研究表明，基于流行度和基于网络的推荐算法对虚假信息的传播贡献最大。在我们的研究中，我们基于两个Twitter数据集展开探索，旨在理解干预技术对遏制推荐算法引发的虚假信息放大的影响。我们模拟了多种场景，并评估了社会科学中干预策略的有效性，例如“病毒传播断路器”和准确性提示。我们的研究结果表明，当这些干预策略应用于协同过滤和基于内容的推荐算法时，通常能够取得成功，但其有效性水平会因数据集中热衷于传播虚假新闻的用户数量而有所不同。|code|0| |Not Just Algorithms: Strategically Addressing Consumer Impacts in Information Retrieval|Michael D. Ekstrand, Lex Beattie, Maria Soledad Pera, Henriette Cramer|Drexel Univ, Philadelphia, PA 19104 USA; Delft Univ Technol, Delft, Netherlands; Spotify, Seattle, WA USA|Information Retrieval (IR) systems have a wide range of impacts on consumers. We offer maps to help identify goals IR systems could—or should—strive for, and guide the process of scoping how to gauge a wide range of consumer-side impacts and the possible interventions needed to address these effects. Grounded in prior work on scoping algorithmic impact efforts, our goal is to promote and facilitate research that (1) is grounded in impacts on information consumers, contextualizing these impacts in the broader landscape of positive and negative consumer experience; (2) takes a broad view of the possible means of changing or improving that impact, including non-technical interventions; and (3) uses operationalizations and strategies that are well-matched to the technical, social, ethical, legal, and other dimensions of the specific problem in question.|信息检索（IR）系统对消费者有着广泛的影响。我们提供了一些地图，以帮助识别IR系统可以——或应该——追求的目标，并指导如何衡量广泛的消费者端影响以及解决这些影响所需的可能干预措施的范围界定过程。基于先前关于范围界定算法影响的工作，我们的目标是促进和推动以下研究：（1）以信息消费者的影响为基础，将这些影响置于更广泛的积极和消极消费者体验的背景下；（2）采取广泛的视角来看待改变或改进这些影响的可能手段，包括非技术干预；（3）使用与特定问题的技术、社会、伦理、法律和其他维度相匹配的操作化和策略。|code|0| |A Study of Pre-processing Fairness Intervention Methods for Ranking People|Clara Rus, Andrew Yates, Maarten de Rijke|Univ Amsterdam, Amsterdam, Netherlands|Fairness interventions are hard to use in practice when ranking people due to legal constraints that limit access to sensitive information. Pre-processing fairness interventions, however, can be used in practice to create more fair training data that encourage the model to generate fair predictions without having access to sensitive information during inference. Little is known about the performance of pre-processing fairness interventions in a recruitment setting. To simulate a real scenario, we train a ranking model on pre-processed representations, while access to sensitive information is limited during inference. We evaluate pre-processing fairness intervention methods in terms of individual fairness and group fairness. On two real-world datasets, the pre-processing methods are found to improve the diversity of rankings with respect to gender, while individual fairness is not affected. Moreover, we discuss advantages and disadvantages of using pre-processing fairness interventions in practice for ranking people.|由于法律限制了对敏感信息的访问，公平性干预措施在排名人员时难以实际应用。然而，预处理公平性干预措施可以在实践中用于创建更公平的训练数据，从而鼓励模型在推理过程中无需访问敏感信息的情况下生成公平的预测。在招聘环境中，预处理公平性干预措施的性能尚不为人所知。为了模拟真实场景，我们在预处理的表示上训练了一个排名模型，同时在推理过程中对敏感信息的访问受到限制。我们从个体公平性和群体公平性的角度评估了预处理公平性干预方法。在两个真实世界的数据集上，预处理方法被发现可以提高与性别相关的排名多样性，而个体公平性不受影响。此外，我们讨论了在排名人员时使用预处理公平性干预措施的实际优缺点。|code|0| |Evaluating the Explainability of Neural Rankers|Saran Pandian, Debasis Ganguly, Sean MacAvaney||Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.|信息检索模型已经见证了从无监督统计方法到基于特征的监督方法到完全数据驱动方法的范式转变，这种方法利用了大型语言模型的预训练。虽然搜索模型日益增加的复杂性已经能够证明有效性的改善(根据检索结果的相关性衡量) ，但是一个值得彻底检查的问题是——“这些模型如何解释?”这就是本文的目的。特别是，我们提出了一个通用的评估平台，系统地评估任何排名模型的可解释性(解释算法对于所有待评估的模型都是相同的)。在我们提出的框架中，每个模型除了返回排序的文档列表之外，还需要返回每个文档的解释单元或基本原理的列表。然后，利用每份文件中的元信息来衡量这些理由在当地的一致程度，作为衡量可解释性的内在尺度，而不需要人工进行相关性评估。此外，作为一个外在的测量，我们通过利用子文档级别的相关性评估来计算这些基本原理的相关性。我们的研究结果显示，许多有趣的观察结果，例如句子水平的基本原理更加一致，复杂性的增加主要导致不一致的解释，并且可解释性测量提供了 IR 系统评估的补充维度，因为一致性与顶级的 nDCG 不相关。|code|0| |Knowledge Graph Cross-View Contrastive Learning for Recommendation|Zeyuan Meng, Iadh Ounis, Craig Macdonald, Zixuan Yi|Univ Glasgow, Glasgow, Lanark, Scotland|Knowledge Graphs (KGs) are useful side information that help recommendation systems improve recommendation quality by providing rich semantic information about entities and items. Recently, models based on graph neural networks (GNNs) have adopted knowledge graphs to capture further high-order structural information, such as shared preferences between users and similarities between items. However, existing GNN-based methods suffer from two challenges: (1) Sparse supervisory signal, where a large amount of information in the knowledge graph is non-relevant to recommendation, and the training labels are insufficient, thereby limiting the recommendation performance of the trained model; (2) Valuable information is discarded whereby the use by the existing models of edge or node dropout strategies to obtain augmented views during self-supervised learning could lead to valuable information being discarded in recommendation. These two challenges limit the effective representation of users and items by existing methods. Inspired by self-supervised learning to mine supervision signals from data, in this paper, we focus on exploring contrastive learning based on knowledge graph enhancement, and propose a new model named Knowledge Graph Cross-view Contrastive Learning for Recommendation (KGCCL) to address the two challenges. Specifically, to address supervision sparseness, we perform contrastive learning between graph views at different levels and mine graph feature information in a self-supervised learning manner. In addition, we use noise augmentation to enhance the representation of users and items, while retaining all triplet information in the knowledge graph to address the challenge of valuable information being discarded. Experimental results on three public datasets show that our proposed KGCCL model outperforms existing state-of-the-art methods. In particular, our model outperforms the best baseline performance by 10.65% on the MIND dataset.|知识图谱（KGs）是一种有用的辅助信息，通过提供关于实体和项目的丰富语义信息，帮助推荐系统提高推荐质量。最近，基于图神经网络（GNNs）的模型采用知识图谱来捕捉更多的高阶结构信息，例如用户之间的共享偏好和项目之间的相似性。然而，现有的基于GNN的方法面临两个挑战：（1）稀疏的监督信号，即知识图谱中的大量信息与推荐无关，且训练标签不足，从而限制了训练模型的推荐性能；（2）有价值的信息被丢弃，即现有模型在自监督学习过程中使用边或节点丢弃策略来获得增强视图，可能导致推荐中有价值的信息被丢弃。这两个挑战限制了现有方法对用户和项目的有效表示。受自监督学习从数据中挖掘监督信号的启发，本文重点探索基于知识图谱增强的对比学习，并提出了一种名为知识图谱跨视图对比学习推荐模型（KGCCL）的新模型来解决这两个挑战。具体来说，为了解决监督稀疏性问题，我们在不同层次的图视图之间进行对比学习，并以自监督学习的方式挖掘图特征信息。此外，我们使用噪声增强来增强用户和项目的表示，同时保留知识图谱中的所有三元组信息，以解决有价值信息被丢弃的挑战。在三个公开数据集上的实验结果表明，我们提出的KGCCL模型优于现有的最先进方法。特别是在MIND数据集上，我们的模型比最佳基线性能提高了10.65%。|code|0| |Recommendation Fairness in eParticipation: Listening to Minority, Vulnerable and NIMBY Citizens|Marina AlonsoCortés, Iván Cantador, Alejandro Bellogín|Univ Autonoma Madrid, Escuela Politecn Super, Madrid 28049, Spain|E-participation refers to the use of digital technologies and online platforms to engage citizens and other stakeholders in democratic and government decision-making processes. Recent research work has explored the application of recommender systems to e-participation, focusing on the development of algorithmic solutions to be effective in terms of personalized content retrieval accuracy, but ignoring underlying societal issues, such as biases, fairness, privacy and transparency. Motivated by this research gap, on a public e-participatory budgeting dataset, we measure and analyze recommendation fairness metrics oriented to several minority, vulnerable and NIMBY (Not In My Back Yard) groups of citizens. Our empirical results show that there is a strong popularity bias (especially for the minority groups) due to how content is presented and accessed in a reference e-participation platform; and that hybrid algorithms exploiting user geolocation information in a collaborative filtering fashion are good candidates to satisfy the proposed fairness conceptualization for the above underrepresented citizen collectives.|电子参与（E-participation）是指利用数字技术和在线平台，使公民及其他利益相关者参与到民主和政府决策过程中。近期的研究工作探索了推荐系统在电子参与中的应用，重点开发了在个性化内容检索准确性方面有效的算法解决方案，但忽略了潜在的社会问题，如偏见、公平性、隐私和透明度。受这一研究空白的启发，我们在一个公共电子参与预算数据集上，针对多个少数群体、弱势群体和“不要在我家后院”（NIMBY）的公民群体，测量并分析了面向推荐公平性的指标。我们的实证结果表明，由于内容在参考电子参与平台上的呈现和访问方式，存在强烈的流行度偏见（尤其是对少数群体）；而利用用户地理位置信息以协同过滤方式工作的混合算法，是满足上述代表性不足公民群体所提出的公平性概念的良好候选方案。|code|0| |Responsible Opinion Formation on Debated Topics in Web Search|Alisa Rieger, Tim Draws, Nicolas Mattis, David Maxwell, David Elsweiler, Ujwal Gadiraju, Dana McKay, Alessandro Bozzon, Maria Soledad Pera|Univ Regensburg, Regensburg, Germany; OTTO, Hamburg, Germany; Vrije Univ Amsterdam, Amsterdam, Netherlands; RMIT Univ Melbourne, Melbourne, Australia; Booking com, Amsterdam, Netherlands; Delft Univ Technol, Delft, Netherlands|Web search has evolved into a platform people rely on for opinion formation on debated topics. Yet, pursuing this search intent can carry serious consequences for individuals and society and involves a high risk of biases. We argue that web search can and should empower users to form opinions responsibly and that the information retrieval community is uniquely positioned to lead interdisciplinary efforts to this end. Building on digital humanism—a perspective focused on shaping technology to align with human values and needs—and through an extensive interdisciplinary literature review, we identify challenges and research opportunities that focus on the searcher, search engine, and their complex interplay. We outline a research agenda that provides a foundation for research efforts toward addressing these challenges.|网络搜索已经发展成为一个人们依赖的平台，用于就争议话题形成意见。然而，追求这种搜索意图可能会对个人和社会带来严重后果，并涉及高度的偏见风险。我们认为，网络搜索能够并且应该使用户负责任地形成意见，信息检索社区在这方面具有独特的优势，可以领导跨学科的努力。基于数字人文主义——一种专注于塑造技术以符合人类价值观和需求的视角——并通过广泛的跨学科文献综述，我们确定了关注搜索者、搜索引擎及其复杂互动的挑战和研究机会。我们概述了一个研究议程，为应对这些挑战的研究工作提供了基础。|code|0| |Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines|Janek Bevendorff, Matti Wiegmann, Martin Potthast, Benno Stein|Bauhaus Univ Weimar, Weimar, Germany; Univ Leipzig, Leipzig, Germany|Many users of web search engines have been complaining in recent years about the supposedly decreasing quality of search results. This is often attributed to an increasing amount of search-engine-optimized but low-quality content. Evidence for this has always been anecdotal, yet it's not unreasonable to think that popular online marketing strategies such as affiliate marketing incentivize the mass production of such content to maximize clicks. Since neither this complaint nor affiliate marketing as such have received much attention from the IR community, we hereby lay the groundwork by conducting an in-depth exploratory study of how affiliate content affects today's search engines. We monitored Google, Bing and DuckDuckGo for a year on 7,392 product review queries. Our findings suggest that all search engines have significant problems with highly optimized (affiliate) content—more than is representative for the entire web according to a baseline retrieval system on the ClueWeb22. Focussing on the product review genre, we find that only a small portion of product reviews on the web uses affiliate marketing, but the majority of all search results do. Of all affiliate networks, Amazon Associates is by far the most popular. We further observe an inverse relationship between affiliate marketing use and content complexity, and that all search engines fall victim to large-scale affiliate link spam campaigns. However, we also notice that the line between benign content and spam in the form of content and link farms becomes increasingly blurry—a situation that will surely worsen in the wake of generative AI. We conclude that dynamic adversarial spam in the form of low-quality, mass-produced commercial content deserves more attention. (Code and data: https://github.com/webis-de/ECIR-24 ).|近年来，许多网络搜索引擎的用户抱怨搜索结果的质量似乎在下降。这通常归因于搜索引擎优化但低质量内容的增加。虽然这些证据大多是轶事性的，但认为诸如联盟营销等流行的在线营销策略激励了此类内容的大规模生产以最大化点击量并非不合理。由于这一投诉和联盟营销本身都未受到信息检索（IR）社区的太多关注，我们在此通过深入探索性研究来奠定基础，研究联盟内容如何影响当今的搜索引擎。我们对Google、Bing和DuckDuckGo进行了为期一年的监控，涉及7,392个产品评论查询。我们的研究结果表明，所有搜索引擎在高度优化的（联盟）内容方面都存在显著问题——根据ClueWeb22上的基线检索系统，这种情况比整个网络上的情况更为严重。聚焦于产品评论这一类别，我们发现网络上只有一小部分产品评论使用联盟营销，但大多数搜索结果都使用了。在所有联盟网络中，Amazon Associates是最受欢迎的。我们进一步观察到联盟营销使用与内容复杂性之间的反向关系，并且所有搜索引擎都成为大规模联盟链接垃圾邮件活动的受害者。然而，我们也注意到良性内容与垃圾邮件（如内容和链接农场）之间的界限越来越模糊——这种情况在生成式AI的推动下肯定会进一步恶化。我们得出结论，低质量、大规模生产的商业内容形式的动态对抗性垃圾邮件值得更多关注。（代码和数据：https://github.com/webis-de/ECIR-24）|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Is+Google+Getting+Worse?+A+Longitudinal+Investigation+of+SEO+Spam+in+Search+Engines)|0| |Robustness in Fairness Against Edge-Level Perturbations in GNN-Based Recommendation|Ludovico Boratto, Francesco Fabbri, Gianni Fenu, Mirko Marras, Giacomo Medda||Efforts in the recommendation community are shifting from the sole emphasis on utility to considering beyond-utility factors, such as fairness and robustness. Robustness of recommendation models is typically linked to their ability to maintain the original utility when subjected to attacks. Limited research has explored the robustness of a recommendation model in terms of fairness, e.g., the parity in performance across groups, under attack scenarios. In this paper, we aim to assess the robustness of graph-based recommender systems concerning fairness, when exposed to attacks based on edge-level perturbations. To this end, we considered four different fairness operationalizations, including both consumer and provider perspectives. Experiments on three datasets shed light on the impact of perturbations on the targeted fairness notion, uncovering key shortcomings in existing evaluation protocols for robustness. As an example, we observed perturbations affect consumer fairness on a higher extent than provider fairness, with alarming unfairness for the former. Source code: https://github.com/jackmedda/CPFairRobust|推荐社区的工作正在从单纯强调效用转向考虑效用以外的因素，如公平性和稳健性。推荐模型的健壮性通常与它们在受到攻击时维护原始实用程序的能力有关。有限的研究已经探索了推荐模型在公平性方面的健壮性，例如，在攻击场景下组间的性能均等。本文旨在评估基于图形的推荐系统在受到基于边界层扰动的攻击时对公平性的鲁棒性。为此，我们考虑了四种不同的公平可操作性，包括消费者和提供者视角。在三个数据集上的实验揭示了扰动对目标公平性概念的影响，揭示了现有鲁棒性评估协议的关键缺陷。作为一个例子，我们观察到扰动对消费者公平性的影响程度高于提供者公平性，前者的不公平性令人担忧。源代码: https://github.com/jackmedda/cpfairrobust|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Robustness+in+Fairness+Against+Edge-Level+Perturbations+in+GNN-Based+Recommendation)|0| |Shallow Cross-Encoders for Low-Latency Retrieval|Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald||Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51 Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3 latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.|基于变压器的交叉编码器在文本检索中取得了最先进的效果。然而，基于大型转换器模型(如 BERT 或 T5)的交叉编码器在计算上是昂贵的，并且只允许在一个相当小的延迟窗口内对少量文档进行评分。然而，保持较低的搜索延迟对于用户满意度和能量使用非常重要。在本文中，我们表明，较弱的浅层变压器模型(即，有限层数的变压器)实际上比全尺寸模型表现更好，当约束到这些实际的低延迟设置，因为他们可以估计相关性更多的文件在同一时间预算。我们进一步表明，浅变压器可以受益于广义二进制交叉熵(gBCE)训练方案，最近证明了推荐任务的成功。我们对 TREC 深度学习段落排序查询集的实验表明，在低延迟场景中，浅层和全尺度模型有了显著的改进。例如，当每个查询的延迟限制为25毫秒时，MonoBERT-Large (一种基于全尺寸 BERT 模型的交叉编码器)在 TREC dL 2019上只能达到0.431的 NDCG@10，而 TinyBERT-gBCE (一种基于 TinyBERT 的交叉编码器，经 gBCE 训练)达到0.652的 NDCG@10，a + 51交叉编码器即使在没有图形处理器的情况下也是有效的(例如，根据 CPU 推断，NDCG@10只减少了3个延迟) ，这使得交叉编码器即使没有专门的硬件加速也能实际运行。|code|0| |Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies|Puxuan Yu, Antonio Mallia, Matthias Petri||We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.|我们探索利用特定于语料库的词汇来提高学习的稀疏检索系统的效率和有效性。我们发现，在目标语料库上预先训练基础的 BERT 模型，特别是针对文档扩展过程中包含的不同词汇量，可以提高检索质量达12% ，而在某些情况下可以减少50% 的延迟。我们的实验表明，采用特定语料库词汇和增加词汇量减少了平均发布列表长度，从而减少了延迟。消融研究显示了自定义词汇表、文档扩展技术和稀疏模型的稀疏化目标之间有趣的交互作用。成效和效率的提高都转移到不同的检索方法，如 uniCOIL 和 SPLADE，并提供了一种简单而有效的方法，为学习的稀疏检索系统提供新的效率效益权衡。|code|0| |An Adaptive Framework of Geographical Group-Specific Network on O2O Recommendation|Luo Ji, Jiayu Mao, Hailong Shi, Qian Li, Yunfei Chu, Hongxia Yang||Online to offline recommendation strongly correlates with the user and service's spatiotemporal information, therefore calling for a higher degree of model personalization. The traditional methodology is based on a uniform model structure trained by collected centralized data, which is unlikely to capture all user patterns over different geographical areas or time periods. To tackle this challenge, we propose a geographical group-specific modeling method called GeoGrouse, which simultaneously studies the common knowledge as well as group-specific knowledge of user preferences. An automatic grouping paradigm is employed and verified based on users' geographical grouping indicators. Offline and online experiments are conducted to verify the effectiveness of our approach, and substantial business improvement is achieved.|Online To Offline线上到线下推荐与用户和服务的时空信息密切相关，因此需要更高程度的模型个性化。传统的方法是基于由收集的中央数据训练的统一模型结构，这种结构不太可能捕获不同地理区域或不同时期的所有用户模式。为了应对这一挑战，我们提出了一种名为 GeoGrouse 的地理组特定建模方法，该方法同时研究用户偏好的常识和组特定知识。基于用户的地理分组指标，采用自动分组范式进行验证。通过离线和在线实验验证了该方法的有效性，并取得了实质性的业务改进。|code|0| |GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation|Kaustubh D. Dhole, Eugene Agichtein||Query Reformulation(QR) is a set of techniques used to transform a user's original search query to a text that better aligns with the user's intent and improves their search experience. Recently, zero-shot QR has been shown to be a promising approach due to its ability to exploit knowledge inherent in large language models. By taking inspiration from the success of ensemble prompting strategies which have benefited many tasks, we investigate if they can help improve query reformulation. In this context, we propose an ensemble based prompting technique, GenQREnsemble which leverages paraphrases of a zero-shot instruction to generate multiple sets of keywords ultimately improving retrieval performance. We further introduce its post-retrieval variant, GenQREnsembleRF to incorporate pseudo relevant feedback. On evaluations over four IR benchmarks, we find that GenQREnsemble generates better reformulations with relative nDCG@10 improvements up to 18 the previous zero-shot state-of-art. On the MSMarco Passage Ranking task, GenQREnsembleRF shows relative gains of 5 and 9|查询重构(Query Reformation，QR)是一组技术，用于将用户的原始搜索查询转换为更好地符合用户意图并改善其搜索体验的文本。最近，零拍 QR 已被证明是一种有前途的方法，因为它能够利用知识固有的大型语言模型。本文从集成提示策略的成功经验中得到启发，探讨了集成提示策略是否有助于改进查询重构。在这种背景下，我们提出了一种基于集成的提示技术，GenQR 集成，它利用一个零拍指令的释义来生成多组关键字，最终提高检索性能。我们进一步引入其检索后变体，GenQREnsembleRF，以纳入伪相关反馈。通过对四个 IR 基准的评估，我们发现 GenQREnamble 在相对 nDCG@10的改进下产生了更好的重构效果，最高达到了之前的最高水平。在 MSMarco 通道排名任务中，GenQREnsembleRF 显示相对增益为5和9|code|0| |Improving the Robustness of Dense Retrievers Against Typos via Multi-Positive Contrastive Learning|Georgios Sidiropoulos, Evangelos Kanoulas||Dense retrieval has become the new paradigm in passage retrieval. Despite its effectiveness on typo-free queries, it is not robust when dealing with queries that contain typos. Current works on improving the typo-robustness of dense retrievers combine (i) data augmentation to obtain the typoed queries during training time with (ii) additional robustifying subtasks that aim to align the original, typo-free queries with their typoed variants. Even though multiple typoed variants are available as positive samples per query, some methods assume a single positive sample and a set of negative ones per anchor and tackle the robustifying subtask with contrastive learning; therefore, making insufficient use of the multiple positives (typoed queries). In contrast, in this work, we argue that all available positives can be used at the same time and employ contrastive learning that supports multiple positives (multi-positive). Experimental results on two datasets show that our proposed approach of leveraging all positives simultaneously and employing multi-positive contrastive learning on the robustifying subtask yields improvements in robustness against using contrastive learning with a single positive.|密集检索已经成为文章检索的新范式。尽管它对于无输入错误的查询有效，但是在处理包含输入错误的查询时并不健壮。目前致力于改善密集检索器的输入鲁棒性，将(i)数据增强结合起来以在训练期间获得输入查询和(ii)额外的鲁棒子任务，旨在将原始的，无输入的查询与其输入变体对齐。尽管每个查询可以提供多个类型变体作为正面样本，但是一些方法假设每个锚具有单个正面样本和一组负面样本，并通过对比学习处理强健的子任务; 因此，没有充分利用多个正面样本(类型查询)。相比之下，在这项工作中，我们认为所有可用的积极因素可以同时使用，并采用对比学习，支持多个积极因素(多积极)。在两个数据集上的实验结果表明，我们提出的方法同时利用所有的积极和使用多个积极的对比学习的鲁棒性子任务产生的鲁棒性对使用单个积极的对比学习的改善。|code|0| |Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-Seeking Conversations|Weronika Lajewska, Krisztian Balog||Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems. We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response. This way we can automatically assess if the answer to the user's question is present in the corpus. Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate. For training and evaluation, we develop a dataset based on the TREC CAsT benchmark that includes answerability labels on the sentence, passage, and ranking levels. We demonstrate that our proposed method represents a strong baseline and outperforms a state-of-the-art LLM on the answerability prediction task.|生成型人工智能模型面临着幻觉的挑战，幻觉可能会破坏用户对这类系统的信任。我们将会话信息搜寻问题分为两个步骤，首先识别语料库中的相关段落，然后将其归纳为最终的系统反应。这样我们就可以自动评估用户问题的答案是否在语料库中。具体来说，我们提出的方法使用一个句子级别的分类器来检测答案是否存在，然后将这些预测集中在短文级别，最终通过排名最高的段落来得到最终的可回答性估计。为了培训和评估，我们开发了一个基于 TREC CAsT 基准的数据集，包括句子、段落和排名等级上的可回答性标签。我们证明，我们提出的方法代表了一个强大的基线和优于国家的最先进的 LLM 的应答性预测任务。|code|0| |On the Influence of Reading Sequences on Knowledge Gain During Web Search|Wolfgang Gritz, Anett Hoppe, Ralph Ewerth||Nowadays, learning increasingly involves the usage of search engines and web resources. The related interdisciplinary research field search as learning aims to understand how people learn on the web. Previous work has investigated several feature classes to predict, for instance, the expected knowledge gain during web search. Therein, eye-tracking features have not been extensively studied so far. In this paper, we extend a previously used reading model from a line-based one to one that can detect reading sequences across multiple lines. We use publicly available study data from a web-based learning task to examine the relationship between our feature set and the participants' test scores. Our findings demonstrate that learners with higher knowledge gain spent significantly more time reading, and processing more words in total. We also find evidence that faster reading at the expense of more backward regressions may be an indicator of better web-based learning. We make our code publicly available at https://github.com/TIBHannover/reading_web_search.|如今，学习越来越多地涉及到搜索引擎和网络资源的使用。相关的科际整合领域搜索作为学习，旨在了解人们如何在网上学习。以前的工作已经调查了几个特征类来预测，例如，在网络搜索期间的预期知识增益。其中，眼球跟踪特征还没有被广泛研究到目前为止。在本文中，我们将以前使用的读取模型从基于行的模型扩展到能够跨多行检测读取序列的模型。我们使用来自网络学习任务的公开可用的研究数据来检查我们的特征集和参与者的测试分数之间的关系。我们的研究结果表明，获得更高知识的学习者花费更多的时间阅读，并处理更多的单词总数。我们还发现，以更多倒退回归为代价的更快阅读可能是更好的网络学习的一个指标。我们让我们的代码在 https://github.com/tibhannover/reading_web_search 上公开。|code|0| |SPARe: Supercharged Lexical Retrievers on GPU with Sparse Kernels|Tiago Almeida, Sérgio Matos|Univ Aveiro, IEETA, DETI, LASI, P-3810193 Aveiro, Portugal|Lexical sparse retrievers, rely on efficient searching algorithms that operate over inverted index structures, tailored specifically for CPU. This CPU-centric design poses a challenge when adapting these algorithms for highly parallel accelerators, such as GPUs, thus deterring potential performance gains. To address this, we propose to leverage the recent advances in sparse computations offered by deep learning frameworks to directly implementing sparse retrievals on these accelerators. This paper presents the SPARe (SPArse Retrievers) Python package, which provides a high-level API to deal with sparse retrievers on (single or multi)-accelerators, by leveraging deep learning frameworks at its core. Experimental results show that SPARe, running on an accessible GPU (RTX 2070), can calculate the BM25 scores for close to 9 million MSMARCO documents at a rate of 800 questions per second with our specialized algorithm. Notably, SPARe proves highly effective for denser LSR indexes, significantly surpassing the performance of established systems such as PISA, Pyserini and PyTerrier. SPARe is publicly available at https://github.com/ieeta-pt/SPARe .|词汇稀疏检索器依赖于专门为CPU设计的倒排索引结构的高效搜索算法。这种以CPU为中心的设计在将这些算法适配到高度并行的加速器（如GPU）时带来了挑战，从而阻碍了潜在的性能提升。为了解决这一问题，我们提出利用深度学习框架在稀疏计算方面的最新进展，直接在加速器上实现稀疏检索。本文介绍了SPARe（SPArse Retrievers）Python包，它提供了一个高级API，通过在核心中利用深度学习框架来处理（单或多）加速器上的稀疏检索器。实验结果表明，SPARe在可访问的GPU（RTX 2070）上运行时，使用我们的专用算法可以以每秒800个问题的速度计算近900万篇MSMARCO文档的BM25分数。值得注意的是，SPARe在较密集的LSR索引上表现出极高的效率，显著超越了PISA、Pyserini和PyTerrier等现有系统的性能。SPARe已在https://github.com/ieeta-pt/SPARe 公开提供。|code|0| |Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT|Ben Giacalone, Greg Paiement, Quinn Tucker, Richard Zanibbi|Rochester Inst Technol, Rochester, NY 14623 USA|ColBERT is a highly effective and interpretable retrieval model based on token embeddings. For scoring, the model adds cosine similarities between the most similar pairs of query and document token embeddings. Previous work on interpreting how tokens affect scoring pay little attention to non-text tokens used in ColBERT such as [MASK]. Using MS MARCO and the TREC 2019-2020 deep passage retrieval task, we show that [MASK] embeddings may be replaced by other query and structural token embeddings to obtain similar effectiveness, and that [Q] and [MASK] are sensitive to token order, while [CLS] and [SEP] are not.|ColBERT是一种基于词元嵌入的高效且可解释的检索模型。在评分过程中，该模型通过计算查询和文档词元嵌入之间最相似对的余弦相似度来进行累加。先前关于解释词元如何影响评分的研究很少关注ColBERT中使用的非文本词元，例如[MASK]。通过使用MS MARCO数据集和TREC 2019-2020深度段落检索任务，我们发现[MASK]嵌入可以被其他查询和结构词元嵌入替代，从而获得相似的检索效果。此外，[Q]和[MASK]对词元顺序敏感，而[CLS]和[SEP]则不受词元顺序影响。|code|0| |A Cost-Sensitive Meta-learning Strategy for Fair Provider Exposure in Recommendation|Ludovico Boratto, Giulia Cerniglia, Mirko Marras, Alessandra Perniciano, Barbara Pes||When devising recommendation services, it is important to account for the interests of all content providers, encompassing not only newcomers but also minority demographic groups. In various instances, certain provider groups find themselves underrepresented in the item catalog, a situation that can influence recommendation results. Hence, platform owners often seek to regulate the exposure of these provider groups in the recommended lists. In this paper, we propose a novel cost-sensitive approach designed to guarantee these target exposure levels in pairwise recommendation models. This approach quantifies, and consequently mitigate, the discrepancies between the volume of recommendations allocated to groups and their contribution in the item catalog, under the principle of equity. Our results show that this approach, while aligning groups exposure with their assigned levels, does not compromise to the original recommendation utility. Source code and pre-processed data can be retrieved at https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure.|在设计推荐服务时，必须考虑到所有内容提供者的利益，不仅包括新来者，还包括少数人口群体。在各种情况下，某些提供者组发现自己在项目目录中的表示不足，这种情况可能会影响推荐结果。因此，平台所有者往往试图在建议名单中规范这些提供商群体的风险敞口。在本文中，我们提出了一种新的成本敏感的方法来保证这些目标暴露水平的配对推荐模型。根据公平原则，这种方法量化并因此减少了分配给各组的建议数量与它们在项目目录中的贡献之间的差异。我们的研究结果表明，这种方法，虽然调整组暴露与他们指定的水平，不妥协的原始推荐实用程序。源代码和预处理数据可以在 https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure 检索到。|code|0| |Multiple Testing for IR and Recommendation System Experiments|Ngozi Ihemelandu, Michael D. Ekstrand|Drexel Univ, Dept Informat Sci, Philadelphia, PA 19104 USA; Boise State Univ, Boise, ID 83725 USA|While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).|尽管已有大量研究关注于比较两个信息检索（IR）系统的统计技术，但许多IR实验往往会测试超过两个系统。这可能导致由于多重比较问题（MCP）而增加的误发现。一些IR研究已经探讨了多重比较程序；这些研究主要使用TREC数据并控制族系错误率。在本研究中，我们扩展了他们的研究范围，包括推荐系统评估数据以及控制误发现率（FDR）的多重比较程序。|code|0| |An In-Depth Comparison of Neural and Probabilistic Tree Models for Learning-to-rank|Haonan Tan, Kaiyu Yang, Haitao Yu|Univ Tsukuba, Inst Lilbray Informat & Media Sci, 1-2 Kasuga, Tsukuba, Ibaraki 3050821, Japan; Univ Tsukuba, Grad Sch Comprehens Human Sci, 1-2 Kasuga, Tsukuba, Ibaraki 3050821, Japan|Learning-to-rank has been intensively studied and has demonstrated significant value in several fields, such as web search and recommender systems. Over the learning-to-rank datasets given as vectors of feature values, LambdaMART proposed more than a decade ago, and its subsequent descendants based on gradient-boosted decision trees (GBDT), have demonstrated leading performance. Recently, different novel tree models have been developed, such as neural tree ensembles that utilize neural networks to emulate decision tree models and probabilistic gradient boosting machines (PGBM). However, the effectiveness of these tree models for learning-to-rank has not been comprehensively explored. Hence, this study bridges the gap by systematically comparing several representative neural tree ensembles (e.g., TabNet, NODE, and GANDALF), PGBM, and traditional learning-to-rank models on two benchmark datasets. The experimental results reveal that benefiting from end-to-end gradient-based optimization and the power of feature representation and adaptive feature selection, the neural tree ensemble does have its advantage for learning-to-rank over the conventional tree-based ranking model, such as LambdaMART. This finding is important as LambdaMART has achieved leading performance in a long period.|学习排序（learning-to-rank）技术已被深入研究，并在多个领域展示了显著的价值，如网络搜索和推荐系统。在给定特征值向量的学习排序数据集上，十多年前提出的LambdaMART及其基于梯度提升决策树（GBDT）的后续改进模型，已经展示了领先的性能。近年来，不同的新型树模型被开发出来，例如利用神经网络模拟决策树模型的神经树集成模型（neural tree ensembles）和概率梯度提升机（PGBM）。然而，这些树模型在学习排序任务中的有效性尚未得到全面探索。因此，本研究通过系统比较几种代表性的神经树集成模型（如TabNet、NODE和GANDALF）、PGBM以及传统的学习排序模型，填补了这一空白。实验结果表明，得益于端到端的基于梯度的优化以及特征表示和自适应特征选择的能力，神经树集成模型在学习排序任务中确实比传统的基于树的排序模型（如LambdaMART）更具优势。这一发现具有重要意义，因为LambdaMART在很长一段时间内一直保持着领先的性能。|code|0| |GenRec: Large Language Model for Generative Recommendation|Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, Yongfeng Zhang||In recent years, large language models (LLM) have emerged as powerful tools for diverse natural language processing tasks. However, their potential for recommender systems under the generative recommendation paradigm remains relatively unexplored. This paper presents an innovative approach to recommendation systems using large language models (LLMs) based on text data. In this paper, we present a novel LLM for generative recommendation (GenRec) that utilized the expressive power of LLM to directly generate the target item to recommend, rather than calculating ranking score for each candidate item one by one as in traditional discriminative recommendation. GenRec uses LLM's understanding ability to interpret context, learn user preferences, and generate relevant recommendation. Our proposed approach leverages the vast knowledge encoded in large language models to accomplish recommendation tasks. We first we formulate specialized prompts to enhance the ability of LLM to comprehend recommendation tasks. Subsequently, we use these prompts to fine-tune the LLaMA backbone LLM on a dataset of user-item interactions, represented by textual data, to capture user preferences and item characteristics. Our research underscores the potential of LLM-based generative recommendation in revolutionizing the domain of recommendation systems and offers a foundational framework for future explorations in this field. We conduct extensive experiments on benchmark datasets, and the experiments shows that our GenRec has significant better results on large dataset.|近年来，大型语言模型(LLM)已经成为处理各种自然语言处理任务的强大工具。然而，他们的潜力推荐系统在生成推荐范式仍然相对未开发。本文提出了一种基于文本数据的大语言模型(LLM)推荐系统的创新方法。本文提出了一种新的生成推荐 LLM (GenRec) ，它利用 LLM 的表达能力直接生成推荐的目标项目，而不是像传统的区分推荐那样逐个计算每个候选项目的排名得分。GenRec 使用 LLM 的理解能力来解释上下文、学习用户偏好并生成相关推荐。我们提出的方法利用大型语言模型中编码的大量知识来完成推荐任务。我们首先制定专门的提示来增强 LLM 理解推荐任务的能力。随后，我们使用这些提示对用户-项目交互的数据集(由文本数据表示)上的 LLaMA 主干 LLM 进行微调，以捕获用户偏好和项目特征。我们的研究强调了基于 LLM 的生成式推荐在推荐系统领域革命性变革中的潜力，并为未来该领域的探索提供了一个基础框架。我们在基准数据集上进行了广泛的实验，实验结果表明我们的 GenRec 在大数据集上有明显的更好的结果。|code|0| |News Gathering: Leveraging Transformers to Rank News|Carlos Muñoz, María José Apolo, Maximiliano Ojeda, Hans Lobel, Marcelo Mendoza|Pontificia Univ Catolica Chile, Vicuna Mackenna 6840, Santiago, Chile; Univ Tecn Federico Santa Maria, Vicuna Mackenna 3939, Santiago, Chile|News media outlets disseminate information across various platforms. Often, these posts present complementary content and perspectives on the same news story. However, to compile a set of related news articles, users must thoroughly scour multiple sources and platforms, manually identifying which publications pertain to the same story. This tedious process hinders the speed at which journalists can perform essential tasks, notably fact-checking. To tackle this problem, we created a dataset containing both related and unrelated news pairs. This dataset allows us to develop information retrieval models grounded in the principle of binary relevance. Recognizing that many Transformer-based models might be suited for this task but could overemphasize relationships based on lexical connections, we tailored a dataset to fine-tune these models to focus on semantically relevant connections in the news domain. To craft this dataset, we introduced a methodology to identify pairs of news stories that are lexically similar yet refer to different events and pairs that discuss the same event but have distinct lexical structures. This design compels Transformers to recognize semantic connections between stories, even when their lexical similarities might be absent. Following a human-annotation assessment, we reveal that BERT outperformed other techniques, excelling even in challenging test cases. To ensure the reproducibility of our approach, we have made the dataset and top-performing models publicly available.|新闻媒体机构通过各种平台传播信息。通常情况下，这些帖子会针对同一新闻事件提供互补的内容和视角。然而，要汇编一组相关的新闻报道，用户必须彻底搜索多个来源和平台，手动识别哪些出版物涉及同一事件。这一繁琐的过程阻碍了记者执行关键任务（如事实核查）的速度。为了解决这一问题，我们创建了一个包含相关和不相关新闻对的数据集。该数据集使我们能够开发基于二元相关性原则的信息检索模型。考虑到许多基于Transformer的模型可能适合此任务，但可能会过度强调基于词汇连接的关系，我们定制了一个数据集，以微调这些模型，使其专注于新闻领域中的语义相关连接。为了构建该数据集，我们引入了一种方法来识别词汇相似但涉及不同事件的新闻对，以及讨论同一事件但具有不同词汇结构的新闻对。这种设计迫使Transformer模型识别故事之间的语义连接，即使它们的词汇相似性可能不存在。经过人工注释评估后，我们发现BERT优于其他技术，即使在具有挑战性的测试案例中也表现出色。为确保我们方法的可重复性，我们已将数据集和表现最佳的模型公开发布。|code|0| |Answer Retrieval in Legal Community Question Answering|Arian Askari, Zihui Yang, Zhaochun Ren, Suzan Verberne||The task of answer retrieval in the legal domain aims to help users to seek relevant legal advice from massive amounts of professional responses. Two main challenges hinder applying existing answer retrieval approaches in other domains to the legal domain: (1) a huge knowledge gap between lawyers and non-professionals; and (2) a mix of informal and formal content on legal QA websites. To tackle these challenges, we propose CE_FS, a novel cross-encoder (CE) re-ranker based on the fine-grained structured inputs. CE_FS uses additional structured information in the CQA data to improve the effectiveness of cross-encoder re-rankers. Furthermore, we propose LegalQA: a real-world benchmark dataset for evaluating answer retrieval in the legal domain. Experiments conducted on LegalQA show that our proposed method significantly outperforms strong cross-encoder re-rankers fine-tuned on MS MARCO. Our novel finding is that adding the question tags of each question besides the question description and title into the input of cross-encoder re-rankers structurally boosts the rankers' effectiveness. While we study our proposed method in the legal domain, we believe that our method can be applied in similar applications in other domains.|法律领域的答案检索任务旨在帮助用户从大量的专业答复中寻求相关的法律咨询。两个主要的挑战阻碍了现有的回答检索方法在其他领域的法律领域: (1)律师和非专业人士之间的巨大知识差距; 和(2)在法律质量保证网站的非正式和正式内容的组合。为了解决这些问题，我们提出了一种基于细粒度结构化输入的交叉编码器(CE)重排序算法 CE _ FS。CE _ FS 在 CQA 数据中使用额外的结构化信息来提高交叉编码器重新排序的有效性。此外，我们提出了 LegalQA: 一个真实世界的基准数据集，用于评估法律领域的答案检索。在 LegalQA 上进行的实验表明，本文提出的方法明显优于在 MS MARCO 上进行微调的强交叉编码器重排序器。我们的新发现是，除了问题描述和题名之外，在交叉编码器重新排序的输入中增加每个问题的问题标签，从结构上提高了排序的有效性。在法律领域研究我们提出的方法的同时，我们相信我们的方法可以应用于其他领域的类似应用。|code|0| |Towards Optimizing Ranking in Grid-Layout for Provider-Side Fairness|Amifa Raj, Michael D. Ekstrand|Microsoft, Redmond, WA 98052 USA; Drexel Univ, Dept Informat Sci, Philadelphia, PA 19104 USA|Information access systems, such as search engines and recommender systems, order and position results based on their estimated relevance. These results are then evaluated for a range of concerns, including provider-side fairness: whether exposure to users is fairly distributed among items and the people who created them. Several fairness-aware ranking and re-ranking techniques have been proposed to ensure fair exposure for providers, but this work focuses almost exclusively on linear layouts in which items are displayed in single ranked list. Many widely-used systems use other layouts, such as the grid views common in streaming platforms, image search, and other applications. Providing fair exposure to providers in such layouts is not well-studied. We seek to fill this gap by providing a grid-aware re-ranking algorithm to optimize layouts for provider-side fairness by adapting existing re-ranking techniques to grid-aware browsing models, and an analysis of the effect of grid-specific factors such as device size on the resulting fairness optimization. Our work provides a starting point and identifies open gaps in ensuring provider-side fairness in grid-based layouts.|信息访问系统，如搜索引擎和推荐系统，根据其估计的相关性对结果进行排序和定位。这些结果随后会被评估以考虑一系列问题，包括提供方公平性：即用户对项目和项目创建者的曝光是否公平分配。已有多种公平性感知的排序和重排序技术被提出，以确保提供方的公平曝光，但这些工作几乎完全集中在线性布局上，即项目以单一排序列表的形式展示。许多广泛使用的系统采用其他布局，如流媒体平台、图像搜索及其他应用中常见的网格视图。在这些布局中为提供方提供公平曝光的研究尚不充分。我们旨在填补这一空白，通过提供一种网格感知的重排序算法，将现有的重排序技术适应于网格感知的浏览模型，以优化提供方公平性的布局，并分析设备大小等网格特定因素对最终公平性优化的影响。我们的工作为基于网格布局中确保提供方公平性提供了一个起点，并指出了尚未解决的问题。|code|0| |A Conversational Robot for Children's Access to a Cultural Heritage Multimedia Archive|Thomas Beelen, Roeland Ordelman, Khiet P. Truong, Vanessa Evers, Theo Huibers|Univ Twente, Enschede, Netherlands|In this paper we introduce a conversational robot designed to assist children in searching a museum's cultural heritage video archive. The robot employs a form of Spoken Conversational Search to facilitate the clarification of children's interest (their information need) in specific videos from the archive. Children are typically insufficiently supported in this process by common search technologies such as search-bar and keyboard, or one-shot voice interfaces. We present our approach, which leverages a knowledge-graph representation of the museum's video archive to facilitate conversational search interactions and suggest content based on the interaction, in order to study information-seeking conversations with children. We plan to use the robot test-bed to investigate the effectiveness of conversational designs over one-shot voice interactions for clarifying children's information needs in a museum context.|本文介绍了一种对话机器人，旨在帮助儿童搜索博物馆文化遗产视频档案。该机器人采用一种口语对话搜索形式，以帮助澄清儿童对档案中特定视频的兴趣（即他们的信息需求）。常见的搜索技术，如搜索栏和键盘，或一次性语音界面，通常无法充分支持儿童在这一过程中的需求。我们提出了一种方法，利用博物馆视频档案的知识图谱表示来促进对话搜索交互，并根据交互建议内容，以研究与儿童的信息寻求对话。我们计划使用该机器人测试平台，研究在博物馆环境中，对话设计相对于一次性语音交互在澄清儿童信息需求方面的有效性。|code|0| |MathMex: Search Engine for Math Definitions|Shea Durgin, James Gore, Behrooz Mansouri|Univ Southern Maine, Portland, ME 04103 USA|This paper introduces MathMex, an open-source search engine for math definitions. With MathMex, users can search for definitions of mathematical concepts extracted from a variety of data sources and types including text, images, and videos. Definitions are extracted using a fine-tuned SciBERT classifier, and the search is done with a fine-tuned Sentence-BERT model. MathMex interface provides means of issuing a text, formula, and combined queries and logging features.|本文介绍了MathMex，一个用于数学定义的开源搜索引擎。通过MathMex，用户可以搜索从多种数据源和类型（包括文本、图像和视频）中提取的数学概念定义。定义提取使用了一个经过微调的SciBERT分类器，而搜索则通过一个经过微调的Sentence-BERT模型完成。MathMex界面提供了文本查询、公式查询以及组合查询的功能，并具备日志记录功能。|code|0| |XSearchKG: A Platform for Explainable Keyword Search over Knowledge Graphs|Leila Feddoul, Martin Birke, Sirko Schindler|Friedrich Schiller Univ Jena, Heinz Nixdorf Chair Distributed Informat Syst, Jena, Germany; German Aerosp Ctr DLR, Inst Data Sci, Jena, Germany|One of the most user-friendly methods to search over knowledge graphs is the usage of keyword queries. They offer a simple text input that requires no technical or domain knowledge. Most existing approaches for keyword search over graph-shaped data rely on graph traversal algorithms to find connections between keywords. They mostly concentrate on achieving efficiency and effectiveness (accurate ranking), but ignore usability, visualization, and interactive result presentation. All of which offer better support to non-experienced users. Moreover, it is not sufficient to just show a raw list of results, but it is also important to explain why a specific result is proposed. This not only provides an abstract view of the capabilities and limitations of the search system, but also increases confidence and helps discover new interesting facts. We propose XSearchKG, a platform for explainable keyword search over knowledge graphs that extends our previously proposed graph traversal-based approach and complements it with an interactive user interface for results explanation and browsing.|在知识图谱上进行搜索的最用户友好方法之一是使用关键字查询。这种方法提供了一个简单的文本输入，不需要任何技术或领域知识。大多数现有的基于图形数据的关键字搜索方法依赖于图遍历算法来找到关键字之间的连接。这些方法主要集中在实现效率和有效性（准确排名）上，但忽略了可用性、可视化和交互式结果展示，而这些都为非专业用户提供了更好的支持。此外，仅仅显示一个原始的结果列表是不够的，解释为什么提出某个特定结果也很重要。这不仅提供了搜索系统能力和局限性的抽象视图，还增加了信心，并有助于发现新的有趣事实。我们提出了XSearchKG，这是一个用于知识图谱上可解释关键字搜索的平台，它扩展了我们之前提出的基于图遍历的方法，并通过一个交互式用户界面补充了结果解释和浏览功能。|code|0| |Result Assessment Tool: Software to Support Studies Based on Data from Search Engines|Sebastian Sünkler, Nurce Yagci, Sebastian Schultheiß, Sonja von Mach, Dirk Lewandowski|Hamburg Univ Appl Sci, Dept Informat Media & Commun, Finkenau 35, D-22081 Hamburg, Germany|The Result Assessment Tool (RAT) is a software toolkit for conducting research with results from commercial search engines and other information retrieval (IR) systems. The software integrates modules for study design and management, automatic collection of search results via web scraping, and evaluation of search results in an assessment interface using different question types. RAT can be used for conducting a wide range of studies, including retrieval effectiveness studies, classification studies, and content analyses.|结果评估工具（RAT）是一款用于基于商业搜索引擎和其他信息检索（IR）系统结果进行研究的软件工具包。该软件集成了研究设计与管理模块、通过网页抓取自动收集搜索结果的模块，以及在使用不同问题类型的评估界面中对搜索结果进行评估的模块。RAT可用于开展多种研究，包括检索效果研究、分类研究和内容分析。|code|0| |Translating Justice: A Cross-Lingual Information Retrieval System for Maltese Case Law Documents|Joel Azzopardi|Univ Malta, Fac ICT, Dept Artificial Intelligence, Msida, Malta|In jurisdictions adhering to the Common Law system, previous court judgements inform future rulings based on the Stare Decisis principle. For enhanced accessibility and retrieval of such judgements, we introduced a cross-lingual Legal Information Retrieval system prototype focused on Malta's small claims tribunal. This system utilises Neural Machine Translation (NMT) to automatically translate Maltese judgement documents into English, enabling dual-language querying. Additionally, it employs Rhetorical Role Labelling (RRL) on sentences within the judgements, allowing for targeted searches based on specific rhetorical roles. Developed without depending on high-end resources or commercial systems, this prototype showcases the potential of AI in advancing legal research tools and making legal documents more accessible, especially for non-native speakers.|在遵循普通法体系的司法管辖区中，先前的法院判决根据遵循先例原则（Stare Decisis）为未来的裁决提供参考。为了提高此类判决的可访问性和检索效率，我们引入了一个跨语言的法律信息检索系统原型，重点关注马耳他的小额索赔法庭。该系统利用神经机器翻译（Neural Machine Translation, NMT）技术，自动将马耳他语的判决文件翻译成英语，从而实现双语查询。此外，该系统还对判决中的句子进行修辞角色标注（Rhetorical Role Labelling, RRL），从而支持基于特定修辞角色的定向搜索。该原型系统在开发过程中未依赖高端资源或商业系统，展示了人工智能在推进法律研究工具和使法律文件更易于访问方面的潜力，尤其对非母语使用者而言。|code|0| |Displaying Evolving Events Via Hierarchical Information Threads for Sensitivity Review|Hitarth Narvala, Graham McDonald, Iadh Ounis|Univ Glasgow, Glasgow, Lanark, Scotland|Many government documents contain sensitive (e.g. personal or confidential) information that must be protected before the documents can be released to the public. However, reviewing documents to identify sensitive information is a complex task, which often requires analysing multiple related documents that mention a particular context of sensitivity. For example, coherent information about evolving events, such as legal proceedings, is often dispersed across documents produced at different times. In this paper, we present a novel system for sensitivity review, which automatically identifies hierarchical information threads to capture diverse aspects of an event. In particular, our system aims to assist sensitivity reviewers in making accurate sensitivity judgements efficiently by presenting hierarchical information threads that provide coherent and chronological information about an event's evolution. Through a user study, we demonstrate our system's effectiveness in improving the sensitivity reviewers' reviewing speed and accuracy compared to the traditional document-by-document review process.|许多政府文件包含敏感（例如个人或机密）信息，这些信息在文件公开之前必须得到保护。然而，审查文件以识别敏感信息是一项复杂的任务，通常需要分析多个相关文件，这些文件可能涉及特定的敏感背景。例如，关于不断演变的事件（如法律诉讼）的连贯信息通常分散在不同时间生成的文件中。本文提出了一种新颖的敏感性审查系统，该系统能够自动识别层次化的信息线索，以捕捉事件的不同方面。特别是，我们的系统旨在通过提供关于事件演变的连贯且按时间顺序排列的层次化信息线索，帮助敏感性审查员高效地做出准确的敏感性判断。通过一项用户研究，我们证明了与传统逐文件审查流程相比，我们的系统在提高审查员的审查速度和准确性方面的有效性。|code|0| |Analyzing Mathematical Content for Plagiarism and Recommendations|Ankit Satpute|Georg August Univ Gottingen, Gottingen, Germany|Defined as "the use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected" [6], plagiarism poses a severe concern in the rapidly increasing number of scientific publications.|抄袭被定义为“在预期原创性的环境中，未经适当承认来源而使用思想、概念、词语或结构以获取利益”[6]，在迅速增加的科学出版物中，抄袭成为一个严重的问题。|code|0| |Explainable Recommender Systems with Knowledge Graphs and Language Models|Giacomo Balloccu, Ludovico Boratto, Gianni Fenu, Francesca Maridina Malloci, Mirko Marras|Meta Platforms Inc, Menlo Pk, CA USA; Univ Potsdam, HPI, Potsdam, Germany; Rutgers State Univ, New Brunswick, NJ 08901 USA|To facilitate human decisions with credible suggestions, personalized recommender systems should have the ability to generate corresponding explanations while making recommendations. Knowledge graphs (KG), which contain comprehensive information about users and products, are widely used to enable this. By reasoning over a KG in a node-by-node manner, existing explainable models provide a KG-grounded path for each user-recommended item. Such paths serve as an explanation and reflect the historical behavior pattern of the user. However, not all items can be reached following the connections within the constructed KG under finite hops. Hence, previous approaches are constrained by a recall bias in terms of existing connectivity of KG structures. To overcome this, we propose a novel Path Language Modeling Recommendation (PLM-Rec) framework, learning a language model over KG paths consisting of entities and edges. Through path sequence decoding, PLM-Rec unifies recommendation and explanation in a single step and fulfills them simultaneously. As a result, PLM-Rec not only captures the user behaviors but also eliminates the restriction to pre-existing KG connections, thereby alleviating the aforementioned recall bias. Moreover, the proposed technique makes it possible to conduct explainable recommendation even when the KG is sparse or possesses a large number of relations. Experiments and extensive ablation studies on three Amazon e-commerce datasets demonstrate the effectiveness and explainability of the PLM-Rec framework.|为了通过可信的建议来辅助人类决策，个性化推荐系统应具备在生成推荐的同时提供相应解释的能力。知识图谱（KG）包含关于用户和产品的全面信息，因此被广泛用于实现这一目标。通过在知识图谱上以逐节点的方式进行推理，现有的可解释模型为每个用户推荐的项目提供了一条基于知识图谱的路径。这些路径作为解释，反映了用户的历史行为模式。然而，并非所有项目都能在有限跳数内通过构建的知识图谱中的连接到达。因此，先前的方法受到知识图谱结构现有连接性的召回偏差的限制。为了克服这一问题，我们提出了一种新颖的路径语言建模推荐（PLM-Rec）框架，该框架在由实体和边组成的知识图谱路径上学习语言模型。通过路径序列解码，PLM-Rec将推荐和解释统一在一个步骤中，并同时完成这两项任务。因此，PLM-Rec不仅能够捕捉用户行为，还能消除对预先存在知识图谱连接的限制，从而缓解上述的召回偏差。此外，所提出的技术使得即使在知识图谱稀疏或具有大量关系的情况下，也能进行可解释的推荐。在三个亚马逊电子商务数据集上的实验和广泛的消融研究证明了PLM-Rec框架的有效性和可解释性。|code|0| |Recent Advances in Generative Information Retrieval|Yubao Tang, Ruqing Zhang, Zhaochun Ren, Jiafeng Guo, Maarten de Rijke|Leiden Univ, Leiden, Netherlands; Univ Chinese Acad Sci, CAS Key Lab Network Data Sci & Technol, ICT, CAS, Beijing, Peoples R China; Univ Amsterdam, Amsterdam, Netherlands|Generative retrieval (GR) has become a highly active area of information retrieval (IR) that has witnessed significant growth recently. Compared to the traditional “index-retrieve-then-rank” pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications. We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and the applications of GR. We end by outlining remaining challenges and issuing a call for future GR research. This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.|生成式检索（Generative Retrieval, GR）已成为信息检索（IR）领域中一个高度活跃的研究方向，近年来取得了显著的发展。与传统的“索引-检索-排序”流程相比，GR范式旨在将所有语料库中的信息整合到一个单一模型中。通常，训练一个序列到序列（sequence-to-sequence）模型，以直接将查询映射到其相关的文档标识符（即docids）。本教程介绍了GR范式的核心概念，并全面概述了其基础和应用方面的最新进展。我们首先提供了涵盖GR基础知识和问题表述的初步信息。然后，重点转向了docid设计、训练方法、推理策略以及GR应用方面的最新进展。最后，我们概述了当前面临的挑战，并呼吁未来对GR研究的进一步探索。本教程旨在为有兴趣开发新型GR解决方案或将其应用于实际场景的研究人员和行业从业者提供有益的参考。|code|0| |Affective Computing for Social Good Applications: Current Advances, Gaps and Opportunities in Conversational Setting|Priyanshu Priya, Mauajama Firdaus, Gopendra Vikram Singh, Asif Ekbal|Indian Inst Technol Patna, Dayalpur Daulatpur, India; Univ Alberta, Edmonton, AB, Canada|Affective computing involves examining and advancing systems and devices capable of identifying, comprehending, processing, and emulating human emotions, sentiment, politeness and personality characteristics. This is an ever-expanding multidisciplinary domain that investigates how technology can contribute to the comprehension of human affect, how affect can influence interactions between humans and machines, how systems can be engineered to harness affect for enhanced capabilities, and how integrating affective strategies can revolutionize interactions between humans and machines. Recognizing the fact that affective computing encompasses disciplines such as computer science, psychology, and cognitive science, this tutorial aims to delve into the historical underpinnings and overarching objectives of affective computing, explore various approaches for affect detection and generation, its practical applications across diverse areas, including but not limited to social good (like persuasion, therapy and support, etc.), address ethical concerns, and outline potential future directions.|情感计算涉及研究和开发能够识别、理解、处理和模拟人类情感、情绪、礼貌及个性特征的系统与设备。这是一个不断扩展的多学科领域，研究技术如何有助于理解人类情感、情感如何影响人机交互、如何设计系统以利用情感来增强能力，以及如何通过整合情感策略来革新人机交互。鉴于情感计算涵盖了计算机科学、心理学和认知科学等学科，本教程旨在深入探讨情感计算的历史基础和总体目标，探索情感检测和生成的各种方法，其在多个领域的实际应用（包括但不限于社会公益领域，如说服、治疗和支持等），讨论伦理问题，并概述未来可能的发展方向。|code|0| |Query Performance Prediction: From Fundamentals to Advanced Techniques|Negar Arabzadeh, Chuan Meng, Mohammad Aliannejadi, Ebrahim Bagheri|Univ Amsterdam, Amsterdam, Netherlands; Univ Waterloo, Waterloo, ON, Canada; Toronto Metropolitan Univ, Toronto, ON, Canada|Query performance prediction (QPP) is a core task in information retrieval (IR) that aims at predicting the retrieval quality for a given query without relevance judgments. QPP has been investigated for decades and has witnessed a surge in research activity in recent years; QPP has been shown to benefit various aspects, e.g., improving retrieval effectiveness by selecting the most effective ranking function per query [5, 7]. Despite its importance, there is no recent tutorial to provide a comprehensive overview of QPP techniques in the era of pre-trained/large language models or in the scenario of emerging conversational search (CS); In this tutorial, we have three main objectives. First, we aim to disseminate the latest advancements in QPP to the IR community. Second, we go beyond investigating QPP in ad-hoc search and cover QPP for CS. Third, the tutorial offers a unique opportunity to bridge the gap between theory and practice; we aim to equip participants with the essential skills and insights needed to navigate the evolving landscape of QPP, ultimately benefiting both researchers and practitioners in the field of IR and encouraging them to work around the future avenues on QPP.|查询性能预测（Query Performance Prediction, QPP）是信息检索（Information Retrieval, IR）中的一项核心任务，旨在在没有相关性判断的情况下预测给定查询的检索质量。QPP已经被研究了数十年，并且近年来研究活动显著增加；QPP已被证明在多个方面具有重要价值，例如通过为每个查询选择最有效的排序函数来提高检索效果[5, 7]。尽管QPP的重要性不言而喻，但在预训练/大语言模型时代或新兴的对话式搜索（Conversational Search, CS）场景中，目前还没有最新的教程对QPP技术进行全面概述。在本教程中，我们有三个主要目标。首先，我们旨在向IR社区传播QPP的最新进展。其次，我们不仅研究QPP在临时搜索中的应用，还涵盖了QPP在CS中的应用。第三，本教程为弥合理论与实践之间的差距提供了一个独特的机会；我们旨在为参与者提供应对QPP不断演变的技术格局所需的基本技能和洞察力，最终使IR领域的研究人员和从业人员受益，并鼓励他们在QPP的未来发展方向上进行探索。|code|0| |Fairness Through Domain Awareness: Mitigating Popularity Bias for Music Discovery|Rebecca Salganik, Fernando Diaz, Golnoosh Farnadi||As online music platforms grow, music recommender systems play a vital role in helping users navigate and discover content within their vast musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between music discovery and popularity bias. To mitigate this issue we propose a domain-aware, individual fairness-based approach which addresses popularity bias in graph neural network (GNNs) based recommender systems. Our approach uses individual fairness to reflect a ground truth listening experience, i.e., if two songs sound similar, this similarity should be reflected in their representations. In doing so, we facilitate meaningful music discovery that is robust to popularity bias and grounded in the music domain. We apply our BOOST methodology to two discovery based tasks, performing recommendations at both the playlist level and user level. Then, we ground our evaluation in the cold start setting, showing that our approach outperforms existing fairness benchmarks in both performance and recommendation of lesser-known content. Finally, our analysis explains why our proposed methodology is a novel and promising approach to mitigating popularity bias and improving the discovery of new and niche content in music recommender systems.|随着在线音乐平台的发展，音乐推荐系统在帮助用户浏览和发现其庞大的音乐数据库中的内容方面发挥着至关重要的作用。与这个更大的目标不一致的是流行偏见的存在，它导致算法系统偏爱主流内容，而不是潜在的更相关的，但是利基项目。在这项工作中，我们探讨音乐发现和流行偏见之间的内在关系。为了缓解这一问题，我们提出了一种基于领域感知的、基于个体公平性的方法，该方法解决了基于图神经网络(GNN)的推荐系统中的流行偏差问题。我们的方法使用个人的公平性来反映一个基本的真理倾听经验，也就是说，如果两首歌听起来相似，这种相似性应该反映在他们的表述中。这样做，我们促进了有意义的音乐发现，这是强大的流行偏见，并在音乐领域的基础。我们将 BOOST 方法应用于两个基于发现的任务，在播放列表级别和用户级别执行建议。然后，我们在冷启动环境下进行评估，结果表明我们的方法在性能和推荐不太知名内容方面都优于现有的公平性基准。最后，我们的分析解释了为什么我们提出的方法是一个新颖和有前途的方法，以减少流行偏见和改善发现新的和利基内容的音乐推荐系统。|code|0| |Countering Mainstream Bias via End-to-End Adaptive Local Learning|Jinhao Pan, Ziwei Zhu, Jianling Wang, Allen Lin, James Caverlee||Collaborative filtering (CF) based recommendations suffer from mainstream bias – where mainstream users are favored over niche users, leading to poor recommendation quality for many long-tail users. In this paper, we identify two root causes of this mainstream bias: (i) discrepancy modeling, whereby CF algorithms focus on modeling mainstream users while neglecting niche users with unique preferences; and (ii) unsynchronized learning, where niche users require more training epochs than mainstream users to reach peak performance. Targeting these causes, we propose a novel end-To-end Adaptive Local Learning (TALL) framework to provide high-quality recommendations to both mainstream and niche users. TALL uses a loss-driven Mixture-of-Experts module to adaptively ensemble experts to provide customized local models for different users. Further, it contains an adaptive weight module to synchronize the learning paces of different users by dynamically adjusting weights in the loss. Extensive experiments demonstrate the state-of-the-art performance of the proposed model. Code and data are provided at https://github.com/JP-25/end-To-end-Adaptive-Local-Leanring-TALL-|基于协同过滤(CF)的推荐受到主流偏见的影响——主流用户比小众用户更受青睐，导致许多长尾用户的推荐质量较差。在本文中，我们确定了这种主流偏见的两个根本原因: (i)差异建模，即 CF 算法侧重于建模主流用户，而忽视具有独特偏好的小生境用户; 和(ii)非同步学习，其中小生境用户需要比主流用户更多的训练周期才能达到峰值性能。针对这些原因，我们提出了一个新颖的端到端适应性本地学习(TALL)框架，为主流和小众用户提供高质量的建议。TALL 使用一个损耗驱动的专家混合模块来自适应地集成专家，为不同的用户提供定制的本地模型。此外，它还包含一个自适应权重模块，通过动态调整权重来同步不同用户的学习步伐。大量的实验证明了该模型的最新性能。代码和数据载于 < https://github.com/jp-25/end-to-end-adaptive-local-leanring-tall- |code|0| |BioASQ at CLEF2024: The Twelfth Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge|Anastasios Nentidis, Anastasia Krithara, Georgios Paliouras, Martin Krallinger, Luis Gascó Sánchez, Salvador LimaLópez, Eulàlia Farré, Natalia V. Loukachevitch, Vera Davydova, Elena Tutubalina|Sber AI, Moscow, Russia; Moscow MV Lomonosov State Univ, Moscow, Russia; Natl Ctr Sci Res Demokritos, Athens, Greece; Barcelona Supercomp Ctr, Barcelona, Spain|The large-scale biomedical semantic indexing and question-answering challenge (BioASQ) aims at the continuous advancement of methods and tools to meet the needs of biomedical researchers and practitioners for efficient and precise access to the ever-increasing resources of their domain. With this purpose, during the last eleven years, a series of annual challenges have been organized with specific shared tasks on large-scale biomedical semantic indexing and question answering. Benchmark datasets have been concomitantly provided in alignment with the real needs of biomedical experts, providing a unique common testbed where different teams around the world can investigate and compare new approaches for accessing biomedical knowledge. The twelfth version of the BioASQ Challenge will be held as an evaluation Lab within CLEF2024 providing four shared tasks: (i) Task b on the information retrieval for biomedical questions, and the generation of comprehensible answers. (ii) Task Synergy the information retrieval and generation of answers for open biomedical questions on developing topics, in collaboration with the experts posing the questions. (iii) Task MultiCardioNER on the automated annotation of clinical entities in medical documents in the field of cardiology, primarily in Spanish, English, Italian and Dutch. (iv) Task BioNNE on the automated annotation of biomedical documents in Russian and English with nested named entity annotations. As BioASQ rewards the methods that outperform the state of the art in these shared tasks, it pushes the research frontier towards approaches that accelerate access to biomedical knowledge.|大规模生物医学语义索引与问答挑战赛（BioASQ）旨在持续推动方法和工具的发展，以满足生物医学研究人员和实践者对高效、精确获取其领域日益增长资源的需求。为此，在过去十一年中，已组织了一系列年度挑战赛，专注于大规模生物医学语义索引和问答的特定共享任务。与此同时，根据生物医学专家的实际需求提供了基准数据集，为全球不同团队提供了一个独特的共同测试平台，用于研究和比较获取生物医学知识的新方法。第十二届BioASQ挑战赛将作为CLEF2024评估实验室的一部分举办，提供四个共享任务：(i) 任务b，针对生物医学问题的信息检索及生成可理解的答案。(ii) 任务Synergy，与提出问题的专家合作，针对发展中的开放生物医学问题进行信息检索和答案生成。(iii) 任务MultiCardioNER，专注于心脏病学领域医疗文档中临床实体的自动标注，主要涉及西班牙语、英语、意大利语和荷兰语。(iv) 任务BioNNE，针对俄语和英语生物医学文档的自动标注，包含嵌套命名实体标注。由于BioASQ奖励在这些共享任务中超越现有技术水平的方法，它推动了研究前沿向着加速获取生物医学知识的方向发展。|code|0| |ProMap: Product Mapping Datasets|Katerina Macková, Martin Pilát|Charles Univ Prague, Fac Math & Phys, Malostranske Namesti 25, Prague 11800 1, Czech Republic|The goal of product mapping is to decide, whether two listings from two different e-shops describe the same products. Existing datasets of matching and non-matching pairs of products, however, often suffer from incomplete product information or contain only very distant non-matching products. In this paper, we introduce two new datasets for product mapping: ProMapCz consisting of 1,495 Czech product pairs and ProMapEn consisting of 1,555 English product pairs of matching and non-matching products manually scraped from two pairs of e-shops. The datasets contain both images and textual descriptions of the products, including their specifications, making them one of the most complete datasets for product mapping. Additionally, we divide the non-matching products into two different categories – close non-matches and medium non-matches, based on how similar the products are to each other. Even the medium non-matches are, however, pairs of products that are much more similar than non-matches in other datasets – for example, they still need to have the same brand and similar name and price. Finally, we train a number of product matching models on these datasets to demonstrate the advantages of having these two types of non-matches for the analysis of these models.|产品映射的目标是判断来自两个不同电商平台的商品列表是否描述的是同一产品。然而，现有的匹配和不匹配产品对数据集往往存在产品信息不完整或仅包含非常不相似的不匹配产品的问题。在本文中，我们引入了两个新的产品映射数据集：ProMapCz 包含 1,495 对捷克产品对，ProMapEn 包含 1,555 对英语产品对，这些产品对是从两对电商平台手动抓取的匹配和不匹配产品。这些数据集包含产品的图像和文本描述，包括其规格，使其成为最完整的产品映射数据集之一。此外，我们根据产品之间的相似程度，将不匹配产品分为两个不同的类别——接近不匹配和中等不匹配。然而，即使是中等不匹配的产品对，也比其他数据集中的不匹配产品对更为相似——例如，它们仍然需要具有相同的品牌以及相似的名称和价格。最后，我们在这些数据集上训练了多个产品匹配模型，以展示这两种不匹配类型对这些模型分析的优势。|code|0| |Eliminating Contextual Bias in Aspect-Based Sentiment Analysis|Ruize An, Chen Zhang, Dawei Song|Beijing Inst Technol, Beijing, Peoples R China|Pretrained language models (LMs) have made remarkable achievements in aspect-based sentiment analysis (ABSA). However, it is discovered that these models may struggle in some particular cases (e.g., to detect sentiments expressed towards targeted aspects with only implicit or adversarial expressions). Since it is hard for models to align implicit or adversarial expressions with their corresponding aspects, the sentiments of the targeted aspects would largely be impacted by the expressions towards other aspects in the sentence. We name this phenomenon as contextual bias. To tackle the problem, we propose a flexible aspect-oriented debiasing method (Arde) to eliminate the harmful contextual bias without the need of adjusting the underlying LMs. Intuitively, Arde calibrates the prediction towards the targeted aspect by subtracting the bias towards the context. Favorably, Arde can get theoretical support from counterfactual reasoning theory. Experiments are conducted on SemEval benchmark, and the results show that Arde can empirically improve the accuracy on contextually biased aspect sentiments without degrading the accuracy on unbiased ones. Driven by recent success of large language models (LLMs, e.g., ChatGPT), we further uncover that even LLMs can fail to address certain contextual bias, which yet can be effectively tackled by Arde.|预训练语言模型（LMs）在基于方面的情感分析（ABSA）中取得了显著成就。然而，研究发现这些模型在某些特定情况下可能会遇到困难（例如，检测仅通过隐含或对抗性表达针对目标方面的情感）。由于模型难以将隐含或对抗性表达与其对应的方面对齐，因此目标方面的情感很大程度上会受到句子中其他方面表达的影响。我们将这种现象称为上下文偏差。为了解决这一问题，我们提出了一种灵活的面向方面的去偏方法（Arde），以消除有害的上下文偏差，而无需调整底层LMs。直观上，Arde通过减去对上下文的偏差来校准对目标方面的预测。令人欣慰的是，Arde可以从反事实推理理论中获得理论支持。我们在SemEval基准上进行了实验，结果表明，Arde能够在经验上提高对具有上下文偏差的方面情感的准确性，而不会降低对无偏差方面情感的准确性。受大型语言模型（LLMs，例如ChatGPT）近期成功的推动，我们进一步发现，即使是LLMs也可能无法解决某些上下文偏差，而Arde却能有效应对这一问题。|code|0| |A Streaming Approach to Neural Team Formation Training|Hossein Fani, Reza Barzegar, Arman Dashti, Mahdis Saeedi|Univ Windsor, Windsor, ON, Canada|Predicting future successful teams of experts who can effectively collaborate is challenging due to the experts' temporality of skill sets, levels of expertise, and collaboration ties, which is overlooked by prior work. Specifically, state-of-the-art neural-based methods learn vector representations of experts and skills in a static latent space, falling short of incorporating the possible drift and variability of experts' skills and collaboration ties in time. In this paper, we propose (1) a streaming-based training strategy for neural models to capture the evolution of experts' skills and collaboration ties over time and (2) to consume time information as an additional signal to the model for predicting future successful teams. We empirically benchmark our proposed method against state-of-the-art neural team formation methods and a strong temporal recommender system on datasets from varying domains with distinct distributions of skills and experts in teams. The results demonstrate neural models that utilize our proposed training strategy excel at efficacy in terms of classification and information retrieval metrics. The codebase is available at https://github.com/fani-lab/OpeNTF/tree/ecir24 .|预测未来能够有效协作的成功专家团队具有挑战性，因为专家的技能集、专业水平和协作关系的时效性往往被先前的研究所忽视。具体而言，现有的基于神经网络的先进方法在静态潜在空间中学习专家和技能的向量表示，未能充分考虑专家技能和协作关系随时间可能发生的漂移和变化。在本文中，我们提出了（1）一种基于流式训练的神经网络模型策略，以捕捉专家技能和协作关系随时间的演变；（2）将时间信息作为模型的额外输入信号，用于预测未来的成功团队。我们在多个领域的团队数据集上，针对不同的技能和专家分布，将我们提出的方法与最先进的神经团队形成方法和一个强大的时序推荐系统进行了实证对比。结果表明，采用我们提出的训练策略的神经网络模型在分类和信息检索指标上表现出色。代码库可在以下网址获取：https://github.com/fani-lab/OpeNTF/tree/ecir24。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=A+Streaming+Approach+to+Neural+Team+Formation+Training)|0| |A Second Look on BASS - Boosting Abstractive Summarization with Unified Semantic Graphs - A Replication Study|Osman Alperen Koras, Jörg Schlötterer, Christin Seifert||We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.|我们提出了一个详细的复制研究的 BASS 框架，一个抽象的摘要系统的概念为基础的统一语义图。我们的研究包括复制关键组件的挑战和一项消融研究，以系统地隔离根植于复制新组件的错误源。我们的发现揭示了与原始工作相比在性能上的差异。我们强调认真注意甚至合理忽略复制高级框架(如 BASS)的细节的重要性，并强调编写可复制论文的关键实践。|code|0| |Absolute Variation Distance: An Inversion Attack Evaluation Metric for Federated Learning|Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia|JPMorgan Chase, Global Technol Appl Res, New York, NY 10017 USA|Federated Learning (FL) has emerged as a pivotal approach for training models on decentralized data sources by sharing only model gradients. However, the shared gradients in FL are susceptible to inversion attacks which can expose sensitive information. While several defense and attack strategies have been proposed, their effectiveness is often evaluated using metrics that may not necessarily reflect the success rate of an attack or information retrieval, especially in the context of multidimensional data such as images. Traditional metrics like the Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE) are typically used as lightweight metrics, assume only pixel-wise comparison, but fail to consider the semantic context of the recovered data. This paper introduces the Absolute Variation Distance (AVD), a lightweight metric derived from total variation, to assess data recovery and information leakage in FL. Unlike traditional metrics, AVD offers a continuous measure for extracting information in noisy images and aligns closely with human perception. Our results combined with a user experience survey demonstrate that AVD provides a more accurate and consistent measure of data recovery. It also matches the accuracy of the more costly and complex Neural Network based metric, the Learned Perceptual Image Patch Similarity (LPIPS). Hence it offers an effective tool for automatic evaluation of data security in FL and a reliable way of studying defence and inversion attacks strategies in FL.|联邦学习（Federated Learning, FL）作为一种关键方法，通过仅共享模型梯度来在分散的数据源上训练模型。然而，FL中共享的梯度容易受到反演攻击，从而可能暴露敏感信息。尽管已经提出了多种防御和攻击策略，但其有效性通常使用可能无法准确反映攻击成功率或信息检索效果的指标进行评估，尤其是在处理如图像等多维数据时。传统的指标如结构相似性指数（SSIM）、峰值信噪比（PSNR）和均方误差（MSE）通常被用作轻量级指标，仅假设像素级别的比较，但未能考虑恢复数据的语义上下文。本文引入了绝对变差距离（Absolute Variation Distance, AVD），这是一种基于总变差的轻量级指标，用于评估FL中的数据恢复和信息泄露。与传统指标不同，AVD为在噪声图像中提取信息提供了连续的度量，并且与人类感知高度一致。我们的研究结果结合用户体验调查表明，AVD提供了更准确和一致的数据恢复度量。同时，它与更昂贵且复杂的基于神经网络的度量——学习感知图像块相似性（LPIPS）的准确性相当。因此，AVD为FL中的数据安全自动评估提供了有效工具，并为研究FL中的防御和反演攻击策略提供了可靠的方法。|code|0| |Experiments in News Bias Detection with Pre-trained Neural Transformers|Tim Menzner, Jochen L. Leidner|Coburg Univ Appl Sci, Informat Access Res Grp, Friedrich Streib Str 2, D-96459 Coburg, Germany|The World Wide Web provides unrivalled access to information globally, including factual news reporting and commentary. However, state actors and commercial players increasingly spread biased (distorted) or fake (non-factual) information to promote their agendas. We compare several large, pre-trained language models on the task of sentence-level news bias detection and sub-type classification, providing quantitative and qualitative results. Our findings are to be seen as part of a wider effort towards realizing the conceptual vision, articulated by Fuhr et al. [10], of a "nutrition label" for online content for the social good.|万维网为全球信息获取提供了无与伦比的便利，包括事实新闻报道和评论。然而，国家行为体和商业参与者越来越多地传播带有偏见（扭曲）或虚假（非事实）的信息，以推动其议程。我们比较了几种大型预训练语言模型在句子级新闻偏见检测及其子类型分类任务上的表现，并提供了定量和定性的结果。我们的研究结果应被视为实现Fuhr等人[10]所阐述的“营养标签”概念愿景的一部分，该愿景旨在为社会公益提供在线内容的透明度和可信度评估。|code|0| |A Transformer-Based Object-Centric Approach for Date Estimation of Historical Photographs|Francesc Net, Núria Hernández, Adrià Molina, Lluís Gómez|Univ Autonoma Barcelona, Comp Vis Ctr, Catalunya, Spain|The accurate estimation of the creation date of cultural heritage photographic assets is a challenging and complex task, typically requiring the expertise of qualified archivists, with significant implications for archival and preservation purposes. This paper introduces a new dataset for image date estimation, which complements existing datasets, thus creating a more balanced and realistic training set for deep learning models. On this dataset, we present a set of modern strong baselines that outperform previous state-of-the-art methods for this task. Additionally, we propose a novel approach that leverages “dating indicators” or “dating clues” through object detection and a self-attention based Transformer encoder. Our experiments demonstrate that the proposed approach has promising applicability in real scenarios and that incorporating “dating indicators” through object detection can improve the performance of image date estimation models. The dataset and code of our models are publicly available at https://github.com/cesc47/DEXPERT .|文化遗产摄影资料的准确创建日期估计是一项具有挑战性且复杂的任务，通常需要合格档案管理员的专业知识，对于档案保存具有重要意义。本文引入了一个新的图像日期估计数据集，该数据集补充了现有数据集，从而为深度学习模型创建了一个更加平衡和现实的训练集。在该数据集上，我们提出了一组现代强基线模型，这些模型在此任务上优于以往的最先进方法。此外，我们提出了一种新颖的方法，通过物体检测和基于自注意力机制的Transformer编码器来利用“年代指示器”或“年代线索”。我们的实验表明，所提出的方法在实际场景中具有良好的适用性，并且通过物体检测引入“年代指示器”可以提高图像日期估计模型的性能。我们的模型的数据集和代码公开在https://github.com/cesc47/DEXPERT。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=A+Transformer-Based+Object-Centric+Approach+for+Date+Estimation+of+Historical+Photographs)|0| |Bias Detection and Mitigation in Textual Data: A Study on Fake News and Hate Speech Detection|Apostolos Kasampalis, Despoina Chatzakou, Theodora Tsikrika, Stefanos Vrochidis, Ioannis Kompatsiaris|Ctr Res & Technol Hellas, Informat Technol Inst, Thessaloniki, Greece|Addressing bias in NLP-based solutions is crucial to promoting fairness, avoiding discrimination, building trust, upholding ethical standards, and ultimately improving their performance and reliability. On the topic of bias detection and mitigation in textual data, this work examines the effect of different bias detection models along with standard debiasing methods on the effectiveness of fake news and hate speech detection tasks. Extensive discussion of the results draws useful conclusions, highlighting the inherent difficulties in effectively managing bias.|在基于自然语言处理（NLP）的解决方案中，解决偏见问题对于促进公平性、避免歧视、建立信任、维护道德标准以及最终提高其性能和可靠性至关重要。本文围绕文本数据中的偏见检测与缓解这一主题，探讨了不同偏见检测模型以及标准去偏方法对虚假新闻和仇恨言论检测任务效果的影响。通过对结果的广泛讨论，得出了有益的结论，突显了在有效管理偏见方面所固有的困难。|code|0| |DQNC2S: DQN-Based Cross-Stream Crisis Event Summarizer|Daniele Rege Cambrin, Luca Cagliero, Paolo Garza||Summarizing multiple disaster-relevant data streams simultaneously is particularly challenging as existing Retrieve&Re-ranking strategies suffer from the inherent redundancy of multi-stream data and limited scalability in a multi-query setting. This work proposes an online approach to crisis timeline generation based on weak annotation with Deep Q-Networks. It selects on-the-fly the relevant pieces of text without requiring neither human annotations nor content re-ranking. This makes the inference time independent of the number of input queries. The proposed approach also incorporates a redundancy filter into the reward function to effectively handle cross-stream content overlaps. The achieved ROUGE and BERTScore results are superior to those of best-performing models on the CrisisFACTS 2022 benchmark.|同时汇总多个与灾难相关的数据流尤其具有挑战性，因为现有的检索和重新排序策略受到多流数据的固有冗余和多查询设置中有限的可伸缩性的影响。提出了一种基于深度 Q 网络弱注释的危机时间表在线生成方法。它动态地选择相关的文本片段，而不需要人工注释或内容重新排序。这使得推理时间与输入查询的数量无关。该方法还在奖励函数中引入了冗余过滤器，以有效地处理跨流内容重叠。所获得的 ROUGE 和 BERTScore 结果优于那些在 CrisisFACTS 2022基准上表现最好的模型。|code|0| |QuantPlorer: Exploration of Quantities in Text|Satya Almasian, Alexander Kosnac, Michael Gertz|Heidelberg Univ, Heidelberg, Germany|Quantities play an important role in documents of various domains such as finance, business, and medicine. Despite the role of quantities, only a limited number of works focus on their extraction from text and even less on creating respective user-friendly document exploration frameworks. In this work, we introduce QuantPlorer, an online quantity extractor and explorer. Through an intuitive web interface, QuantExplorer extracts quantities from unstructured text, enables users to interactively investigate and visualize quantities in text, and it supports filtering based on diverse features, i.e., value ranges, units, trends, and concepts. Furthermore, users can explore and visualize distributions of values for specific units and concepts. Our demonstration is available at https://quantplorer.ifi.uni-heidelberg.de/ .|在各种领域的文档中，如金融、商业和医学，数量扮演着重要角色。尽管数量在文档中具有重要作用，但只有有限的研究工作专注于从文本中提取数量，而创建相应的用户友好文档探索框架的研究则更少。在本研究中，我们介绍了QuantPlorer，一个在线数量提取和探索工具。通过直观的网页界面，QuantPlorer能够从非结构化文本中提取数量，使用户能够交互式地研究和可视化文本中的数量，并支持基于多种特征的过滤，如数值范围、单位、趋势和概念。此外，用户还可以探索和可视化特定单位和概念下的数值分布。我们的演示可在https://quantplorer.ifi.uni-heidelberg.de/ 访问。|code|0| |ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction|Nicolay Rusnachenko, Huizhi Liang, Maksim Kalameyets, Lei Shi|Newcastle Univ, Sch Comp, Newcastle Upon Tyne, Tyne & Wear, England|The escalating volume of textual data necessitates adept and scalable Information Extraction (IE) systems in the field of Natural Language Processing (NLP) to analyse massive text collections in a detailed manner. While most deep learning systems are designed to handle textual information as it is, the gap in the existence of the interface between a document and the annotation of its parts is still poorly covered. Concurrently, one of the major limitations of most deep-learning models is a constrained input size caused by architectural and computational specifics. To address this, we introduce ARElight $$^1$$ , a system designed to efficiently manage and extract information from sequences of large documents by dividing them into segments with mentioned object pairs. Through a pipeline comprising modules for text sampling, inference, optional graph operations, and visualisation, the proposed system transforms large volumes of text in a structured manner. Practical applications of ARElight are demonstrated across diverse use cases, including literature processing and social network analysis.( $$^1$$ https://github.com/nicolay-r/ARElight )|随着文本数据量的不断增加，自然语言处理（NLP）领域需要高效且可扩展的信息抽取（IE）系统，以便对大量文本集合进行详细分析。尽管大多数深度学习系统旨在直接处理文本信息，但文档与其部分内容标注之间的接口仍然存在较大空白。同时，大多数深度学习模型的一个主要限制是由于架构和计算特性导致的输入大小受限。为了解决这一问题，我们提出了ARElight $$^1$$ ，该系统通过将大型文档划分为包含提及对象对的片段，从而高效地管理和从这些文档序列中提取信息。通过一个包含文本采样、推理、可选图操作和可视化模块的流程，所提出的系统以结构化的方式处理大量文本。ARElight的实际应用在多个用例中得到展示，包括文献处理和社会网络分析。( $$^1$$ https://github.com/nicolay-r/ARElight )|code|0| |Variance Reduction in Ratio Metrics for Efficient Online Experiments|Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko, Olivier Jeunen||Online controlled experiments, such as A/B-tests, are commonly used by modern tech companies to enable continuous system improvements. Despite their paramount importance, A/B-tests are expensive: by their very definition, a percentage of traffic is assigned an inferior system variant. To ensure statistical significance on top-level metrics, online experiments typically run for several weeks. Even then, a considerable amount of experiments will lead to inconclusive results (i.e. false negatives, or type-II error). The main culprit for this inefficiency is the variance of the online metrics. Variance reduction techniques have been proposed in the literature, but their direct applicability to commonly used ratio metrics (e.g. click-through rate or user retention) is limited. In this work, we successfully apply variance reduction techniques to ratio metrics on a large-scale short-video platform: ShareChat. Our empirical results show that we can either improve A/B-test confidence in 77 retain the same level of confidence with 30 show that the common approach of including as many covariates as possible in regression is counter-productive, highlighting that control variates based on Gradient-Boosted Decision Tree predictors are most effective. We discuss the practicalities of implementing these methods at scale and showcase the cost reduction they beget.|在线控制实验，如 A/B 测试，通常被现代科技公司用来实现持续的系统改进。尽管 A/B 测试非常重要，但它们的成本很高: 根据它们的定义，一定比例的流量被分配给一个劣质的系统变体。为了确保顶级指标的统计显著性，在线实验通常要运行数周。即使这样，大量的实验也会导致不确定的结果(例如，假阴性，或 II 型错误)。这种低效率的罪魁祸首是在线指标的变化。文献中已经提出了减少方差的技术，但它们对常用比率指标(如点进率或用户保留)的直接适用性是有限的。在这项工作中，我们成功地将方差减少技术应用到一个大规模的短视频平台上: ShareChat。我们的实证结果显示，我们可以提高77个中的 A/B 检验置信度保持相同的置信水平，30个显示在回归中包含尽可能多的协变量的常见方法是适得其反的，突出显示基于梯度增强决策树预测器的控制变量是最有效的。我们讨论了在规模上实施这些方法的实用性，并展示了它们带来的成本降低。|code|0| |CLEF 2024 SimpleText Track - Improving Access to Scientific Texts for Everyone|Liana Ermakova, Eric SanJuan, Stéphane Huet, Hosein Azarbonyad, Giorgio Maria Di Nunzio, Federica Vezzani, Jennifer D'Souza, Salomon Kabongo, Hamed Babaei Giglou, Yue Zhang, Sören Auer, Jaap Kamps|TIB Leibniz Informat Ctr Sci & Technol, Hannover, Germany; Univ Amsterdam, Amsterdam, Netherlands; Avignon Univ, LIA, Avignon, France; Elsevier, Amsterdam, Netherlands; Univ Padua, Padua, Italy; Univ Bretagne Occidentale, HCTI, Brest, France|Everyone acknowledges the importance of objective scientific information. However, finding and understanding relevant scientific documents is often challenging due to complex terminology and readers' lack of prior knowledge. The question is can we improve accessibility for everyone? This paper presents an overview of the SimpleText Track at CLEF 2024 addressing the technical and evaluation challenges associated with making scientific information accessible to a wide audience, including students and non-experts. It describes the data and benchmarks provided for scientific text summarization and simplification, along with the participants' results. The CLEF 2024 SimpleText track is based on four interrelated tasks: Task 1 on Content Selection: Retrieving Passages to Include in a Simplified Summary. Task 2 on Complexity Spotting: Identifying and Explaining Difficult Concepts. Task 3 on Text Simplification: Simplify Scientific Text. Task 4 on SOTA?: Tracking the State-of-the-Art in Scholarly Publications.|每个人都认识到客观科学信息的重要性。然而，由于复杂的术语和读者缺乏先验知识，查找和理解相关的科学文献往往具有挑战性。问题是，我们能否提高信息的可访问性，使每个人都能轻松获取？本文概述了CLEF 2024中的SimpleText Track，该赛道旨在解决将科学信息普及给学生和非专家等广泛受众所面临的技术和评估挑战。文章详细介绍了为科学文本摘要和简化提供的数据和基准，以及参与者的结果。CLEF 2024 SimpleText赛道基于四个相互关联的任务：任务1关于内容选择：检索要包含在简化摘要中的段落。任务2关于复杂性识别：识别并解释困难概念。任务3关于文本简化：简化科学文本。任务4关于SOTA？：追踪学术出版物中的最新进展。|code|0| |LifeCLEF 2024 Teaser: Challenges on Species Distribution Prediction and Identification|Alexis Joly, Lukás Picek, Stefan Kahl, Hervé Goëau, Vincent Espitalier, Christophe Botella, Benjamin Deneu, Diego Marcos, Joaquim Estopinan, César Leblanc, Théo Larcher, Milan Sulc, Marek Hrúz, Maximilien Servajean, Jirí Matas, Hervé Glotin, Robert Planqué, WillemPier Vellinga, Holger Klinck, Tom Denton, Andrew M. Durso, Ivan Eggel, Pierre Bonnet, Henning Müller|Google Res, San Francisco, CA USA; Florida Gulf Coast Univ, Dept Biol Sci, Ft Myers, FL USA; Univ Montpellier, CNRS, LIRMM, INRIA, Montpellier, France; Univ Montpellier, Univ Paul Valery Montpellier, AMIS, LIRMM,CNRS, Montpellier, France; Univ West Bohemia, Dept Cybernet, FAV, Plzen, Czech Republic; Second Fdn, Prague, Czech Republic; Tech Univ Chemnitz, Chemnitz, Germany; Czech Tech Univ, Prague, Czech Republic; HES SO Valais, Inst Informat, Sierre, Switzerland; Cornell Univ, Cornell Lab Ornithol, K Lisa Yang Ctr Conservat Bioacoust, Ithaca, NY USA; Aix Marseille Univ, Univ Toulon, CNRS, LIS,DYNI Team, Marseille, France; CIRAD, UMR AMAP, Montpellier, Occitanie, France; Xenocanto Fdn, Amersfoort, Netherlands|Building accurate knowledge of the identity, the geographic distribution and the evolution of species is essential for the sustainable development of humanity, as well as for biodiversity conservation. However, species identification and inventory is a difficult and costly task, requiring large-scale automated approaches. The LifeCLEF lab has been promoting and evaluating advances in this domain since 2011 through the organization of multi-year challenges. The 2024 edition presented in this article proposes five data-driven challenges as a continuation of this effort: (i) BirdCLEF: bird species recognition in audio soundscapes, (ii)FungiCLEF: fungi recognition beyond 0-1 cost, (iii) GeoLifeCLEF: remote sensing based prediction of species, (iv) PlantCLEF: Multi-species identification in vegetation plot images, and (v) SnakeCLEF: snake recognition in medically important scenarios.|构建关于物种身份、地理分布和演变的准确知识对于人类的可持续发展以及生物多样性保护至关重要。然而，物种识别和编目是一项困难且成本高昂的任务，需要大规模自动化方法。自2011年以来，LifeCLEF实验室通过组织多年挑战赛，推动并评估了该领域的进展。本文介绍的2024年版本提出了五项数据驱动的挑战，作为这一努力的延续：(i) BirdCLEF：音频声景中的鸟类物种识别，(ii) FungiCLEF：超越0-1成本的真菌识别，(iii) GeoLifeCLEF：基于遥感的物种预测，(iv) PlantCLEF：植被样地图像中的多物种识别，以及(v) SnakeCLEF：在医学重要场景中的蛇类识别。|code|0| |The CLEF 2024 Monster Track: One Lab to Rule Them All|Nicola Ferro, Julio Gonzalo, Jussi Karlgren, Henning Müller|SiloGen, Helsinki, Finland; Univ Padua, Padua, Italy; HES SO Valais, Valais, Switzerland; UNED, Madrid, Spain|Generative Artificial Intelligence (AI) and Large Language Models (LLMs) are revolutionizing technology and society thanks to their versatility and applicability to a wide array of tasks and use cases, in multiple media and modalities. As a new and relatively untested technology, LLMs raise several challenges for research and application alike, including questions about their quality, reliability, predictability, veracity, as well as on how to develop proper evaluation methodologies to assess their various capacities. This evaluation lab will focus on a specific aspect of LLMs, namely their versatility. The CLEF Monster Track is organized as a meta-challenge across a selection of tasks chosen from other evaluation labs running in CLEF 2024, and participants will be asked to develop or adapt a generative AI or LLM-based system that will be run on all the tasks with no or minimal task adaptation. This will allow us to systematically evaluate the performance of the same LLM-based system across a wide range of very different tasks and to provide feedback to each targeted task about the performance of a general-purpose LLM system compared to systems specifically developed for the task. Since the datasets for CLEF 2024 have not yet been released publicly, we will be able to experiment with previously unseen data, thus reducing the risk of contamination, which is one of the most serious problems faced by LLM evaluation datasets.|生成式人工智能（Generative Artificial Intelligence, AI）和大型语言模型（Large Language Models, LLMs）凭借其多功能性以及在多种媒体和模态下广泛任务和应用场景中的适用性，正在彻底改变技术和社会。作为一种新兴且相对未经充分验证的技术，LLMs 在研究和应用方面提出了诸多挑战，包括关于其质量、可靠性、可预测性、真实性等问题，以及如何开发适当的评估方法来衡量其各项能力。本次评估实验室将聚焦于 LLMs 的一个特定方面，即其多功能性。CLEF Monster Track 被组织为一项跨任务元挑战，任务选自 CLEF 2024 中运行的其他评估实验室，参与者将被要求开发或调整一个基于生成式 AI 或 LLM 的系统，该系统将在所有任务上运行，且无需或仅需极少的任务适配。这将使我们能够系统性地评估同一基于 LLM 的系统在广泛且差异巨大的任务中的表现，并为每个目标任务提供关于通用 LLM 系统与专门为该任务开发的系统相比的性能反馈。由于 CLEF 2024 的数据集尚未公开，我们将能够在未见过的数据上进行实验，从而降低数据污染的风险，这是 LLM 评估数据集面临的最严重问题之一。|code|0| |CLEF 2024 JOKER Lab: Automatic Humour Analysis|Liana Ermakova, AnneGwenn Bosser, Tristan Miller, Tremaine Thomas, Victor Manuel PalmaPreciado, Grigori Sidorov, Adam Jatowt|Inst Politecn Nacl IPN, Ctr Invest Computac CIC, Mexico City, DF, Mexico; Ecole Natl Ingnenieurs Brest, Lab STICC, CNRS, UMR 6285, Brest, France; Univ Bretagne Occidentale, HCTI, Brest, France; Univ Manitoba, Dept Comp Sci, Winnipeg, MB, Canada; Univ Innsbruck, Innsbruck, Austria|The JOKER Lab at the Conference and Labs of the Evaluation Forum (CLEF) aims to foster research on automated processing of verbal humour, including tasks such as retrieval, classification, interpretation, generation, and translation. Despite the heady success of large language models, humour and wordplay automatic processing are far from being a solved problem. JOKER brings together experts from the social and computational sciences and encourages them to collaborate on shared tasks with quality-controlled annotated datasets. In 2024, we will offer entirely new shared tasks on humour-aware information retrieval, as well as fine-grained sentiment analysis and classification of humour for conversational agents. As in the past JOKER Labs, we will also make our data available for an unshared task that solicits novel use cases. In this paper, we provide a brief retrospective on the JOKER Labs, with a focus on the results and lessons learnt from last year's iteration, and we preview the tasks to be held at JOKER 2024.|JOKER实验室是评估论坛会议及实验室（CLEF）的一部分，旨在推动关于语言幽默自动处理的研究，包括检索、分类、解释、生成和翻译等任务。尽管大型语言模型取得了令人瞩目的成功，但幽默和文字游戏的自动处理远未成为一个已解决的问题。JOKER汇聚了来自社会科学和计算科学领域的专家，并鼓励他们利用经过质量控制的标注数据集在共享任务上进行合作。2024年，我们将提供全新的共享任务，包括幽默感知信息检索、细粒度情感分析以及对话代理的幽默分类。与以往的JOKER实验室一样，我们还将提供数据支持非共享任务，以征集新颖的应用案例。本文简要回顾了JOKER实验室的历史，重点介绍了去年迭代的成果和经验教训，并预览了将在2024年JOKER中举行的任务。|code|0| |iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge|Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Barbara Di Camillo, Mamede de Carvalho, Nicola Ferro, Piero Fariselli, Jose Manuel García Dominguez, Sara C. Madeira, Eleonora Tavazzi|Gregorio Maranon Hosp Madrid, Madrid, Spain; Univ Pavia, Pavia, Italy; IRCCS Fdn C Mondino Pavia, Pavia, Italy; Univ Turin, Turin, Italy; Univ Lisbon, Lisbon, Portugal; Univ Padua, Padua, Italy; Citta Salute & Sci, Turin, Italy|Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases characterized by progressive or alternate impairment of neurological functions (motor, sensory, visual, cognitive). Patients have to manage alternated periods in hospital with care at home, experiencing a constant uncertainty regarding the timing of the disease acute phases and facing a considerable psychological and economic burden that also involves their caregivers. Clinicians, on the other hand, need tools able to support them in all the phases of the patient treatment, suggest personalized therapeutic decisions, indicate urgently needed interventions. iDPP@CLEF run in CLEF 2022 and 2023, offering tasks on the prediction of ALS and MS progression, using retrospective patient clinical data complemented with environmental data. iDPP@CLEF 2024 will focus on prospective patient data for ALS collected via a dedicated app developed by the BRAINTEASER project and sensor data in the context of clinical trials in Turin, Pavia, Lisbon, and Madrid. For MS, iDPP@CLEF 2024 will rely on retrospective patient data complemented with environmental and pollution data from clinical institutions in Pavia and Turin.|肌萎缩侧索硬化症（ALS）和多发性硬化症（MS）是两种慢性疾病，其特征是神经功能（运动、感觉、视觉、认知）的逐渐或交替受损。患者需要在医院和家庭护理之间交替管理，面对疾病急性期时间的不确定性，并承受巨大的心理和经济负担，这些负担也波及到他们的护理者。另一方面，临床医生需要能够在患者治疗的所有阶段支持他们的工具，提供个性化的治疗决策建议，并指出急需的干预措施。iDPP@CLEF在2022年和2023年运行，提供了基于回顾性患者临床数据并辅以环境数据的ALS和MS进展预测任务。iDPP@CLEF 2024将专注于通过BRAINTEASER项目开发的专用应用程序收集的ALS前瞻性患者数据，以及在都灵、帕维亚、里斯本和马德里进行的临床试验中的传感器数据。对于MS，iDPP@CLEF 2024将依赖于回顾性患者数据，并辅以来自帕维亚和都灵临床机构的环境和污染数据。|code|0| |LongEval: Longitudinal Evaluation of Model Performance at CLEF 2024|Rabab Alkhalifa, Hsuvas Borkakoty, Romain Deveaud, Alaa ElEbshihy, Luis Espinosa Anke, Tobias Fink, Gabriela González Sáez, Petra Galuscáková, Lorraine Goeuriot, David Iommi, Maria Liakata, Harish Tayyar Madabushi, Pablo MedinaAlias, Philippe Mulhem, Florina Piroi, Martin Popel, Christophe Servan, Arkaitz Zubiaga|Cardiff Univ, Cardiff, Wales; Qwant, Paris, France; Queen Mary Univ London, London, England; Univ Stavanger, Stavanger, Norway; Charles Univ Prague, Prague, Czech Republic; Univ Bath, Bath, Avon, England; Res Studios Austria, Data Sci Studio, Vienna, Austria; Univ Grenoble Alpes, Grenoble INP, CNRS, Inst Engn,LIG, Grenoble, France|This paper introduces the planned second LongEval Lab, part of the CLEF 2024 conference. The aim of the lab's two tasks is to give researchers test data for addressing temporal effectiveness persistence challenges in both information retrieval and text classification, motivated by the fact that model performance degrades as the test data becomes temporally distant from the training data. LongEval distinguishes itself from traditional IR and classification tasks by emphasizing the evaluation of models designed to mitigate performance drop over time using evolving data. The second LongEval edition will further engage the IR community and NLP researchers in addressing the crucial challenge of temporal persistence in models, exploring the factors that enable or hinder it, and identifying potential solutions along with their limitations.|本文介绍了计划中的第二届LongEval实验室，该实验室是CLEF 2024会议的一部分。该实验室的两个任务旨在为研究人员提供测试数据，以解决信息检索和文本分类中的时间有效性持续性挑战，其动机在于随着测试数据与训练数据在时间上的距离增加，模型性能会下降。LongEval通过强调评估旨在利用演化数据来缓解性能随时间下降的模型，从而与传统的信息检索和分类任务区分开来。第二届LongEval将进一步吸引信息检索社区和自然语言处理研究人员，共同应对模型时间持续性的关键挑战，探索促进或阻碍这一持续性的因素，并识别潜在的解决方案及其局限性。|code|0| |CrisisKAN: Knowledge-Infused and Explainable Multimodal Attention Network for Crisis Event Classification|Shubham Gupta, Nandini Saini, Suman Kundu, Debasis Das||Pervasive use of social media has become the emerging source for real-time information (like images, text, or both) to identify various events. Despite the rapid growth of image and text-based event classification, the state-of-the-art (SOTA) models find it challenging to bridge the semantic gap between features of image and text modalities due to inconsistent encoding. Also, the black-box nature of models fails to explain the model's outcomes for building trust in high-stakes situations such as disasters, pandemic. Additionally, the word limit imposed on social media posts can potentially introduce bias towards specific events. To address these issues, we proposed CrisisKAN, a novel Knowledge-infused and Explainable Multimodal Attention Network that entails images and texts in conjunction with external knowledge from Wikipedia to classify crisis events. To enrich the context-specific understanding of textual information, we integrated Wikipedia knowledge using proposed wiki extraction algorithm. Along with this, a guided cross-attention module is implemented to fill the semantic gap in integrating visual and textual data. In order to ensure reliability, we employ a model-specific approach called Gradient-weighted Class Activation Mapping (Grad-CAM) that provides a robust explanation of the predictions of the proposed model. The comprehensive experiments conducted on the CrisisMMD dataset yield in-depth analysis across various crisis-specific tasks and settings. As a result, CrisisKAN outperforms existing SOTA methodologies and provides a novel view in the domain of explainable multimodal event classification.|社交媒体的广泛使用已经成为实时信息(如图像、文本或两者)的新兴来源，用于识别各种事件。尽管图像和基于文本的事件分类发展迅速，但是由于编码不一致，最新的 SOTA 模型在消除图像特征和文本模式之间的语义鸿沟方面遇到了挑战。此外，模型的黑盒子性质也无法解释模型在灾难、流行病等高风险情况下建立信任的结果。此外，对社交媒体帖子的字数限制可能会引起对特定事件的偏见。为了解决这些问题，我们提出了 CrisisKAN，一个新颖的知识注入和可解释的多模式注意力网络，将图像和文本与来自维基百科的外部知识结合起来，对危机事件进行分类。为了丰富文本信息的上下文特定理解，我们使用提出的 wiki 抽取算法集成 Wikipedia 知识。与此同时，引导交叉注意模块的实现，以填补在整合视觉和文本数据的语义差距。为了确保可靠性，我们采用了一种特定于模型的方法，称为梯度加权类激活映射(Grad-CAM) ，它为所提出的模型的预测提供了一个稳健的解释。在 CrisisMMD 数据集上进行的综合实验产生了对各种危机特定任务和设置的深入分析。因此，CrisisKAN 优于现有的 SOTA 方法，在可解释多模态事件分类领域提供了一种新的视角。|code|0| |Probing Pretrained Language Models with Hierarchy Properties|Jesús LovónMelgarejo, José G. Moreno, Romaric Besançon, Olivier Ferret, Lynda Tamine||Since Pretrained Language Models (PLMs) are the cornerstone of the most recent Information Retrieval (IR) models, the way they encode semantic knowledge is particularly important. However, little attention has been given to studying the PLMs' capability to capture hierarchical semantic knowledge. Traditionally, evaluating such knowledge encoded in PLMs relies on their performance on a task-dependent evaluation approach based on proxy tasks, such as hypernymy detection. Unfortunately, this approach potentially ignores other implicit and complex taxonomic relations. In this work, we propose a task-agnostic evaluation method able to evaluate to what extent PLMs can capture complex taxonomy relations, such as ancestors and siblings. The evaluation is based on intrinsic properties that capture the hierarchical nature of taxonomies. Our experimental evaluation shows that the lexico-semantic knowledge implicitly encoded in PLMs does not always capture hierarchical relations. We further demonstrate that the proposed properties can be injected into PLMs to improve their understanding of hierarchy. Through evaluations on taxonomy reconstruction, hypernym discovery and reading comprehension tasks, we show that the knowledge about hierarchy is moderately but not systematically transferable across tasks.|由于预训练语言模型是最新的信息检索模型的基石，因此它们编码语义知识的方式尤为重要。然而，对于 PLM 获取层次化语义知识的能力的研究却很少被关注。传统上，评估编码在 PLM 中的此类知识依赖于基于代理任务的任务相关评估方法，如上位词检测。不幸的是，这种方法可能忽略了其他隐式和复杂的分类关系。在这项工作中，我们提出了一个任务无关的评估方法，能够评估 PLM 在多大程度上可以捕获复杂的分类关系，如祖先和兄弟姐妹。评估基于捕获分类法的层次性质的内在属性。实验结果表明，PLM 中隐含的词汇语义知识并不总是能够捕获层次关系。我们进一步证明了所提议的属性可以被注入到 PLM 中，以提高它们对层次结构的理解。通过对分类学重建、上位词发现和阅读理解任务的评估，我们发现关于等级的知识适度但不能系统地跨任务转移。|code|0| |HyperPIE: Hyperparameter Information Extraction from Scientific Publications|Tarek Saier, Mayumi Ohta, Takuto Asakura, Michael Färber||Automatic extraction of information from publications is key to making scientific knowledge machine readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this paper, we formalize and tackle hyperparameter information extraction (HyperPIE) as an entity recognition and relation extraction task. We create a labeled data set covering publications from a variety of computer science disciplines. Using this data set, we train and evaluate BERT-based fine-tuned models as well as five large language models: GPT-3.5, GALACTICA, Falcon, Vicuna, and WizardLM. For fine-tuned models, we develop a relation extraction approach that achieves an improvement of 29 develop an approach leveraging YAML output for structured data extraction, which achieves an average improvement of 5.5 using JSON. With our best performing model we extract hyperparameter information from a large number of unannotated papers, and analyze patterns across disciplines. All our data and source code is publicly available at https://github.com/IllDepence/hyperpie|从出版物中自动提取信息是使科学知识机器具有大规模可读性的关键。提取的信息可以方便学术搜索、决策制定和知识图的构建。现有方法未涵盖的一种重要类型的信息是超参数。在本文中，我们将超参数信息抽取(HyperpIE)形式化并处理为一个实体识别和关系提取任务。我们创建了一个标签数据集，涵盖了来自各种计算机科学学科的出版物。使用这个数据集，我们训练和评估基于 BERT 的微调模型以及五种大型语言模型: GPT-3.5、 GALACTICA、 Falcon、 Vicuna 和 WizardLM。对于微调模型，我们开发了一种关系提取方法，它实现了29个改进，开发了一种利用 YAML 输出进行结构化数据提取的方法，它使用 JSON 实现了平均5.5个改进。使用性能最好的模型，我们从大量未注释的论文中提取超参数信息，并分析跨学科的模式。我们所有的数据和源代码都可以在 https://github.com/illdepence/hyperpie 上公开|code|0| |An EcoSage Assistant: Towards Building A Multimodal Plant Care Dialogue Assistant|Mohit Tomar, Abhisek Tiwari, Tulika Saha, Prince Jha, Sriparna Saha||In recent times, there has been an increasing awareness about imminent environmental challenges, resulting in people showing a stronger dedication to taking care of the environment and nurturing green life. The current $19.6 billion indoor gardening industry, reflective of this growing sentiment, not only signifies a monetary value but also speaks of a profound human desire to reconnect with the natural world. However, several recent surveys cast a revealing light on the fate of plants within our care, with more than half succumbing primarily due to the silent menace of improper care. Thus, the need for accessible expertise capable of assisting and guiding individuals through the intricacies of plant care has become paramount more than ever. In this work, we make the very first attempt at building a plant care assistant, which aims to assist people with plant(-ing) concerns through conversations. We propose a plant care conversational dataset named Plantational, which contains around 1K dialogues between users and plant care experts. Our end-to-end proposed approach is two-fold : (i) We first benchmark the dataset with the help of various large language models (LLMs) and visual language model (VLM) by studying the impact of instruction tuning (zero-shot and few-shot prompting) and fine-tuning techniques on this task; (ii) finally, we build EcoSage, a multi-modal plant care assisting dialogue generation framework, incorporating an adapter-based modality infusion using a gated mechanism. We performed an extensive examination (both automated and manual evaluation) of the performance exhibited by various LLMs and VLM in the generation of the domain-specific dialogue responses to underscore the respective strengths and weaknesses of these diverse models.|近年来，人们越来越认识到迫在眉睫的环境挑战，因此人们更加致力于保护环境和培育绿色生活。目前196亿美元的室内园艺产业，反映了这种日益增长的情绪，不仅意味着货币价值，而且表明了人类与自然世界重新建立联系的强烈愿望。然而，最近的一些调查揭示了我们所照料的植物的命运，超过一半的植物死亡主要是由于不当照料的无声威胁。因此，现在比以往任何时候都更需要能够帮助和指导个人通过复杂的植物护理的可获得的专业知识。在这项工作中，我们首次尝试构建一个植物护理助手，其目的是通过对话帮助人们处理植物问题。我们提出了一个名为 Plantational 的植物护理会话数据集，它包含用户和植物护理专家之间大约1K 的对话。我们提出的端到端的方法是双重的: (i)我们首先在各种大型语言模型(LLM)和可视化语言模型(VLM)的帮助下，通过研究指令调优(零拍摄和少拍摄提示)和微调技术对这项任务的影响来测试数据集; (ii)最后，我们构建 EcoSage，一个多模态植物护理辅助对话生成框架，使用门控机制结合基于适配器的模式输入。我们对各种 LLM 和 VLM 在生成特定领域的对话响应时所展示的性能进行了广泛的检查(包括自动和手动评估) ，以强调这些不同模型各自的优缺点。|code|0| |Controllable Decontextualization of Yes/No Question and Answers into Factual Statements|Lingbo Mo, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi|Ohio State Univ, Columbus, OH 43210 USA; Amazon Com Inc, Seattle, WA USA|Yes/No or polar questions represent one of the main linguistic questioncategories. They consist of a main interrogative clause, for which the answeris binary (assertion or negation). Polar questions and answers (PQA) representa valuable knowledge resource present in many community and other curated QAsources, such as forums or e-commerce applications. Using answers to polarquestions alone in other contexts is not trivial. Answers are contextualized,and presume that the interrogative question clause and any shared knowledgebetween the asker and answerer are provided. We address the problem of controllable rewriting of answers to polarquestions into decontextualized and succinct factual statements. We propose aTransformer sequence to sequence model that utilizes soft-constraints to ensurecontrollable rewriting, such that the output statement is semanticallyequivalent to its PQA input. Evaluation on three separate PQA datasets asmeasured through automated and human evaluation metrics show that our proposedapproach achieves the best performance when compared to existing baselines.|是/否或极性问题是主要的语言问题类别之一。它们由一个主要的疑问句组成，其答案是二元的（肯定或否定）。极性问题和答案（PQA）是存在于许多社区和其他精心策划的问答来源（如论坛或电子商务应用）中的宝贵知识资源。在其他上下文中单独使用极性问题的答案并不简单。答案是上下文化的，并假设提供了疑问句以及提问者和回答者之间的任何共享知识。我们解决了将极性问题的答案重写为去上下文化且简洁的事实陈述的可控重写问题。我们提出了一种Transformer序列到序列模型，该模型利用软约束来确保可控重写，使得输出语句在语义上与其PQA输入等价。通过对三个独立的PQA数据集的评估，通过自动和人工评估指标测量，显示我们提出的方法与现有基线相比具有最佳性能。|code|0| |Reading Between the Frames: Multi-modal Depression Detection in Videos from Non-verbal Cues|David GimenoGómez, AnaMaria Bucur, Adrian Cosma, Carlos David MartínezHinarejos, Paolo Rosso||Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin. Our code is publicly available on GitHub.|抑郁症是导致全球性残疾的一个重要因素，影响着相当一部分人口。从社交媒体文本中检测抑郁症的努力已经很普遍，然而只有少数作品探索了从用户生成的视频内容中检测抑郁症。在这项工作中，我们通过提出一个简单而灵活的多模态时间模型来解决这一研究差距，该模型能够从嘈杂的现实世界视频中的不同模式中辨别出非语言性抑郁的线索。我们表明，对于野外视频，使用额外的高水平非语言线索对获得良好的表现至关重要，我们提取和处理语音嵌入，面部情感嵌入，面部，身体和手的地标，以及凝视和眨眼信息。通过广泛的实验，我们表明，我们的模型实现了国家的最先进的结果，三个关键的基准数据集抑郁症检测从视频相当大的幅度。我们的代码在 GitHub 上公开可用。|code|0| |Investigating the Effects of Sparse Attention on Cross-Encoders|Ferdinand Schlatt, Maik Fröbe, Matthias Hagen||Cross-encoders are effective passage and document re-rankers but less efficient than other neural or classic retrieval models. A few previous studies have applied windowed self-attention to make cross-encoders more efficient. However, these studies did not investigate the potential and limits of different attention patterns or window sizes. We close this gap and systematically analyze how token interactions can be reduced without harming the re-ranking effectiveness. Experimenting with asymmetric attention and different window sizes, we find that the query tokens do not need to attend to the passage or document tokens for effective re-ranking and that very small window sizes suffice. In our experiments, even windows of 4 tokens still yield effectiveness on par with previous cross-encoders while reducing the memory requirements to at most 78 for passages / documents.|交叉编码器是有效的通道和文档重新排序，但效率低于其他神经或经典检索模型。以前的一些研究已经应用窗口自我注意，使交叉编码器更有效率。然而，这些研究并没有调查不同注意模式或窗口大小的潜力和局限性。我们缩小了这一差距，并系统地分析了如何在不损害重新排序效率的情况下减少令牌交互。通过不对称注意和不同窗口大小的实验，我们发现查询标记不需要注意文章或文档标记来进行有效的重新排序，非常小的窗口大小就足够了。在我们的实验中，即使是4个令牌的窗口也能产生与以前的交叉编码器相同的效率，同时将文章/文档的内存需求降低到最多78个。|code|0| |SumBlogger: Abstractive Summarization of Large Collections of Scientific Articles|Pavlos Zakkas, Suzan Verberne, Jakub Zavrel|Zeta Alpha, Amsterdam, Netherlands; Leiden Univ, Leiden, Netherlands|We propose a prompt-based pipeline for extreme summarization of large collections of scientific articles, which facilitates the consumption of scientific knowledge in high-volume fast-paced fields like AI. Although prompting of generative large language models (LLMs) has been applied to news summarization, its effectiveness in the scientific domain and in multi-document summarization is underexplored. We propose a three-step approach for summarizing a large collection of documents (e.g. hundreds or thousands of papers published in a conference). First, selecting representative papers per document cluster, second, performing single-document summarization (SDS) of the selected papers, and third, aggregating these in a multi-document summarization (MDS) step. Both the single-document summaries and the multi-document summaries are generated with an instruction-tuned LLM. The cluster summaries are used to generate a blog post summarizing a conference. We show that our SDS model achieves better results than strong fine-tuned models on the SciTLDR benchmark. Our two-step approach reaches the performance of state-of-the-art fine-tuned MDS models on the Multi-XScience benchmark. Through a small-scale user study, we find that , although a human-written blog post is clearly preferred over an automatically generated one, the users appreciate the good informativeness and factuality of our pipeline. Our findings demonstrate the potential use of generative LLMs as a way to digest large amounts of scientific papers and help researchers to stay up-to-date with rapidly evolving fields.|我们提出了一种基于提示的流水线方法，用于对大量科学文章进行极致摘要，以促进在人工智能等高容量快节奏领域中科学知识的消费。尽管生成式大型语言模型（LLM）的提示技术已被应用于新闻摘要，但其在科学领域和多文档摘要中的有效性尚未得到充分探索。我们提出了一种三步法来总结大量文档（例如，会议上发表的数百或数千篇论文）。首先，为每个文档集群选择代表性论文；其次，对选定的论文进行单文档摘要（SDS）；最后，将这些摘要聚合在多文档摘要（MDS）步骤中。单文档摘要和多文档摘要均由指令调优的LLM生成。集群摘要用于生成总结会议的博客文章。我们展示了我们的SDS模型在SciTLDR基准测试中取得了比强微调模型更好的结果。我们的两步法在Multi-XScience基准测试中达到了最先进的微调MDS模型的性能。通过小规模用户研究，我们发现，尽管人类撰写的博客文章明显优于自动生成的文章，但用户对我们流水线的高信息性和事实性表示赞赏。我们的研究结果表明，生成式LLM有潜力作为消化大量科学论文并帮助研究人员跟上快速发展的领域的一种方式。|code|0| |Role-Guided Contrastive Learning for Event Argument Extraction|Chunyu Yao, Yi Guo, Xue Chen, Zhenzhen Duan, Jiaojiao Fu|East China Univ Sci & Technol, Shanghai, Peoples R China|Event argument extraction is a subtask of information extraction. Recent efforts have predominantly focused on mitigating the issue of error propagation associated with pipeline methods for extracting event arguments, such as machine reading comprehension and generative approaches. However, these aforementioned methods necessitate the careful design of various templates, and the choice of templates can significantly impact the model's performance. Therefore, we propose a novel approach to extract event arguments using contrastive learning. Our approach aims to maximize the semantic similarity between role name semantics and actual argument semantics while minimizing the similarity between role name semantics and the semantics of other non-argument words, thereby enabling more precise extraction of argument boundaries. We investigate the impact of different templates on event argument extraction, and experimental results demonstrate that template adjustments have limited effects on our model. To attain more precise argument boundaries, we also introduce entity type boundary embeddings, which substantially enhance the effectiveness of event argument extraction.|事件论元抽取是信息抽取的一个子任务。近期研究主要集中在缓解与流水线方法相关的事件论元抽取中的错误传播问题，例如机器阅读理解方法和生成式方法。然而，上述方法需要精心设计各种模板，且模板的选择会显著影响模型性能。因此，我们提出了一种利用对比学习进行事件论元抽取的新方法。我们的方法旨在最大化角色名称语义与实际论元语义之间的相似性，同时最小化角色名称语义与其他非论元词语义之间的相似性，从而实现更精确的论元边界抽取。我们研究了不同模板对事件论元抽取的影响，实验结果表明模板调整对我们的模型影响有限。为了获得更精确的论元边界，我们还引入了实体类型边界嵌入，这显著提升了事件论元抽取的效果。|code|0| |Attend All Options at Once: Full Context Input for Multi-choice Reading Comprehension|Runda Wang, Suzan Verberne, Marco Spruit|Leiden Univ, Leiden, Netherlands|This paper proposes a method to capture the relations between options in Multiple-choice Machine Reading Comprehension (MMRC) tasks. MMRC is a form of question answering (QA) in which the question is about a given text, and multiple answers are provided as options. Capturing the relations between options is especially important for options with information references between them that cannot stand alone as responses to the questions, such as "None of the above". Our method 1) takes the whole sample including identification of the passage, question, and all options as input for pre-trained language models, and 2) adds a fuser network to emphasize the information interaction between options. Experimental results show that our method improves over the common encoding approaches on COSMOS-QA, an MMRC dataset with between-option references, while having a relatively small impact on other MMRC datasets without references between the options. We conclude that our method actually helps to capture the necessary relationships between options. In addition, our method can reduce the memory usage required for training, and the model can be easily transferred to other domains and models.|本文提出了一种在多项选择机器阅读理解（MMRC）任务中捕捉选项之间关系的方法。MMRC是一种问答（QA）形式，其中问题基于给定文本，并提供多个答案作为选项。捕捉选项之间的关系对于那些选项之间存在信息引用且无法单独作为问题回答的选项尤为重要，例如“以上都不是”。我们的方法 1) 将整个样本（包括段落、问题及所有选项的识别）作为预训练语言模型的输入，2) 添加一个融合网络以强调选项之间的信息交互。实验结果表明，我们的方法在COSMOS-QA（一个具有选项间引用的MMRC数据集）上优于常见的编码方法，而在没有选项间引用的其他MMRC数据集上影响较小。我们得出结论，我们的方法确实有助于捕捉选项之间的必要关系。此外，我们的方法可以减少训练所需的内存使用，并且该模型可以轻松迁移到其他领域和模型中。|code|0| |Zero-Shot Generative Large Language Models for Systematic Review Screening Automation|Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman, Guido Zuccon||Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models~(LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.|系统评价对于循证医学来说至关重要，因为它们全面分析了已发表的关于具体问题的研究成果。进行这种审查往往需要大量资源和时间，特别是在筛选阶段，对出版物摘要进行评估以纳入审查。本研究旨在探讨使用大语言模型 ~ (LLM)进行自动筛选的有效性。我们评估了8种不同 LLM 的有效性，并研究了一种校准技术，该技术使用预定义的召回阈值来确定是否应该将出版物纳入系统综述。我们使用五个标准测试集进行的综合评估表明，指令微调在筛选中起着重要作用，校准使 LLM 实用于实现有针对性的召回，并且与一系列零拍模型相结合，与最先进的方法相比节省了显着的筛选时间。|code|0| |WebSAM-Adapter: Adapting Segment Anything Model for Web Page Segmentation|Bowen Ren, Zefeng Qian, Yuchen Sun, Chao Gao, Chongyang Zhang|Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai 200240, Peoples R China; China Pacific Insurance Grp Co Ltd, Shanghai 200010, Peoples R China|With the advancement of internet technology, web page segmentation, which aims to divide web pages into semantically coherent units, has become increasingly crucial for web-related applications. Conventional purely visual web page segmentation approaches, which depend on traditional edge detection, face challenges in generalizing across complex web pages. Recently, the Segment Anything Model (SAM) represents remarkable visual understanding and segmentation abilities. This inspires us that SAM can also demonstrate great potential in Web Page Segmentation. However, due to the lack of web-specific training data, its direct adaptation to web page segmentation domain has been hindered. To address this challenge, we propose WebSAM-Adapter, an effective adaptation of SAM, featuring a three-module architecture specifically tailored for web page segmentation with minimal additional trainable parameters. First, we propose a patch embedding tune module for adjusting the frozen patch embedding features, which is crucial for modifying the distribution of the original model. Second, an edge components tune module is designed to learn significant structural features within each web page. Finally, the outputs of these specialized modules are sent into our key Adapter module, which employs a lightweight multi-layer perceptron (MLP) to amalgamate these enriched features and generate webpage-specific knowledge. To the best of our knowledge, our method is the first successful adaptation of a large visual model like SAM to web page segmentation. Empirical evaluations on the comprehensive Webis-WebSeg-20 dataset demonstrate our model's state-of-the-art performance.|随着互联网技术的进步，网页分割旨在将网页划分为语义连贯的单元，对于与网页相关的应用变得越来越重要。传统的纯视觉网页分割方法依赖于传统的边缘检测，在复杂网页上的泛化能力面临挑战。最近，Segment Anything Model（SAM）展现了卓越的视觉理解和分割能力。这启发我们，SAM在网页分割领域也可能展现出巨大的潜力。然而，由于缺乏特定于网页的训练数据，其直接应用于网页分割领域受到了阻碍。为了应对这一挑战，我们提出了WebSAM-Adapter，这是一种有效的SAM适配方法，具有专为网页分割设计的三模块架构，且仅需极少的额外可训练参数。首先，我们提出了一个补丁嵌入调整模块，用于调整冻结的补丁嵌入特征，这对于修改原始模型的分布至关重要。其次，设计了一个边缘组件调整模块，用于学习每个网页中的重要结构特征。最后，这些专门模块的输出被送入我们的关键适配器模块，该模块采用轻量级多层感知器（MLP）来融合这些丰富的特征并生成特定于网页的知识。据我们所知，我们的方法是首次成功将像SAM这样的大型视觉模型适配到网页分割领域。在全面的Webis-WebSeg-20数据集上的实证评估表明，我们的模型达到了最先进的性能。|code|0| |A Phrase-Level Attention Enhanced CRF for Keyphrase Extraction|Shinian Li, Tao Jiang, Yuxiang Zhang|Civil Aviat Univ China, Sch Comp Sci & Technol, Tianjin, Peoples R China|Since sequence labeling-based methods take into account the dependencies between neighbouring labels, they have been widely used for keyphrase prediction. Existing methods mainly focus on the word-level sequence labeling over the word-level features, and fail to capture the phrase-level information (i.e., inner properties of multi-word keyphrases). In this paper, we concentrate on how to effectively capture the phrase-level features and then integrate them with the word-level features to improve the performance of keyphrase extraction in the sequence labeling-based method. Specifically, we propose a phrase-level attention enhanced conditional random field (PAE-CRF) model for keyphrase extraction, which consists of two major modules: a phrase-level attention module that captures phrase-level features, and a phrase-level attention enhanced CRF module that integrates the phrase-level attention information with the word-level features into CRF to extract keyphrases. Finally, these two modules are jointly trained to help them learn complementary information from each other. Compared with the recent state-of-the-art methods, our model can achieve better results through experiments on four benchmark datasets. The code and keyphrase prediction results of our model are available in public at https://github.com/pae-crf/PAE-CRF .|由于基于序列标注的方法考虑了相邻标签之间的依赖关系，它们已被广泛用于关键词预测。现有的方法主要集中在词级特征上的词级序列标注，未能捕捉到短语级信息（即多词关键词的内部属性）。本文重点研究如何有效捕捉短语级特征，并将其与词级特征相结合，以提高基于序列标注方法的关键词抽取性能。具体而言，我们提出了一种短语级注意力增强的条件随机场（PAE-CRF）模型用于关键词抽取，该模型包含两个主要模块：一个用于捕捉短语级特征的短语级注意力模块，以及一个将短语级注意力信息与词级特征整合到CRF中以抽取关键词的短语级注意力增强CRF模块。最后，这两个模块被联合训练，以帮助它们从彼此中学习互补信息。与最近的最先进方法相比，通过在四个基准数据集上的实验，我们的模型能够取得更好的结果。我们的模型代码和关键词预测结果可在https://github.com/pae-crf/PAE-CRF 公开获取。|code|0| |Taxonomy of Mathematical Plagiarism|Ankit Satpute, André GreinerPetter, Noah Gießing, Isabel Beckenbach, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp||Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism|剽窃是一个迫切需要关注的问题，在大型语言模型的可用性方面更是如此。现有的剽窃检测系统可靠地发现抄袭和适度重写的文本，但无法发现思想剽窃，尤其是在大量使用正式数学符号的数学科学领域。我们有两个贡献。首先，我们通过对可能抄袭的122个科学文档对进行注释，建立了数学内容重用的分类。其次，我们分析了在新建立的分类法中检测剽窃和数学内容相似性的最佳方法。我们发现表现最好的剽窃和数学内容相似性方法的总体检测得分(PlagDet)分别为0.06和0.16。表现最好的方法无法从所有七个新建立的数学相似性类型中检测出大多数案例。概述的贡献将有助于剽窃检测系统、推荐系统、问答系统和搜索引擎的研究。我们将我们实验的代码和注释数据集提供给社区: https://github.com/gipplab/taxonomy-of-mathematical-plagiarism|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Taxonomy+of+Mathematical+Plagiarism)|0| |Unraveling Disagreement Constituents in Hateful Speech|Giulia Rizzi, Alessandro Astorino, Paolo Rosso, Elisabetta Fersini|Univ Milano Bicocca, Milan, Italy; Univ Politecn Valencia, Valencia, Spain|This paper presents a probabilistic semantic approach to identifying disagreement-related textual constituents in hateful content. Several methodologies to exploit the selected constituents to determine if a message could lead to disagreement have been defined. The proposed approach is evaluated on 4 datasets made available for the SemEval 2023 Task 11 shared task, highlighting that a few constituents can be used as a proxy to identify if a sentence could be perceived differently by multiple readers. The source code of our approaches is publicly available ( https://github.com/MIND-Lab/Unrevealing-Disagreement-Constituents-in-Hateful-Speech ).|本文提出了一种概率语义方法，用于识别仇恨内容中与分歧相关的文本成分。我们定义了几种利用所选成分来确定消息是否可能引发分歧的方法。所提出的方法在SemEval 2023任务11共享任务提供的4个数据集上进行了评估，结果表明，少数成分可以作为代理，用于识别句子是否可能被多位读者感知为不同。我们的方法的源代码已公开提供（https://github.com/MIND-Lab/Unrevealing-Disagreement-Constituents-in-Hateful-Speech）。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Unraveling+Disagreement+Constituents+in+Hateful+Speech)|0| |SoftQE: Learned Representations of Queries Expanded by LLMs|Varad Pimpalkhute, John Heyer, Xusen Yin, Sameer Gupta||We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.|我们研究了如何将大语言模型(LLM)集成到查询编码器中，通过在推理时避免对 LLM 的依赖，在不增加延迟和成本的情况下提高密集检索。SoftQE 通过将输入查询的嵌入映射到 LLM 扩展查询的嵌入来整合来自 LLM 的知识。虽然在域内 MS-MARCO 指标的各种强基线上的改进是微乎其微的，但是在5个域外 BEIR 任务上，SoftQE 平均提高了2.83个绝对百分点的性能。|code|0| |Optimizing BERTopic: Analysis and Reproducibility Study of Parameter Influences on Topic Modeling|Martin Borcin, Joemon M. Jose|Univ Glasgow, Glasgow G12 8QQ, Scotland|This paper reproduces key experiments and results from the BERTopic neural topic modeling framework. We validate prior findings regarding the role of text preprocessing, embedding models and term weighting strategies in optimizing BERTopic's modular pipeline. Specifically, we show that advanced embedding models like MPNet benefit from raw input while simpler models like GloVe perform better with pre-processed text. We also demonstrate that excluding outlier documents from the topic model provides minimal gains. Additionally, we highlight that appropriate term weighting schemes, such as root TF-BM25(IDF), are critical for topic quality. We manage to reproduce prior results and our rigorous reproductions affirm the effectiveness of BERTopic's flexible framework while providing novel insights into tuning its components for enhanced topic modeling performance. The findings offer guidance and provide insightful refinements and clarifications, serving as a valuable reference for both researchers and practitioners applying clustering-based neural topic modeling.|本文复现了BERTopic神经主题建模框架的关键实验和结果。我们验证了先前关于文本预处理、嵌入模型和词项加权策略在优化BERTopic模块化流程中作用的发现。具体而言，我们展示了像MPNet这样的高级嵌入模型在原始输入上表现更佳，而像GloVe这样的简单模型则在预处理后的文本上表现更好。我们还证明了从主题模型中排除离群文档带来的收益微乎其微。此外，我们强调适当的词项加权方案，如根号TF-BM25（IDF），对主题质量至关重要。我们成功复现了先前的结果，并且我们严格的复现证实了BERTopic灵活框架的有效性，同时为调整其组件以提升主题建模性能提供了新的见解。这些发现为研究人员和实践者在应用基于聚类的神经主题建模时提供了指导，并提供了有价值的参考和深入的改进与澄清。|code|0| |A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR|Xinyu Mao, Bevan Koopman, Guido Zuccon||Screening documents is a tedious and time-consuming aspect of high-recall retrieval tasks, such as compiling a systematic literature review, where the goal is to identify all relevant documents for a topic. To help streamline this process, many Technology-Assisted Review (TAR) methods leverage active learning techniques to reduce the number of documents requiring review. BERT-based models have shown high effectiveness in text classification, leading to interest in their potential use in TAR workflows. In this paper, we investigate recent work that examined the impact of further pre-training epochs on the effectiveness and efficiency of a BERT-based active learning pipeline. We first report that we could replicate the original experiments on two specific TAR datasets, confirming some of the findings: importantly, that further pre-training is critical to high effectiveness, but requires attention in terms of selecting the correct training epoch. We then investigate the generalisability of the pipeline on a different TAR task, that of medical systematic reviews. In this context, we show that there is no need for further pre-training if a domain-specific BERT backbone is used within the active learning pipeline. This finding provides practical implications for using the studied active learning pipeline within domain-specific TAR tasks.|筛选文档是高召回率检索任务的一个乏味和耗时的方面，例如编写系统的文献综述，其目标是确定某个主题的所有相关文档。为了帮助简化这个过程，许多技术辅助评审(TAR)方法利用主动学习技术来减少需要评审的文档数量。基于 BERT 的模型在文本分类方面表现出了很高的效率，这使得人们对它们在 TAR 工作流中的潜在用途产生了兴趣。在本文中，我们调查了最近的工作，检查进一步的预训练时代对基于 BERT 的主动学习流水线的有效性和效率的影响。我们首先报告说，我们可以在两个特定的 TAR 数据集上复制原始实验，证实了一些发现: 重要的是，进一步的预训练对于高效性至关重要，但是在选择正确的训练时代方面需要注意。然后，我们调查管道的普遍性在一个不同的 TAR 任务，即医疗系统评价。在这种情况下，我们表明，没有必要进一步的预训练，如果领域特定的 BERT 骨干是在主动学习流水线使用。这一发现为在特定领域的 TAR 任务中使用所研究的主动学习流水线提供了实际意义。|code|0| |Good for Children, Good for All?|Monica Landoni, Theo Huibers, Emiliana Murgia, Maria Soledad Pera|Univ Genoa, Genoa, Italy; Univ Svizzera Italiana, Lugano, Switzerland; Delft Univ Technol, Web Informat Syst, Delft, Netherlands; Univ Twente, Enschede, Netherlands|In this work, we reason how focusing on Information Retrieval (IR) for children and involving them in participatory studies would benefit the IR community. The Child Computer Interaction (CCI) community has embraced the child as a protagonist as their main philosophy, regarding children as informants, co-designers, and evaluators, not just users. Leveraging prior literature, we posit that putting children in the centre of the IR world and giving them an active role could enable the IR community to break free from the preexisting bias derived from interpretations inferred from past use by adult users and the still dominant system-oriented approach. This shift would allow researchers to revisit complex foundational concepts that greatly influence the use of IR tools as part of socio-technical systems in different domains. In doing so, IR practitioners could provide more inclusive, and supportive information access experiences to children and other understudied user groups alike in different contexts.|在本研究中，我们探讨了将信息检索（IR）研究的重点放在儿童身上，并让他们参与参与式研究如何使IR领域受益。儿童计算机交互（CCI）领域已将儿童作为主角作为其主要理念，将儿童视为信息提供者、共同设计者和评估者，而不仅仅是用户。基于先前的文献，我们提出，将儿童置于IR世界的中心并赋予他们积极的角色，可以使IR领域摆脱由成人用户过去使用经验推断出的既有偏见以及仍然占主导地位的系统导向方法。这种转变将使研究人员能够重新审视复杂的基础概念，这些概念极大地影响了IR工具作为社会技术系统在不同领域中的使用。通过这样做，IR从业者可以在不同情境下为儿童和其他未被充分研究的用户群体提供更具包容性和支持性的信息获取体验。|code|0| |Mu2STS: A Multitask Multimodal Sarcasm-Humor-Differential Teacher-Student Model for Sarcastic Meme Detection|Gitanjali Kumari, Chandranath Adak, Asif Ekbal|Indian Inst Technol Patna, Dept Comp Sci & Engn, Bihta, India|Memes, a prevalent form of online communication, often express opinions, emotions, and creativity concisely and entertainingly. Amidst the diverse landscape of memes, the realm of sarcastic memes holds a unique position with its foundation in irony, mockery, satire, and messages that diverge from literal meanings. Detecting sarcasm in memes is challenging due to the intricate interplay between sarcasm and humor. While prior research has primarily concentrated on leveraging the relationship between sarcasm and humor for identifying sarcastic memes, our goal in this paper extends beyond establishing a fundamental connection between the two; instead, we aspire to unravel their distinct characteristics and nuances that differentiate sarcasm from humor. To accomplish this, we introduce a novel deep learning model, i.e., Mu2STS (Mu ltitask Mu ltimodal S arcasm-Humor-Differential T eacher-S tudent), for sarcasm detection in memes, with a special focus on humor. To bolster Mu2STS, we have developed the SHMH (WARNING: This paper contains meme samples that are offensive in nature.) (Sarcasm-with-Humorous-Meme-in-Hindi) dataset, designed for detecting sarcasm and humor in memes written in the Hindi language, which is the first of its kind to the best of our knowledge. Our empirical evaluation, which includes both qualitative and quantitative analyses conducted on the SHMH dataset and some benchmark meme datasets, clearly illustrates the effectiveness of Mu2STS, which outperformed major state-of-the-art models. (The dataset and codes are available at https://www.iitp.ac.in/~ai-nlp-ml/resources.html .)|表情包作为一种流行的在线交流形式，通常以简洁且有趣的方式表达观点、情感和创意。在多样化的表情包领域中，讽刺表情包以其基于反讽、嘲弄、讽刺和与字面意义相悖的信息而占据独特地位。由于讽刺与幽默之间错综复杂的相互作用，检测表情包中的讽刺具有挑战性。尽管先前的研究主要集中在利用讽刺与幽默的关系来识别讽刺表情包，但本文的目标不仅仅是建立两者之间的基本联系；相反，我们致力于揭示区分讽刺与幽默的独特特征和细微差别。为了实现这一目标，我们提出了一种新颖的深度学习模型，即Mu2STS（多任务多模态讽刺-幽默差异教师-学生模型），用于检测表情包中的讽刺，特别关注幽默。为了支持Mu2STS，我们开发了SHMH（警告：本文包含具有冒犯性质的表情包样本）（印地语幽默讽刺表情包）数据集，该数据集专为检测印地语表情包中的讽刺和幽默而设计，据我们所知，这是首个此类数据集。我们的实证评估包括在SHMH数据集和一些基准表情包数据集上进行的定性和定量分析，结果清楚地展示了Mu2STS的有效性，其表现优于主要的最先进模型。（数据集和代码可在https://www.iitp.ac.in/~ai-nlp-ml/resources.html获取。）|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Mu2STS:+A+Multitask+Multimodal+Sarcasm-Humor-Differential+Teacher-Student+Model+for+Sarcastic+Meme+Detection)|0| |An Adaptive Feature Selection Method for Learning-to-Enumerate Problem|Satoshi Horikawa, Chiyonosuke Nemoto, Keishi Tajima, Masaki Matsubara, Atsuyuki Morishima|Kyoto Univ, Kyoto 6068501, Japan; Univ Tsukuba, 1-2 Kasuga, Tsukuba, Ibaraki 3058550, Japan|In this paper, we propose a method for quickly finding a given number of instances of a target class from a fixed data set. We assume that we have a noisy query consisting of both useful and useless features (e.g., keywords). Our method finds target instances and trains a classifier simultaneously in a greedy strategy: it selects an instance most likely to be of the target class, manually label it, and add it to the training set to retrain the classifier, which is used for selecting the next item. In order to quickly inactivate useless query features, our method compares discriminative power of features, and if a feature is inferior to any other feature, the weight 0 is assigned to the inferior one. The weight is 1 otherwise. The greedy strategy explained above has a problem of bias: the classifier is biased toward target instances found earlier, and deteriorates after running out of similar target instances. To avoid it, when we run out of items that have the superior features, we re-activate the inactivated inferior features. By this mechanism, our method adaptively shifts to new regions in the data space. Our experiment shows that our binary and adaptive feature weighting method outperforms existing methods.|本文提出了一种方法，用于从固定数据集中快速找到给定数量的目标类实例。我们假设有一个包含有用和无用特征（例如关键词）的噪声查询。我们的方法以贪婪策略同时寻找目标实例并训练分类器：它选择最有可能属于目标类的实例，手动标注它，并将其添加到训练集中以重新训练分类器，该分类器用于选择下一个实例。为了快速禁用无用的查询特征，我们的方法比较了特征的判别能力，如果某个特征劣于任何其他特征，则将该特征的权重设为0，否则设为1。上述贪婪策略存在偏差问题：分类器偏向于较早找到的目标实例，并且在用完相似的目标实例后性能下降。为了避免这种情况，当我们用完具有优势特征的实例时，我们重新激活被禁用的劣势特征。通过这种机制，我们的方法能够自适应地转移到数据空间中的新区域。实验表明，我们的二值自适应特征加权方法优于现有方法。|code|0| |Asking Questions Framework for Oral History Archives|Jan Svec, Martin Bulín, Adam Frémund, Filip Polák|Univ West Bohemia, Fac Sci Appl, Dept Cybernet, Plzen, Czech Republic|The importance of oral history archives in preserving and understanding past experiences is counterbalanced by the challenges encountered in accessing and searching through them, primarily due to their extensive size and the diverse demographics of the speakers. This paper presents an approach combining ASR technology and Transformer-based neural networks into the Asking questions framework. Its primary function is to generate questions accompanied by concise answers that relate to the topics discussed in each interview segment. Additionally, we introduce a semantic continuity model that filters the generated questions, ensuring that only the most relevant ones are retained. This enables a real-time semantic search through thousands of hours of recordings, with the crucial benefit that the speakers' original words remain unaltered and still semantically align with the query. While the method is exemplified using a specific publicly available archive, its applicability extends universally to datasets of a similar nature.|口述历史档案在保存和理解过去经验方面的重要性，与其在访问和搜索过程中遇到的挑战形成了鲜明对比，这主要是由于其庞大的规模和发言者多样的人口统计特征。本文提出了一种方法，将自动语音识别（ASR）技术与基于Transformer的神经网络结合到提问框架中。其主要功能是生成与每段访谈内容相关的问题，并附上简洁的答案。此外，我们引入了一种语义连续性模型，用于过滤生成的问题，确保只保留最相关的问题。这使得能够实时对数千小时的录音进行语义搜索，其关键优势在于发言者的原话保持不变，并且仍然与查询语义对齐。虽然该方法以特定的公开档案为例进行说明，但其适用性可普遍扩展到类似性质的数据集。|code|0| |Yes, This Is What I Was Looking For! Towards Multi-modal Medical Consultation Concern Summary Generation|Abhisek Tiwari, Shreyangshu Bera, Sriparna Saha, Pushpak Bhattacharyya, Samrat Ghosh||Over the past few years, the use of the Internet for healthcare-related tasks has grown by leaps and bounds, posing a challenge in effectively managing and processing information to ensure its efficient utilization. During moments of emotional turmoil and psychological challenges, we frequently turn to the internet as our initial source of support, choosing this over discussing our feelings with others due to the associated social stigma. In this paper, we propose a new task of multi-modal medical concern summary (MMCS) generation, which provides a short and precise summary of patients' major concerns brought up during the consultation. Nonverbal cues, such as patients' gestures and facial expressions, aid in accurately identifying patients' concerns. Doctors also consider patients' personal information, such as age and gender, in order to describe the medical condition appropriately. Motivated by the potential efficacy of patients' personal context and visual gestures, we propose a transformer-based multi-task, multi-modal intent-recognition, and medical concern summary generation (IR-MMCSG) system. Furthermore, we propose a multitasking framework for intent recognition and medical concern summary generation for doctor-patient consultations. We construct the first multi-modal medical concern summary generation (MM-MediConSummation) corpus, which includes patient-doctor consultations annotated with medical concern summaries, intents, patient personal information, doctor's recommendations, and keywords. Our experiments and analysis demonstrate (a) the significant role of patients' expressions/gestures and their personal information in intent identification and medical concern summary generation, and (b) the strong correlation between intent recognition and patients' medical concern summary generation The dataset and source code are available at https://github.com/NLP-RL/MMCSG.|在过去几年中，互联网用于与保健有关的任务的使用突飞猛进，对有效管理和处理信息以确保其有效利用提出了挑战。在情绪动荡和心理挑战的时刻，我们经常求助于互联网作为我们最初的支持来源，由于相关的社会耻辱，我们选择这种方式而不是与他人讨论我们的感受。在本文中，我们提出了一个新的任务多模式医疗关注摘要(MMCS)生成，它提供了一个简短而精确的总结病人的主要关注在咨询过程中提出。非语言线索，如病人的手势和面部表情，有助于准确识别病人的关切。医生还会考虑病人的个人信息，如年龄和性别，以便恰当地描述病情。基于患者个人情境和视觉手势的潜在功效，我们提出了一个基于转换器的多任务、多模态意图识别和医学关注摘要生成(IR-MMCSG)系统。此外，我们提出了一个多任务框架的意图识别和医疗关注摘要生成的医患咨询。我们构建了第一个多模式医疗关注摘要生成(MM-MediConSumsum)语料库，其中包括用医疗关注摘要，意图，患者个人信息，医生建议和关键字注释的患者-医生咨询。我们的实验和分析表明(a)患者的表情/手势及其个人信息在意图识别和医疗关注摘要生成中的重要作用，以及(b)意图识别和患者医疗关注摘要生成之间的强相关性。数据集和源代码可在 https://github.com/nlp-rl/mmcsg 获得。|code|0| |Interactive Topic Tagging in Community Question Answering Platforms|Radin Hamidi Rad, Silviu Cucerzan, Nirupama Chandrasekaran, Michael Gamon|Microsoft Res, Redmond, WA USA; Toronto Metropolitan Univ, Toronto, ON, Canada|Community question-answering platforms offer new opportunities for users to share knowledge online. Such platforms allow building communities around areas of interest, and enable community members to post questions and have other members answer them. In this paper, we investigate a novel, interactive approach for tagging input questions with relevant topics, which are needed by community question-answering platforms for various tasks such as indexing and routing. Iteratively, we employ explicit feedback from the users who post questions to fine-tune further the tag suggestions for those questions. We show that our proposed method is able to suggest tags efficiently, and outperforms state-of-the-art methods applied to the tag suggestion task.|社区问答平台为用户提供了在线分享知识的新机会。此类平台允许围绕兴趣领域构建社区，并让社区成员发布问题并由其他成员进行回答。在本文中，我们研究了一种新颖的交互式方法，用于为输入问题标注相关主题，这是社区问答平台在索引和路由等任务中所必需的。我们迭代地利用发布问题的用户的显式反馈，进一步优化这些问题的标签建议。我们展示了所提出的方法能够高效地建议标签，并且在标签建议任务中优于现有的最先进方法。|code|0| |Mitigating Data Sparsity via Neuro-Symbolic Knowledge Transfer|Tommaso Carraro, Alessandro Daniele, Fabio Aiolli, Luciano Serafini|Fdn Bruno Kessler, Data & Knowledge Management Unit, Via Sommarive 18, I-38123 Povo, Italy; Univ Padua, Dept Math, Via Trieste 63, I-35121 Padua, Italy|Data sparsity is a well-known historical limitation of recommender systems that still impacts the performance of state-of-the-art approaches. The literature proposed various ways to mitigate the problem by providing additional information to the model (e.g., hybrid recommendation, knowledge graph-based systems). In particular, one promising technique involves transferring information from other domains or tasks to compensate for sparsity in the target domain, where the recommendations must be performed. Following this idea, we propose a novel approach based on Neuro-Symbolic computing designed for the knowledge transfer task in recommender systems. In particular, we use a Logic Tensor Network (LTN) to train vanilla Latent Factor Models (LFMs) for rating prediction. We show how the LTN can be used to regularize the LFMs using axiomatic knowledge that permits injecting pre-trained information learned by Collaborative Filtering on a different task or domain. Extensive experiments comparing our models with different baselines on two versions of a novel real-world dataset prove our proposal's potential in the knowledge transfer task. In particular, our models outperform the baselines, including those that can encode additional information, suggesting that the knowledge is effectively transferred to the target domain via logical reasoning. Moreover, an experiment that drastically decreases the density of user-item ratings shows that the benefits of the acquired knowledge increase with the sparsity of the dataset, showing the importance of exploiting knowledge from a denser source of information when training data is scarce in the target domain.|数据稀疏性是推荐系统中一个众所周知的历史性限制，至今仍影响着最先进方法的性能。文献中提出了多种方法来缓解这一问题，主要通过向模型提供额外信息（例如，混合推荐、基于知识图谱的系统）。特别是，一种有前景的技术涉及从其他领域或任务中转移信息，以补偿目标领域中推荐必须执行的稀疏性。遵循这一思路，我们提出了一种基于神经符号计算的新方法，专门设计用于推荐系统中的知识转移任务。具体而言，我们使用逻辑张量网络（LTN）来训练用于评分预测的普通潜在因子模型（LFM）。我们展示了如何利用LTN通过公理知识来正则化LFM，这些公理知识允许注入通过协同过滤在不同任务或领域上学习的预训练信息。通过在一个新颖的真实世界数据集的两个版本上，将我们的模型与不同基线模型进行广泛实验，证明了我们提出的方法在知识转移任务中的潜力。特别是，我们的模型优于包括那些能够编码额外信息的基线模型，这表明知识通过逻辑推理有效地转移到了目标领域。此外，一个大幅降低用户-项目评分密度的实验表明，所获得知识的好处随着数据集稀疏性的增加而增加，这表明当目标领域的训练数据稀缺时，利用来自更密集信息源的知识的重要性。|code|0| |Enhancing Legal Named Entity Recognition Using RoBERTa-GCN with CRF: A Nuanced Approach for Fine-Grained Entity Recognition|Arihant Jain, Raksha Sharma|Indian Inst Technol Roorkee, Roorkee, India|Accurate identification of named entities is pivotal for the advancement of sophisticated legal Artificial Intelligence (AI) applications. However, the legal domain presents distinct challenges due to the presence of fine-grained, domain-specific entities, including lawyers, judges, courts, and precedents. This necessitates a nuanced approach to Named Entity Recognition (NER). In this paper, we introduce a novel NER approach tailored to the legal domain. Our system combines Robustly Optimized BERT (RoBERTa) with a Graph Convolutional Network (GCN) to harness two distinct types of complementary information related to words in the data. Furthermore, the application of a Conditional Random Field (CRF) at the output layer ensures global consistency in data labeling by considering the entire sequence when predicting a named entity. RoBERTa captures contextual information about individual words, while GCN allows us to exploit the mutual relationships between words, resulting in more precise named entity identification. Our results indicate that RoBERTa-GCN (CRF) outperforms other standard settings, such as, RoBERTa, textGCN, and BiLSTM, including state-of-the-art for NER in the legal domain.|精确识别命名实体对于推动复杂的法律人工智能（AI）应用的发展至关重要。然而，由于法律领域中存在细粒度、特定领域的实体，如律师、法官、法院和判例，这给命名实体识别（NER）带来了独特的挑战，因此需要一种细致入微的方法。本文提出了一种专门针对法律领域的新型NER方法。我们的系统结合了鲁棒优化BERT（RoBERTa）和图卷积网络（GCN），以利用数据中与词语相关的两种不同类型的互补信息。此外，在输出层应用条件随机场（CRF）通过考虑整个序列来预测命名实体，从而确保数据标签的全局一致性。RoBERTa捕捉单个词语的上下文信息，而GCN使我们能够利用词语之间的相互关系，从而实现更精确的命名实体识别。我们的结果表明，RoBERTa-GCN（CRF）优于其他标准设置，如RoBERTa、textGCN和BiLSTM，包括法律领域NER的最先进技术。|code|0| |A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation Using GPT|Subhankar Maity, Aniket Deroy, Sudeshna Sarkar||We introduce a multi-stage prompting approach (MSP) for the generation of multiple choice questions (MCQs), harnessing the capabilities of GPT models such as text-davinci-003 and GPT-4, renowned for their excellence across various NLP tasks. Our approach incorporates the innovative concept of chain-of-thought prompting, a progressive technique in which the GPT model is provided with a series of interconnected cues to guide the MCQ generation process. Automated evaluations consistently demonstrate the superiority of our proposed MSP method over the traditional single-stage prompting (SSP) baseline, resulting in the production of high-quality distractors. Furthermore, the one-shot MSP technique enhances automatic evaluation results, contributing to improved distractor generation in multiple languages, including English, German, Bengali, and Hindi. In human evaluations, questions generated using our approach exhibit superior levels of grammaticality, answerability, and difficulty, highlighting its efficacy in various languages.|我们引入了一种多阶段提示方法(MSP)来生成多项选择题(MCQs) ，利用 GPT 模型的能力，如 text-davinci-003和 GPT-4，它们在各种 NLP 任务中的卓越表现而闻名。我们的方法结合了思维链激励的创新概念，这是一种渐进的技术，其中 GPT 模型提供了一系列相互关联的线索来指导 MCQ 的生成过程。自动评估一致表明，我们提出的 MSP 方法优于传统的单阶段提示(SSP)基线，从而产生了高质量的干扰器。此外，一次性 MSP 技术增强了自动评估结果，有助于改善多种语言的干扰生成，包括英语、德语、孟加拉语和印地语。在人类评价中，使用我们的方法产生的问题表现出优越的语法性、可回答性和难度水平，突出了它在各种语言中的功效。|code|0| |A Study on Hierarchical Text Classification as a Seq2seq Task|Fatos Torba, Christophe Gravier, Charlotte Laclau, Abderrhammen Kammoun, Julien Subercaze|Inst Polytech Paris, Telecom Paris, Paris, France; CNRS, Lab Hubert Curien, UMR 5516, St Etienne, France; AItenders, St Etienne, France|With the progress of generative neural models, Hierarchical Text Classification (HTC) can be cast as a generative task. In this case, given an input text, the model generates the sequence of predicted class labels taken from a label tree of arbitrary width and depth. Treating HTC as a generative task introduces multiple modeling choices. These choices vary from choosing the order for visiting the class tree and therefore defining the order of generating tokens, choosing either to constrain the decoding to labels that respect the previous level predictions, up to choosing the pre-trained Language Model itself. Each HTC model therefore differs from the others from an architectural standpoint, but also from the modeling choices that were made. Prior contributions lack transparent modeling choices and open implementations, hindering the assessment of whether model performance stems from architectural or modeling decisions. For these reasons, we propose with this paper an analysis of the impact of different modeling choices along with common model errors and successes for this task. This analysis is based on an open framework coming along this paper that can facilitate the development of future contributions in the field by providing datasets, metrics, error analysis toolkit and the capability to readily test various modeling choices for one given model.|随着生成式神经模型的进步，层次文本分类（Hierarchical Text Classification, HTC）可以被视为一种生成任务。在这种情况下，给定一个输入文本，模型会生成从任意宽度和深度的标签树中提取的预测类别标签序列。将HTC视为生成任务引入了多种建模选择。这些选择包括选择访问类别树的顺序，从而定义生成标记的顺序，选择是否限制解码以尊重前一级预测的标签，以及选择预训练的语言模型本身。因此，每个HTC模型不仅在架构上有所不同，还在建模选择上有所差异。以往的研究缺乏透明的建模选择和开放的实现，这阻碍了评估模型性能是否源于架构或建模决策。基于这些原因，我们通过本文提出了对不同建模选择的影响以及该任务中常见模型错误和成功案例的分析。该分析基于本文附带的一个开放框架，该框架通过提供数据集、指标、错误分析工具包以及能够轻松测试给定模型的各种建模选择的能力，促进了该领域未来研究的发展。|code|0| |MFVIEW: Multi-modal Fake News Detection with View-Specific Information Extraction|Marium Malik, Jiaojiao Jiang, Yang Song, Sanjay Jha|Univ New South Wales, Sch Comp Sci & Engn, Sydney, NSW, Australia|The spread of fake news on social media is a rapidly growing problem that is impacting both the general public and the government. Current methods for detecting false news often fail to take full advantage of the multi-modal information that is available, which can lead to inconsistent decisions due to modality ambiguity. Moreover, existing methods often overlook the unique information pertaining to view-specific details that could significantly boost their discriminative power and overall performance. To this end, we introduce a novel model, MFVIEW (Multi-Modal Fake News Detection with View-Specific Information Extraction), that unifies the modeling of multi-modal and view-specific information within a single framework. Specifically, the proposed model consists of a View-Specific Information Extractor that incorporates an orthogonal constraint within the shared subspace, enabling the utilization of discriminative information unique to each modality, and an Ambiguity Cross-Training Module that detects inherent ambiguity across different modalities by capturing their correlation. Extensive experiments on two publicly available datasets show that MFVIEW outperforms state-of-the-art fake news detection approaches with an accuracy of 91.0% on the Twitter dataset and 93.3% on the Weibo dataset.|社交媒体上虚假新闻的传播是一个迅速增长的问题，对公众和政府都产生了影响。现有的虚假新闻检测方法往往未能充分利用可用的多模态信息，这可能导致因模态模糊性而产生不一致的决策。此外，现有方法通常忽略了与特定视角相关的独特信息，这些信息可以显著提升其判别能力和整体性能。为此，我们提出了一种新颖的模型，MFVIEW（基于多模态和视角特定信息提取的虚假新闻检测），该模型将多模态和视角特定信息的建模统一在一个框架中。具体而言，所提出的模型包括一个视角特定信息提取器，该提取器在共享子空间中引入了正交约束，从而能够利用每个模态独有的判别信息，以及一个模糊性交叉训练模块，该模块通过捕捉不同模态之间的相关性来检测它们之间的固有模糊性。在两个公开可用的数据集上进行的大量实验表明，MFVIEW在虚假新闻检测方面的表现优于现有的最先进方法，在Twitter数据集上的准确率达到91.0%，在微博数据集上的准确率达到93.3%。|code|0| |Navigating Uncertainty: Optimizing API Dependency for Hallucination Reduction in Closed-Book QA|Pierre Erbacher, Louis Falissard, Vincent Guigue, Laure Soulier|Sorbonne Univ, Paris, France; Sorbonne Ctr Artificial Intelligence, Paris, France; Paris Saclay, AgroParisTech, Gif Sur Yvette, France|While Large Language Models (LLM) are able to accumulate and restore knowledge, they are still prone to hallucination. Especially when faced with factual questions, LLM cannot only rely on knowledge stored in parameters to guarantee truthful and correct answers. Augmenting these models with the ability to search on external information sources, such as the web, is a promising approach to ground knowledge to retrieve information. However, searching in a large collection of documents introduces additional computational/time costs. An optimal behavior would be to query external resources only when the LLM is not confident about answers. In this paper, we propose a new LLM able to self-estimate if it is able to answer directly or needs to request an external tool. We investigate a supervised approach by introducing a hallucination masking mechanism in which labels are generated using a close book question-answering task. In addition, we propose to leverage parameter-efficient fine-tuning techniques to train our model on a small amount of data. Our model directly provides answers for 78.2% of the known queries and opts to search for 77.2% of the unknown ones. This results in the API being utilized only 62% of the time.|尽管大型语言模型（LLM）能够积累和恢复知识，但它们仍然容易出现幻觉问题。特别是在面对事实性问题时，LLM不能仅依赖存储在参数中的知识来保证答案的真实性和正确性。通过增强这些模型的能力，使其能够搜索外部信息源（如网络），是一种有前景的方法，可以将知识落地以检索信息。然而，在大量文档集合中搜索会引入额外的计算/时间成本。最优的行为应是在LLM对答案不自信时才查询外部资源。在本文中，我们提出了一种新的LLM，它能够自我评估是否能够直接回答问题，或者需要请求外部工具。我们研究了一种监督方法，通过引入幻觉掩码机制来生成标签，该机制使用闭卷问答任务生成标签。此外，我们提出利用参数高效微调技术，在少量数据上训练我们的模型。我们的模型直接为78.2%的已知查询提供了答案，并选择为77.2%的未知查询进行搜索。这使得API的使用率仅为62%。|code|0| |Can We Predict QPP? An Approach Based on Multivariate Outliers|AdrianGabriel Chifu, Sébastien Déjean, Moncef Garouani, Josiane Mothe, Diégo Ortiz, Md Zia Ullah||Query performance prediction (QPP) aims to forecast the effectiveness of a search engine across a range of queries and documents. While state-of-the-art predictors offer a certain level of precision, their accuracy is not flawless. Prior research has recognized the challenges inherent in QPP but often lacks a thorough qualitative analysis. In this paper, we delve into QPP by examining the factors that influence the predictability of query performance accuracy. We propose the working hypothesis that while some queries are readily predictable, others present significant challenges. By focusing on outliers, we aim to identify the queries that are particularly challenging to predict. To this end, we employ multivariate outlier detection method. Our results demonstrate the effectiveness of this approach in identifying queries on which QPP do not perform well, yielding less reliable predictions. Moreover, we provide evidence that excluding these hard-to-predict queries from the analysis significantly enhances the overall accuracy of QPP.|查询性能预测(QPP)旨在预测一个搜索引擎在一系列查询和文档中的有效性。虽然最先进的预测器提供了一定程度的精确性，但它们的准确性并非完美无缺。先前的研究已经认识到 QPP 固有的挑战，但往往缺乏一个彻底的定性分析。在本文中，我们通过研究影响查询性能准确性可预测性的因素来深入研究 QPP。我们提出这样一个工作假设: 尽管一些查询是容易预测的，但是其他查询会带来重大挑战。通过关注异常值，我们的目标是识别特别难以预测的查询。为此，我们采用了多元异常检测方法。我们的研究结果证明了这种方法在识别 QPP 表现不佳的查询时的有效性，从而产生了不太可靠的预测。此外，我们提供的证据表明，排除这些难以预测的查询从分析显着提高了整体准确性的 QPP。|code|0| |SALSA: Salience-Based Switching Attack for Adversarial Perturbations in Fake News Detection Models|Chahat Raj, Anjishnu Mukherjee, Hemant Purohit, Antonios Anastasopoulos, Ziwei Zhu|George Mason Univ, Fairfax, VA 22030 USA|Despite advances in fake news detection algorithms, recent research reveals that machine learning-based fake news detection models are still vulnerable to carefully crafted adversarial attacks. In this landscape, traditional methods, often relying on text perturbations or heuristic-based approaches, have proven insufficient, revealing a critical need for more nuanced and context-aware strategies to enhance the robustness of fake news detection. Our research identifies and addresses three critical areas: creating subtle perturbations, preserving core information while modifying sentence structure, and incorporating inherent interpretability. We propose SALSA, an adversarial Salience-based Switching Attack strategy that harnesses salient words, using similarity-based switching to address the shortcomings of traditional adversarial attack methods. Using SALSA, we perform a two-way attack: misclassifying real news as fake and fake news as real. Due to the absence of standardized metrics to evaluate adversarial attacks in fake news detection, we further propose three new evaluation metrics to gauge the attack's success. Finally, we validate the transferability of our proposed attack strategy across attacker and victim models, demonstrating our approach's broad applicability and potency. Code and data are available here at https://github.com/iamshnoo/salsa .|尽管虚假新闻检测算法取得了进展，但最近的研究表明，基于机器学习的虚假新闻检测模型仍然容易受到精心设计的对抗攻击的影响。在这一背景下，传统方法通常依赖于文本扰动或基于启发式的方法，已被证明是不够的，这揭示了对更细致和上下文感知策略的迫切需求，以增强虚假新闻检测的鲁棒性。我们的研究识别并解决了三个关键领域：创建微妙的扰动、在修改句子结构的同时保留核心信息，以及融入固有的可解释性。我们提出了SALSA，一种基于显著性的对抗性切换攻击策略，该策略利用显著词，采用基于相似性的切换来应对传统对抗攻击方法的不足。使用SALSA，我们执行了双向攻击：将真实新闻误分类为虚假新闻，将虚假新闻误分类为真实新闻。由于缺乏标准化指标来评估虚假新闻检测中的对抗攻击，我们进一步提出了三个新的评估指标来衡量攻击的成功。最后，我们验证了我们提出的攻击策略在攻击者和受害者模型之间的可转移性，展示了我们方法的广泛适用性和效力。代码和数据可在https://github.com/iamshnoo/salsa 获取。|code|0| |FakeClaim: A Multiple Platform-Driven Dataset for Identification of Fake News on 2023 Israel-Hamas War|Gautam Kishore Shahi, Amit Kumar Jaiswal, Thomas Mandl||We contribute the first publicly available dataset of factual claims from different platforms and fake YouTube videos on the 2023 Israel-Hamas war for automatic fake YouTube video classification. The FakeClaim data is collected from 60 fact-checking organizations in 30 languages and enriched with metadata from the fact-checking organizations curated by trained journalists specialized in fact-checking. Further, we classify fake videos within the subset of YouTube videos using textual information and user comments. We used a pre-trained model to classify each video with different feature combinations. Our best-performing fine-tuned language model, Universal Sentence Encoder (USE), achieves a Macro F1 of 87%, which shows that the trained model can be helpful for debunking fake videos using the comments from the user discussion. The dataset is available on Github[https://github.com/Gautamshahi/FakeClaim]|我们贡献了第一个公开可用的数据集来自不同平台的事实声明和2023年以色列-哈马斯战争的假 YouTube 视频自动假 YouTube 视频分类。FakeClaim 的数据是从60个事实核查组织以30种语言收集的，并且由专门从事事实核查的训练有素的记者组织的事实核查组织的元数据加以丰富。此外，我们使用文本信息和用户评论将 YouTube 视频子集中的假视频进行分类。我们使用一个预先训练的模型来分类每个视频与不同的特征组合。我们表现最好的微调语言模型，通用句子编码器(USE) ，实现了87% 的宏 F1，这表明，训练有素的模型可以有助于揭穿假视频使用用户讨论的评论。该数据集可在 Github [ https://Github.com/gautamshahi/fakeclaim ]上获得|code|0| |MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries|Akash Ghosh, Arkadeep Acharya, Prince Jha, Sriparna Saha, Aniket Gaudgaul, Rajdeep Majumdar, Aman Chadha, Raghav Jain, Setu Sinha, Shivani Agarwal||In the healthcare domain, summarizing medical questions posed by patients is critical for improving doctor-patient interactions and medical decision-making. Although medical data has grown in complexity and quantity, the current body of research in this domain has primarily concentrated on text-based methods, overlooking the integration of visual cues. Also prior works in the area of medical question summarisation have been limited to the English language. This work introduces the task of multimodal medical question summarization for codemixed input in a low-resource setting. To address this gap, we introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset, which combines Hindi-English codemixed medical queries with visual aids. This integration enriches the representation of a patient's medical condition, providing a more comprehensive perspective. We also propose a framework named MedSumm that leverages the power of LLMs and VLMs for this task. By utilizing our MMCQS dataset, we demonstrate the value of integrating visual information from images to improve the creation of medically detailed summaries. This multimodal strategy not only improves healthcare decision-making but also promotes a deeper comprehension of patient queries, paving the way for future exploration in personalized and responsive medical care. Our dataset, code, and pre-trained models will be made publicly available.|在医疗保健领域，总结病人提出的医疗问题对于改善医患互动和医疗决策至关重要。尽管医学数据在复杂性和数量上都有所增长，但目前该领域的研究主要集中在基于文本的方法上，忽视了视觉线索的整合。此外，以前在医疗问题总结领域的工作已经限制在英语。本文介绍了在低资源环境下多模式医学问题摘要的任务。为了解决这一差距，我们引入了多模式医学代码混合问题摘要 MCQS 数据集，它结合了印地语-英语代码混合医学查询和视觉辅助。这种整合丰富了病人的医疗状况的表示，提供了一个更全面的视角。我们还提出了一个名为 MedSumm 的框架，该框架利用 LLM 和 VLM 的强大功能完成此任务。通过利用我们的 MMCQS 数据集，我们证明了整合来自图像的视觉信息以改善医学详细摘要的创建的价值。这种多模式策略不仅改善了医疗保健决策，而且促进了对患者查询的更深入理解，为未来探索个性化和响应式医疗保健铺平了道路。我们的数据集、代码和预先训练的模型将公开发布。|code|0| |The Open Web Index - Crawling and Indexing the Web for Public Use|Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, Benno Stein|German Aerosp Ctr DLR, Cologne, Germany; CERN, Geneva, Switzerland; Bauhaus Univ Weimar, Weimar, Germany; Univ Passau, Passau, Germany; Radboud Univ Nijmegen, Nijmegen, Netherlands; Friedrich Schiller Univ Jena, Jena, Germany; Univ Leipzig, Leipzig, Germany|Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index . The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index—for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.|目前只有少数搜索引擎能够大规模地索引整个网络。希望基于网络搜索开发下游应用的第三方完全依赖于这些少数供应商的条款和条件。尽管大规模公共爬取数据（Common Crawl）的公开可用性并未改善这一局面，因为针对特定下游应用场景爬取和索引较小规模的数据集通常比构建和维护一个与Common Crawl规模相当的通用索引更加经济。我们的目标是通过开发开放网络索引（Open Web Index）来改善这一现状。开放网络索引是一个由公共资助的基础设施，下游应用将能够以简单透明的方式从中选择和编译自定义索引。我们的目标是建立开放网络索引及其相关数据产品，使其成为一种新的开放网络信息中介。在本文中，我们展示了开放网络索引的第一个原型以及未来的发展规划。除了概念和技术背景外，我们还讨论了信息检索社区如何从开放网络索引中受益并为其做出贡献——例如，通过提供资源、预处理组件和流程，或创建新型垂直搜索引擎和测试集。|code|0| |Towards Robust Expert Finding in Community Question Answering Platforms|Maddalena Amendola, Andrea Passarella, Raffaele Perego|ISTI CNR, Pisa, Italy; Univ Pisa, Pisa, Italy; IIT CNR, Pisa, Italy|This paper introduces TUEF, a topic-oriented user-interaction model for fair Expert Finding in Community Question Answering (CQA) platforms. The Expert Finding task in CQA platforms involves identifying proficient users capable of providing accurate answers to questions from the community. To this aim, TUEF improves the robustness and credibility of the CQA platform through a more precise Expert Finding component. The key idea of TUEF is to exploit diverse types of information, specifically, content and social information, to identify more precisely experts thus improving the robustness of the task. We assess TUEF through reproducible experiments conducted on a large-scale dataset from StackOverflow. The results consistently demonstrate that TUEF outperforms state-of-the-art competitors while promoting transparent expert identification.|本文介绍了TUEF，一种面向主题的用户交互模型，用于在社区问答（CQA）平台中实现公平的专家发现。CQA平台中的专家发现任务涉及识别能够为社区问题提供准确答案的熟练用户。为此，TUEF通过更精确的专家发现组件提高了CQA平台的鲁棒性和可信度。TUEF的核心思想是利用多种类型的信息，特别是内容信息和社会信息，以更精确地识别专家，从而提高任务的鲁棒性。我们通过在StackOverflow的大规模数据集上进行的可重复实验对TUEF进行了评估。结果一致表明，TUEF在促进透明专家识别的同时，优于现有的最先进竞争对手。|code|0| |Interactive Document Summarization|Raoufdine Said, Adrien Guille|Univ Lyon, ERIC UR 3083, Lyon, France|With the advent of modern chatbots, automatic summarization is becoming common practice to quicken access to information. However the summaries they generate can be biased, unhelpful or untruthful. Hence, in sensitive scenarios, extractive summarization remains a more reliable approach. In this paper we present an original extractive method combining a GNN-based encoder and a RNN-based decoder, coupled with a user-friendly interface that allows for interactive summarization.|随着现代聊天机器人的出现，自动摘要已成为加快信息获取的常见做法。然而，它们生成的摘要可能存在偏见、无用或不真实的情况。因此，在敏感场景中，抽取式摘要仍然是一种更为可靠的方法。本文提出了一种创新的抽取式方法，结合了基于图神经网络（GNN）的编码器和基于循环神经网络（RNN）的解码器，并配备了一个用户友好的界面，支持交互式摘要生成。|code|0| |Physio: An LLM-Based Physiotherapy Advisor|Rúben Almeida, Hugo O. Sousa, Luís Filipe Cunha, Nuno Guimarães, Ricardo Campos, Alípio Jorge||The capabilities of the most recent language models have increased the interest in integrating them into real-world applications. However, the fact that these models generate plausible, yet incorrect text poses a constraint when considering their use in several domains. Healthcare is a prime example of a domain where text-generative trustworthiness is a hard requirement to safeguard patient well-being. In this paper, we present Physio, a chat-based application for physical rehabilitation. Physio is capable of making an initial diagnosis while citing reliable health sources to support the information provided. Furthermore, drawing upon external knowledge databases, Physio can recommend rehabilitation exercises and over-the-counter medication for symptom relief. By combining these features, Physio can leverage the power of generative models for language processing while also conditioning its response on dependable and verifiable sources. A live demo of Physio is available at https://physio.inesctec.pt.|最新语言模型的功能增加了将它们集成到实际应用程序中的兴趣。然而，当考虑到这些模型在几个领域中的使用时，这些模型产生合理但不正确的文本这一事实构成了一个限制。医疗保健是这样一个领域的典型例子，在这个领域，文本生成的可信度是保障患者健康的一个硬性要求。在本文中，我们介绍了生理，一个聊天为基础的应用程序的身体康复。理疗师能够做出初步诊断，同时引用可靠的健康来源来支持所提供的信息。此外，利用外部知识数据库，生理学可以推荐康复练习和非处方药物的症状缓解。通过结合这些特征，理疗师可以利用语言处理的生成模型的力量，同时也调节其反应的可靠和可验证的来源。一个现场演示的理疗可在 https://Physio.inesctec.pt。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=Physio:+An+LLM-Based+Physiotherapy+Advisor)|0| |eval-rationales: An End-to-End Toolkit to Explain and Evaluate Transformers-Based Models|Khalil Maachou, Jesús LovónMelgarejo, José G. Moreno, Lynda Tamine|Univ Paul Sabatier, CNRS, IRIT, UMR 5505, Toulouse, France|State-of-the-art (SOTA) transformer-based models in the domains of Natural Language Processing (NLP) and Information Retrieval (IR) are often characterized by their opacity in terms of decision-making processes. This limitation has given rise to various techniques for enhancing model interpretability and the emergence of evaluation benchmarks aimed at designing more transparent models. These techniques are primarily focused on developing interpretable models with the explicit aim of shedding light on the rationales behind their predictions. Concurrently, evaluation benchmarks seek to assess the quality of these rationales provided by the models. Despite the availability of numerous resources for using these techniques and benchmarks independently, their seamless integration remains a non-trivial task. In response to this challenge, this work introduces an end-to-end toolkit that integrates the most common techniques and evaluation approaches for interpretability. Our toolkit offers user-friendly resources facilitating fast and robust evaluations.|在自然语言处理（NLP）和信息检索（IR）领域，基于Transformer的最先进（SOTA）模型通常以其决策过程的不透明性为特征。这一局限性催生了各种增强模型可解释性的技术，以及旨在设计更透明模型的评估基准的出现。这些技术主要集中于开发可解释模型，明确目标是揭示其预测背后的推理过程。同时，评估基准则旨在评估模型提供的这些推理过程的质量。尽管有许多资源可以独立使用这些技术和基准，但它们的无缝集成仍然是一项非平凡的任务。针对这一挑战，本研究引入了一个端到端的工具包，该工具包集成了最常见的可解释性技术和评估方法。我们的工具包提供了用户友好的资源，以促进快速且稳健的评估。|code|0| |VADIS - A Variable Detection, Interlinking and Summarization System|Yavuz Selim Kartal, Muhammad Ahsan Shahid, Sotaro Takeshita, Tornike Tsereteli, Andrea Zielinski, Benjamin Zapilko, Philipp Mayr||The VADIS system addresses the demand of providing enhanced information access in the domain of the social sciences. This is achieved by allowing users to search and use survey variables in context of their underlying research data and scholarly publications which have been interlinked with each other.|VADIS 系统满足了增强社会科学领域信息获取的需求。这是通过允许用户在其相互关联的基础研究数据和学术出版物的背景下搜索和使用调查变量来实现的。|code|0| |Building and Evaluating a WebApp for Effortless Deep Learning Model Deployment|Ruikun Wu, Jiaxuan Han, Jerome Ramos, Aldo Lipani|UCL, London, England|In the field of deep learning, particularly Natural Language Processing (NLP), model deployment is a key process for public testing and analysis. However, developing a deployment pipeline is often difficult and time-consuming. To address this challenge, we developed SUD.DL, a web application to simplify the model deployment process for NLP researchers. Our application provides significant improvements in deployment efficiency, functionality discoverability, and deployment functionality, allowing NLP researchers to quickly deploy and test models on the web.|在深度学习领域，尤其是自然语言处理（NLP）中，模型部署是进行公开测试和分析的关键过程。然而，开发一个部署管道通常既困难又耗时。为了解决这一挑战，我们开发了SUD.DL，这是一个旨在简化NLP研究人员模型部署过程的网络应用程序。我们的应用程序在部署效率、功能可发现性以及部署功能方面提供了显著的改进，使得NLP研究人员能够快速地在网络上部署和测试模型。|code|0| |indxr: A Python Library for Indexing File Lines|Elias Bassani, Nicola Tonellotto|; Univ Pisa, Pisa, Italy|indxr is a Python utility for indexing file lines that allows users to dynamically access specific ones, avoiding loading the entire file in the computer's main memory. indxr addresses two main issues related to working with textual data. First, users who do not have plenty of RAM at their disposal may struggle to work with large datasets. Since indxr allows accessing specific lines without loading entire files, users can work with datasets that do not fit into their computer's main memory. For example, it enables users to perform complex tasks with limited RAM without noticeable slowdowns, such as pre-processing texts and training Neural models for Information Retrieval or other tasks. Second, indxr reduces the burden of working with datasets split among multiple files by allowing users to load specific data by providing the related line numbers or the identifiers of the information they describe, thus providing convenient access to such data. This paper overviews indxr's main features. ( https://github.com/AmenRa/indxr ).|indxr 是一个用于索引文件行的 Python 工具，它允许用户动态访问特定的行，从而避免将整个文件加载到计算机的主内存中。indxr 解决了与处理文本数据相关的两个主要问题。首先，对于那些没有足够 RAM 资源的用户来说，处理大型数据集可能会变得困难。由于 indxr 允许在不加载整个文件的情况下访问特定行，用户能够处理那些无法完全放入计算机主内存的数据集。例如，它使得用户可以在有限的 RAM 资源下执行复杂的任务而不会出现明显的性能下降，例如文本预处理以及为信息检索或其他任务训练神经网络模型。其次，indxr 通过允许用户通过提供相关行号或信息描述的标识符来加载特定数据，从而减轻了处理分布在多个文件中的数据集的负担，提供了便捷的数据访问方式。本文概述了 indxr 的主要功能。（https://github.com/AmenRa/indxr）|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=indxr:+A+Python+Library+for+Indexing+File+Lines)|0| |SciSpace Literature Review: Harnessing AI for Effortless Scientific Discovery|Siddhant Jain, Asheesh Kumar, Trinita Roy, Kartik Shinde, Goutham Vignesh, Rohan Tondulkar|SciSpace, Bengaluru, India|In the rapidly evolving landscape of academia, the scientific research community barely copes with the challenges posed by a surging volume of scientific literature. Nevertheless, discovering research remains an important step in the research workflow which is also proven to be a challenging one to automate. We present Scispace Literature Review, a sophisticated, multi-faceted tool that serves as a comprehensive solution to streamline the literature review process. By leveraging the state-of-the-art methods in vector-based search, reranking, and large language models, the tool delivers features like customizable search results, data exintegration with an AI assistant, multi-language support, top papers insights, and customizable results columns to cater a researcher's requirements, and accelerate literature exploration. Resources for simplified sharing and documentation further enhance the scope and depth and breadth of research. We demonstrate the extensive use and popularity of the tool among researchers with various metrics, highlighting its value as a resource to elevate scientific literature review. This tool can be tried using this link: https://typeset.io/search .|在快速变化的学术领域，科研界几乎难以应对激增的科学文献所带来的挑战。然而，发现研究仍然是研究流程中的一个重要步骤，而这一步骤也被证明是自动化中的难点。我们推出了Scispace文献综述工具，这是一个复杂且多功能的工具，旨在简化文献综述过程，提供全面的解决方案。通过利用基于向量的搜索、重排序和大语言模型等最先进的方法，该工具提供了可定制的搜索结果、与AI助手的数据集成、多语言支持、顶级论文洞察以及可定制的结果列等功能，以满足研究者的需求，并加速文献探索。简化的共享和文档资源进一步增强了研究的深度和广度。我们通过各种指标展示了该工具在研究者中的广泛使用和受欢迎程度，突显了其作为提升科学文献综述价值的资源。该工具可通过此链接试用：https://typeset.io/search。|[code](https://paperswithcode.com/search?q_meta=&q_type=&q=SciSpace+Literature+Review:+Harnessing+AI+for+Effortless+Scientific+Discovery)|0| |Let's Get It Started: Fostering the Discoverability of New Releases on Deezer|Léa Briand, Théo Bontempelli, Walid Bendada, Mathieu Morlon, François Rigaud, Benjamin Chapus, Thomas Bouabça, Guillaume SalhaGalvan|Deezer Res, Paris, France|This paper presents our recent initiatives to foster the discoverability ofnew releases on the music streaming service Deezer. After introducing oursearch and recommendation features dedicated to new releases, we outline ourshift from editorial to personalized release suggestions using cold startembeddings and contextual bandits. Backed by online experiments, we discuss theadvantages of this shift in terms of recommendation quality and exposure of newreleases on the service.|本文介绍了我们在音乐流媒体服务Deezer上为提升新发布内容的可发现性所采取的最新举措。在介绍了我们专门为新发布内容设计的搜索和推荐功能后，我们概述了如何通过冷启动嵌入和上下文老虎机算法，从编辑推荐转向个性化发布推荐。通过在线实验的支持，我们讨论了这一转变在推荐质量和平台上新发布内容的曝光度方面的优势。|code|0| |Augmenting KG Hierarchies Using Neural Transformers|Sanat Sharma, Mayank Poddar, Jayant Kumar, Kosta Blank, Tracy Holloway King|Adobe Inc, San Jose, CA 95110 USA|This work leverages neural transformers to generate hierarchies in an existing knowledge graph. For small (<10,000 node) domain-specific KGs, we find that a combination of few-shot prompting with one-shot generation works well, while larger KG may require cyclical generation. Hierarchy coverage increased by 98% for intents and 95% for colors.|本研究利用神经变换器在现有知识图谱中生成层次结构。针对小型（节点数小于10,000）的特定领域知识图谱，我们发现结合少量样本提示与单样本生成的方法效果显著，而对于更大规模的知识图谱则可能需要循环生成。实验结果表明，意图的层次结构覆盖率提高了98%，颜色的层次结构覆盖率提高了95%。|code|0| |Document Level Event Extraction from Narratives|Luís Filipe Cunha|Univ Porto, FCUP, Porto, Portugal|One of the fundamental tasks in Information Extraction (IE) is Event Extraction (EE), an extensively studied and challenging task [13,15], which aims to identify and classify events from the text. This involves identifying the event's central word (trigger) and its participants (arguments) [1]. These elements capture the event semantics and structure, which have applications in various fields, including biomedical texts [42], cybersecurity [24], economics [12], literature [32], and history [33]. Structured knowledge derived from EE can also benefit other downstream tasks such as Question Answering [20,30], Natural Language Understanding [21], Knowledge Base Graphs [3,37], summarization [8,10,41] and recommendation systems [9,18]. Despite the existence of several English EE systems [2,22,25,26], they face limited portability to other languages [4] and most of them are designed for closed domains, posing difficulties in generalising. Furthermore, most current EE systems restrict their scope to the sentence level, assuming that all arguments are contained within the same sentence as their corresponding trigger. However, real-world scenarios often involve event arguments spanning multiple sentences, highlighting the need for document-level EE.|信息抽取（Information Extraction, IE）中的一项基本任务是事件抽取（Event Extraction, EE），这是一个被广泛研究且具有挑战性的任务 [13,15]，其目标是从文本中识别并分类事件。这包括识别事件的核心词（触发词）及其参与者（论元）[1]。这些元素捕捉了事件的语义和结构，并在多个领域中得到了应用，包括生物医学文本 [42]、网络安全 [24]、经济学 [12]、文学 [32] 和历史 [33]。从事件抽取中获取的结构化知识还可以为其他下游任务提供帮助，例如问答系统 [20,30]、自然语言理解 [21]、知识图谱 [3,37]、摘要生成 [8,10,41] 和推荐系统 [9,18]。尽管已经存在多个英语事件抽取系统 [2,22,25,26]，但它们在其他语言中的可移植性有限 [4]，并且大多数系统是为封闭领域设计的，难以推广到更广泛的应用场景。此外，当前大多数事件抽取系统将其范围限制在句子级别，假设所有论元都与相应的触发词位于同一句子中。然而，现实场景中的事件论元往往跨越多个句子，这凸显了文档级别事件抽取的必要性。|code|0| |Shuffling a Few Stalls in a Crowded Bazaar: Potential Impact of Document-Side Fairness on Unprivileged Info-Seekers|Sean Healy|Dublin City Univ, ADAPT Ctr, Dublin, Ireland|Information systems rely on algorithmic ranking to ascertain expected relevance. Concerns about this strategy have resulted in the emergence of a field of inquiry referred to as fair ranking. Within this field, the aim varies between one-sided and two-sided fairness across automatically generated rankings. But research has focused primarily on fairness among document providers as opposed to fairness among searchers. Concerns have already been raised about the present framing of fairness. In the following line of research, a novel framing concern is introduced, whereby researchers may fail to consider the broader context of search engine usage among protected groups of searchers.|信息系统依赖于算法排名来确定预期的相关性。关于这一策略的担忧催生了一个被称为公平排名的研究领域。在该领域中，目标在自动生成的排名中分为单边公平和双边公平。但研究主要集中在文档提供者之间的公平性，而非搜索者之间的公平性。目前关于公平性的框架已经引发了一些担忧。在接下来的研究中，引入了一个新的框架问题，即研究人员可能未能考虑到受保护搜索者群体在使用搜索引擎时的更广泛背景。|code|0| |Knowledge Transfer from Resource-Rich to Resource-Scarce Environments|Negin Ghasemi||||code|0| |PhD Candidacy: A Tutorial on Overcoming Challenges and Achieving Success|Johanne R. Trippas, David Maxwell|RMIT Univ, Melbourne, Vic, Australia; Booking Com, Delft, Netherlands|Undertaking a PhD is a demanding yet rewarding experience. PhD candidates develop a deep understanding of their research topic and acquire a wide range of skills, including (i) formulating research questions; (ii) conducting research ethically and rigorously; (iii) communicating research findings effectively to both academic and non-academic audiences alike; (iv) forging a profile as an independent researcher; and (v) developing a teaching portfolio. PhD candidates inevitably experience challenges during their candidature. These challenges can be overcome by applying various techniques to adapt and learn from these experiences. This tutorial introduces strategies to help them advance in the PhD process. It is presented by two early career researchers in information retrieval, who have the unique perspective of being close enough to their time as PhD candidates to remember the highs and lows of PhD life yet far enough removed from the process to reflect on their experiences and provide insights. The tutorial will empower attendees to share, review, and refine productivity methods for their PhD journey. It provides a non-judgemental platform for open discussions led by the presenters.|攻读博士学位是一段充满挑战却又收获颇丰的经历。博士研究生不仅会深入研究自己的课题，还会掌握一系列技能，包括：（1）提出研究问题；（2）以严谨和合乎伦理的方式开展研究；（3）有效地向学术界和非学术界受众传达研究成果；（4）塑造独立研究者的形象；（5）积累教学经验。在攻读博士学位期间，研究生难免会遇到各种挑战。通过运用多种技巧来适应并从这些经历中学习，这些挑战是可以克服的。本教程旨在介绍一些策略，帮助博士研究生在攻读过程中取得进展。教程由两位信息检索领域的早期职业研究者主讲，他们既对博士生活的起伏记忆犹新，又能以过来人的视角反思自己的经历并提供见解。教程将为参与者提供一个无评判的平台，鼓励开放讨论，分享、审视并优化他们在博士旅程中的效率方法。|code|0| |The CLEF-2024 CheckThat! Lab: Check-Worthiness, Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness|Alberto BarrónCedeño, Firoj Alam, Tanmoy Chakraborty, Tamer Elsayed, Preslav Nakov, Piotr Przybyla, Julia Maria Struß, Fatima Haouari, Maram Hasanain, Federico Ruggeri, Xingyi Song, Reem Suwaileh|HBKU, Doha, Qatar; Mohamed bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates; Univ Appl Sci Potsdam, Potsdam, Germany; Indian Inst Technol Delhi, New Delhi, India; Univ Bologna, DISI, Bologna, Italy; Univ Sheffield, Sheffield, S Yorkshire, England; HBKU, Qatar Comp Res Inst, Doha, Qatar; Univ Pompeu Fabra, Barcelona, Spain; Univ Bologna, DIT, Forli, Italy; Qatar Univ, Doha, Qatar|The first five editions of the CheckThat! lab focused on the main tasks of the information verification pipeline: check-worthiness, evidence retrieval and pairing, and verification. Since the 2023 edition, it has been focusing on new problems that can support the research and decision making during the verification process. In this new edition, we focus on new problems and —for the first time— we propose six tasks in fifteen languages (Arabic, Bulgarian, English, Dutch, French, Georgian, German, Greek, Italian, Polish, Portuguese, Russian, Slovene, Spanish, and code-mixed Hindi-English): Task 1 estimation of check-worthiness (the only task that has been present in all CheckThat! editions), Task 2 identification of subjectivity (a follow up of CheckThat! 2023 edition), Task 3 identification of persuasion (a follow up of SemEval 2023), Task 4 detection of hero, villain, and victim from memes (a follow up of CONSTRAINT 2022), Task 5 Rumor Verification using Evidence from Authorities (a first), and Task 6 robustness of credibility assessment with adversarial examples (a first). These tasks represent challenging classification and retrieval problems at the document and at the span level, including multilingual and multimodal settings.|前五届 CheckThat! 实验室主要关注信息验证流程中的核心任务：检查价值、证据检索与配对以及验证。自 2023 年版以来，该实验室开始关注能够支持验证过程中研究和决策的新问题。在本届新版本中，我们首次提出了六个任务，涵盖十五种语言（阿拉伯语、保加利亚语、英语、荷兰语、法语、格鲁吉亚语、德语、希腊语、意大利语、波兰语、葡萄牙语、俄语、斯洛文尼亚语、西班牙语以及印地语-英语混合语）：任务 1 检查价值估计（这是所有 CheckThat! 版本中唯一始终存在的任务），任务 2 主观性识别（延续自 CheckThat! 2023 版），任务 3 说服力识别（延续自 SemEval 2023），任务 4 从表情包中检测英雄、反派和受害者（延续自 CONSTRAINT 2022），任务 5 基于权威证据的谣言验证（首次提出），以及任务 6 使用对抗样本进行可信度评估的鲁棒性（首次提出）。这些任务代表了文档和片段层面的具有挑战性的分类和检索问题，包括多语言和多模态场景。|code|0| |ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality|Jussi Karlgren, Luise Dürlich, Evangelia Gogoulou, Liane Guillou, Joakim Nivre, Magnus Sahlgren, Aarne Talman|Silo AI, Helsinki, Finland; RISE Res Inst Sweden, Stockholm, Sweden|ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to bring together some high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. The selected tasks for this first year of ELOQUENT are (1) probing a language model for topical competence; (2) assessing the ability of models to generate and detect hallucinations; (3) assessing the robustness of a model output given variation in the input prompts; and (4) establishing the possibility to distinguish human-generated text from machine-generated text.|ELOQUENT 是一组用于评估生成式语言模型质量和实用性的共享任务。ELOQUENT 旨在整合一些基于实际任务中部署模型经验的高层次质量标准，并为这些标准制定测试方案，尽可能减少人工评估的工作量，并在多语言环境中实施。在ELOQUENT的首年，选定的任务包括：（1）探究语言模型在主题能力方面的表现；（2）评估模型生成和检测幻觉的能力；（3）评估模型输出在面对输入提示变化时的鲁棒性；（4）确立区分人类生成文本与机器生成文本的可能性。|code|0| |Overview of Touché 2024: Argumentation Systems|Johannes Kiesel, Çagri Çöltekin, Maximilian Heinrich, Maik Fröbe, Milad Alshomary, Bertrand De Longueville, Tomaz Erjavec, Nicolas Handke, Matyás Kopp, Nikola Ljubesic, Katja Meden, Nailia Mirzakhmedova, Vaidas Morkevicius, Theresa ReitisMünstermann, Mario Scharfbillig, Nicolas Stefanovitch, Henning Wachsmuth, Martin Potthast, Benno Stein|Leibniz Univ Hannover, Hannover, Germany; Arcadia Sistemi Informativi Terr, Milan, Italy; Univ Tubingen, Tubingen, Germany; Univ Kassel, Kassel, Germany; Charles Univ Prague, Prague, Czech Republic; Bauhaus Univ Weimar, Weimar, Germany; Jozef Stefan Inst, Ljubljana, Slovenia; Joint Res Ctr JRC, European Commiss, Brussels, Belgium; Kaunas Univ Technol, Kaunas, Lithuania; Univ Leipzig, Leipzig, Germany; Friedrich Schiller Univ, Jena, Germany|This paper is a condensed overview of Touche: the fifth edition of the lab on argumentation systems that was held at CLEF 2024. With the goal to foster the development of support-technologies for decision-making and opinion-forming, we organized three shared tasks: (1) Human value detection (ValueEval), where participants detect (implicit) references to human values and their attainment in text; (2) Multilingual Ideology and Power Identification in Parliamentary Debates, where participants identify from a speech the political leaning of the speaker's party and whether it was governing at the time of the speech (new task); and (3) Image retrieval or generation in order to convey the premise of an argument with visually. In this paper, we describe these tasks, their setup, and participating approaches in detail.|本文是对Touche的简要概述：2024年CLEF会议上举办的第五届论证系统实验室。我们的目标是促进支持决策和意见形成技术的发展，为此我们组织了三个共享任务：（1）人类价值观检测（ValueEval），参与者需要检测文本中对人类价值观及其实现的（隐含）引用；（2）多语言议会辩论中的意识形态和权力识别，参与者从演讲中识别发言者所属政党的政治倾向以及该政党在演讲时是否执政（新任务）；（3）图像检索或生成，以便通过视觉传达论证的前提。在本文中，我们详细描述了这些任务、其设置以及参与方法。|code|0| |eRisk 2024: Depression, Anorexia, and Eating Disorder Challenges|Javier Parapar, Patricia MartínRodilla, David E. Losada, Fabio Crestani|; Univ Svizzera Italiana USI, Fac Informat, Lugano, Switzerland; Univ Santiago de Compostela, Ctr Singular Invest Tecnol Intelixentes CiTIUS, Santiago, Spain|In 2017, we launched eRisk as a CLEF Lab to encourage research on early risk detection on the Internet. Since then, thanks to the participants' work, we have developed detection models and datasets for depression, anorexia, pathological gambling and self-harm. In 2024, it will be the eighth edition of the lab, where we will present a revision of the sentence ranking for depression symptoms, the third edition of tasks on early alert of anorexia and eating disorder severity estimation. This paper outlines the work that we have done to date, discusses key lessons learned in previous editions, and presents our plans for eRisk 2024.|2017年，我们启动了eRisk作为CLEF实验室，旨在鼓励互联网早期风险检测的研究。自那时起，凭借参与者的努力，我们已经开发了针对抑郁症、厌食症、病态赌博和自我伤害的检测模型和数据集。2024年，实验室将迎来第八届，届时我们将展示对抑郁症症状句子排名的修订，以及厌食症早期预警和饮食障碍严重程度估计任务的第三版。本文概述了我们迄今为止的工作，讨论了前几届中的关键经验教训，并介绍了我们为eRisk 2024制定的计划。|code|0| |QuantumCLEF - Quantum Computing at CLEF|Andrea Pasin, Maurizio Ferrari Dacrema, Paolo Cremonesi, Nicola Ferro|Univ Padua, Padua, Italy; Politecn Milan, Milan, Italy|Over the last few years, Quantum Computing (QC) has captured the attention of numerous researchers pertaining to different fields since, due to technological advancements, QC resources have become more available and also applicable in solving practical problems. In the current landscape, Information Retrieval (IR) and Recommender Systems (RS) need to perform computationally intensive operations on massive and heterogeneous datasets. Therefore, it could be possible to use QC and especially Quantum Annealing (QA) technologies to boost systems' performance both in terms of efficiency and effectiveness. The objective of this work is to present the first edition of the QuantumCLEF lab, which is composed of two tasks that aim at: - evaluating QA approaches compared to their traditional counterpart; - identifying new problem formulations to discover novel methods that leverage the capabilities of QA for improved solutions; - establishing collaborations among researchers from different fields to harness their knowledge and skills to solve the considered challenges and promote the usage of QA. This lab will employ the QC resources provided by CINECA, one of the most important computing centers worldwide. We also describe the design of our infrastructure which uses Docker and Kubernetes to ensure scalability, fault tolerance and replicability.|在过去的几年里，量子计算（QC）吸引了来自不同领域的众多研究者的关注，因为随着技术的进步，QC资源变得更加可用，并且能够应用于解决实际问题。在当前背景下，信息检索（IR）和推荐系统（RS）需要对大规模且异构的数据集执行计算密集型操作。因此，利用QC，特别是量子退火（QA）技术来提升系统在效率和效果方面的性能是可行的。本工作的目标是推出QuantumCLEF实验室的首个版本，该实验室包含两个任务，旨在：

评估QA方法与传统方法的对比；
识别新的问题表述，以发现利用QA能力改进解决方案的新方法；
促进不同领域研究者之间的合作，利用他们的知识和技能来解决所面临的挑战，并推动QA的应用。
该实验室将使用由CINECA提供的QC资源，CINECA是全球最重要的计算中心之一。我们还描述了基础设施的设计，该设计使用Docker和Kubernetes来确保可扩展性、容错性和可复制性。|code|0| |EXIST 2024: sEXism Identification in Social neTworks and Memes|Laura Plaza, Jorge CarrillodeAlbornoz, Enrique Amigó, Julio Gonzalo, Roser Morante, Paolo Rosso, Damiano Spina, Berta Chulvi, Alba Maeso, Víctor Ruiz|RMIT Univ, Melbourne, Vic 3000, Australia; Univ Nacl Educ Distancia UNED, Madrid 28040, Spain; Univ Politecn Valencia UPV, Valencia 46022, Spain|The paper describes the EXIST 2024 lab on Sexism identification in social networks, that is expected to take place at the CLEF 2024 conference and represents the fourth edition of the EXIST challenge. The lab comprises five tasks in two languages, English and Spanish, with the initial three tasks building upon those from EXIST 2023 (sexism identification in tweets, source intention detection in tweets, and sexism categorization in tweets). In this edition, two new tasks have been introduced: sexism detection in memes and sexism categorization in memes. Similar to the prior edition, this one will adopt the Learning With Disagreement paradigm. The dataset for the various tasks will provide all annotations from multiple annotators, enabling models to learn from a range of training data, which may sometimes present contradictory opinions or labels. This approach facilitates the model's ability to handle and navigate diverse perspectives. Data bias will be handled both in the sampling and in the labeling processes: seed, topic, temporal and user bias will be taken into account when gathering data; in the annotation process, bias will be reduced by involving annotators from different social and demographic backgrounds.|本文介绍了将在CLEF 2024会议上举行的EXIST 2024实验室，该实验室专注于社交媒体中的性别歧视识别，这是EXIST挑战赛的第四届。实验室包含五种任务，涉及英语和西班牙语两种语言，其中前三个任务延续了EXIST 2023的内容（推文中的性别歧视识别、推文中的意图检测以及推文中的性别歧视分类）。在本届实验室中，新增了两项任务：迷因中的性别歧视检测和迷因中的性别歧视分类。与往届类似，本届实验室将采用“学习分歧”（Learning With Disagreement）范式。各项任务的数据集将提供来自多位标注者的所有标注，使模型能够从多样化的训练数据中学习，这些数据有时可能包含相互矛盾的观点或标签。这种方法有助于模型处理和应对不同的观点。数据偏差将在采样和标注过程中得到处理：在数据收集时，将考虑种子、主题、时间以及用户偏差；在标注过程中，将通过引入来自不同社会背景和人口统计背景的标注者来减少偏差。|code|0|

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ecir2024.md

ecir2024.md

ECIR2024 Paper List

Files

ecir2024.md

Latest commit

History

ecir2024.md

File metadata and controls

ECIR2024 Paper List