chore: update confs

actions-user · actions-user · commit 2e22db614b2a · 2025-04-28T11:07:03.000Z
diff --git a/arxiv.json b/arxiv.json
@@ -45624,5 +45624,40 @@
         "pub_date": "2025-04-24",
         "summary": "As false information continues to proliferate across social media platforms, effective rumor detection has emerged as a pressing challenge in natural language processing. This paper proposes RAGAT-Mind, a multi-granular modeling approach for Chinese rumor detection, built upon the MindSpore deep learning framework. The model integrates TextCNN for local semantic extraction, bidirectional GRU for sequential context learning, Multi-Head Self-Attention for global dependency focusing, and Bidirectional Graph Convolutional Networks (BiGCN) for structural representation of word co-occurrence graphs. Experiments on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior classification performance, attaining 99.2% accuracy and a macro-F1 score of 0.9919. The results validate the effectiveness of combining hierarchical linguistic features with graph-based semantic structures. Furthermore, the model exhibits strong generalization and interpretability, highlighting its practical value for real-world rumor detection applications.",
         "translated": "【摘要译文】  \n随着虚假信息在社交媒体平台持续泛滥，高效的谣言检测已成为自然语言处理领域的紧迫挑战。本文提出RAGAT-Mind——一种基于MindSpore深度学习框架的中文谣言多粒度建模方法。该模型融合TextCNN进行局部语义提取、双向GRU学习序列上下文、多头自注意力机制聚焦全局依赖关系，并通过双向图卷积网络（BiGCN）构建词共现图的结构化表征。在Weibo1-Rumor数据集上的实验表明，RAGAT-Mind以99.2%的准确率和0.9919的宏F1值实现了最优分类性能，验证了层次化语言特征与图式语义结构相结合的有效性。该模型同时展现出强泛化能力和可解释性，凸显了其在现实场景谣言检测应用中的实用价值。  \n\n（注：严格遵循技术文献翻译规范，关键术语如\"BiGCN\"保留英文缩写并首次出现标注全称；专业表达如\"macro-F1 score\"译为\"宏F1值\"符合计算机领域惯例；被动语态\"demonstrate\"主动化为\"表明\"更符合中文表达习惯；复杂句式拆分为符合中文阅读节奏的短句结构。）"
+    },
+    {
+        "title": "Music Tempo Estimation on Solo Instrumental Performance",
+        "url": "http://arxiv.org/abs/2504.18502v1",
+        "pub_date": "2025-04-25",
+        "summary": "Recently, automatic music transcription has made it possible to convert musical audio into accurate MIDI. However, the resulting MIDI lacks music notations such as tempo, which hinders its conversion into sheet music. In this paper, we investigate state-of-the-art tempo estimation techniques and evaluate their performance on solo instrumental music. These include temporal convolutional network (TCN) and recurrent neural network (RNN) models that are pretrained on massive of mixed vocals and instrumental music, as well as TCN models trained specifically with solo instrumental performances. Through evaluations on drum, guitar, and classical piano datasets, our TCN models with the new training scheme achieved the best performance. Our newly trained TCN model increases the Acc1 metric by 38.6% for guitar tempo estimation, compared to the pretrained TCN model with an Acc1 of 61.1%. Although our trained TCN model is twice as accurate as the pretrained TCN model in estimating classical piano tempo, its Acc1 is only 50.9%. To improve the performance of deep learning models, we investigate their combinations with various post-processing methods. These post-processing techniques effectively enhance the performance of deep learning models when they struggle to estimate the tempo of specific instruments.",
+        "translated": "近年来，自动音乐转录技术已能实现将音乐音频准确转换为MIDI文件。然而生成的MIDI缺乏速度标记（tempo）等乐谱符号，阻碍了其向标准乐谱的转换。本文系统研究了当前最先进的节奏速度估计技术，并评估其在独奏器乐上的表现。这些技术包括基于海量混合人声与器乐数据预训练的时间卷积网络（TCN）与循环神经网络（RNN）模型，以及专门使用独奏器乐表演数据训练的TCN模型。通过对鼓组、吉他与古典钢琴数据集的评估，采用新训练方案的TCN模型取得了最优性能。在吉他节奏速度估计任务中，我们新训练的TCN模型将Acc1指标提升38.6%（相比预训练TCN模型的61.1%）。尽管在古典钢琴节奏估计中，我们训练的TCN模型准确率是预训练模型的两倍，但其Acc1仅达50.9%。为提升深度学习模型性能，我们探索了其与多种后处理方法的组合方案。当模型难以准确估计特定乐器的节奏速度时，这些后处理技术能有效增强模型表现。\n\n（注：Acc1是节奏估计领域常用评估指标，指预测速度值与真实值偏差在±3%范围内的准确率）"
+    },
+    {
+        "title": "An Empirical Study of Evaluating Long-form Question Answering",
+        "url": "http://arxiv.org/abs/2504.18413v1",
+        "pub_date": "2025-04-25",
+        "summary": "\\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at https://github.com/bugtig6351/lfqa_evaluation.",
+        "translated": "\\Ac{LFQA}（长形式问答）旨在针对复杂问题生成详尽的答案。这一场景为评估工作提供了极大的灵活性，同时也带来了重大挑战。当前大多数评估依赖于基于字符串或n-元组匹配的确定性指标，而基于大语言模型对长形式答案进行评估的可靠性仍缺乏充分研究。我们通过开展长形式答案评估的深度研究来填补这一空白，重点关注以下研究问题：(i) 现有自动评估指标能在多大程度上替代人工评估？(ii) 相较于人工评估，现有评估指标存在哪些局限性？(iii) 如何提升现有评估方法的有效性和鲁棒性？我们收集了5,236个由不同大语言模型生成的事实型与非事实型长形式答案，并对其中2,079个答案进行了以正确性和信息量为核心的人工评估。随后，我们通过评估这些答案来考察自动评估指标的表现，分析这些指标与人工评估的一致性。研究发现，答案的文体风格、长度以及问题类别都会导致自动评估指标产生偏差。但精细化评估有助于缓解部分指标的偏差问题。我们的研究结论对于使用大语言模型评估长形式问答具有重要指导意义。所有代码与数据集已开源：https://github.com/bugtig6351/lfqa_evaluation。\n\n（注：\\Ac{LFQA}作为首现术语，采用\"长形式问答（LFQA）\"的规范译法；\"fine-grained evaluation\"译为\"精细化评估\"以准确体现技术内涵；通过拆分英文长句为中文短句链式结构，如将\"analyzing the consistency...\"独立成句处理；专业表述如\"鲁棒性\"、\"n-元组\"等严格遵循计算机领域术语规范）"
+    },
+    {
+        "title": "Bridge the Domains: Large Language Models Enhanced Cross-domain\n  Sequential Recommendation",
+        "url": "http://arxiv.org/abs/2504.18383v1",
+        "pub_date": "2025-04-25",
+        "summary": "Cross-domain Sequential Recommendation (CDSR) aims to extract the preference from the user's historical interactions across various domains. Despite some progress in CDSR, two problems set the barrier for further advancements, i.e., overlap dilemma and transition complexity. The former means existing CDSR methods severely rely on users who own interactions on all domains to learn cross-domain item relationships, compromising the practicability. The latter refers to the difficulties in learning the complex transition patterns from the mixed behavior sequences. With powerful representation and reasoning abilities, Large Language Models (LLMs) are promising to address these two problems by bridging the items and capturing the user's preferences from a semantic view. Therefore, we propose an LLMs Enhanced Cross-domain Sequential Recommendation model (LLM4CDSR). To obtain the semantic item relationships, we first propose an LLM-based unified representation module to represent items. Then, a trainable adapter with contrastive regularization is designed to adapt the CDSR task. Besides, a hierarchical LLMs profiling module is designed to summarize user cross-domain preferences. Finally, these two modules are integrated into the proposed tri-thread framework to derive recommendations. We have conducted extensive experiments on three public cross-domain datasets, validating the effectiveness of LLM4CDSR. We have released the code online.",
+        "translated": "跨域序列推荐（CDSR）旨在从用户跨多个领域的历史交互行为中提取偏好。尽管CDSR研究已取得一定进展，但重叠困境和转移复杂性两大问题阻碍了其进一步发展。前者指现有CDSR方法严重依赖在所有领域都有交互行为的用户来学习跨域物品关联，这损害了方法的实用性；后者则指从混合行为序列中学习复杂转移模式的困难。大型语言模型（LLMs）凭借强大的表征和推理能力，有望通过语义视角连接物品并捕捉用户偏好来解决这两个问题。为此，我们提出LLM增强的跨域序列推荐模型（LLM4CDSR）。为获取语义层面的物品关联，我们首先提出基于LLM的统一表征模块来构建物品表示；随后设计带有对比正则化的可训练适配器以适应CDSR任务特性。此外，构建了分层式LLM画像模块来归纳用户跨域偏好。最终将这两个模块集成至提出的三线程框架中进行推荐生成。我们在三个公开跨域数据集上进行了充分实验，验证了LLM4CDSR的有效性，相关代码已开源。\n\n（翻译说明：采用学术论文摘要的标准句式结构，通过\"前者/后者\"清晰对应英文原文的\"overlap dilemma/transition complexity\"；将\"practicability\"译为\"实用性\"而非字面直译\"实践性\"更符合中文表达；\"hierarchical LLMs profiling module\"译为\"分层式LLM画像模块\"既保留技术特征又符合中文术语习惯；\"tri-thread framework\"采用\"三线程框架\"的准确技术表述；通过\"构建/设计/集成\"等动词保持技术方案的连贯性；结尾\"代码已开源\"符合国内学术论文表述惯例）"
+    },
+    {
+        "title": "Leveraging Decoder Architectures for Learned Sparse Retrieval",
+        "url": "http://arxiv.org/abs/2504.18151v1",
+        "pub_date": "2025-04-25",
+        "summary": "Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. With the advent of large-scale pre-trained language models, their capability to generate sparse representations for retrieval tasks across different transformer-based architectures, including encoder-only, decoder-only, and encoder-decoder models, remains largely unexplored. This study investigates the effectiveness of LSR across these architectures, exploring various sparse representation heads and model scales. Our results highlight the limitations of using large language models to create effective sparse representations in zero-shot settings, identifying challenges such as inappropriate term expansions and reduced performance due to the lack of expansion. We find that the encoder-decoder architecture with multi-tokens decoding approach achieves the best performance among the three backbones. While the decoder-only model performs worse than the encoder-only model, it demonstrates the potential to outperform when scaled to a high number of parameters.",
+        "translated": "传统上，学习型稀疏检索（LSR）主要聚焦于小规模的仅编码器Transformer架构。随着大规模预训练语言模型的出现，这些模型在不同Transformer架构（包括仅编码器、仅解码器以及编码器-解码器模型）中生成稀疏检索表征的能力仍待深入探索。本研究系统评估了LSR在这些架构中的有效性，探究了多种稀疏表征头模块和模型规模的影响。实验结果表明：在零样本场景下，使用大语言模型创建有效稀疏表征存在明显局限，主要表现为术语扩展失当和因缺乏扩展导致的性能下降等问题。其中，采用多令牌解码策略的编码器-解码器架构在三种骨干网络中表现最佳。虽然仅解码器模型性能逊于仅编码器模型，但当参数量级提升时，其展现出显著超越的潜力。\n\n（注：根据学术翻译规范，对关键术语进行了统一处理：\n1. \"sparse representations\"译为\"稀疏表征\"（计算机领域标准译法）\n2. \"multi-tokens decoding\"译为\"多令牌解码\"（保留区块链/大模型领域术语习惯）\n3. 长难句采用拆分策略，如将\"identifying challenges such as...\"独立成短句\n4. 被动语态转换为主动句式（如\"remains largely unexplored\"→\"仍待深入探索\"）\n5. 专业表述补充说明，如\"零样本场景\"而非直译\"zero-shot settings\"）"
+    },
+    {
+        "title": "Revisiting Algorithmic Audits of TikTok: Poor Reproducibility and\n  Short-term Validity of Findings",
+        "url": "http://arxiv.org/abs/2504.18140v1",
+        "pub_date": "2025-04-25",
+        "summary": "Social media platforms are constantly shifting towards algorithmically curated content based on implicit or explicit user feedback. Regulators, as well as researchers, are calling for systematic social media algorithmic audits as this shift leads to enclosing users in filter bubbles and leading them to more problematic content. An important aspect of such audits is the reproducibility and generalisability of their findings, as it allows to draw verifiable conclusions and audit potential changes in algorithms over time. In this work, we study the reproducibility of the existing sockpuppeting audits of TikTok recommender systems, and the generalizability of their findings. In our efforts to reproduce the previous works, we find multiple challenges stemming from social media platform changes and content evolution, but also the research works themselves. These drawbacks limit the audit reproducibility and require an extensive effort altogether with inevitable adjustments to the auditing methodology. Our experiments also reveal that these one-shot audit findings often hold only in the short term, implying that the reproducibility and generalizability of the audits heavily depend on the methodological choices and the state of algorithms and content on the platform. This highlights the importance of reproducible audits that allow us to determine how the situation changes in time.",
+        "translated": "随着社交媒体平台日益转向基于用户隐式或显式反馈的算法推荐内容，监管机构与研究者们正呼吁开展系统性算法审计——这种转向会将用户禁锢于信息茧房，并导向更具危害性的内容。此类审计的核心价值在于研究结果的可复现性与普适性，这既有助于得出可验证的结论，也能持续追踪算法可能的变更。本研究针对TikTok推荐系统现有\"傀儡账户\"审计实验展开可复现性验证，并评估其结论的普适边界。在复现过程中，我们发现社交媒体平台更新、内容生态演变及原研究设计本身均带来多重挑战。这些缺陷不仅限制了审计的可复现性，更迫使研究者必须投入大量精力调整审计方法学。实验同时揭示：此类一次性审计结论往往仅具短期时效性，意味着审计的可复现性与普适性高度依赖于方法学选择及平台算法与内容的实时状态。这凸显了可复现审计的重要性——唯有通过持续性审计，我们方能准确追踪事态演变轨迹。\n\n（译文特点说明：\n1. 专业术语处理：\"algorithmically curated content\"译为\"算法推荐内容\"而非字面直译，符合中文技术文档惯例\n2. 被动语态转化：将\"are calling for\"等被动结构转换为\"正呼吁\"的主动句式\n3. 长句拆分：将原文复合长句拆分为符合中文表达习惯的短句群\n4. 概念对等：\"filter bubbles\"采用学界通用译法\"信息茧房\"而非字面翻译\n5. 动态对应：\"one-shot audit\"创造性译为\"一次性审计\"，准确传达其临时性特征\n6. 逻辑显化：通过\"这既有助于...也能...\"等连接词明示原文隐含逻辑关系）"
     }
 ]