You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: arxiv.json
+35Lines changed: 35 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -45624,5 +45624,40 @@
45624
45624
"pub_date": "2025-04-24",
45625
45625
"summary": "As false information continues to proliferate across social media platforms, effective rumor detection has emerged as a pressing challenge in natural language processing. This paper proposes RAGAT-Mind, a multi-granular modeling approach for Chinese rumor detection, built upon the MindSpore deep learning framework. The model integrates TextCNN for local semantic extraction, bidirectional GRU for sequential context learning, Multi-Head Self-Attention for global dependency focusing, and Bidirectional Graph Convolutional Networks (BiGCN) for structural representation of word co-occurrence graphs. Experiments on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior classification performance, attaining 99.2% accuracy and a macro-F1 score of 0.9919. The results validate the effectiveness of combining hierarchical linguistic features with graph-based semantic structures. Furthermore, the model exhibits strong generalization and interpretability, highlighting its practical value for real-world rumor detection applications.",
"title": "Music Tempo Estimation on Solo Instrumental Performance",
45630
+
"url": "http://arxiv.org/abs/2504.18502v1",
45631
+
"pub_date": "2025-04-25",
45632
+
"summary": "Recently, automatic music transcription has made it possible to convert musical audio into accurate MIDI. However, the resulting MIDI lacks music notations such as tempo, which hinders its conversion into sheet music. In this paper, we investigate state-of-the-art tempo estimation techniques and evaluate their performance on solo instrumental music. These include temporal convolutional network (TCN) and recurrent neural network (RNN) models that are pretrained on massive of mixed vocals and instrumental music, as well as TCN models trained specifically with solo instrumental performances. Through evaluations on drum, guitar, and classical piano datasets, our TCN models with the new training scheme achieved the best performance. Our newly trained TCN model increases the Acc1 metric by 38.6% for guitar tempo estimation, compared to the pretrained TCN model with an Acc1 of 61.1%. Although our trained TCN model is twice as accurate as the pretrained TCN model in estimating classical piano tempo, its Acc1 is only 50.9%. To improve the performance of deep learning models, we investigate their combinations with various post-processing methods. These post-processing techniques effectively enhance the performance of deep learning models when they struggle to estimate the tempo of specific instruments.",
"title": "An Empirical Study of Evaluating Long-form Question Answering",
45637
+
"url": "http://arxiv.org/abs/2504.18413v1",
45638
+
"pub_date": "2025-04-25",
45639
+
"summary": "\\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at https://github.com/bugtig6351/lfqa_evaluation.",
45640
+
"translated": "\\Ac{LFQA}(长形式问答)旨在针对复杂问题生成详尽的答案。这一场景为评估工作提供了极大的灵活性,同时也带来了重大挑战。当前大多数评估依赖于基于字符串或n-元组匹配的确定性指标,而基于大语言模型对长形式答案进行评估的可靠性仍缺乏充分研究。我们通过开展长形式答案评估的深度研究来填补这一空白,重点关注以下研究问题:(i) 现有自动评估指标能在多大程度上替代人工评估?(ii) 相较于人工评估,现有评估指标存在哪些局限性?(iii) 如何提升现有评估方法的有效性和鲁棒性?我们收集了5,236个由不同大语言模型生成的事实型与非事实型长形式答案,并对其中2,079个答案进行了以正确性和信息量为核心的人工评估。随后,我们通过评估这些答案来考察自动评估指标的表现,分析这些指标与人工评估的一致性。研究发现,答案的文体风格、长度以及问题类别都会导致自动评估指标产生偏差。但精细化评估有助于缓解部分指标的偏差问题。我们的研究结论对于使用大语言模型评估长形式问答具有重要指导意义。所有代码与数据集已开源:https://github.com/bugtig6351/lfqa_evaluation。\n\n(注:\\Ac{LFQA}作为首现术语,采用\"长形式问答(LFQA)\"的规范译法;\"fine-grained evaluation\"译为\"精细化评估\"以准确体现技术内涵;通过拆分英文长句为中文短句链式结构,如将\"analyzing the consistency...\"独立成句处理;专业表述如\"鲁棒性\"、\"n-元组\"等严格遵循计算机领域术语规范)"
45641
+
},
45642
+
{
45643
+
"title": "Bridge the Domains: Large Language Models Enhanced Cross-domain\n Sequential Recommendation",
45644
+
"url": "http://arxiv.org/abs/2504.18383v1",
45645
+
"pub_date": "2025-04-25",
45646
+
"summary": "Cross-domain Sequential Recommendation (CDSR) aims to extract the preference from the user's historical interactions across various domains. Despite some progress in CDSR, two problems set the barrier for further advancements, i.e., overlap dilemma and transition complexity. The former means existing CDSR methods severely rely on users who own interactions on all domains to learn cross-domain item relationships, compromising the practicability. The latter refers to the difficulties in learning the complex transition patterns from the mixed behavior sequences. With powerful representation and reasoning abilities, Large Language Models (LLMs) are promising to address these two problems by bridging the items and capturing the user's preferences from a semantic view. Therefore, we propose an LLMs Enhanced Cross-domain Sequential Recommendation model (LLM4CDSR). To obtain the semantic item relationships, we first propose an LLM-based unified representation module to represent items. Then, a trainable adapter with contrastive regularization is designed to adapt the CDSR task. Besides, a hierarchical LLMs profiling module is designed to summarize user cross-domain preferences. Finally, these two modules are integrated into the proposed tri-thread framework to derive recommendations. We have conducted extensive experiments on three public cross-domain datasets, validating the effectiveness of LLM4CDSR. We have released the code online.",
"title": "Leveraging Decoder Architectures for Learned Sparse Retrieval",
45651
+
"url": "http://arxiv.org/abs/2504.18151v1",
45652
+
"pub_date": "2025-04-25",
45653
+
"summary": "Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. With the advent of large-scale pre-trained language models, their capability to generate sparse representations for retrieval tasks across different transformer-based architectures, including encoder-only, decoder-only, and encoder-decoder models, remains largely unexplored. This study investigates the effectiveness of LSR across these architectures, exploring various sparse representation heads and model scales. Our results highlight the limitations of using large language models to create effective sparse representations in zero-shot settings, identifying challenges such as inappropriate term expansions and reduced performance due to the lack of expansion. We find that the encoder-decoder architecture with multi-tokens decoding approach achieves the best performance among the three backbones. While the decoder-only model performs worse than the encoder-only model, it demonstrates the potential to outperform when scaled to a high number of parameters.",
"title": "Revisiting Algorithmic Audits of TikTok: Poor Reproducibility and\n Short-term Validity of Findings",
45658
+
"url": "http://arxiv.org/abs/2504.18140v1",
45659
+
"pub_date": "2025-04-25",
45660
+
"summary": "Social media platforms are constantly shifting towards algorithmically curated content based on implicit or explicit user feedback. Regulators, as well as researchers, are calling for systematic social media algorithmic audits as this shift leads to enclosing users in filter bubbles and leading them to more problematic content. An important aspect of such audits is the reproducibility and generalisability of their findings, as it allows to draw verifiable conclusions and audit potential changes in algorithms over time. In this work, we study the reproducibility of the existing sockpuppeting audits of TikTok recommender systems, and the generalizability of their findings. In our efforts to reproduce the previous works, we find multiple challenges stemming from social media platform changes and content evolution, but also the research works themselves. These drawbacks limit the audit reproducibility and require an extensive effort altogether with inevitable adjustments to the auditing methodology. Our experiments also reveal that these one-shot audit findings often hold only in the short term, implying that the reproducibility and generalizability of the audits heavily depend on the methodological choices and the state of algorithms and content on the platform. This highlights the importance of reproducible audits that allow us to determine how the situation changes in time.",
0 commit comments