全文下载:
19.pdf
文章编号: 1672-6987(2025)02-0140-10 DOI: 10.16351/j.1672-6987.2025.02.019
崔莹莹, 陈卓*(青岛科技大学 信息科学技术学院, 山东 青岛 266061)
摘要: 针对现有关键词提取方法忽略词语的语义多样性,从而所提取出的关键词之间具有语义相近的问题,提出面向语义多样性的无监督关键词提取方法。该方法首先使用融合词语位置信息和表面信息量特征的词语节点中心性分数来衡量词语在全文其他词语中的重要程度;然后,将词语聚类后为多个局部,根据词语的主旨性分数,降低每一局部范围中多个相似词语中冗余词语的权重;最后,考虑到存在某个局部中的所有单词都很重要的情况,以整个文档为全局范围,根据词语的节点中心性分数,降低全局范围中非中心性词语的权重,从而使得更多正确的关键词有机会被提取出来,改善了关键词提取的质量。在3个公开数据集中的实验结果表明,该方法的F1值比先进基线方法提升约5%。
关键词: 关键词提取; 语义相似度; 无监督方法; 聚类
中图分类号: TP 391 文献标志码: A
引用格式: 崔莹莹, 陈卓. 面向语义多样性的无监督关键词提取方法[J]. 青岛科技大学学报(自然科学版), 2025, 46(2): 140-149.
CUI Yingying, CHEN Zhuo. Unsupervised keyword extraction for semantic diversity[J]. Journal of Qingdao University of Science and Technology(Natural Science Edition), 2025, 46(2): 140-149.
Unsupervised Keyword Extraction for Semantic Diversity
CUI Yingying, CHEN Zhuo(College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China)
Abstract: Aimed at the problem that the existing keyword extraction methods ignore the semantic diversity of words and the semantic similarity between the extracted keywords, proposed an unsupervised keyword extraction method for semantic diversity. First, used the word node centrality score that combined word position and surface information to measure the word importance in other words. Then, clustering the words to get multiple locals, according to the word subject score to reduce the redundant words weight in each local. Finally, all words in local may be important, taked the whole document as the global scope, according the word node centrality score to reduce the non-central words weight, that can extract more correct keywords and improve the keyword quality. Experiments on three public datasets show that the F1 value of this paper's method are increasing about 5% compared with advanced baseline methods.
Key words: keyword extraction; semantic similarity; unsupervised methods; clustering
收稿日期: 2024-10-03
基金项目: 国家自然科学基金项目(6217072142);山东省自然科学基金项目(ZR2021MF092).
作者简介: 崔莹莹(1998—),女,硕士研究生. * 通信联系人.