上下文自编码框架下的科研热点挖掘方法

王睿 ,  吕心诚 ,  陆家豪 ,  周永权

小型微型计算机系统 ›› 2026, Vol. 47 ›› Issue (5) : 1089 -1098.

PDF (1719KB)
小型微型计算机系统 ›› 2026, Vol. 47 ›› Issue (5) : 1089 -1098. DOI: 10.20009/j.cnki.21-1106/TP.2025-0227
算法理论与人工智能

上下文自编码框架下的科研热点挖掘方法

作者信息 +

Research Hotspot Mining Under a Contextual Autoencoding Framework

Author information +
文章历史 +
PDF (1760K)

摘要

高效挖掘科研热点及其对应作者是学术研究领域的重要任务。针对传统作者主题模型忽略上下文语义、难以融合外部知识及缺乏背景主题建模的问题,本文提出了一种基于上下文的神经作者主题模型。该模型利用 Transformer 捕捉文本的上下文语义以提升主题推断准确性,将单词与作者的预训练嵌入引入解码过程并利用 vMF 分布对主题进行建模以提升主题质量,同时采用狄利克雷树分布作为先验以区分背景主题与热点主题。此外,本文提出两个量化研究热点与作者关联程度的指标。本文在构建的计算语言学、计算机视觉和数据挖掘 3 个数据集上进行实验,结果表明,本模型在主题一致性、多样性及作者-主题关联性指标上均优于对比方法,充分验证了其在科研热点挖掘上的优越性。

Abstract

Efficiently mining research hotspots and their corresponding authors is a critical task in academic research.To address the limitations of traditional author topic models,which often overlook contextual semantics,struggle to incorporate external knowledge, and fail to model background topics,this paper proposes a contextualized neural author topic model.The model utilizes Transformer to capture contextual semantics of text to improve the accuracy of topic inference,incorporates pre-trained word and author embeddings into the decoding process,and employs von Mises-Fisher distribution for topic modeling to improve topic quality.Meanwhile,it uses Dirichlet tree distribution as a prior to distinguish background topics from hotspot topics.Furthermore,the paper introduce two metrics to quantify the degree of association between research hotspots and authors.Experiments were conducted on three constructed datasets: Computational Linguistics,Computer Vision,and Data Mining.The results demonstrate that the model outperforms existing methods in topic coherence,diversity,and author-topic relevance,validating its superiority in mining research hotspots.

关键词

科研热点挖掘 / 作者主题模型 / von Mises-Fisher 分布 / 狄利克雷树分布

Key words

research hotspot mining / author topic model / von Mises-Fisher distribution / dirichlet tree distribution

引用本文

引用格式 ▾
王睿,吕心诚,陆家豪,周永权. 上下文自编码框架下的科研热点挖掘方法[J]. 小型微型计算机系统, 2026, 47(5): 1089-1098 DOI:10.20009/j.cnki.21-1106/TP.2025-0227

登录浏览全文

4963

注册一个新账户 忘记密码

参考文献

[1]

Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Jour- nal of Machine Learning Research, 2003, 3(1):993-1022.

[2]

Zhang Y, Jin R, Zhou Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1):43-52.

[3]

Miao Y, Yu L, Blunsom P. Neural variational inference for text pro- cessing[C]// International Conference on Machine Learning, 2016: 1727-1736.

[4]

Wang R, Zhou D, He Y. Atm:adversarial-neural topic model[J]. Information Processing & Management, 2019, 56 (6),doi: 10.48550/arXiv.1811.00265.

[5]

Wang R, Hu X, Zhou D, et al. Neural topic modeling with bidirec- tional adversarial training[C]// 58th Annual Meeting of the Asso- ciation for Computational Linguistics, 2020:340-350.

[6]

Rosen Zvi M, Griffiths T, Steyvers M, et al. The author-topic model for authors and documents[C]// Proceedings of the 20th Confer- ence on Uncertainty in Artificial Intelligence, 2004:487-494.

[7]

Zhang D C, Lauw H W. Variational graph author topic modeling[C]// Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022:2429-2438.

[8]

Nagda M, Ostheimer P, Fellenz S. Tethering broken themes:alig- ning neural topic models with labels and authors[J]. arXiv preprint arXiv:2410.18140,2024.

[9]

Minka T. The dirichlet-tree distribution[EB/OL]. https://tminka.github.io/papers/dirichlet/minka-dirtree.pdf,1999.

[10]

Waltman L, Van Eck N J. A new methodology for constructing a publication-level classification system of science[J]. Joumal of the American Society for Information Science and Technology, 2012, 63 (12):2378-2392.

[11]

Ding Y, Zhang G, Chambers T, et al. Content-based citation analysis: the next generation of citation analysis[J]. Journal of the Association for Information Science and Technology, 2014, 65(9):1820-1833.

[12]

Chen C. Science mapping;a systematic review of the literature[J]. Journal of Data and Information Science, 2017, 2(2):1-40.

[13]

Hou J, Yang X, Chen C. Emerging trends and new developments in information science:a document co-citation analysis(2009-2016)[J]. Scientometrics, 2018, 115(2):869-892.

[14]

Zhang D, Zhang Z, Managi S. A bibliometric analysis on green fi- nance:current status,development,and future directions[J]. Fi- nance Research Letters, 2019, 29 (C):425-430,doi:10.1016/j.frl.2019.02.003.

[15]

Pesta B, Fuerst J, Kirkegaard E O W. Bibliometric keyword analysis across seventeen years (2000-2016) of intelligence articles[J]. Journal of Intelligence, 2018, 6(4 ):46,doi:10.3390/jintelli-gence6040046.

[16]

Church K W. Word2Vec[J]. Natural Language Engineering, 2017, 23 (1):155-162.

[17]

Pennington J, Socher R, Manning C D. Glove :global vectors for word representation[C]// Proceedings of the Conference on Em- pirical Methods in Natural Language Processing, 2014:1532-1543.

[18]

Meng Y, Huang J, Wang G, et al. Spherical text embedding[J]. Advances in Neural Information Processing Systems, 2019:32,doi: 10.48550/arXiv.1911.01196.

[19]

Grootendorst M. BERTopic:neural topic modeling with a class- based TF-IDF procedure[J]. arXiv preprint arXiv:2203.05794,2022.

[20]

Wu X, Dong X, Nguyen T T, et al. Effective neural topic modeling with embedding clustering regularization[C]// International Con- ference on Machine Learning, 2023:37335-37357.

[21]

Pham C, Hoyle A, Sun S, et al. TopicGPT:a prompt-based topic modeling framework[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Lin- guistics:Human Language Technologies(Volume 1: Long Pa- pers), 2024:2956-2984.

[22]

Nurminen H, Suomalainen L, Ali Loytty S, et al. 3D angle-of-arri- val positioning using von Mises-Fisher distribution[C]// 21st In- ternational Conference on Information Fusion, 2018:2036-2041.

[23]

Conti J R, Noiry N, Clemencon S, et al. Mitigating gender bias in face recognition using the von mises-fisher mixture model[C]// International Conference on Machinc Learning, 2022:4344-4369.

[24]

Alirezazadeh P, Dornaika F, Charafeddine J. Mises-Fisher similari- ty-based boosted additive angular margin loss for breast cancer clas- sification[J]. Artificial Intelligence Review, 2024, 57(12):326,doi:10.1007/s10462-024-10963-4.

[25]

Wang P, Wu D, Chen C, et al. Deep adaptive graph clustering via von Mises-Fisher distributions[J]. ACM Transactions on the Web, 2024, 18(2):1-21.

[26]

Chikhi N F. Scientific publications clustering using textual and cita- tion information[J]. Expert Systems with Applications, 2024,248: 123319,doi:10.1016/j.eswa.2024.123319.

[27]

Zhang R, Guo J, Lan Y, et al. Aggregating neural word embeddings for document representation[C]//Advances in Information Re- trieval: 40th European Conference on IR Research 2018:303-315.

[28]

Xu W, Jiang X, Rao S S H, et al. vONTSS:vMF based semi-super- vised neural topic modeling with optimal transport[C]// Findings of the Association for Computational Linguistics, 2023:4433-4457.

[29]

Gretton A, Borgwardt K M, Rasch M J, et al. A kernel two-sample test[J]. Journal of Machine Learning Research, 2012, 13(1):723-773.

[30]

Nan F, Ding R, Nallapati R, et al. Topic modeling with wasserstein autoencoders[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019:6345-6381.

[31]

Bianchi F, Terragni S, Hovy D. Pre-training is a hot topic: contextu- alized document embeddings improve topic coherence[C]// Pro- ceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021:759-766.

[32]

Adhya S, Lahiri A, Sanyal D K, et al. Improving contextualized topic models with negative sampling[C]// 19th International Con- ference on Natural Language Processing, 2022:128-138.

[33]

Fang Z, He Y, Procter R. CWTM :leveraging contextualized word embeddings from bert for neural topic modeling[C]// Proceedings of the Joint International Conference on Computational Linguistics, 2024:4273-4286.

[34]

Wu X, Nguyen T, Zhang D, et al. Fastopic:pretrained transformer is a fast,adaptive,stable,and transferable topic model[J]. Advances in Neural Information Processing Systems, 2024, 37:84447-84481, doi:10.48550/arXiv.2405.17978.

[35]

Schneider J. Efficient and flexible topic modeling using pretrained em- beddings and bag of sentences[C]// International Conference on A- gents and Artificial Intelligence, 2024,doi:10.5220/0012404000003636.

[36]

Kristensen McLachlan R D, Hicke R M M, Kardos M, et al. Con- text is key(NMF):modelling topical information dynamics in chi- nese diaspora media[J]. arXiv preprint arXiv:2410.12791, 2024.

[37]

Kardos M, Kostkan J, Vermillet A Q, et al. Semantic signal separa- tion[J]. arXiv preprint arXiv:2406.09556,2024.

[38]

Ma Y, Xiao C, Yuan C, et al. CAST:corpus-aware self-similarity enhanced topic modelling[J]. arXiv preprint arXiv:2410.15136,2024.

[39]

Reuter A, Thielmann A, Weisser C, et al. Probabilistic topic model- ling with transformer representations[J]. arXiv:2403.03737,2024.

基金资助

国家自然科学基金青华基金项目(62102192)

中国博士后科学基金面上项目(2022M710071)

江苏省双创博士人才项目(JSSCBS20210530)

中央高校基本科研业务费专项资金项目(aiia-24-01)

中央高校基本科研业务费专项资金项目(PA2025IISL0107)

AI Summary AI Mindmap
PDF (1719KB)

0

访问

0

被引

详细

导航
相关文章

AI思维导图

/