上下文自编码框架下的科研热点挖掘方法

王睿; 吕心诚; 陆家豪; 周永权

doi:10.20009/j.cnki.21-1106/TP.2025-0227

小型微型计算机系统 ›› 2026, Vol. 47 ›› Issue (5) : 1089 -1098. DOI: 10.20009/j.cnki.21-1106/TP.2025-0227

算法理论与人工智能

上下文自编码框架下的科研热点挖掘方法

^1,2,3,4

⁴

作者信息 +

Research Hotspot Mining Under a Contextual Autoencoding Framework

^1,2,3,4

⁴

Author information +

文章历史 +

摘要

高效挖掘科研热点及其对应作者是学术研究领域的重要任务.针对传统作者主题模型忽略上下文语义、难以融合外部知识及缺乏背景主题建模的问题,本文提出了一种基于上下文的神经作者主题模型.该模型利用Transformer捕捉文本的上下文语义以提升主题推断准确性,将单词与作者的预训练嵌入引入解码过程并利用vMF分布对主题进行建模以提升主题质量,同时采用狄利克雷树分布作为先验以区分背景主题与热点主题.此外,本文提出两个量化研究热点与作者关联程度的指标.本文在构建的计算语言学、计算机视觉和数据挖掘3个数据集上进行实验,结果表明,本模型在主题一致性、多样性及作者-主题关联性指标上均优于对比方法,充分验证了其在科研热点挖掘上的优越性.

Abstract

Efficiently mining research hotspots and their corresponding authors is a critical task in academic research.To address the limitations of traditional author topic models,which often overlook contextual semantics,struggle to incorporate external knowledge,and fail to model background topics,this paper proposes a contextualized neural author topic model.The model utilizes Transformer to capture contextual semantics of text to improve the accuracy of topic inference,incorporates pre-trained word and author embeddings into the decoding process,and employs von Mises-Fisher distribution for topic modeling to improve topic quality.Meanwhile,it uses Dirichlet tree distribution as a prior to distinguish background topics from hotspot topics.Furthermore,the paper introduce two metrics to quantify the degree of association between research hotspots and authors.Experiments were conducted on three constructed datasets:Computational Linguistics,Computer Vision,and Data Mining.The results demonstrate that the model outperforms existing methods in topic coherence,diversity,and author-topic relevance,validating its superiority in mining research hotspots.

关键词

科研热点挖掘 / 作者主题模型 / von Mises-Fisher分布 / 狄利克雷树分布

Key words

research hotspot mining / author topic model / von Mises-Fisher distribution / dirichlet tree distribution

引用本文

引用格式 ▾

王睿, 吕心诚, 陆家豪, 周永权. 上下文自编码框架下的科研热点挖掘方法[J]. 小型微型计算机系统, 2026, 47(5): 1089-1098 DOI:10.20009/j.cnki.21-1106/TP.2025-0227

登录浏览全文

4963

注册一个新账户忘记密码

参考文献

[1] Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(1):993-1022.
[2] Zhang Y,Jin R,Zhou Z H.Understanding bag-of-words model:a statistical framework[J].International Journal of Machine Learning and Cybernetics,2010,1(1):43-52.
[3] Miao Y,Yu L,Blunsom P.Neural variational inference for text processing[C]//International Conference on Machine Learning,2016:1727-1736.
[4] Wang R,Zhou D,He Y.Atm:adversarial-neural topic model[J].Information Processing & Management,2019,56(6),doi:10.48550/arXiv.1811.00265.
[5] Wang R,Hu X,Zhou D,et al.Neural topic modeling with bidirectional adversarial training[C]//58th Annual Meeting of the Association for Computational Linguistics,2020:340-350.
[6] Rosen Zvi M,Griffiths T,Steyvers M,et al.The author-topic model for authors and documents[C]//Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence,2004:487-494.
[7] Zhang D C,Lauw H W.Variational graph author topic modeling[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,2022:2429-2438.
[8] Nagda M,Ostheimer P,Fellenz S.Tethering broken themes:aligning neural topic models with labels and authors[J].arXiv preprint arXiv:2410.18140,2024.
[9] Minka T.The dirichlet-tree distribution[EB/OL].https://tminka.github.io/papers/dirichlet/minka-dirtree.pdf,1999.
[10] Waltman L,Van Eck N J.A new methodology for constructing a publication-level classification system of science[J].Journal of the American Society for Information Science and Technology,2012,63(12):2378-2392.
[11] Ding Y,Zhang G,Chambers T,et al.Content-based citation analysis:the next generation of citation analysis[J].Journal of the Association for Information Science and Technology,2014,65(9):1820-1833.
[12] Chen C.Science mapping:a systematic review of the literature[J].Journal of Data and Information Science,2017,2(2):1-40.
[13] Hou J,Yang X,Chen C.Emerging trends and new developments in information science:a document co-citation analysis(2009-2016)[J].Scientometrics,2018,115(2):869-892.
[14] Zhang D,Zhang Z,Managi S.A bibliometric analysis on green finance:current status,development,and future directions[J].Finance Research Letters,2019,29(C):425-430,doi:10.1016/j.frl.2019.02.003.
[15] Pesta B,Fuerst J,Kirkegaard E O W.Bibliometric keyword analysis across seventeen years(2000-2016)of intelligence articles[J].Journal of Intelligence,2018,6(4):46,doi:10.3390/jintelligence6040046.
[16] Church K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162.
[17] Pennington J,Socher R,Manning C D.Glove:global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing,2014:1532-1543.
[18] Meng Y,Huang J,Wang G,et al.Spherical text embedding[J].Advances in Neural Information Processing Systems,2019:32,doi:10.48550/arXiv.1911.01196.
[19] Grootendorst M.BERTopic:neural topic modeling with a class-based TF-IDF procedure[J].arXiv preprint arXiv:2203.05794,2022.
[20] Wu X,Dong X,Nguyen T T,et al.Effective neural topic modeling with embedding clustering regularization[C]//International Conference on Machine Learning,2023:37335-37357.
[21] Pham C,Hoyle A,Sun S,et al.TopicGPT:a prompt-based topic modeling framework[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(Volume 1:Long Papers),2024:2956-2984.
[22] Nurminen H,Suomalainen L,Ali Loytty S,et al.3D angle-of-arrival positioning using von Mises-Fisher distribution[C]//21st International Conference on Information Fusion,2018:2036-2041.
[23] Conti J R,Noiry N,Clemencon S,et al.Mitigating gender bias in face recognition using the von mises-fisher mixture model[C]//International Conference on Machine Learning,2022:4344-4369.
[24] Alirezazadeh P,Dornaika F,Charafeddine J.Mises-Fisher similarity-based boosted additive angular margin loss for breast cancer classification[J].Artificial Intelligence Review,2024,57(12):326,doi:10.1007/s10462-024-10963-4.
[25] Wang P,Wu D,Chen C,et al.Deep adaptive graph clustering via von Mises-Fisher distributions[J].ACM Transactions on the Web,2024,18(2):1-21.
[26] Chikhi N F.Scientific publications clustering using textual and citation information[J].Expert Systems with Applications,2024,248:123319,doi:10.1016/j.eswa.2024.123319.
[27] Zhang R,Guo J,Lan Y,et al.Aggregating neural word embeddings for document representation[C]//Advances in Information Retrieval:40th European Conference on IR Research,2018:303-315.
[28] Xu W,Jiang X,Rao S S H,et al.vONTSS:vMF based semi-supervised neural topic modeling with optimal transport[C]//Findings of the Association for Computational Linguistics,2023:4433-4457.
[29] Gretton A,Borgwardt K M,Rasch M J,et al.A kernel two-sample test[J].Journal of Machine Learning Research,2012,13(1):723-773.
[30] Nan F,Ding R,Nallapati R,et al.Topic modeling with wasserstein autoencoders[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,2019:6345-6381.
[31] Bianchi F,Terragni S,Hovy D.Pre-training is a hot topic:contextualized document embeddings improve topic coherence[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing,2021:759-766.
[32] Adhya S,Lahiri A,Sanyal D K,et al.Improving contextualized topic models with negative sampling[C]//19th International Conference on Natural Language Processing,2022:128-138.
[33] Fang Z,He Y,Procter R.CWTM:leveraging contextualized word embeddings from bert for neural topic modeling[C]//Proceedings of the Joint International Conference on Computational Linguistics,2024:4273-4286.
[34] Wu X,Nguyen T,Zhang D,et al.Fastopic:pretrained transformer is a fast,adaptive,stable,and transferable topic model[J].Advances in Neural Information Processing Systems,2024,37:84447-84481,doi:10.48550/arXiv.2405.17978.
[35] Schneider J.Efficient and flexible topic modeling using pretrained embeddings and bag of sentences[C]//International Conference on Agents and Artificial Intelligence,2024,doi:10.5220/0012404000003636.
[36] Kristensen McLachlan R D,Hicke R M M,Kardos M,et al.Context is key(NMF):modelling topical information dynamics in chinese diaspora media[J].arXiv preprint arXiv:2410.12791,2024.
[37] Kardos M,Kostkan J,Vermillet A Q,et al.Semantic signal separation[J].arXiv preprint arXiv:2406.09556,2024.
[38] Ma Y,Xiao C,Yuan C,et al.CAST:corpus-aware self-similarity enhanced topic modelling[J].arXiv preprint arXiv:2410.15136,2024.
[39] Reuter A,Thielmann A,Weisser C,et al.Probabilistic topic modelling with transformer representations[J].arXiv:2403.03737,2024.