In imbalanced data classification, the majority of class samples have an advantage in terms of quantity, and their distribution will have a significant "pulling" effect on the clustering results. However, the minority class samples, due to their small quantity, have relatively unclear features in the entire dataset, resulting in drift problems in the data stream and affecting the classification performance of the data stream. To address this issue, research is conducted on an imbalanced drift big data stream classification algorithm based on spectral clustering undersampling. By using undersampling techniques to reduce the redundant amount of majority class data in imbalanced drift big data streams, balance the amount of majority class data and minority class data, and alleviate the problem of data drift caused by clustering "pulling"; select the core points of the balanced big data stream to form a core point set, and use spectral clustering algorithm to cluster this core point set. Based on the clustering structure obtained from spectral clustering and the selected core points, realize the classification of imbalanced drift big data streams. The experimental results show that the algorithm can achieve balanced processing of imbalanced drift big data streams, and the average imbalance degree after processing can be reduced to 1.024, almost approaching the equilibrium state; it can achieve the selection and effective grouping of core points for different attribute big data streams, providing guarantees for the subsequent effective application of such big data streams.
SongTing-ting, WuSai-jun, PeiSong-wen. Dual graph neural networks with BiLSTM for text classification[J]. Journal of University of Shanghai for Science and Technology, 2023,45(2): 103-111.
DengWei-bin, WangZhi-ying, GaoRong-hao, et al. Multi-label text classification combining attention with CorNet[J]. Journal of Northwest University (Natural Science Edition), 2022, 52(5): 824-833.
CuiYu-meng, WangJing-ya, LiuXiao-wen, et al. General text classification model combining attention and cropping mechanism[J]. Journal of Computer Applications, 2023, 43(8): 2396-2405.
ZhangHu, BaiPing. Graph convolutional networks with long-distance words dependency in sentences for short text classification[J]. Computer Science, 2022, 49(2): 279-284.
ZhaoXiao-qiang, YaoQing-lei. Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling[J]. Journal of Lanzhou University of Technology, 2023,49(6): 80-89.
HuangFu-xing, HanWen-hua. Classification algorithm of IMA-BP for unbalanced data based on voting mechanism[J]. Science Technology and Engineering, 2023, 23(27): 11698-11705.
ZhouEr-hao, GaoShang, ShenZhen. Classification algorithm of imbalanced data based on rotation balanced forest[J]. Computer Engineering and Design, 2022, 43(2): 458-464.
[17]
KengerM N, OzceylanE. A hybrid approach based on mathematical modelling and improved online learning algorithm for data classification[J].Expert Systems with Applications, 2023, 218(5): 1-16.
BiZhi-zhen, YangDe-gang, FengJi.Adaptive spectral clustering algorithm for very large scale data[J]. Journal of Intelligent Systems,2023,18(2):251-259.
GuXian-feng, TangYong-li. Clustering simulation of mixed attribute big data based on swarm intelligence algorithm[J]. Computer Simulation, 2023,40(9): 458-461.
ZhangMan, XuZhao-rui, ShenXiang-jun. A high-speed spectral clustering method for Fourier domain massive data[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(8):1445-1454.
HuangWei, LiuGui-quan. Study on hierarchical multi-label text classification method of MSML-BERT model[J]. Computer Engineering and Applications, 2022, 5815: 191-201.