PDF (1124K)
摘要
针对大数据环境下汉语文本数据分类中模糊边界难以界定、隶属度函数静态固化导致分类精度与效率不足的问题,提出基于改进模糊均值聚类的汉语文本数据分类方法。该方法对原始汉语文本数据进行预处理与特征归一化,构建模糊特征矩阵;采用改进模糊C均值算法迭代优化隶属度函数,并引入三角模糊集生成分类规则,有效刻画类别间的模糊过渡区域;在此基础上,进一步动态更新隶属度函数以适应数据分布的变化,计算模糊协方差矩阵并建立判别函数,完成分类决策。通过在IBM多属性人群数据集上的实验验证,该方法分类正确率最高达99.86%,数据浓缩率最高达97.62%,错分率始终低于0.6%,且性能随数据量增加下降平缓,具有良好的稳定性与抗噪性,满足汉语文本数据分类的实际需求。
Abstract
To address the issues of indistinct boundaries and static membership functions in Chinese text data classification under big data environments, leading to insufficient classification accuracy and efficiency, an improved fuzzy mean clustering-based Chinese text data classification method is proposed. This method preprocesses and normalizes the original Chinese text data to construct a fuzzy feature matrix. It employs an improved fuzzy C-means algorithm to iteratively optimize the membership function and introduces triangular fuzzy sets to generate classification rules, effectively characterizing the fuzzy transition zones between categories. Building on this, the membership function is further dynamically updated to adapt to changes in data distribution. A fuzzy covariance matrix is calculated, and a discriminant function is established to complete the classification decision. Experimental validation on the IBM multi-attribute population dataset demonstrates that this method achieves a maximum classification accuracy of 99.86%, a data condensation rate of up to 97.62%, and consistently maintains a misclassification rate below 0.6%. Additionally, its performance declines gradually with increasing data volume, exhibiting excellent stability and noise resistance, thereby meeting the practical requirements of Chinese text data classification.
关键词
Key words
霍亮.
基于改进模糊均值聚类的汉语文本数据分类方法[J].
自动化技术与应用, 2026, 45(6): 154-158 DOI:10.20033/j.1003-7241.(2026)06-0154-06