In order to solve the problem that the classical K-means clustering algorithm reguired users to know the number of clusters in advance and the clustering results were sensitive to initialization of the algorithm, a comprehensive scheme was proposed to improve the random initial partitioning of K-means algorithm and visually determine the number of clusters. Firstly, the data was standardized to make it obey normal distribution, and the most important features were extracted by principal component analysis to achieve dimensionality reduction of high-dimensional data. Then, the farthest centroid selection and min-max distance rule were used to modify the random initialization of K-means algorithm to avoid empty clusters and ensure data separability. Based on these, the statistical empirical rule was used to estimate the range of the number of clusters, and the optimal number of clusters was assessed by searching the elbow of sum-of-squared-error curve within this range. Finally, by calculating and comparing the silhouette coefficients of each cluster, the clustering quality of the algorithm was evaluated, thereby ultimately determining the inherent number of clusters in the dataset. The simulation results show that the proposed scheme can not only visually determine the potential number of clusters in the dataset, but also provide an effective method for high-dimensional data analysis in the era of big data.
为解决K-均值聚类算法的问题,很多学者提出了相应的改进方案。针对随机初始化问题,依据变异系数最大值和相关绝对值的最小值,通过选择两个主变量来计算初始聚类中心[11];采用初始种子选择算法以确定数据的初始划分[12];利用鲁棒密度初始化将种子点定位在数据集的密集区域以实现聚类的初始化[13];在算法每次运行中,使用数据的一个属性创建一个初始聚类,并将多次运行的聚类结果组合起来形成初始划分[14]。在确定聚类数量方面,将最大稳定集与连续Hopfield网络相结合估计聚类数量[15];为提高差分进化自动聚类(automatic clustering using differential evolution,ACDE)的性能,使用U-控制图法,结合ACDE以确定聚类数量[16];基于查找聚类均值有效性的聚类指数,结合核心函数指数及均值性质确定新的聚类中心和聚类数量[17];通过计算基于中心点的偏差比指数(deviation ratio index,DRI)来确定聚类的数量[18];在数据分区之前,利用相似性矩阵的易感性确定最佳的聚类数量[19]。
由于数据 X 的聚类数量K是有限的正整数,尽管具体K值无法事先得知,但可通过数据统计特性估计出其范围。当所有数据点都属于一个聚类时, X 的最小聚类数量为Kmin=1;而聚类数量最大的情况是每个样本点形成一个单独聚类,即Kmax=n。然而,实际数据本身的聚类数量一般满足K<<n,因此只要估计出K的可能最大值Kmax,就可以构造出聚类数量范围[1, Kmax]。
(1)数据 X 的标准化预处理;(2) If p≥8 go to 3, else go to 4;(3)对标准化数据执行 PCA 降维的预处理,使数据维数压缩为q维;(4)利用经验法则估计聚类数量范围[Kmin,Kmax];(5)用肘部法在[Kmin,Kmax]中搜索最佳 SSE 对应的聚类数量 K;(6)采用最远质心选择和最小-最大距离规则改进经典K-均值算法的随机初始化;(7)将K值作为输入参数运行改进的K-均值算法获得聚类结果;(8)计算聚类的轮廓系数,利用轮廓分析评价聚类质量;(9)确定与数据本身相适应最优聚类数量K;(10)返回聚类数量K以及对应的质心向量 M。
GrolemundG, WickhamH.A cognitive interpretation of data analysis[J].International Statistical Review,2014,82(2):184-204.
[2]
WangR L, JiW T, LiuM Z,et al.Review on mi-ning data from multiple data sources[J].Pattern Recognition Letters,2018,109(7):120-128.
[3]
ClancyT C, KhawarA, NewmanT R.Robust signal classification using unsupervised learning[J].IEEE Transactions on Wireless Communications,2011,10(4):1289-1299.
[4]
XiaoQ G, LiC B, TangY,et al.Energy efficiency modeling for configuration-dependent machining via machine learning:a comparative study[J].IEEE Transactions on Automation Science and Engineering,2021,18(2):717-730.
[5]
CarusoG, GattoneS A, FortunaF,et al.Cluster analysis for mixed data:an application to credit risk evaluation[J].Socio-Economic Planning Sciences,2021,73:100850.
[6]
HuangD, WangC D, PengH X,et al.Enhanced ensemble clustering via fast propagation of cluster-wise similarities[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2021,51(1):508-520.
[7]
IkotunA M, EzugwuA E, AbualigahL,et al.K-means clustering algorithms:a comprehensive review variants analysis,and advances in the era of big data[J].Information Sciences:an International Journal,2023,622(1):178-210.
[8]
BiswasT K, GiriK, RoyS.ECKM:an improved k-means clustering based on computational geo-metry[J].Expert Systems with Applications,2023,212:118862.
[9]
SteinleyD.K-means clustering:a half-century synthesis[J].The British Journal of Mathematical and Statistical Psychology,2006,59(1):1-34.
[10]
SteinleyD.Stability analysis in K-means clustering[J].The British Journal of Mathematical and Statistical Psychology,2008,61(2):255-273.
[11]
ErisogluM, CalisN, SakalliogluS.A new algorithm for initial cluster centers in k-means algorithm[J].Pattern Recognition Letters,2011,32(14):1701-1705.
[12]
KhanF.An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application[J].Applied Soft Computing,2012,12(11):3698-3700.
[13]
KumarK M, ReddyA M.An efficient k-means clustering filtering algorithm using density based initial cluster centers[J].Information Sciences:an International Journal,2017,418(1):286-301.
[14]
AhmadA, KhanS S.InitKmix-a novel initial partition generation algorithm for clustering mixed data using k-means-based clustering[J].Expert Systems with Applications,2021,167:114149.
[15]
KarimA, LoqmanC, BoumhidiJ.Determining the number of clusters using neural network and max stable set problem[J].Procedia Computer Science,2018,127(1):16-25.
[16]
ViloriaA, LezamaO B P.Improvements for determining the number of clusters in k-means for innovation databases in SMEs[J].Procedia Computer Science,2019,151(1):1201-1206.
[17]
AbdalameerA K, AlswaittiM, AlsudaniA A,et al.A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters[J].Expert Systems with Applications,2022,191:116329.
[18]
Kariyam,Abdurakhman, EffendieA R.A medoid-based deviation ratio index to determine the number of clusters in a dataset[J].MethodsX,2023,10:102084.
[19]
LippielloE, BaccariS, BountzisP.Determining the number of clusters,before finding clusters,from the susceptibility of the similarity matrix[J].Physica a Statistical Mechanics and Its Applications,2023,616:128592.
[20]
LeeG, SimE, YoonY,et al.Probabilistic orthogonal-signal-corrected principal component analysis[J].Knowledge-Based Systems,2023,268:110473.
[21]
KwakN.Principal component analysis based on L1-norm maximization[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2008,30(9):1672-1680.
[22]
SelimS Z, IsmailM A.K-means-type algorithms:a generalized convergence theorem and characte-rization of local optimality[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1984,6(1):81-87.
[23]
BagirovA M, AliguliyevR M, SultanovaN.Fin-ding compact and well-separated clusters:clustering using silhouette coefficients[J].Pattern Recognition,2023,135:109144.
[24]
BatoolF, HennigC.Clustering with the average silhouette width[J].Computational Statistics & Data Analysis,2021,158:107190.
[25]
HubertL, ArabieP.Comparing partitions[J].Journal of Classification,1985,2(1):193-218.
[26]
SteinleyD.Properties of the hubert-arabie adjusted rand index[J].Psychological Methods,2004,9(3):386-396.
[27]
StrehlA, GhoshJ.Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research,2003,3(3):583-617.
[28]
YangZ, AlgesheimerR, TessoneC J.A comparative analysis of community detection algorithms on artificial networks[J].Scientific Reports,2016,6:30750.
[29]
RosenbergA, HirschbergJ.V-measure:a conditional entropy-based external cluster evaluation measure[J].EMNLP-CoNLL 2007-Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,2007,10:410-420.
[30]
FowlkesE B, MallowsC L.A method for compa-ring two hierarchical clusterings[J].Journal of the American Statistical Association,1983,78(383):553-569.
[31]
DaviesD L, BouldinD W.A cluster separation measure[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1979,1(2):224-227.
[32]
HalkidiM, BatistakisY, VazirgiannisM.On clustering validation techniques[J].Journal of Intelligent Information Systems,2001,17(2):107-145.
[33]
CalinskiT, HarabaszJ.A dendrite method for cluster analysis[J].Communications in Statistics- Theory and Methods,1974,3(1):780133830.