1.School of Computer Science, Xi'an Shiyou University, Xi'an 710065, China
2.School of Science, Xi'an Shiyou University, Xi'an 710065, China
Show less
文章历史+
Received
Accepted
Published
2025-08-20
Issue Date
2026-01-28
PDF (2357K)
摘要
尽管现有基于相对模糊粗糙集的特征选择方法在相似关系计算中已尝试刻画样本的离群分布来增强鲁棒性,但仍未能有效抑制潜在噪声的干扰,且未能进一步压缩数据规模。鉴于此,本文提出基于抗噪加权模糊粒的双选择方法:首先设计基于抗噪离散因子的相对距离度量,实现依据局部样本密度分布的自适应调整。其次构建抗噪加权模糊粒,完成双选择框架下的模型粒化。最终设计基于该粒结构的双选择算法BS-RFRS,在最大化数据约简的同时提升分类性能。通过在12个基准数据集上的实验验证,该算法在分类准确率与有效性方面显著优于其他5种所比较的双选择算法,其中在医疗诊断数据集和工业控制数据集上取得非常显著的准确率提升,且有效性较传统双选择模型有所提高。在标签噪声影响下,BS-RFRS的分类准确率相比于BSNID (bi-selection approach based on neighborhood importance degree)模型和BSFRS (bi-selection method based on fuzzy rough sets)模型分别平均提升19.9%和42.7%。
Abstract
Existing feature selection methods based on relative fuzzy rough sets have attempted to characterize instance outlier distribution in similarity calculation to enhance robustness, but still fail to effectively suppress potential noise interference and cannot further compress data scale. To overcome this issue, this paper proposed a bi-selection method using denoise-weighted fuzzy granules (BS-RFRS). A relative distance measure with a denoise discretization factor for adaptive adjustment based on local instance density was designed. Denoise-weighted fuzzy granules were then constructed for model granulation within the bi-selection framework. Based on this granular structure, the paper proposed the BS-RFRS algorithm to maximize data reduction while improving classification performance. Experiments on 12 benchmark datasets demonstrated that BS-RFRS significantly outperforms five other bi-selection algorithms in classification accuracy and effectiveness. It achieves particularly notable accuracy gains on medical diagnosis and industrial control datasets, and shows improved effectiveness over traditional models. Under label noise, the classification accuracy of BS-RFRS is on average improved by 19.9% and 42.7% compared with the BSNID model and the (bi-selection method based on fuzzy rough sets) (BSFRS) model, respectively.
为了进一步进行数据约简,将样本选择和特征选择同时进行的双选择技术也受到不同领域学者的关注。在去除冗余样本的基础上进一步对特征集进行约简,可以更大程度上缩小数据规模。Ślęzak等[28]最先提出双向约简的概念,将样本选择和特征选择技术相结合。Zhang等[29]提出了一种基于模糊粗糙集模型的双选择方法BSFRS(bi-selection method based on fuzzy rough sets),定义了一种新的重要性程度函数作为评价指标来选择代表性样本和约简的属性集。同时,Zhang等[30]注意到在邻域粗糙集领域关于双选择方法的研究较少,于是提出了一种基于邻域粗糙集模型的双选择方法。Quan等[31]通过引入β参数,提出了基于β一致性粒化机制的模糊粗糙集双选择模型,来应对大规模数据的挑战。值得注意的是,Zhang等[29]提出了删除潜在噪声的处理方法,但此方法对于位于噪声边界的潜在噪声无法有效识别。针对这一局限,本文提出的抗噪加权模糊粒模型能够对潜在噪声进行有效修正,避免忽视靠近噪声边界的噪声影响。
(2)对比算法:为了检验BS-RFRS算法的性能表现,我们挑选了5个双选择算法进行对比。其中第一个算法为基于模糊粗糙集的双选择算法BSFRS[29],第二个算法为基于邻域粗糙集的双选择算法BSNID(bi-selection approach based on neighborhood importance degree)[30]。为了有更多对比算法,我们用较为具有代表性的样本选择算法RIS[9]和NID[30]与三个特征选择算法RFRS[20]、FSI[25]以及NRBO[26]进行组合,形成三个双选择算法RIS-FSI、RIS-RFRS和NID-NRBO来进行对比。
(1)分类准确率:从表3中可以看到BS-RFRS算法在分类准确率方面具有明显的优势。其中BS-RFRS算法在12个数据集中有10个数据集达到了最优,2个数据集中达到了次优。在小规模数据集中例如Wine、 WDBC(Wisconsin Diagnostic Breast Cancer)、 Sonar等为最优算法,同时在高维数据集Darwin中BS-RFRS算法也为最优算法。在大规模数据集中如CTG(Cardiotocography)、WDG2(Waveform Database Generator Version 2)、Thyroid以及Gamma中,达到了最佳的性能表现且在Robot和WDG1(Waveform Database Generator Version 1)数据集中表现为次优。由表3观察得,基于抗噪加权模糊粒构造的BS-RFRS算法在不同数据规模的数据集中皆表现优越。
CHENM S, HANJ W, YUP S. Data Mining: an Overview from a Database Perspective[J]. IEEE Trans Knowl Data Eng, 1996, 8(6): 866-883. DOI: 10.1109/69.553155 .
[2]
LELEWERD A, HIRSCHBERGD S. Data Compression[J]. ACM Comput Surv, 1987, 19(3): 261-296. DOI: 10.1145/45072.45074 .
[3]
OLVERA-LÓPEZJ A, CARRASCO-OCHOAJ A, MARTÍNEZ-TRINIDADJ F, et al. A Review of Instance Selection Methods[J]. Artif Intell Rev, 2010, 34(2): 133-143. DOI: 10.1007/s10462-010-9165-y .
[4]
GARCÍAS, CANOJ R, HERRERAF. A Memetic Algorithm for Evolutionary Prototype Selection: A Scaling up Approach[J]. Pattern Recognit, 2008, 41(8): 2693-2709. DOI: 10.1016/j.patcog.2008.02.006 .
[5]
TSAIC F, CHENZ Y. Towards High Dimensional Instance Selection: an Evolutionary Approach[J]. Decis Support Syst, 2014, 61: 79-92. DOI: 10.1016/j.dss.2014.01.012 .
[6]
ZHANGX, YANGQ, QIANT. Learning to Select Representative Instances Based on Neighborhood Distribution[J]. Neurocomputing, 2025, 654: 131320. DOI: 10.1016/j.neucom.2025.131320 .
[7]
BRIGHTONH, MELLISHC. Advances in Instance Selection for Instance-based Learning Algorithms[J]. Data Min Knowl Discov, 2002, 6(2): 153-172. DOI: 10.1023/A:1014043630878 .
[8]
ZHANGQ, ZHUY, CORDEIROF R, et al. PSSCL: a Progressive Sample Selection Framework with Contrastive Loss Designed for Noisy Labels[J]. Pattern Recognit, 2025, 161: 111284. DOI: 10.1016/j.patcog.2024.111284 .
[9]
ZHANGX, MEIC L, CHEND G, et al. A Fuzzy Rough Set-based Feature Selection Method Using Representative Instances[J]. Knowl Based Syst, 2018, 151: 216-229. DOI: 10.1016/j.knosys.2018.03.031 .
[10]
VENKATESHB, ANURADHAJ. A Review of Feature Selection and Its Methods[J]. Cybern Inf Technol, 2019, 19(1): 3-26. DOI: 10.2478/cait-2019-0001 .
[11]
ALTMANN, KRZYWINSKIM. The Curse(s) of Dimensionality[J]. Nat Methods, 2018, 15(6): 399-400. DOI: 10.1038/s41592-018-0019-x .
ZADEHL A. Granular Computing and Rough Set Theory[M]//Rough Sets and Intelligent Systems Paradigms. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007: 1-4. DOI: 10.1007/978-3-540-73451-2_1 .
[14]
DUBOISD, PRADEH. Rough Fuzzy Sets and Fuzzy Rough Sets[J]. Int J Gen Syst, 1990, 17(2/3): 191-209. DOI: 10.1080/03081079008935107 .
[15]
陈德刚, 徐伟华, 李金海, 等. 粒计算基础教程[M]. 北京: 科学出版社, 2019.
[16]
CHEND G, XUW H, LIJ H, et al. Basic course of granular computing[M]. Beijing: Science Press, 2019.
CHEND G, LIUM, WUC, et al. Algebraic Structure and Reduction of Fuzzy Information Systems[J]. J Tsinghua Univ Sci Technol, 2003, 43(9): 1233-1235. DOI: 10.16511/j.cnki.qhdxxb.2003.09.022 .
[19]
HUQ H, YUD R, LIUJ F, et al. Neighborhood Rough Set Based Heterogeneous Feature Subset Selection[J]. Inf Sci, 2008, 178(18): 3577-3594. DOI: 10.1016/j.ins.2008.05.024 .
[20]
SHANNONC E. A Mathematical Theory of Communication[J]. Bell Syst Tech J, 1948, 27(3): 379-423. DOI: 10.1002/j.1538-7305.1948.tb01338.x .
[21]
HUQ H, YUD R. Neighborhood Entropy[C]//2009 International Conference on Machine Learning and Cybernetics. New York: IEEE, 2009: 1776-1782. DOI: 10.1109/ICMLC.2009.5212245 .
[22]
ANS, ZHAOE H, WANGC Z, et al. Relative Fuzzy Rough Approximations for Feature Selection and Classification[J]. IEEE Trans Cybern, 2021, 53(4): 2200-2210. DOI: 10.1109/TCYB.2021.3112674 .
[23]
HUANGW L, SHEY H, HEX L, et al. Fuzzy Rough Sets-based Incremental Feature Selection for Hierarchical Classification[J]. IEEE Trans Fuzzy Syst, 2023, 31(10): 3721-3733. DOI: 10.1109/TFUZZ.2023.3300913 .
[24]
ZHANGX Y, YAOY Y. Tri-level Attribute Reduction in Rough Set Theory[J]. Expert Syst Appl, 2022, 190: 116187. DOI: 10.1016/j.eswa.2021.116187 .
[25]
HUM, TSANGE C C, GUOY T, et al. Attribute Reduction Based on Overlap Degree and K-nearest-neighbor Rough Sets in Decision Information Systems[J]. Inf Sci, 2022, 584: 301-324. DOI: 10.1016/j.ins.2021.10.063 .
[26]
DAIJ H, ZHUZ L, ZOUX T. Fuzzy Rough Attribute Reduction Based on Fuzzy Implication Granularity Information[J]. IEEE Trans Fuzzy Syst, 2024, 32(6): 3741-3752. DOI: 10.1109/TFUZZ.2024.3381993 .
SOWMYAR, PREMKUMARM, JANGIRP. Newton-raphson-based Optimizer: A New Population-based Metaheuristic Algorithm for Continuous Optimization Problems[J]. Eng Appl Artif Intell, 2024, 128: 107532. DOI: 10.1016/j.engappai.2023.107532 .
[29]
FRÉNAYB, VERLEYSENM. Classification in the Presence of Label Noise: A Survey[J]. IEEE Trans Neural Netw Learn Syst, 2014, 25(5): 845-869. DOI: 10.1109/TNNLS.2013.2292894 .
[30]
ŚLĘZAKD, JANUSZA. Ensembles of Bireducts: Towards Robust Classification and Simple Representation[M]//Future Generation Information Technology. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011: 64-77. DOI: 10.1007/978-3-642-27142-7_9 .
[31]
ZHANGX, MEIC L, LIJ H, et al. Instance and Feature Selection Using Fuzzy Rough Sets: A Bi-selection Approach for Data Reduction[J]. IEEE Trans Fuzzy Syst, 2023, 31(6): 1981-1994. DOI: 10.1109/TFUZZ.2022.3216990 .
[32]
ZHANGX, HEZ Q, LIJ H, et al. Bi-selection of Instances and Features Based on Neighborhood Importance Degree[J]. IEEE Trans Big Data, 2024, 10(4): 415-428. DOI: 10.1109/TBDATA.2023.3342643 .
[33]
QUANJ S, QIAOF C, YANGT, et al. A Biselection Method Based on Consistent Matrix for Large-scale Datasets[J]. IEEE Trans Fuzzy Syst, 2025, 33(6): 1992-2005. DOI: 10.1109/TFUZZ.2025.3543893 .