To address the issue of false positives caused by multiple hypothesis testing in big data mining, as well as the extremely time-consuming nature of calculating theoretical results for controlling the false discovery rate (FDR). Aiming at the computational efficiency of theoretical FDR values, a distributed false-positive control algorithm based on DPFDR(distributed permutation testing-based false discovery rate) is proposed. The algorithm firstly mining the representative patterns based on the conditional frequent pattern tree (CFP) method, and using the representative patterns to compress the pattern space. Then, the workload of the corresponding task is estimated according to the representative mode, the data is divided according to the workload, and the task is allocated to each compute node through the load balancing policy. Finally, the effective FDR false-positive control threshold is obtained by merging and sorting the calculation results of each node. A series of experimental results on real data sets show that the proposed DPFDR algorithm can greatly improve the computational efficiency of FDR false positive control threshold.
ErdogmusH. Bayesian hypothesis testing illustrated: an introduction for software engineering researchers[J]. ACM Computing Surveys, 2022, 55(6): 1-28.
[2]
KelterR. Power analysis and type I and type II error rates of Bayesian nonparametric two-sample tests for location-shifts based on the Bayes factor under Cauchy priors[J]. Computational Statistics & Data Analysis, 2022, 165: 107326.
[3]
de Araújo SilvaA, GouvêaM A. Study on the effect of sample size on type I error, in the first, second and first-two digits Excess tests[J]. International Journal of Accounting Information Systems, 2023, 48: 100599.
[4]
LiuH P, ZhangJ V, WangD, et al. Extended endocrine therapy in breast cancer: a basket of length-constraint feature selection metaheuristics to balance type I against type II errors[J]. Journal of Biomedical Informatics, 2022, 131: 104112.
[5]
SharmaV S, AfthanorhanA, BarwarN C, et al. A dynamic repository approach for small file management with fast access time on Hadoop cluster: Hash based extended Hadoop archive[J]. IEEE Access, 2022, 10: 36856-36867.
[6]
LuoC, CaoQ, LiT R, et al. MapReduce accelerated attribute reduction based on neighborhood entropy with Apache Spark[J]. Expert Systems with Applications, 2023, 211: 118554.
[7]
Llinares-LópezF, SugiyamaM, PapaxanthosL, et al. Fast and memory-efficient significant pattern mining via permutation testing[C]// Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, 2015: 725-734.
[8]
DeyM, BhandariS K. FWER goes to zero for correlated normal[J]. Statistics & Probability Letters, 2023, 193: 109700.
[9]
TeradaA, SeseJ. Bonferroni correction hides significant motif combinations[C]// 13th IEEE International Conference on BioInformatics and BioEngineering. Chania,2013: 1-4.
[10]
HolmS. A simple sequentially rejective multiple test procedure[J]. Scandinavian Journal of Statistics, 1979, 6(2): 65-70.
[11]
SimesR J. An improved Bonferroni procedure for multiple tests of significance[J]. Biometrika, 1986, 73(3): 751-754.
[12]
HochbergY. A sharper Bonferroni procedure for multiple tests of significance[J]. Biometrika, 1988, 75(4): 800-802.
[13]
ChaubeyY P, WestfallP H, YoungS S. Resampling-based multiple testing: examples and methods for p-value adjustment[J]. Technometrics, 1993, 35(4): 450.
[14]
BenjaminiY, HochbergY. Controlling the false discovery rate: a practical and powerful approach to multiple testing[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1995, 57(1): 289-300.
[15]
NawazM S, AzamM, AslamM. An efficient double exponentially weighted moving average Benjamini-Hochberg control chart to control false discovery rate[J]. Quality and Reliability Engineering International, 2019, 35(8): 2677-2686.
[16]
CuiJ F, WangG H, ZouC L, et al. Change-point testing for parallel data sets with FDR control[J]. Computational Statistics & Data Analysis, 2023, 182: 107705.
[17]
LiuG M, ZhangH J, WongL S. Controlling false positives in association rule mining[J]. Proceedings of the VLDB Endowment, 2011, 5(2): 145-156.
[18]
PellizzoniP, BorgwardtK. FASM and FAST-YB: significant pattern mining with false discovery rate control[C]// 2023 IEEE International Conference on Data Mining (ICDM). Shanghai,2023: 1265-1270.
[19]
SidákZ. On multivariate normal probabilities of rectangles: their dependence on correlations[J]. The Annals of Mathematical Statistics, 1968, 39(5): 1425-1434.
[20]
BestgenY. Using Fisher’s exact test to evaluate association measures for N-grams[EB/OL]. (2021-04-29) [2023-12-29].
[21]
LiuG M, ZhangH J, WongL S. A flexible approach to finding representative pattern sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 26(7): 1562-1574.
[22]
LiuG M, LuH J, YuJ X. CFP-tree: a compact disk-based structure for storing and querying frequent itemsets[J]. Information Systems, 2007, 32(2): 295-319.
JiCe, WangJin-zhi, GengRong. Weak-selection backtracking matching pursuit algorithm based on Dice coefficient[J]. Journal of Northeastern University (Natural Science), 2021,42(2): 189-195.