The approximate functional dependency discovery method allows a certain proportion of violations by relaxing the conditions for the existence of functional dependency to ensure that the original functional dependency can still be mined in noise data.However,existing discovery algorithms were easy to mine a large number of false functional dependency with a large number of left attributes after relaxing the conditions for functional dependency,resulting in a significant reduction in the accuracy of discovery results.In order to solve this problem,an approximate functional dependency discovery algorithm based on Markov blanket,which uses Markov blanket to prune the left attribute search space and reduce the candidate set of decision items was proposed.And through the downward generalization algorithm,the number of error calculations was reduced and the complexity was reduced.In this way,on the premise of not losing the real functional dependency,overfitting of approximate functional dependency was avoided,thus improving the accuracy of discovery results.Experimental results show that the accuracy of the proposed method in real data sets and synthetic data sets is better than the existing approximate functional dependency discovery methods.
现有的近似函数依赖挖掘方法虽然可以降低数据错误对函数依赖挖掘结果的影响,但仍然存在以下两方面的问题:(1)现有方法大多需要系统地枚举所有可能的FD,并逐一验证其共现频率是否满足阈值要求,导致候选FD规模巨大,挖掘效率较低。例如,假设数据中有m个属性,文献[3]的搜索空间为O(2 m ),搜索代价是指数级的;(2)现有的方法容易出现过拟合问题,挖掘出大量左部属性数量较多的虚假FD,挖掘结果准确率较低。例如,对于只有几十个属性的数据集,基于共现频率的方法会挖掘出成百上千条FD,其中大部分是过拟合的。
真实数据集使用Rayyan、Beers、Flights及Hospital 4个数据集,Rayyan,Beers[Ouzzani et al.,2016]和Flights[Li et al.,2012]是由用户手动收集并清理的数据集,Hospital数据集是常用于数据清洗的基准数据集,其中4个数据集分别包含3、3、4和15条真实的FD,如表2所示。
ZiawaschA, LukaszG, FelixN. Profiling relational data: a survey[J]. The VLDB Journal, 2015, 24(4):557-581.
[3]
SebastianK, FelixN. Efficient discovery of approximate dependencies[J]. Proceedings of the VLDB Endowment,2018,11(7):759-772.
[4]
HuhtalaY, KärkkäinenJ, PorkkaP. Tane : an efficient algorithm for discovering functional and approximate dependencies[J]. The Computer Journal,1999,42(2):100-111.
[5]
NovelliN, CicchettiR.FUN: An efficient algorithm for discovery functional and embedded dependencies[M]. Database Theory-ICDT 2001,Berlin:Heidelberg,2001.
[6]
YaoH, HamiltonH J, ButzC J.FD/spl I.bar/Mine:discovering functional dependencies in a database using equivalences[C]//2002 IEEE International Conference on Data Mining,2002.Proceedings.Maebashi City,Japan:IEEE,2003:729-732.
[7]
AbedjanZ, SchulzeP, NaumannF.DFD:efficient functional dependency discovery[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management.New York,USA:ACM,2014.
[8]
FlachP A, SavnikI.Database dependency discovery: a machine learning approach[J]. AI Communications, 1999, 12(3):139-160.
[9]
LopesS, PetitJ M, LakhalL.Efficient discovery of functional dependencies and Armstrong relations[M].Advances in Database Technology-EDBT 2000.New York:Springer Berlin Heidelberg,2000.
[10]
WyssC, GiannellaC, RobertsonE.FastFDs:a heuristic-driven,depth-first algorithm for mining functional dependencies from relation instances extended abstract[M].Data Warehousing and Knowledge Discovery.New York:Springer Berlin Heidelberg,2001.
[11]
PapenbrockT, NaumannF.A hybrid approach to functional dependency discovery[C]//Proceedings of the 2016 International Conference on Management of Data.New York,USA:ACM,2016:821-833.
[12]
LingZ L, YuK, ZhangY W,et al. Jiuyong. Causal learner: A toolbox for causal structure and Markov blanket learning[J]. Pattern Recognition Letters,2022,163(2):92-95.
[13]
RamsteadM J D. One person's modus ponens…: Comment on “The Markov blanket trick: on the scope of the free energy principle and active inference” by Raja and colleagues (2021)[J]. Physics of Life Reviews,2022,43(3):305-307.