Due to the fact that ordinary spatial keyword queries often result in too many query results, and people tend to search for locations with high text matching in the result set, a spatial text query problem based on density peak is proposed to obtain typical spatial objects with concentrated spatial object density and hightext similarity. TF-IDF similarity and Cosine similarity evaluation methods are used to calculate the correlation etween query conditions and other spatial keywords. Then, based on DPC (Clustering by Fast Search and Find of Density Peaks) algorithm, the TS-DPC algorithm is designed to divide the intermediate result set into several clusters according to density requirements among objects that meet spatial text conditions. On the one hand, it can obtain spatial object clusters that meet density requirements within a given range. On the other hand, it can obtain the centers of different spatial object clusters for research purposes. Then, the algorithm is optimized and the TS-DPC-IMP algorithm is proposed which greatly reduces the running time of the algorithm by using a grid algorithm while keeping other parameters unchanged.
随着Internet的普遍应用以及智能手机和移动终端的普及,对于兴趣点(Point of Interests,POIs,比如:商场、饭店、旅游景点等)的空间关键词查询已成为当前空间数据库的研究热点[1].而以空间地理位置-文本数据为背景的空间关键词查询技术在多个领域得到了广泛的应用,比如:给定一个空间查询对象,包括待查询的空间地理位置(一般用经度和纬度表示)和待查询的一组关键词,返回距离查询位置以及空间文本相关性最高的前k个结果.但是由于现在每天会产生海量的空间数据,任意一个空间关键词的搜索可能会产生大量的查询结果,而且这样返回的空间对象均零散地分布在空间地理位置中,从而忽略了空间对象之间的相关性.
聚类算法可以根据数据之间的相似特征对数据集进行归类与划分,已经广泛应用于数据分析、地理位置分析等领域[2-3].根据给定的规则按照对象的属性将其划分成不同的簇集,相同簇集下的对象具有属性相关性,而不同的簇集中的对象则存在较弱的相关性.在聚类方法发展的进程中,出现了很多经典的算法,比如基于密度聚类[4-6](Density-Based)的典型代表算法有DBSCAN[7](Density-Based Spatial Clustering of Applications with Noise)和OPTICS[8](Ordering Points to Identify the Clustering Structure),基于划分聚类的k-means[9]均值聚类算法以及基于模型聚类[10-11]的HMM算法,这些算法因为计算简单而被广泛使用,但是存在一定的局限性.
本文采用IR-Tree[16]来构建空间文本索引,它是一种混合索引模型,以倒排索引和R-Tree索引为基础.IR-Tree混合索引将两种索引结构有机结合在一起实现文本相似度和位置相关度的融合.因此,设计TS-DPC(Spatial Textual Similar-Based on Clustering by Fast Search and Find of Density Peaks)算法,该算法在经典的密度峰值聚类算法DPC(Clustering by Fast Search and Find of Density Peaks)的基础上,不仅考虑结果对象之间的距离相似度,而且考虑结果对象与其他对象的关键词集合与查询对象的关键词的相似度,这样在整个空间中,就会得到多个以密度峰值对象为中心的簇集.最后,再计算查询对象与得到的多个簇集中包含的对象的空间文本相似度,选择前k个结果.
上述的算法时间复杂度较高的地方主要在DPC计算密度峰值时,DPC算法的步骤1中每次查询都需要计算其中的任一数据点和其他所有数据点的距离,其时间复杂度为O(N2 ),对于空间大量的数据来说,这里的代价是比较大的.另外,算法在运行时,所有的数据都在内存中进行计算,当数据量增大时,需要比较大的内存才能支持该算法,而且磁盘I/O消耗也大.另外通过对该算法步骤2的研究可以发现数据点的截断距离领域之外的点,并不加入其密度ρ的计算.于是,提出了改进的TS-DPC-IMP算法(Spatial Textual Similar-Based on Clustering by Fast Search and Find of Density Peaks-Improvment),通过网格划分的方法来对上述计算进行剪枝,优化区域查询的操作,以减少时间的开销.
4.1 相关概念
定义5(网格单元) 用户需要确定一个空间网格划分的参数e,根据这个参数,将数据空间在每一行上划分为N个相同长度的区间,每一列上做相同的划分,从而将整个数据空间划分为边长为e的网格.设网格单元Grid={g1,g2,…,gn },gi ={1≤i, j≤n | [col ij,row ij ]},其中i代表行,j代表列.在网格单元划分的过程中,对于每行或者每列的末尾的网格单元,可以称之为边缘网格单元.可能长度不足e,但是仍可以将该边长看作e,因为对于边缘网格单元来说,不会存在其相同方向延伸的点,这样既可以保证计算的准确性,又可以保证算法的一致性.
LIF, ZHOUM, LIS, et al. A new density peak clustering algorithm based on cluster fusion strategy[J]. IEEE Access,2022, 10: 98034-98047.
[2]
DUANB, WEIP, MAW. An improved density peak clustering algorithm[C]//International Conference on Computer Graphics, Artificial Intelligence, and Data Processing. Guangzhou: SPIE, 2023: 1044-1052.
[3]
CHENH, ZHOUY, MEIK, et al. A new density peak clustering algorithm with adaptive clustering center based on differential privacy[J]. IEEE Access, 2023, 11: 1418-1431.
[4]
JORDANM I, MITCHELLT M. Machine learning: Trends, perspectives, and prospects[J]. Science, 2015, 349(6245):255-260.
[5]
YUJ, HONGR, WANGM, et al. Image clustering based on sparse patch alignment framework[J]. Pattern Recognition, 2014, 47(11):3512-3519.
[6]
BUCZAKA L, GUVENE. A survey of data mining and machine learning methods for cyber security intrusion detection[J].IEEE Communications Surveys & Tutorials, 2016, 18(2):1153-1176.
[7]
ESTERM, KRIEGELH, SANDERJ, et al. A density-based algorithm for discovering clusters in large spatial databaseswith noise[C]//National Conferences on Aritificial Intelligence. Munich: AAAI, 1999: 836-841.
[8]
ANKERSTM, BREUNIGM M, KRIEGELH P, et al. OPTICS: Ordering points to identify the clusteringstructure[C]// Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. Philadelphia: ACM, 1999: 49-60.
[9]
LIKASA, VLASSISN, VERBEEKJ J. The global k-means clustering algorithm[J]. Pattern Recognition, 2003, 36(2): 451-461.
[10]
MCPARLANDD, GORMLEYI C.Model based clustering for mixed data: ClustMD[J]. Advances in Data Analysis and Classification,2016,10(2):155-169.
[11]
ROKACHL. A survey of clustering algorithms[M]//MAIMON O, ROKACH L, eds. Data Mining and Knowledge Discovery Handbook. Boston: Springer US, 2009: 269-298.
[12]
CHENZ, ZHAOT, LIUW. Time-aware collective spatial keyword query[J]. Computer Science and Information Systems, 2021,18(3):1077-1100.
[13]
RODRIGUEZA, LAIOA. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
GUTTMANA. R-trees: A dynamic index structure for spatial searching[C]//Proceedings of ACM SIGMOD Conference on Management of Data. Boston: ACM,1984:47-57.
[16]
CONGG, JENSENC S, WUD. Efficient retrieval of the top-k most relevant spatial web objects[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 337-348.