A Two-Step Clustering Approach to Extract Locations from Individual GPS Trajectory Data

Zhongliang Fu, Zongshun Tian, Yanqing Xu, Changjian Qiao
2016 ISPRS International Journal of Geo-Information  
High-accuracy location identification is the basis of location awareness and location services. However, because of the influence of GPS signal loss, data drift and repeated access in the individual trajectory data, the efficiency and accuracy of existing algorithms have some deficiencies. Therefore, we propose a two-step clustering approach to extract individuals' locations according to their GPS trajectory data. Firstly, we defined three different types of stop points; secondly, we extracted
more » ... hese points from the trajectory data by using the spatio-temporal clustering algorithm based on time and distance. The experimental results show that the spatio-temporal clustering algorithm outperformed traditional extraction algorithms. It can avoid the problems caused by repeated access and can substantially reduce the effects of GPS signal loss and data drift. Finally, an improved clustering algorithm based on a fast search and identification of density peaks was applied to discover the trajectory locations. Compared to the existing algorithms, our method shows better performance and accuracy. One characteristic of the individual trajectory data is that it repeatedly accesses the same reference point. Meanwhile, affected by the existing GPS technology, trajectory data suffers from two drawbacks: first, the discontinuity caused by GPS signal loss; second, the data drift caused by other environmental factors. These problems have a significant influence on the location extraction algorithms. The existing algorithms are mostly based on temporal and spatial characteristics of the trajectory data, or they are in combination with the thematic data in order to find the location. These algorithms have their advantages, but there are some deficiencies when dealing with the above mentioned problems. The most direct way is clustering GPS points based on the point density. For example, Schuessler & Axhausen [11] used the density threshold to identity the locations. Ashbrook & Starner [12] used a variation of the K-Means clustering algorithm to learn locations from trajectory data. Zhou et al. [13] proposed the DJ-Cluster algorithm based on the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [14] in order to recognize the location. These algorithms do not provide a processing capacity for the points with signal loss and data drift, especially the signal loss caused by a building block. They only consider the spatial position regardless of the time attribute. Therefore, these algorithms are only suitable for the points with strong signal. To solve the problems caused by data loss, Ashbrook & Starner [15] proposed to firstly identify the lost points. Then, they used the improved K-Means algorithm to cluster the lost points in order to form the locations. Since the K-Means algorithm is sensitive to noise, and the approach only recognizes the lost points, performance is poor. Therefore, more algorithms select the location based on time and space characteristics. For example, Du & Aultman-Hall [16] considered the sequential GPS points with a lower speed as a location. Kang et al. [17] proposed a time-based clustering algorithm. The algorithm takes these points, which are close to each other, and the time interval, which is greater than the threshold at the locations. The SMoT algorithm [18] did an analysis based on the intersection of trajectory data and candidate stops, in which overlapped and sustained points were considered as locations. Palma et al. [19] proposed the CB-SMoT algorithm based on the previous studies. The algorithm clustered data with the parameters Eps (the radium of the cluster) and the minimum time based on the DBSCAN algorithm. Then, it analyzed the intersection of those clusters and the candidate stops. To some extent, these algorithms can avoid the problems caused by data loss. However, due to the impact of data drift, these algorithms are prone to false positives. These algorithms also do not consider the problem of repeated access to the same place. Consequently, each location which was extracted by these algorithms will be considered as a new location. Therefore, a hierarchical clustering algorithm was applied to location extraction. Yuan et al. [20] proposed an approach which firstly divided the trajectory data into segments, and then clustered the endpoints of the segments to the locations. Although the algorithm overcomes the problem of repeated access, the result is not ideal due to the impact of signal loss and data drift. More importantly, the endpoints of the segments only represent the main turning point of the trajectory. To mitigate these drawbacks, Lv et al. [3] used a time-based algorithm to obtain the visit points which were then clustered to form physical places by an improved DBSCAN algorithm. Experiments have achieved good results, but the algorithm also has some problems. First, influenced by the data drift, a visit point is easily divided into multiple small clusters. Consequently, the precision and recall rate of the algorithm are affected by the segmentation. Second, the DBSCAN algorithm is very sensitive to the parameters [21] and has a relatively high time complexity. It is not suitable for processing a large amount of data as part of the trajectory data mining. With such concerns about the previously used algorithms, we propose a two-step clustering approach to extract personal locations from individual GPS trajectory data. First, we extracted the possible stop points by using the spatial-temporal clustering algorithm. Considering the characteristics of the individual trajectory data, this paper presents three different types of stop points, among which the first type is not taken into account in the existing algorithm. The spatio-temporal clustering algorithm can largely avoid the problems of signal loss and data drift. It can also reduce the impact of segmentation found in the literature [3] . Next, an improved fast clustering algorithm is adopted
doi:10.3390/ijgi5100166 fatcat:tku4763zlvboppym4nq5fpytcy