Optimizing star-coordinate visualization models for effective interactive cluster exploration on big data
Intelligent Data Analysis
Interactive visual cluster analysis is the most intuitive way for finding clustering patterns, validating algorithmic clustering results, understanding data clusters with domain knowledge, and refining cluster definitions. The most challenging step is visualizing multidimensional data and allowing a user to interactively explore the data to identify clustering structures. In this paper, we systematically study the star-coordinate based visualization models and propose the optimized design that
... resents the best visualization results and supports the most efficient interaction methods. We explain the intuition behind the models and their link with random projection, and then optimize the visual design in terms of the efficiency of visual presentation and interactive operations. We also discuss the randomized visualization generation method, which can be used to generate batches of meaningful visualization results in parallel for big data. Finally, we present the experimental evaluation of the optimal design of models. This study is critical to generating effective visualization and minimizing the computational cost for visualizing data clusters for big data in the cloud. keyword Interactive Multidimensional Data Visualization, Visual Cluster Analysis, Star-coordinate Models, Big Data 1 With widely deployed Internet applications, sensor networks, and data collection tools, the size of dataset is growing fast. By incorporating users into the analysis loop, interactive visual data analysis has become the most effective exploratory data analysis method for discovering useful information and patterns from large datasets . In particular, interactive cluster visualization techniques have shown unique advantages in visually validating algorithmic clustering results, understanding clusters with domain knowledge, and interactively refining cluster definitions for multidimensional data [18, 5, 19, 22]. Compared to other multidimensional visualization methods, such as scatter-plot matrix  and parallel coordinates , star-coordinate models [19, 5] are probably the most scalable technique for visualizing large datasets. Scalability here refers to both visual representation and data processing. 1. In spite of different underlying mapping models for star coordinates, such as the Kandogan model  and the VISTA model , the star-coordinates visual designs share a unique feature. It extends the traditional two or three dimensional coordinate systems to k-dimensional coordinate systems as Figure 1 shows. Apparently, this visual design can present more dimensions than the scatter-plot matrix and parallel coordinates. In fact, scatter-plot matrix and parallel coordinates are often used to explore less than ten dimensions while star coordinates can handle tens of dimensions  . In particular, star coordinates can present better cluster visualization formed by multiple dimensions. In contrast, each scatter-plot can only observe two dimensions; the clusters are often visualized with heavy overlapping in parallel coordinates. 2. The star-coordinate visualization can scale up to many points, with the help of densitybased representation. Because of the density preserving properties of the underlying mapping models, points can be aggregated and we can only show the density information. This is important when the number of records becomes very large, e.g., tens of millions to billions of records. 3. Most importantly, star-coordinate based cluster visualization methods avoid calculating pairwise distances between the records -rather, it utilizes the property of the underlying mapping model to partially preserve the distance relationship. This is very meaningful to processing big data, where the cost of computing pairwise distances becomes unacceptable.