Interactive data analysis: the Control project

J.M. Hellerstein, R. Avnur, A. Chou, C. Hidber, C. Olston, V. Raman, T. Roth, P.J. Haas
1999 Computer  
51 Interactive Data Analysis: The Control Project D ata analysis is fundamentally an iterative process in which you issue a query, receive a response, formulate the next query based on the response, and repeat. You usually don't issue a single, perfectly chosen query and get the information you want from a database; indeed, the purpose of data analysis is to extract unknown information, and in most situations there is no one perfect query. 1 People naturally start by asking broad, big-picture
more » ... estions and then continually refine their questions based on feedback and domain knowledge. 2 Consider repeating this process several times over, sifting through many more results, and you have an idea of why using advanced data analysis tools is so complex. Composing Structured Query Language (SQL) queries for decision-support database management systems (DBMSs) isn't easy, and even users of graphical query tools find it difficult to generate insightful queries. Although data-mining systems typically don't provide complicated query languages, to use these systems you need to choose a suitable mining algorithm and carefully tune various algorithm-specific parameters such as support and confidence for association rule mining, thresholds for clustering, training sets for classification, and so on. These usability problems increase the number of iterations in the analysis process; you have to try algorithms with different parameters until you find one that produces useful results. In addition, many of these tools require complicated, time-consuming setup phases before they can be used at all. Most research in the areas of decision support, data visualization, statistics, data mining and knowledge discovery has concentrated on improving a single iteration of the analysis process. Some work has focused on improving the quality of a particular analysis result or on reducing the time it takes for each analysis step or algorithm to provide a complete response. These fields have progressed greatly, but this research focus ignores a basic invariant in computing: Full-scale data analysis will always be slow. As Greg Papadopoulos, chief technology officer at Sun, points out, the appetite for data collection, storage, and analysis is outstripping Moore's law, meaning that the time required to analyze massive data sets is steadily growing. To date, the result is a worst-case mode of human-computer interaction: Data analysis is a complex process involving multiple, time-consuming steps, and a poor or erroneous choice of inputs is not noticeable until results return at the end of a given step. The long delay and absolute lack of control during individual analysis steps disrupt the user's concentration and hamper the data analysis process. This situation is reminiscent of Herodotus' lament: "Of all men's miseries, the bitterest is this: to know so much and have control over nothing." In the Control (Continuous Output and Navigation Technology with Refinement Online) project at Berkeley, we are working with collaborators at IBM, Informix, and elsewhere to explore ways to improve human-computer interaction during data analysis. The Control project's goal is to develop interactive, intuitive techniques for analyzing massive data sets. We focus on systems that iteratively refine answers to queries and give users online control of processing, thereby tightening the data analysis process loop. You can use our techniques in diverse software contexts including decision support database systems, data visualization, data mining, and user interface toolkits. BATCH VERSUS ONLINE PROCESSING Traditional analysis tools have a black-box interface: The user issues queries, the system processes silently for a significant period, and then the system returns an exact answer. Because of the long processing times, this interaction is reminiscent of the batch processing of the 1960s
doi:10.1109/2.781635 fatcat:w2e7t3wlbzguzm43edxni7ccoq