Geographic Data Mining and Knowledge Discovery An Overview
Geographic Data Mining and Knowledge Discovery, Second Edition
digital geographic data easily overwhelm mainstream spatial analysis techniques that are oriented towards teasing scarce information from small and homogenous datasets. Traditional statistical methods, particularly spatial statistics, have high computational burdens. These techniques are confirmatory and require the researcher to have a priori hypotheses. Therefore, traditional spatial analytical techniques cannot easily discover new and unexpected patterns, trends and relationships that can be
... hidden deep within very large and diverse geographic datasets. In March 1999, the National Center for Geographic Information and Analysis (NCGIA) -Project Varenius held a workshop on "Discovering geographic knowledge in data-rich environments" in Kirkland, Washington. The workshop brought together a diverse group of stakeholders with interests in developing and applying computational techniques for exploring large, heterogeneous digital geographic datasets. This includes geographers, geographic information scientists, computer scientists and statisticians. This book is a result of that workshop. This volume brings together some of the cutting-edge research from the diverse stakeholders working in the area of geographic data mining and geographic knowledge discovery in a data-rich environment. This chapter provides an introduction to geographic data mining and geographic knowledge discovery (GKD). In this chapter, we provide an overview of knowledge discovery from databases (KDD) and data mining. We also provide an overview of the highly interesting special case of geographic knowledge discovery and geographic data mining. We identify why geographic data is a non-trivial special case that requires special consideration and techniques. We also review the current state-of-the-art in GKD, including the existing literature and the contributions of the chapters in this volume. KNOWLEDGE DISCOVERY AND DATA MINING In this section of the chapter, we provide a general overview of knowledge discovery and data mining. We begin with an overview of knowledge discovery from databases (KDD), highlighting Page 3 of 48 Filename: GKD Chapter 1 v8. Last save: 9-21-2000 7:41 AM its general objectives and its relationship to the field of statistics and the general scientific process. We then identify the major stages of the KDD processing, including data mining. We classify major data mining tasks and discuss some techniques available for each task. We conclude this section by discussing the relationships between scientific visualization and KDD. Knowledge discovery from databases Knowledge discovery from databases (KDD) is a response to the enormous volumes of data being collected and stored in operational and scientific databases. Continuing improvements in information technology (IT) and its widespread adoption for process monitoring and control in many domains is creating a wealth of new data. There is often much more information in these databases than the "shallow" information being extracted by traditional analytical and query techniques. KDD leverages investments in IT by searching for deeply hidden information that can be turned into knowledge for strategic decision-making and answering fundamental research questions. KDD is better known through the more popular term "data mining." However, data mining is only one component (albeit a central component) of the larger KDD process. Data mining involves distilling data into information or facts about the mini-world described by the database. KDD is the higher-level process of obtaining information through data mining and distilling this information into knowledge (ideas and beliefs about the mini-world) through interpretation of information and integration with existing knowledge. KDD is based on a belief that information is hidden in very large databases in the form of interesting patterns. These are non-random properties and relationships that are valid, novel, useful and ultimately understandable. Valid means that the pattern is general enough to apply to new data; it is not just an anomaly of the current data. Novel means that the pattern is non-trivial and unexpected. Useful implies that the pattern should lead to some effective action: rather than searching for any valid and novel pattern, KDD should inform decision making and scientific Page 4 of 48 Filename: GKD Chapter 1 v8. Last save: 9-21-2000 7:41 AM investigation. Ultimately understandable means that the pattern should be simple and interpretable by humans (Fayyad, Piatetsky-Shapiro and Smyth 1996) . KDD is also based on the belief that traditional database queries and statistical methods cannot reveal interesting patterns in very large databases. One reason is the type of data that increasingly comprise enterprise databases. Another reason is the novelty of the patterns sought in KDD. KDD goes beyond the traditional domain of statistics to accommodate data not normally amenable to statistical analysis. Statistics usually involves a small and clean (noiseless) numeric database scientifically sampled from a large population with specific questions in mind. Many statistical models require strict assumptions (such as independence, stationarity of underlying processes and normality). In contrast, the data being collected and stored in many enterprise databases are noisy, non-numeric and possibly incomplete. These data are also collected in an open-ended manner without specific questions in mind (Hand 1998). KDD encompasses principles and techniques from statistics, machine learning, pattern recognition, numeric search and scientific visualization to accommodate the new data types and data volumes being generated through information technologies. KDD is more strongly inductive than traditional statistical analysis. The generalization process of statistics is embedded within the broader deductive process of science. Statistical models are confirmatory, requiring the analyst to specify a model a priori based on some theory, test these hypotheses and perhaps revise the theory depending on the results. In contrast, the deeply hidden, interesting patterns being sought in a KDD process are (by definition) difficult or impossible to specify a priori, at least with any reasonable degree of completeness. KDD is more concerned about prompting investigators to formulate new predictions and hypotheses from data as opposed to testing deductions from theories through a sub-process of induction from a scientific database (Elder and Pregibon 1996; Hand 1998) . A rule-of-thumb is that if the Page 5 of 48 Filename: GKD Chapter 1 v8. Last save: 9-21-2000 7:41 AM information being sought can only be vaguely described in advance, KDD is more appropriate than statistics (Adriaans and Zantinge 1996) . KDD more naturally fits in the initial stage of the deductive process when the researcher forms or modifies theory based on ordered facts and observations from the "real world." In this sense, KDD is to information space as microscopes, remote sensing and telescopes are to atomic, geographic and astronomical spaces, respectively: KDD is a tool for exploring domains that are too difficult to perceive with unaided human abilities. For searching through a large information wilderness, the powerful but focused laser beams of statistics cannot compete with the broad but diffuse floodlights of KDD. However, floodlights can cast shadows and KDD cannot compete with statistics in confirmatory power once the pattern is discovered. Data warehousing An infrastructure that often underlies the KDD process is the data warehouse (DW). A DW is a repository that integrates data from one or more source databases. The data-warehousing phenomenon results from several technological and economic trends, including the decreasing cost of data storage and data processing, and the increasing value of information in business, governmental and scientific environments. A DW usually exists to support strategic and scientific decision-making based on integrated, shared information, although DWs are also used to save legacy data for liability and other purposes (see Jarke at al. 2000) . The data in a DW are usually read-only historical copies of the operational databases in an enterprise, sometimes in summary form. Consequently, a DW is often several orders of magnitude larger than an operational database (Chaudhuri and Dayal 1997). Rather than just a very large database management system, a DW embodies very different database design principles than operational databases. Operational database management systems are designed to support transactional data processing, that is, data entry, retrieval and updating. Design principles for transactional database Page 6 of 48 Filename: GKD Chapter 1 v8. Last save: 9-21-2000 7:41 AM systems attempt to create a database that is internally consistent and recoverable (i.e., can be "rolled-back" to the last known internally consistent state in the event of an error or disruption).