Report on the DB/IR panel at SIGMOD 2005
MOTIVATION This paper summarizes the salient aspects of the SIGMOD 2005 panel on "Databases and Information Retrieval: Rethinking the Great Divide". The goal of the panel was to discuss whether we should rethink data management systems architectures to truly merge Database (DB) and Information Retrieval (IR) technologies. The panel had very high attendance and generated lively discussions. 1 Until now, the DB and IR communities, while each very successful, have evolved largely independently of
... y independently of each other. The DB community has mostly focused on highly structured data, and has developed sophisticated techniques for efficiently processing complex and precise queries over this data. In contrast, the IR community has focused on searching unstructured data, and has developed various techniques for ranking query results and evaluating their effectiveness. Consequently, there has been no single unified system model for managing both structured and unstructured data, and processing both precise and ranked queries. Most prior integration attempts have "glued" together DB and IR engines without making fundamental changes to either engine. However, emerging applications such as content management and XML data management, which have an abundant mix of structured and unstructured data, require us to rethink data management assumptions such as the strict dichotomy between accessing content in DB and IR systems. In fact, recent trends in DB and IR research demonstrate a growing interest in adopting IR techniques in DBs and vice versa. The goal of this report is to issue new challenges to both communities, in particular, from an application, enduser, querying and system architecture perspectives. PANEL OVERVIEW The panel included established DB and IR experts. We first list the set of questions asked to the panelists. We then present the viewpoint of each panelist and a summary of the discussion. Panel Questions 1) Which real-world applications require a tight DB-IR integration? Can most applications be addressed by storing unstructured data as uninterpreted columns in a relational DB system, and invoking an IR engine over unstructured data? 1 Panel slides available at: www.research.att.com/sihem/SIGMOD-PANEL/. 2) XML is being touted as the dominant and pervasive standard that integrates structured and unstructured data, and XML query languages such as XQuery Full-Text , attempt to support this. Can we still cobble together a solution using traditional DB and IR systems? Or do we need to rethink the fundamental data management system architecture? 3) Does it make sense to evaluate "imprecise" queries over structured data and produce ranked results? Conversely, does it make sense to evaluate "precise and complex" queries over unstructured or semi-structured data? If so, do any of the IR techniques carry over to the structured domain, and vice versa? Does this then argue for or against a unified query model? 4) DB and IR systems are already complex pieces of software with decades of research and a strong commercial backing. Is it possible to design a clean underlying formal model (akin to the relational model and IR ranking models) that captures the whole gamut of issues that both classes of systems deal with? Is it feasible to build a system based on what could be exceedingly complex data and query models? Would this gain acceptance in the marketplace and displace loosely coupled DB and IR systems? 5) Are there any "cultural" issues that would prevent a true DB-IR unification?