INDREX: In-database relation extraction

Torsten Kilias, Alexander Löser, Periklis Andritsos
2015 Information Systems  
The management of text data has a long-standing history in the human mankind. A particular common task is extracting relations from text. Typically, the user performs this task with two separate systems, a relation extraction system and an SQL-based query engine for analytical tasks. During this iterative analytical workflow, the user must frequently ship data between these systems. Worse, the user must learn to manage both systems. Therefore, end users often desire a single system for both
more » ... ytical and relation extraction tasks. We propose INDREX, a system that provides a single and comprehensive view of the whole process combining both relation extraction and later exploitation with SQL. The system permits a data warehouse style extract-transform-load of generic relations extracted from text documents and can support additional text mining analysis libraries or systems. Once generic relations are loaded, the user can define SQL queries on the extracted relations to discover higher level semantics or to join them with other relational data. For executing this powerful task, our system extends the SQL-based analytical capabilities of a columnar-based massively parallel query processing engine with a broad set of userdefined functions and a data model that supports this task. Our white-box approach permits INDREX to benefit from built-in query optimization and indexing techniques of the underlaying query execution engine. Applications that support both text mining and analytical workflows leverage new analytical platforms based on the MapReduce framework and its open source Hadoop implementation. We compare our system against this base line. We measure execution times for common workflows and demonstrate orders of magnitude improvement in execution time using INDREX.
doi:10.1016/ fatcat:jrweeola2ffanclmmfw5bpdff4