UniStore: Querying a DHT-based Universal Storage

Marcel Karnstedt, Kai-Uwe Sattler, Martin Richtarsky, Jessica Muller, Manfred Hauswirth, Roman Schmidt, Renault John
2007 2007 IEEE 23rd International Conference on Data Engineering  
In recent time, the idea of collecting and combining large public data sets and services became more and more popular. The special characteristics of such systems and the requirements of the participants demand for strictly decentralized solutions. However, this comes along with several ambitious challenges a corresponding system has to overcome. In this demonstration paper, we present a lightweight distributed universal storage capable of dealing with those challenges, and providing a powerful
more » ... and flexible way of building Internet-scale public data management systems. We introduce our approach based on a triple storage on top of a DHT overlay system, based on the ideas of a universal relation model and RDF, outline solved challenges and open issues, and present usage as well as demonstration aspects of the platform. A Universal Storage based on DHTs An increasing number of applications on the Web are based on the idea of collecting and combining large public data sets and services. In such public data management scenarios, the information, its structure, and its semantics are controlled by a large number of participants. Despite being distributed or decentralized in respect to data from a conceptual point of view, the supporting infrastructures of these systems still are based on inherently centralized concepts. The downsides at the physical layer of such centralized systems, such as bottlenecks, single-point-of-failures and enormous costs for providing the needed resources, are extended by problems on a more logical level, e.g., the problem of integrating data/services and the need of database processing functionality. Examples of such applications include (specialized) Web search engines, scientific database applications, naming or directory services and "social" applications such as file/picture sharing, encyclopedias, friend-ofa-friend networks or recommender systems. In this paper, we argue for a decentralization of data management by creating a universal distributed storage for such public data/metadata, which exploits the gigantic stor-age and processing capacity of the worldwide available Internet nodes in the same way as the network layer exploits the worldwide communication devices for routing messages between nodes. Information sources are highly distributed, data is described according to heterogeneous schemas, no participant has a global view of all information, and data and service quality can only be guaranteed in a best effort way. In this context, the global challenge is to develop a light-weight, generic data management component playing the same role as the TCP/IP stack and a highly scalable infrastructure enforcing a fair distribution of storage and processing load in a highly dynamic world without any central control. For such type of public information management, DHTbased overlay systems offer an interesting alternative to existing information system architectures. While problems like scalability, robustness and fair balance of load and work are covered by modern DHTs, new research problems have to be addressed, the most prominent being: Data may exist in a large number of different schema organizations and expressiveness of queries and possible guarantees (existence, completeness, etc.) are limited at the moment. Concerning a distributed universal storage as we propose, the key issues can be classified along three questions: (1) How to structure and organize data in massively distributed settings? (2) How to query data and how to query efficiently? (3) What is needed to get a robust and practical solution? The first question raises two main problems: We need a generic and flexible schema for structuring data and we have to deal with heterogeneities on schema and on data level. The second question highlights challenges of query processing: The system has to support the combination of both, classical DB-like queries allowing to restrict and combine data (selection, projection, join, set operations) as well as IR-style queries (e.g., keyword search over all attributes, similarity). Moreover, querying schema data (attributes, correspondences) has to be supported as well. Physical query processing should exploit the features of the underlying infrastructure (e.g., hash-based placement, topologyaware routing and multicasting), come with worst-case guarantees, and involve cost-based and adaptive query opti-
doi:10.1109/icde.2007.369054 dblp:conf/icde/KarnstedtSRMHSJ07 fatcat:yx5fjbhnq5a5jhxv34xyroevue