Enabling information integration and workflows in a grid environment with automatic wrapper generation
The 6th IEEE/ACM International Workshop on Grid Computing, 2005.
With a growing trend towards grid-based data repositories and data analysis services, scientific data analysis often involves accessing multiple data sources, and analyzing the data using a variety of analysis programs. One critical challenge in this, however, is that data sources often hold the same type of data in a number of different formats, and also, the formats expected and generated by various data analysis services are often distinct. We believe that the traditional approach for
... approach for dealing with this problem, which is using hand-written wrappers, is not an effective and scalable solution for a grid environment. This paper presents a new approach, which involves generating wrappers automatically for enabling grid-based information integration and workflows. In this approach, a layout descriptor is used for describing the data format for each data source, as well as the input and output format for each tool or service. Efficient wrappers are then generated automatically for translation between any two data formats. Our design separates wrapper generation service from the wrapper execution. The wrapper generation service analyzes the layout descriptors and generates a WRAPINFO data structure. The wrapper comprises a set of application independent modules which take the WRAP-INFO data structure as the input. We demonstrate our wrapper generation tool with two real case studies. Besides showing the effectiveness of our system, the experiments results from these two case studies show that the wrapper generation overhead is very small, automatically generated wrappers scale well to large datasets, and for the one case where this comparison was possible, the execution time of our wrapper was within 30% of that of a hand-written one. ¤ To achieve interoperability between ¥ data formats, an order of ¦ § © ¥ wrappers have to be written. A single update in a data format will involve rewriting of ¦ § © ¥ wrappers. Thus, hand-written wrappers are not scalable with respect to the number of available resources, because of the high programming and maintenance effort involved. P lease see http://forge.gridforum.org/projects/dfdl-wg lished, or existing ones move to new formats, only their layout descriptors need to be written or rewritten. ¤ Resources can be discovered on-the-fly, and as long as they contain the layout descriptors as part of their metadata, they can be integrated with other resources automatically. ¤ Unnecessary transformation of data is avoided. In comparison, some approaches for integration require that all data be converted to a single format (such as XML), which can be very expensive if the datasets are large.