BioKleisli: Integrating Biomedical Data and Analysis Packages [chapter]

Susan B. Davidson, O. Peter Buneman, Jonathan Crabtree, Val Tannen, G. Christian Overton, Limsoon Wong
Bioinformatics: Databases and Systems  
Data of interest to biomedical researchers associated with the Human Genome Project (HGP) is stored all over the world in a variety of electronic data formats and accessible through a variety of interfaces and retrieval languages. These data sources include conventional relational databases with SQL interfaces, formatted text les on top of which indexing is provided for e cient retrieval (ASN.1), and binary les that can be interpreted textually or graphically via special purpose interfaces
more » ... B); there are also image databases of molecular and chemical structures. Researchers within the HGP want to combine data from these di erent data sources, add value through sophisticated data analysis techniques (such as the biosequence comparison software BLAST and FASTA), and view it using special purpose scienti c visualization tools. However, currently there are no commercial tools for enabling such an integrated digital library, and a fundamental barrier to developing such tools appears to be one of language design and optimization. For example, while tools exist for interoperating between heterogeneous relational databases, the data formats and software packages found throughout the HGP contain a number of data types not easily available in conventional databases, such as lists, variants and arrays; furthermore, these types may be deeply nested. We present in this paper a language for querying and transforming data from heterogenous sources, discuss its implementation in a system called BioKleisli and illustrate its use in accessing data sources critical to the HGP.
doi:10.1007/0-306-46903-0_18 fatcat:coirnqnpyrhabhqd3ozkhv4afe