Personalized Biomedical Data Integration [chapter]

Xiaoming Wang, Olufunmilayo Olopade, Ian Foster
2011 Biomedical Engineering, Trends in Electronics, Communications and Software  
Biomedical Engineering Trends in Electronics, Communications and Software 666 2. Background Current status of biomedical data To illustrate the demand of integrating individualized biomedical data, we start with an example: for a cancer translational researcher to assess the association between the genetic background and the occurrence of a particular cancer and its treatment outcomes, she likely needs to: 1) screen family history through medical surveys on a selected cohort; 2) read pathology
more » ... eport about each individual's histological diagnosis; 3) check surgical, chemo, and radiation records in the clinics; 4) follow the outcomes and adverse events of the treatments; 5) record dates and evidences of the cancer recurrence and metastasis; 6) find DNA samples from specimen bank; and 7) conduct genotyping experiments and link the genotype results back to the phonotypical records. In order to extract meaningful information from these data, the researcher needs to have these data distinguishably aligned to individual persons, but linking these data together, even in a modest number of subjects, often fails due to data heterogeneity and discontinuity. Combining biomedical data with integrity at individual level frequently encounters four distinct challenges. The first challenge is caused by source heterogeneity. Data elements and/or schemas for the same domain data that are designed by independent parties will normally be semantically different. Such heterogeneity may also exist in different (or the same) versions of software developed by the same party. To further complicate matters, many data sources are subject to dynamic change in all aspects, including data structures, ontology standards, and instance data coding methods. These sources customarily do not provide metadata or mapping information between datasets from previous and newer versions. The second challenge stems from data descriptor inconsistencies. Many biomedical domains do not have established ontologies and others have more than one set of standard taxonomies. For example, one can find official taxonomies for describing cancers in SNOMED (Cote and Robboy 1980), International Classification of Disease (ICD) (Cimino 1996) , and the NCI-thesaurus (Sioutos, de Coronado et al. 2007 ). The third challenge comes from data source management styles. Most data sources are isolated and autonomously operated. These sources typically neither map nor retain the primary identifiers (of a person or the specimens that originated from the person) created in the other sources. The silo settings of the data sources not only generate segregated datasets but often require repetitive re-entry of the same records (e.g., patient demographic data) by hand into different sources. This practice increases the risk of human error. The fourth challenge is due to low data source interoperability. The majority of clinical data sources are neither programmatically accessible (syntactic interoperability) nor have metadata available for the source data (semantic interoperability). Many of these problems have been continual to date and will linger for the foreseeable future. As a consequence, biomedical source data are typically heterogeneous, inconsistent, fragmented, dirty and difficult to process. Valuable information embedded within the data cannot be consumed until the data are cleansed, unified, standardized, and integrated. 667 various information integration approaches, data warehousing, view integration, and information mashup are popular regimes that are actively discussed in IT and informatics publications (Halevy 2001; Jhingran 2006; . Each regime has its own distinct design principle and system architecture. Hard cover, 736 pages Publisher InTech
doi:10.5772/13017 fatcat:2f5aywboufcfveou6qbf3y2oam