Exploring large document repositories with RDF technology: the DOPE project
IEEE Intelligent Systems
I nnovative research institutes rely on the availability of complete and accurate information about new research and development. Information providers such as Elsevier make it their business to provide the required information in a cost-effective way. The Semantic Web will likely contribute significantly to this effort because it facilitates access to an unprecedented quantity of data. The DOPE project (Drug Ontology Project for Elsevier) explores ways to provide access to multiple lifescience
... information sources through a single interface. With the unremitting growth of scientific information, integrating access to all this information remains an important problem, primarily because the information sources involved are so heterogeneous. Sources might use different syntactic standards (syntactic heterogeneity), organize information in different ways (structural heterogeneity), and even use different terminologies to refer to the same information (semantic heterogeneity). Integrated access hinges on the ability to address these different kinds of heterogeneity. Also, mental models and keywords for accessing data generally diverge between subject areas and communities; hence, many different ontologies have emerged. An ideal architecture must therefore support the disclosure of distributed and heterogeneous data sources through different ontologies. To serve this need, we've developed a thesaurus-based search system that uses automatic indexing, RDF-based querying, and concept-based visualization. We describe here the conversion of an existing proprietary thesaurus to an open standard format, a generic architecture for thesaurus-based information access, an innovative user interface, and results of initial user studies with the resulting DOPE system. Thesaurus-based information access Thesauri have proven to be essential for effective information access. They provide controlled vocabularies for indexing information and thereby help to overcome many free-text search problems by relating and grouping relevant terms in a specific domain. Thesauri in the life sciences include MeSH, produced by the US National Library of Medicine (www. nlm.nih.gov/mesh/meshhome.html) and EMTREE, Elsevier's life science thesaurus (www.elsevier.com/ homepage/sah/spd/site). These thesauri provide access to information sources (in particular document repositories) such as PubMed (http://pubmed.org) and EMBASE.com (http://embase.com), but currently no open architecture exists to support using these thesauri for querying other data sources. For example, when we move from centralized, controlled use of EMTREE within EMBASE.com to a distributed setting, we must improve access to the thesaurus with a standardized representation using open data standards that allow for semantic qualifications. RDF (Resource Description Framework) is such a standard. Elsevier maintains the EMTREE thesaurus as a termi-This thesaurus-based search system uses automatic indexing, RDF-based querying, and concept-based visualization of results to support exploration of large online document repositories.