Breaking Through the Syntax Barrier: Searching with Entities and Relations
Lecture Notes in Computer Science
The next wave in search technology will be driven by the identification, extraction, and exploitation of real-world entities represented in unstructured textual sources. Search systems will either let users express information needs naturally and analyze them more intelligently, or allow simple enhancements that add more user control on the search process. The data model will exploit graph structure where available, but not impose structure by fiat. First generation Web search, which uses graph
... information at the macroscopic level of inter-page hyperlinks, will be enhanced to use fine-grained graph models involving page regions, tables, sentences, phrases, and real-world-entities. New algorithms will combine probabilistic evidence from diverse features to produce responses that are not URLs or pages, but entities and their relationships, or explanations of how multiple entities are related. Toward More Expressive Search Search systems for unstructured textual data have improved enormously since the days of boolean queries over title and abstract catalogs in libraries. Web search engines index much of the full text from billions of Web pages and serve hundreds of millions of users per day. They use rich features extracted from the graph structure and markups in hypertext corpora. Despite these advances, even the most popular search engines make us feel that we are searching with mere strings: we do not find direct expression of the entities involved in our information need, leave alone relations that must hold between those entities in a proper response. In a plenary talk at the 2004 World-wide Web Conference, Udi Manber commented: If music had been invented ten years ago along with the Web, we would all be playing one-string instruments (and not making great music). referring to the one-line text boxes in which users type in 1-2 keywords and expect perfect gratification with the responses. Apart from classical Information Retrieval (IR), several communities are coming together in the quest of expressive search, but they are coming from very different origins. Databases and XML: To be sure, the large gap between the user's information need and the expressed query is well-known. The database community has been traditionally uncomfortable with the imprecise nature of queries inherent in IR.