Natural-language retrieval of images based on descriptive captions
ACM Transactions on Information Systems
We describe a prototype intelligent information retrieval system that uses natural-language understanding to efficiently locate captioned data. Multimedia data generally require captions to explain their features and significance. Such descriptive captions often rely on long nominal compounds (strings of consecutive nouns) which create problems of disambiguating word sense. In our system, captions and user queries are parsed and interpreted to produce a logical form, using a detailed theory of
... he meaning of nominal compounds. A fine-grain match can then compare the logical form of the query to the logical forms for each caption. To improve system efficiency, we first perform a coarse-grain match with index files, using nouns and verbs extracted from the query. Our experiments with randomly selected queries and captions from an existing image library show an increase of 30% in precision and 50% in recall over the keyphrase approach currently used. Our processing times have a median of seven seconds as compared to eight minutes for the existing system, and our system is much easier to use. shortcomings presented by keyword-based retrieval. In keyword approaches, the user is often required to remember the valid words (i.e., keywords), how these keywords correlate with the concepts that he or she wishes to find, and how the keywords may be combined to formulate queries. By removing such limitations and allowing unrestricted English phrases for queries as well as for describing the data, the goal is an information retrieval system that will be easier to use and provide more relevant responses to user queries. Our work looks at applying these artificial intelligence techniques for the general problem of retrieving multimedia data identified by caption descriptions. This article describes these techniques applied specifically to images. We have developed and tested our ideas using the image database from the Naval Air Warfare Center Weapons Division China Lake Photo Lab. This database contains over 100,000 historical photographs, slides, and selected video frames of aircraft and weapon projects from the last 50 years. The images show mostly aircraft and missiles in flight, weapon systems configured on various aircraft, targets and drones being hit, aerial views of the scenery surrounding the Center, test equipment, etc. The images are used for reference, project documentation, and publications. Registration (bookkeeping) information captures customer information about a particular image shoot, including customer name and code; the date, time, and location of the shoot; the photographer and film used; cross references to related images; and an English caption description. Image identifiers are used to uniquely index the images. The caption provides free-form information about the image, to describe either an event occurring in the image or unique characteristics and features of weapon systems and the Center. The linguistic structures of the captions can be characterized as a sublanguage [Sager 1986] with many nominal compounds but few verbs, adverbs, determiners, and conjunctions. The nominal compounds cause difficulties for a keyword-matching approach because order in the compound is crucial to meaning. For instance, "target tank" refers to an actual tank, but "tank target" means a construction that looks like a tank to radar, so there is no tank in a "tank target." Krovetz and Croft  show that these and similar ambiguities cause significant misunderstandings in information retrieval. Anick  pointed out the user preference for natural-language querying versus Boolean expression queries when looking at enhancements for the STARS retrieval system. Our prototype system to support naturallanguage querying is called MARIE-Epistemological Information Retrieval Applied to Multimedia. The user interface for MARIE is shown in Figure 1 and consists of three types of windows: a window type for entering an English query and listing the search results ("Query Statement (In English)"), a window type to view the captions and registration information (of which there are four instances in Figure 1) , and a window type to display images (of which there are four instances).