Video retrieval and summarization

Nicu Sebe, Michael S. Lew, Arnold W.M. Smeulders
2003 Computer Vision and Image Understanding  
Video retrieval and summarization This year, it is anticipated that 25% of the population of the wealthy countries will have a digital television camera at their disposal. The combined capacity to generate bits from these devices is astronomical. In addition, the growth in computer speed, disc capacity, and, most of all, the rapid growth of the Internet and WWW will make this information accessible worldwide. The immediate question is what to do with all the information. One could store the
more » ... tal video information on tapes, CD-ROMs, DVDs, or any such device but the level of access would be less than the well-known shoe boxes filled with tapes, old photographs, and letters. We need to ensure that the techniques for organizing video stay in tune with the tremendous amounts of information. So, with video on demand about to arrive, there is an urgent need for effective video retrieval and summarization methods. Creating access to still images had appeared to be a hard problem. It requires hard work, precise modeling, the inclusion of considerable amounts of a priori knowledge, and solid experimentation to analyze the contents of a photograph. Even though video tends to be much larger than images, it can be argued that the access to video is a simpler problem than access to still images. First of all, video comes in color and color provides easy clues to object geometry, position of the light, and identification of objects by pixel patterns, only at the expense of having to handle three times more data than black and white. And, video comes as a sequence, so what moves together most likely forms an entity in real life, so segmentation of video is intrinsically simpler than of a still image, again at the expense of only more data to handle. That does not mean progress will come for free. Moving from images to video adds several orders of complexity to the retrieval problem due to indexing, analysis, and browsing over the inherently temporal aspect of video. For example, the user can pose a similarity based query of "Find a video scene similar to this one." Responding to such a query requires representations of the image and of the temporal aspects of the video scene. Furthermore, higher level representations which reflect the structure of the constituent video shots or semantic temporal information such as gestures could also aid in retrieving the right video scene. A consequence of the growing consumer demand for visual information is that sophisticated technology is needed for representing, modeling, indexing, and retrieving multimedia data. In particular, we need robust techniques to index/retrieve and compress visual information, new scalable browsing algorithms allowing access to
doi:10.1016/j.cviu.2003.08.003 fatcat:6azywnsaurerfbtvftnscl6jbm