The 2013 SESAME Multimedia Event Detection and Recounting System
TREC Video Retrieval Evaluation
The SESAME team submitted runs as a full participant in the MED13 evaluation, and submitted video, motion, and audio features; high-level semantic concepts for visual objects, scenes, persons, and actions; automatic speech recognition (ASR); and video optical character recognition (OCR). The individual types of features and concepts produced a total of eight event classifiers. We combined the event detection results for these classifiers using arithmetic mean and log-likelihood ratio fusion
... ods, and developed and applied a method for selecting the detection threshold. The SESAME system generated event recountings by selecting intervals based on the semantic concepts, and on concepts recognized by ASR and OCR. Our major findings are: Our strategy of first selecting the most informative interval for a video, and then determining the most appropriate event-related semantic concepts within that interval to display for multimedia event recounting (MER), produced the best ObsTextScore in the evaluation. (The ObsTextScore measures the judges' responses to the question "How well does the text of this observation describe the snippet(s)?".) The multimedia event detection (MED) performance for 100Ex and 10Ex was dominated by the classifiers that exploited visual content. The ASR and OCR classifiers for 0Ex performed better than those trained with 10Ex. The log-likelihood ratio late-fusion method demonstrated improved performance over simple averaging of event detection scores for 100Ex, but not for 10Ex.