Filters








41 Hits in 6.1 sec

Web-scale Multimedia Search for Internet Video Content

Lu Jiang
2016 Proceedings of the 25th International Conference Companion on World Wide Web - WWW '16 Companion  
See Fig. 1(a) . Text-to-video queries are flexible and can be further refined by Boolean and temporal operators.  ...  In a text-to-video query, however, we might look for visual clues in the video content such as "cake", "gift" and "kids", audio clues like "birthday song" and "cheering sound", or visible text like "happy  ...  Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos.  ... 
doi:10.1145/2872518.2888599 dblp:conf/www/Jiang16 fatcat:gxjwoz4ijbf3jnxs6ibfiw6dw4

Web-scale Multimedia Search for Internet Video Content

Lu Jiang
2016 Proceedings of the Ninth ACM International Conference on Web Search and Data Mining - WSDM '16  
See Fig. 1(a) . Text-to-video queries are flexible and can be further refined by Boolean and temporal operators.  ...  In a text-to-video query, however, we might look for visual clues in the video content such as "cake", "gift" and "kids", audio clues like "birthday song" and "cheering sound", or visible text like "happy  ...  Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos.  ... 
doi:10.1145/2835776.2855081 dblp:conf/wsdm/Jiang16 fatcat:imyuikdto5dflaxeykm3olsg3a

Special issue on visual information retrieval

Michael S. Lew
2016 International Journal of Multimedia Information Retrieval  
One of the current frontier areas is searching for video over the Internet. In the paper, "Text-to-video: a semantic search  ...  With the flood of images and video from diverse sources (e.g., smartphones, NetFlix, FLICKR, Amazon, Instagram, Twitter, etc.), there is a great need to be able to browse, index and search through the  ...  Lew mlew@liacs.nl 1 Leiden University, Leiden, The Netherlands engine for internet videos" by Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura and Alexander G.  ... 
doi:10.1007/s13735-016-0094-7 fatcat:x437dbgd35fupnhjprfjlei5mu

GEM: A General Evaluation Benchmark for Multimodal Tasks [article]

Lin Su and Nan Duan and Edward Cui and Lei Ji and Chenfei Wu and Huaishao Luo and Yongfei Liu and Ming Zhong and Taroon Bharti and Arun Sacheti
2021 arXiv   pre-print
tasks and GEM-V for video-language tasks.  ...  In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks.  ...  We finetune m-UniVL on text-to-video retrieval and video captioning tasks. For retrieval, a learning rate of 1e-4 and a batch size of 128 are used to finetune m-UniVL for 50 epochs.  ... 
arXiv:2106.09889v1 fatcat:zwuq4lnufnblhcdxwepeekwvru

Strategies for Searching Video Content with Text Queries or Video Examples [article]

Shoou-I Yu, Yi Yang, Zhongwen Xu, Shicheng Xu, Deyu Meng, Zexi Mao, Zhigang Ma, Ming Lin, Xuanchong Li, Huan Li, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann, Chuang Gan (+2 others)
2016 arXiv   pre-print
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search.  ...  However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines.  ...  The U.S. government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.  ... 
arXiv:1606.05705v1 fatcat:rsbamqhrzjam7cgxwn43w2dqmy

[Invited Paper] Strategies for Searching Video Content with Text Queries or Video Examples

Shoou-I Yu, Yi Yang, Zhongwen Xu, Shicheng Xu, Deyu Meng, Zexi Mao, Zhigang Ma, Ming Lin, Xuanchong Li, Huan Li, Zhenzhong Lan, Lu Jiang (+4 others)
2016 ITE Transactions on Media Technology and Applications  
everyday has led to many commercial video search engines, which mainly rely on text metadata for search.  ...  However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines.  ...  The U.S. government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.  ... 
doi:10.3169/mta.4.227 fatcat:pfld3uehdzeynlfirdaji2pl54

Symbiosis between the TRECVid benchmark and video libraries at the Netherlands Institute for Sound and Vision

Johan Oomen, Paul Over, Wessel Kraaij, Alan F. Smeaton
2013 International Journal on Digital Libraries  
Query-log analyses show the shortcomings of manual annotation, therefore archives are complementing these annotations by developing novel search engines that automatically extract information from both  ...  Prototype and demonstrator systems developed as part of TRECVid are set to become a key driver in improving the qual-  ...  Results from the knownitem search task in 2010 suggest the best results could largely be attributed to matching query text to video metadata.  ... 
doi:10.1007/s00799-012-0102-3 fatcat:bl6e6qavvncejeydygngsqq3qy

Multi-modal Transformer for Video Retrieval [article]

Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
2020 arXiv   pre-print
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets.  ...  Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video.  ...  We thank the authors of [14] for sharing their codebase and features, and Samuel Albanie, in particular, for his help with implementation details.  ... 
arXiv:2007.10639v1 fatcat:r2duzkctwvbffetignqkmxjoka

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [article]

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke (+1 others)
2022 arXiv   pre-print
about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception  ...  For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT  ...  (a) video search {recipe video} with "A medium bowl is a bowl that is..." Bob: ok assume im done whats next. Alice: Step: Keep beating and slowly add the sugar until stiff peaks form.  ... 
arXiv:2204.00598v2 fatcat:sdzx7e6h2rfcbpegrnojeu37ra

Політична легітимність в умовах трансформації цифрових комунікацій

Н. О. Наталіна, Vasyl' Stus Donetsk National University
2021 Political life  
As a result, users seek more privacy and switch to secure channels of communication, which does not promote search for public consensus and legitimization of political institutions.  ...  The search for the legitimization ways of political institutions within the above trends is the subject of the further scientific research of the author.  ...  Launched in 2018, the Tik Tok vertical video network continued this "text-to-video" revolution.  ... 
doi:10.31558/2519-2949.2021.3.5 fatcat:2p7sjumvyrfnxbczntgwoz35yu

Self-Supervised MultiModal Versatile Networks [article]

Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman
2020 arXiv   pre-print
Videos are a rich source of multi-modal supervision.  ...  Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.  ...  Acknowledgement The authors would like to thank Antoine Miech, Yusuf Aytar and Karen Simonyan for fruitful discussions as well as Luyu Wang and Elena Buchatskaya for help on the evaluation benchmarks.  ... 
arXiv:2006.16228v2 fatcat:62lngdeirbgy7acx435o4j427e

Framework for development of cognitive technology for children with hearing impairments

Amal Dandashi, Abdelghani Karkar, Jihad AlJaam, Samir Abou El-Seoud, Osman Ibrahim
2015 2015 International Conference on Interactive Collaborative Learning (ICL)  
The main aim of this study is to investigate the needs of people with HI in the Arab world, and propose a system design that would help alleviate the challenges they face.  ...  The system design is centered on Arabic-based Natural Language Processing, with the objectives focused on presenting a multiple component educational system that utilizes multimedia-based learning, to  ...  The lack of emergency services may also be compensated by the use of the emergency phone text-to-video and vice versa component. A.  ... 
doi:10.1109/icl.2015.7318082 fatcat:gf43h7ic2vhivn757ep2ee3i34

Linking People in Videos with "Their" Names Using Coreference Resolution [chapter]

Vignesh Ramanathan, Armand Joulin, Percy Liang, Li Fei-Fei
2014 Lecture Notes in Computer Science  
We develop a joint model for person naming and coreference resolution, and in the process, infer a latent alignment between tracks and mentions.  ...  Natural language descriptions of videos provide a potentially rich and vast source of supervision. However, the highly-varied nature of language presents a major barrier to its effective use.  ...  We thank A. Fathi, O. Russakovsky and S. Yeung for helpful comments and feedback. This research is partially supported by Intel, the NFS grant IIS-1115493 and DARPA-Mind's Eye grant.  ... 
doi:10.1007/978-3-319-10590-1_7 fatcat:zmx42zoh5vbctptpy54jvdfpla

A Metaverse: taxonomy, components, applications, and open challenges

Sang-Min Park, Young-Gab Kim
2022 IEEE Access  
The integration of enhanced social activities and neural-net methods requires a new definition of Metaverse suitable for the present, different from the previous Metaverse.  ...  , implementation, and application) rather than marketing or hardware approach to conduct a comprehensive analysis.  ...  In the process of scenario graph population, modal conversion (e.g., text-to-video and video-to-text conversion) is used for multimodal integration.  ... 
doi:10.1109/access.2021.3140175 fatcat:fnraeaz74vh33knfvhzrynesli

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow
2021 The Journal of Artificial Intelligence Research  
Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.  ...  Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks.  ...  We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript.  ... 
doi:10.1613/jair.1.11688 fatcat:kvfdrg3bwrh35fns4z67adqp6i
« Previous Showing results 1 — 15 out of 41 results