Filters








2,688 Hits in 2.9 sec

Efficient Scene Text Detection with Textual Attention Tower [article]

Liang Zhang, Yufei Liu, Hang Xiao, Lu Yang, Guangming Zhu, Syed Afaq Shah, Mohammed Bennamoun, Peiyi Shen
2020 arXiv   pre-print
In this work, we propose an efficient and accurate approach to detect multioriented text in scene images.  ...  Scene text detection has received attention for years and achieved an impressive performance across various benchmarks.  ...  Textual Attention Tower The Textual Attention Tower (TAT) is designed to fuse the feature maps from different stages.  ... 
arXiv:2002.03741v1 fatcat:4tvwqt76vzex7a2rdgjgkt7lda

A Case Study of NLG from Multimedia Data Sources: Generating Architectural Landmark Descriptions

Simon Mille, Leo Wanner, Spyridon Symeonidis, Maria Rousi, Montserrat Marimon Felipe, Klearchos Stavrothanasopoul, Petros Alvanitopoulos, Roberto Carlini
2020 Zenodo  
In this paper, we present a pipeline system that generates architectural landmark descriptions using textual, visual and structured data.  ...  The pipeline comprises five main components: (i) a textual analysis component, which extracts information from Wikipedia pages; (ii) a visual analysis component, which extracts information from copyright-free  ...  First, an object detection module classifies indoor and outdoor scenes and detects landmark (in this case, building) elements, and objects.  ... 
doi:10.5281/zenodo.4529236 fatcat:k36enq2vtfavnk3uyuwhb2oz4q

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval [article]

Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
2022 arXiv   pre-print
Compared to existing methods, ViSTA enables to aggregate relevant scene text semantics with visual appearance, and hence improve results under both scene text free and scene text aware scenarios.  ...  Compared with state-of-the-art scene text free retrieval methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while running at least three times faster during the inference stage, which  ...  textual content and the image's visual features V together with its scene text features O.  ... 
arXiv:2203.16778v1 fatcat:hldin76ql5hqtiq7ppvwmf6pmy

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching [article]

Hengcan Shi, Munawar Hayat, Jianfei Cai
2022 arXiv   pre-print
The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data.  ...  Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps.  ...  Image Visual attention map Query-aware visual attention map tower building to the right of the tower left bottom sand mountain white short person guy in yellow jacket Precision of pseudo labels.  ... 
arXiv:2201.06686v2 fatcat:rn2ug5qoy5f3xfhjlk2bmh3qdy

StacMR: Scene-Text Aware Cross-Modal Retrieval [article]

Andrés Mafla and Rafael Sampaio de Rezende and Lluís Gómez and Diane Larlus and Dimosthenis Karatzas
2020 arXiv   pre-print
Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text  ...  from the captions and text from the visual scene, and reconcile them in a common embedding space.  ...  Related Work Scene-Text Detection and Recognition. Due to the large variance in text instances found in the wild [10, 64] , scene text detection and recognition is still an active research field.  ... 
arXiv:2012.04329v1 fatcat:ceuyotjoqbhd5cpau236ivsbqu

Towards precise POI localization with social media

Adrian Popescu, Aymen Shabou
2013 Proceedings of the 21st ACM international conference on Multimedia - MM '13  
With the availability of large geotagged multimedia datasets on the Web, a sustained research effort was dedicated to automatic POI discovery and characterization.  ...  Text-based POI localization Text-based close-up ranking Here we exploit textual cues to determine if a photo is a close-up.  ...  To test these hypotheses, we perform close-far image classification and introduce a simple but efficient spatial clustering algorithm seeded with POI close-up photos.  ... 
doi:10.1145/2502081.2502151 dblp:conf/mm/PopescuS13 fatcat:ky5cywpydjdshcdbflkauxuebu

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [article]

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
2022 arXiv   pre-print
textual and visual information into a unified semantic space over a corpus of image-text pairs.  ...  Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the  ...  Introduction Large-scale pre-training has drawn much attention in both the community of Compute Vision (CV) and Natural Language Processing (NLP) due to its strong capability of generalization and efficient  ... 
arXiv:2012.15409v4 fatcat:woa3moustzc6nexs3ggg3acsdm

CPGAN: Full-Spectrum Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis [article]

Jiadong Liang and Wenjie Pei and Feng Lu
2020 arXiv   pre-print
text encoding.  ...  Particularly, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images during  ...  Note that we replace the Faster R-CNN with Yolo-V3 for object detection for computational efficiency.  ... 
arXiv:1912.08562v2 fatcat:xmc5jqrkuvbp3hfviqbqvrz57q

Information Extraction: The Power of Words and Pictures

Marie-Francine Moens
2007 Journal of Computing and Information Technology  
A number of challenging and emerging research directions are enumerated and illustrated with results obtained by the research group of the author.  ...  The paper stresses the importance of automatically analyzing and semantically annotating creative forms of human expression, among which are textual sources.  ...  Acknowledgements We are very grateful to the organizations that sponsored the research projects mentioned: ACILA (Automatic Detection and Classification of Arguments in a Legal Case), K.  ... 
doi:10.2498/cit.1001136 fatcat:tfpcm22xdranzmo6uo2sdlk7ya

Information Extraction: The Power of Words and Pictures

Marie-Francine Moens
2007 Information Technology Interfaces  
A number of challenging and emerging research directions are enumerated and illustrated with results obtained by the research group of the author.  ...  The paper stresses the importance of automatically analyzing and semantically annotating creative forms of human expression, among which are textual sources.  ...  Acknowledgements We are very grateful to the organizations that sponsored the research projects mentioned: ACILA (Automatic Detection and Classification of Arguments in a Legal Case), K.  ... 
doi:10.1109/iti.2007.4283737 fatcat:2ajmmbxndfe5vlm6ppgbeinkqi

Multilayer Network Model of Movie Script [article]

Youssef Mourchid, Benjamin Renoust, Hocine Cherifi, Mohammed El Hassouni
2018 arXiv   pre-print
- Script: A text source of the movie which has descriptions about scenes, with setting and dialogues. -Scene: Chunk of a script, temporal unit of the movie.  ...  These are markers we detect to chunk the script into scenes. Scene structure: Sets are attached to locations which are always included in the scene header, that we can easily parse.  ... 
arXiv:1812.05718v1 fatcat:m6l3x7byg5cvvbxgdf74w7ibuy

Visual Entailment: A Novel Task for Fine-Grained Image Understanding [article]

Ning Xie, Farley Lai, Derek Doran, Asim Kadav
2019 arXiv   pre-print
The goal of a trained VE model is to predict whether the image semantically entails the text.  ...  Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE.  ...  While the performance of image classification and object detection has significantly improved in the recent years [42, 63, 65, 26] , progress in higher-level scene reasoning tasks such as scene * Work  ... 
arXiv:1901.06706v1 fatcat:hj5zwsyakfgizbv2mkoydpi3uu

Chronology and statistics: Objective understanding of authorial meaning

Tatsuhiro Ohno
2006 English Studies: A Journal of English Language  
One of the most useful tools for the objective detection of authorial meaning is the Sanger-Kroeber method-Sanger's chronological study of the structure of fiction and Kroeber's statistical quantification  ...  To obtain objective information about the three key structural elements-time, place, and characters-, I first divided the story into scenes by time indicators in the text; then, examined scene by scene  ...  The detection of authorial meaning may be too tricky to be done with definite conviction, but can be, or rather should be, achieved with relative probability.  ... 
doi:10.1080/00138380600610035 fatcat:xcombzvlhjaapk7edhxiopcylq

Enhancing cultural tourism by a mixed reality application for outdoor navigation and information browsing using immersive devices

Federico Debandi, Roberto Iacoviello, Alberto Messina, Maurizio Montagnuolo, Federico Manuri, Andrea Sanna, Davide Zappia
2018 IOP Conference Series: Materials Science and Engineering  
Moreover, if the object of interest is detected and tracked by the mixed reality application, also 3D contents can be overlapped and aligned with the real world.  ...  The user can select the object (monument/building/artwork) for which augmented contents have to be displayed (video, text audio); the user can interact with these contents by a set of defined gestures.  ...  Introduction Augmented Reality (AR) provides an efficient and intuitive way to visualize computer generated information overlaid and aligned with objects in the real environment.  ... 
doi:10.1088/1757-899x/364/1/012048 fatcat:7caeq3fpdffwfay6uvawctxbrq

Geotagging in multimedia and computer vision—a survey

Jiebo Luo, Dhiraj Joshi, Jie Yu, Andrew Gallagher
2010 Multimedia tools and applications  
The presence of geographically relevant metadata with images and videos has opened up interesting research avenues within the multimedia and computer vision domains.  ...  We will discuss the nature of different modalities and lay out factors that are expected to govern the choices with respect to multimedia and vision applications.  ...  This is followed by a hierarchical organization of scenes or views for efficient browsing.  ... 
doi:10.1007/s11042-010-0623-y fatcat:esd7subpbjhntpes6quvngtwti
« Previous Showing results 1 — 15 out of 2,688 results