A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Efficient Scene Text Detection with Textual Attention Tower
[article]
2020
arXiv
pre-print
In this work, we propose an efficient and accurate approach to detect multioriented text in scene images. ...
Scene text detection has received attention for years and achieved an impressive performance across various benchmarks. ...
Textual Attention Tower The Textual Attention Tower (TAT) is designed to fuse the feature maps from different stages. ...
arXiv:2002.03741v1
fatcat:4tvwqt76vzex7a2rdgjgkt7lda
A Case Study of NLG from Multimedia Data Sources: Generating Architectural Landmark Descriptions
2020
Zenodo
In this paper, we present a pipeline system that generates architectural landmark descriptions using textual, visual and structured data. ...
The pipeline comprises five main components: (i) a textual analysis component, which extracts information from Wikipedia pages; (ii) a visual analysis component, which extracts information from copyright-free ...
First, an object detection module classifies indoor and outdoor scenes and detects landmark (in this case, building) elements, and objects. ...
doi:10.5281/zenodo.4529236
fatcat:k36enq2vtfavnk3uyuwhb2oz4q
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
[article]
2022
arXiv
pre-print
Compared to existing methods, ViSTA enables to aggregate relevant scene text semantics with visual appearance, and hence improve results under both scene text free and scene text aware scenarios. ...
Compared with state-of-the-art scene text free retrieval methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while running at least three times faster during the inference stage, which ...
textual content and the image's visual features V together with its scene text features O. ...
arXiv:2203.16778v1
fatcat:hldin76ql5hqtiq7ppvwmf6pmy
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching
[article]
2022
arXiv
pre-print
The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. ...
Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. ...
Image Visual attention map Query-aware visual attention map tower building to the right of the tower left bottom sand mountain white short person guy in yellow jacket Precision of pseudo labels. ...
arXiv:2201.06686v2
fatcat:rn2ug5qoy5f3xfhjlk2bmh3qdy
StacMR: Scene-Text Aware Cross-Modal Retrieval
[article]
2020
arXiv
pre-print
Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text ...
from the captions and text from the visual scene, and reconcile them in a common embedding space. ...
Related Work Scene-Text Detection and Recognition. Due to the large variance in text instances found in the wild [10, 64] , scene text detection and recognition is still an active research field. ...
arXiv:2012.04329v1
fatcat:ceuyotjoqbhd5cpau236ivsbqu
Towards precise POI localization with social media
2013
Proceedings of the 21st ACM international conference on Multimedia - MM '13
With the availability of large geotagged multimedia datasets on the Web, a sustained research effort was dedicated to automatic POI discovery and characterization. ...
Text-based POI localization
Text-based close-up ranking Here we exploit textual cues to determine if a photo is a close-up. ...
To test these hypotheses, we perform close-far image classification and introduce a simple but efficient spatial clustering algorithm seeded with POI close-up photos. ...
doi:10.1145/2502081.2502151
dblp:conf/mm/PopescuS13
fatcat:ky5cywpydjdshcdbflkauxuebu
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
[article]
2022
arXiv
pre-print
textual and visual information into a unified semantic space over a corpus of image-text pairs. ...
Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the ...
Introduction Large-scale pre-training has drawn much attention in both the community of Compute Vision (CV) and Natural Language Processing (NLP) due to its strong capability of generalization and efficient ...
arXiv:2012.15409v4
fatcat:woa3moustzc6nexs3ggg3acsdm
CPGAN: Full-Spectrum Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis
[article]
2020
arXiv
pre-print
text encoding. ...
Particularly, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images during ...
Note that we replace the Faster R-CNN with Yolo-V3 for object detection for computational efficiency. ...
arXiv:1912.08562v2
fatcat:xmc5jqrkuvbp3hfviqbqvrz57q
Information Extraction: The Power of Words and Pictures
2007
Journal of Computing and Information Technology
A number of challenging and emerging research directions are enumerated and illustrated with results obtained by the research group of the author. ...
The paper stresses the importance of automatically analyzing and semantically annotating creative forms of human expression, among which are textual sources. ...
Acknowledgements We are very grateful to the organizations that sponsored the research projects mentioned: ACILA (Automatic Detection and Classification of Arguments in a Legal Case), K. ...
doi:10.2498/cit.1001136
fatcat:tfpcm22xdranzmo6uo2sdlk7ya
Information Extraction: The Power of Words and Pictures
2007
Information Technology Interfaces
A number of challenging and emerging research directions are enumerated and illustrated with results obtained by the research group of the author. ...
The paper stresses the importance of automatically analyzing and semantically annotating creative forms of human expression, among which are textual sources. ...
Acknowledgements We are very grateful to the organizations that sponsored the research projects mentioned: ACILA (Automatic Detection and Classification of Arguments in a Legal Case), K. ...
doi:10.1109/iti.2007.4283737
fatcat:2ajmmbxndfe5vlm6ppgbeinkqi
Multilayer Network Model of Movie Script
[article]
2018
arXiv
pre-print
- Script: A text source of the movie which has descriptions about scenes, with setting and dialogues. -Scene: Chunk of a script, temporal unit of the movie. ...
These are markers we detect to chunk the script into scenes. Scene structure: Sets are attached to locations which are always included in the scene header, that we can easily parse. ...
arXiv:1812.05718v1
fatcat:m6l3x7byg5cvvbxgdf74w7ibuy
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
[article]
2019
arXiv
pre-print
The goal of a trained VE model is to predict whether the image semantically entails the text. ...
Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE. ...
While the performance of image classification and object detection has significantly improved in the recent years [42, 63, 65, 26] , progress in higher-level scene reasoning tasks such as scene * Work ...
arXiv:1901.06706v1
fatcat:hj5zwsyakfgizbv2mkoydpi3uu
Chronology and statistics: Objective understanding of authorial meaning
2006
English Studies: A Journal of English Language
One of the most useful tools for the objective detection of authorial meaning is the Sanger-Kroeber method-Sanger's chronological study of the structure of fiction and Kroeber's statistical quantification ...
To obtain objective information about the three key structural elements-time, place, and characters-, I first divided the story into scenes by time indicators in the text; then, examined scene by scene ...
The detection of authorial meaning may be too tricky to be done with definite conviction, but can be, or rather should be, achieved with relative probability. ...
doi:10.1080/00138380600610035
fatcat:xcombzvlhjaapk7edhxiopcylq
Enhancing cultural tourism by a mixed reality application for outdoor navigation and information browsing using immersive devices
2018
IOP Conference Series: Materials Science and Engineering
Moreover, if the object of interest is detected and tracked by the mixed reality application, also 3D contents can be overlapped and aligned with the real world. ...
The user can select the object (monument/building/artwork) for which augmented contents have to be displayed (video, text audio); the user can interact with these contents by a set of defined gestures. ...
Introduction Augmented Reality (AR) provides an efficient and intuitive way to visualize computer generated information overlaid and aligned with objects in the real environment. ...
doi:10.1088/1757-899x/364/1/012048
fatcat:7caeq3fpdffwfay6uvawctxbrq
Geotagging in multimedia and computer vision—a survey
2010
Multimedia tools and applications
The presence of geographically relevant metadata with images and videos has opened up interesting research avenues within the multimedia and computer vision domains. ...
We will discuss the nature of different modalities and lay out factors that are expected to govern the choices with respect to multimedia and vision applications. ...
This is followed by a hierarchical organization of scenes or views for efficient browsing. ...
doi:10.1007/s11042-010-0623-y
fatcat:esd7subpbjhntpes6quvngtwti
« Previous
Showing results 1 — 15 out of 2,688 results