Text-to-picture tools, systems, and approaches: a survey

Jezia Zakraoui, Moutaz Saleh, Jihad Al Ja'am
2019 Multimedia tools and applications  
Text-to-picture systems attempt to facilitate high-level, user-friendly communication between humans and computers while promoting understanding of natural language. These systems interpret a natural language text and transform it into a visual format as pictures or images that are either static or dynamic. In this paper, we aim to identify current difficulties and the main problems faced by prior systems, and in particular, we seek to investigate the feasibility of automatic visualization of
more » ... abic story text through multimedia. Hence, we analyzed a number of well-known text-to-picture systems, tools, and approaches. We showed their constituent steps, such as knowledge extraction, mapping, and image layout, as well as their performance and limitations. We also compared these systems based on a set of criteria, mainly natural language processing, natural language understanding, and input/output modalities. Our survey showed that currently emerging techniques in natural language processing tools and computer vision have made promising advances in analyzing general text and understanding images and videos. Furthermore, important remarks and findings have been deduced from these prior works, which would help in developing an effective text-to-picture system for learning and educational purposes. picture systems concentrated on pictorially representing nouns and some spatial prepositions like maps and charts. For instance, the authors in [58] built the SPRINT system that generates 3D geometric models from natural language descriptions of a scene using spatial constraints extracted from the text. Throughout the last decade, many working text-to-picture systems have been developed. However, more efficient approaches and algorithms need to be developed. Joshi et al. [31] proposed a story picture engine that would depict the events and ideas conveyed by a piece of text in the form of a few representative pictures. Rada et al. [42] proposed a system for the automatic generation of pictorial representations of simple sentences that would use WordNet as a lexical resource for the automatic translation of an input text into pictures. Ustalov [12] developed a text-to-picture system called Utkus for the Russian language. Utkus has been enhanced to operate with ontology to allow loose coupling of the system's components, unifying the interacting objects' representation and behavior, and making possible verification of system information resources [13] . A system to create pictures to illustrate instructions for medical patients was developed by Duy et al. [15] . It has a pipeline of five processing phases: pre-processing, medication annotation, post-processing, image construction, and image rendering. More recently, a medical record summary system was developed by Ruan et al. [54] . It enables users to briefly acquire the patient's medical data, which is visualized spatially and temporarily based on the categorization of multiple classes consisting of event categories and six physiological systems. A novel assisted instant messaging program to search for images in an offline database based on keywords has been proposed by Jiang et al. [29] . The final representation of the picture is constructed from a set of the most representative images. Jain et al. [28] proposed a Hindi natural language processor called Vishit, which aims to help with communication between cultures that use different languages at universities. The authors prepared an offline image repository module consisting of semantic feature tags that serve in the selection and representation of appropriate images, and it eventually displays illustrations linked with textual messages. Other important approaches [11, 23, 53] in the domain of news streaming have been proposed to usefully represent emotions and news stories The latter approach introduced new deep neural network architecture to combine text and image representations and address several tasks in the domain of news articles, including story illustration, source detection, geolocation and popularity prediction, and automatic captioning. All these technical contributions are evaluated on a newly prepared dataset. According to [24] , despite all these features of text-to-picture systems, they still have many limitations in terms of performing their assigned tasks. The authors pointed out that there is a possible way to improve the visualization to be more dynamic. They suggested directly creating the scene rather than showing representative pictures; this can be done via text-toscene systems such as NALIG [1] and WordsEye [9] , or text-to-animation systems such as animated pictures like text-to-picture Synthesis [21] and animations like Carsim [14] , the latter of which converts narrative text about car accidents into 3D scenes using techniques for information extraction coupled with a planning and a visualization module. The CONFUCIUS system is also capable of converting single sentences into corresponding 3D animations [38] . Its successor, SceneMaker [22] , expands CONFUCIUS by adding common-sense knowledge for genre specification, emotional expressions, and capturing emotions from the scripts. Example of a common text-to-picture application is children's stories in which the pictures tell more of the story than the simple text [2, 5, 19, 57] . Huang et al. proposed VizStory in [25] as a way to visualize fairy tales by transforming the text to suitable pictures with consideration for the narrative structure and semantic contents of stories. Interactive storytelling systems
doi:10.1007/s11042-019-7541-4 fatcat:7n7gix2qdfgo7jllick5mb6via