Deep Learning for Mobile Multimedia

Kaoru Ota, Minh Son Dao, Vasileios Mezaris, Francesco G. B. De Natale
2017 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
Deep Learning (DL) has become a crucial technology for multimedia computing. It o ers a powerful instrument to automatically produce high-level abstractions of complex multimedia data, which can be exploited in a number of applications including object detection and recognition, speech-to-text, media retrieval, multimodal data analysis, and so on. e availability of a ordable large-scale parallel processing architectures, and the sharing of e ective open-source codes implementing the basic
more » ... ng algorithms, caused a rapid di usion of DL methodologies, bringing a number of new technologies and applications that outperform in most cases traditional machine learning technologies. In recent years, the possibility of implementing DL technologies on mobile devices has a racted signi cant a ention. anks to this technology, portable devices may become smart objects capable of learning and acting. e path towards these exciting future scenarios, however, entangles a number of important research challenges. DL architectures and algorithms are hardly adapted to the storage and computation resources of a mobile device. erefore, there is a need for new generations of mobile processors and chipsets, small footprint learning and inference algorithms, new models of collaborative and distributed processing, and a number of other fundamental building blocks. is survey reports the state of the art in this exciting research area, looking back to the evolution of neural networks, and arriving to the most recent results in terms of methodologies, technologies and applications for mobile environments. in Facebook. DNN technologies are rapidly spreading thanks to the convergence of two key factors: the unprecedented interest in Big Data analytics, and the advances in hardware technologies. e former is creating the need for powerful tools capable of processing enormous volumes of lowstructured data and extract signi cant information. In fact, traditional machine learning techniques are in most cases unsuitable for this purpose, for they require a lot of work to de ne and calculate suitable data descriptions (feature extraction), and are usually unable to generalize and scale to large, unstructured, and heterogeneous data. DNNs overcome this issue by allowing computers to easily and automatically extract features from unstructured data without human intervention. e la er is enabling the deployment of DNN technologies at a large scale. At the same time, advances in parallel computing architectures make DNNs more and more feasible. Speci cally, in recent years powerful and compact GPUs have been released at a ordable prices, which allow accelerating the computation of the weights of DNNs. Such units provide massive parallel processing, speedingup by several orders of magnitude the execution of both learning and inference algorithms, as compared to conventional CPUs. In addition, the availability of e ective open-source libraries and frameworks for implementing basic learning algorithms (see, e.g., Chainer[3], Ca e[53], Torch[10], eano [14], Tensor ow [12] ) made DL methodologies easily available to anyone, causing their rapid di usion not only within the research community, but also for commercial purposes. Among the many potential application areas, DL brings great opportunities in multimedia computing, where it can greatly enhance the performance of various components, such as object detection and recognition, speech-to-text translation, media information retrieval, multi-modal data analysis, and so on. DL is particularly suited to this domain, as multimedia data are intrinsically associated to severe computational and storage problems. Furthermore, the possibility of introducing smart multimedia applications in mobile environments is gaining more and more a ention, due the rapid spreading of smart portable devices. As a consequence, there is an increasing interest on the possibility of applying DNNs to mobile environments [61] . DL not only can boost the performance of mobile multimedia applications availably nowadays, but could also pave the way towards more sophisticated uses of mobile devices. Many such devices, including smartphones, smart cameras, pads, etc. hold some sensing and processing capability that can potentially make them smart objects capable of learning and acting, either stand-alone or interconnected with other intelligent objects. As an example, a remote health-care system may use wearable devices to produce a huge amount of sensor data such as pulse rate, blood pressure, images of face and body, and even audio and video of the patient and the environment. All those data may be used to monitor the patient's condition, but require an e cient on-the-y processing to produce a compact stream of signi cant information. Despite the great potential of mobile DNNs, it is not always straightforward to match the requirements of the neural architectures with the constraints imposed by mobile and wireless environments. ere are a number of important problems to be solved, starting from the fundamental DNN technologies, to arrive to network architectures, training and inference algorithms, footprint reduction, etc. For example, as the network connections among mobile devices become unstable and unreliable, existing DNN algorithms need more e cient fault tolerance and security technologies. Meanwhile, unlike traditional high performance servers, mobile multimedia devices such as wireless sensors and smartphones usually have limited resources in terms of energy, computing power, memory, network bandwidth, and so on. is brings the need of more e cient DNN technologies, which can cope with the constraints of mobile multimedia. Furthermore, because of the dynamic scale and network topologies, mobile multimedia computing needs more scalable DNN support than traditional data center environments. A possible solution to these problems is to use distributed processing facilities such as cloud computing, where powerful servers can be used to handle heavy DNN processes with plenty of
doi:10.1145/3092831 fatcat:ez2fcgckhjawlfywyecest4jqy