Self-Organization of Spatio-Temporal Hierarchy via Learning of Dynamic Visual Image Patterns on Action Sequences

Minju Jung, Jungsik Hwang, Jun Tani, Xuchu Weng
2015 PLoS ONE  
It is well known that the visual cortex efficiently processes high-dimensional spatial information by using a hierarchical structure. Recently, computational models that were inspired by the spatial hierarchy of the visual cortex have shown remarkable performance in image recognition. Up to now, however, most biological and computational modeling studies have mainly focused on the spatial domain and do not discuss temporal domain processing of the visual cortex. Several studies on the visual
more » ... tex and other brain areas associated with motor control support that the brain also uses its hierarchical structure as a processing mechanism for temporal information. Based on the success of previous computational models using spatial hierarchy and temporal hierarchy observed in the brain, the current report introduces a novel neural network model for the recognition of dynamic visual image patterns based solely on the learning of exemplars. This model is characterized by the application of both spatial and temporal constraints on local neural activities, resulting in the selforganization of a spatio-temporal hierarchy necessary for the recognition of complex dynamic visual image patterns. The evaluation with the Weizmann dataset in recognition of a set of prototypical human movement patterns showed that the proposed model is significantly robust in recognizing dynamically occluded visual patterns compared to other baseline models. Furthermore, an evaluation test for the recognition of concatenated sequences of those prototypical movement patterns indicated that the model is endowed with a remarkable capability for the contextual recognition of long-range dynamic visual image patterns. consists of two main pathways: (1) a ventral, or what, pathway for object recognition and (2) a dorsal, or where, pathway for information processing related to position and movement [3, 4] . Both organizations of the ventral and dorsal pathway show a hierarchical structure of progressively increasing selectivity, invariance, and receptive field sizes [5] . Due to the aforementioned benefits and biological plausibility of a hierarchical structure, a machine vision scheme, the so called deep learning, has been introduced to extract high-level abstractions through multiple non-linear transformations imposed on the hierarchy [6-10]. The core idea of deep-learning-based machine vision is that all necessary information processing structures for recognizing visual image patterns are self-organized in hierarchical neural network models through iterative learning of exemplar visual image patterns. Among such models, a convolutional neural network (CNN) [6], developed using inspiration from the mammalian visual cortex for its spatial hierarchical processing of visual features, has shown remarkably superior recognition performance for static natural visual images compared to conventional vision recognition schemes which used elaborately hand-coded visual features [10]. However, in practical applications, these approaches are limited to the recognition of static visual image patterns. In other words, they cannot recognize dynamic visual image patterns extended in time because they do not comprise of any temporal processing mechanism. To overcome this limitation, it has been proposed that dynamic visual image patterns can be recognized by simply transforming a sequence of 2-dimensional visual spatial patterns within a fixed temporal window into a large 3-dimensional pattern [11] [12] [13] , such as a 3D convolutional neural network (3D CNN) [11] . Indeed, these models performed well on many challenging video recognition datasets by simply extracting short-range temporal correlations. However, it cannot be extended to contextual recognition of dynamic visual image patterns which requires the extraction of relatively long-range temporal correlations. Baccouche et al. [14] has proposed a two-stage model to maintain temporal information in the entire sequence by adding a long short-term memory (LSTM) network [15] as a second stage of the 3D CNN. However, the spatial information process and the temporal one are not fully reconciled with a single principle in their model. The current study introduces a method that can solve the problem of contextual recognition of dynamic visual image patterns by using the capability of a newly proposed neuro-dynamic model in self-organizing an adequate spatio-temporal hierarchy through only iterative learning of exemplars. The proposed model, termed a multiple spatio-temporal scales neural network (MSTNN), was constructed from two essential ideas adopted from two different existing neural network models. One is the self-organization of a spatial hierarchy through the learning of static visual image patterns seen in the CNN [6], and the other is the self-organization of a temporal hierarchy through the learning of dynamic patterns such as sensory-motor sequences seen in a multiple timescales recurrent neural network (MTRNN) [16] . The MSTNN consists of several layers, and each layer is characterized by spatial constraints: local receptive field and weight sharing of the CNN, and a temporal constraint: time constant of a leaky integrator model, which was adapted for the MTRNN. The size of the receptive fields and the timescale of dynamics gradually increase along the hierarchy. The spatial hierarchy of the receptive fields for the MSTNN is analogous to the organization observed in the mammalian visual cortex [17] , and there is evidence for the assumed temporal hierarchy of progressively slower timescale dynamics of the MSTNN in the human visual cortex [18] . The premise for this assumption is that functional hierarchy could be self-organized by using both spatial and temporal constraints incorporated in the learning processes of massively high-dimensional spatio-temporal patterns present in dynamic visual image patterns. To evaluate the performance of the MSTNN, we conducted two classes for human action recognition-by-learning tasks. Human action recognition tasks should involve sophisticated Self-Organization of Spatio-Temporal Hierarchy for Dynamic Vision PLOS ONE |
doi:10.1371/journal.pone.0131214 pmid:26147887 pmcid:PMC4492609 fatcat:5sxff7m7bfbwdijcjegyydshcm