Learning invariant and variant components of time-varying natural images
Journal of Vision
A remarkable property of biological visual systems is their ability to infer structure within the visual world. In order to infer structure, a useful representation should separate the invariant from the variant information [1, 2] . Invariant information is important for determining 'what' we are seeing, recognizing objects and interpreting scenes; while variant information captures the 'where' or 'how' information, the transformations of objects. It has been hypothesized that biological visual
... systems represent 'what' and 'where' visual information in two separate cortical processing streams  . How do biological systems decompose visual information into separate invariant and variant representations? To explore such a decomposition, we present a model that learns to separate the invariant from the variant part of time varying natural movies. We first reformulate the sparse coding model  , in which images are represented as a generative model of linear over-complete bases with sparse causal variables, so that images are instead represented in terms of a multiplicative interaction between two sets of causal variables. One set of variables is constrained to change slowly over time (the invariant representation), and the other set of variables is allowed to change quickly over time and is encoded as a phase angle (the variant representation). These variables effectively decompose the original sparse coding variables into invariant and variant representations. After training on natural image sequences, the learned basis functions are similar to those produced by the original sparse coding model: Gabor-like functions that are spatially localized, oriented and bandpass. However, the multiplicative decomposition produces both invariant components with slowly changing responses, which indicate the presence of a visual shape, and variant components in the form of processing phase angles over time, which indicate their transformations. The model thus predicts two classes of cells in primary visual cortex that form the beginnings of the 'what' and 'where' cortical streams. Moreover, our model's decomposition provides a starting point for the construction of hierarchical models that capture the global structure and interaction of the 'what' and 'where' representations.