While the emphasis on a general theory of vision was already the main objective at the dawn of the discipline (1), computer vision has evolved without a systematic exploration of foundations in the framework of machine learning. In particular, in most cases, computer vision is regarded just as an application of machine learning. When the target is moved to unrestricted visual environments and the emphasis is shifted from huge labelled databases to a human-like protocol of interaction, we need to go beyond the current peaceful interlude that we are experimenting in vision and machine learning. So far, the semantic labeling of pixels of a given video stream has been mostly carried out at frame level. This seems to be the natural outcome of well-established pattern recognition methods working on images, which have given rise to nowadays emphasis on collecting big labelled image databases (e.g. (2)) with the purpose of devising and testing challenging machine learning algorithms. While this framework is the one in which most of the state of the art object recognition approaches have been developing, we argue that there are strong arguments to start exploring the more natural visual interaction that animals experiment in their own environment. This leads to process videos instead of image collections that is very much related to the growing interest of learning in the wild that has been explored in the last few years (see. e.g. https://sites.google.com/site/wildml2017icml/).
A crucial problem that has been recognized by Poggio and Anselmi (3) is the need to incorporate visual invariances into deep nets that go beyond simple translation invariance that is currently characterizing convolutional networks. They propose an elegant mathematical framework on visual invariance and enlighten some intriguing neurobiological connections. Overall, the ambition of extracting distinctive features from vision poses a challenging task. While we are typically concerned with feature extraction methods that are independent of classic geometric transformation, it looks like we are still missing the fantastic human skill of capturing distinctive features to recognize “ironed and rumpled shirts”, for example. There is no apparent difficulty to recognize shirts by keeping the recognition coherence in case we roll up the sleeves, or we simply curl them up into a ball for the laundry basket. Of course, there are neither rigid transformations, like translations and rotation, nor scale maps, that transforms an ironed shirt into the same shirt thrown into the laundry basket. In this paper, we claim that motion invariance can in fact capture all we need. Translation and scale invariance, that have been the subject of many studies (4; 5), are in fact examples of invariances that can be fully gained whenever we develop the ability to detect features that are invariant under motion. For instance, the moving of the finger experimented by infants leads them to enforce a natural invariance: it will become bigger and bigger as it approaches their face, but it is still their inch, which requires to impose a consistent decision. Clearly, translation, rotation, and complex deformation invariances derive from motion invariance. Humans life always experiments motion, so as the gained visual invariances naturally arise from motion invariance. Animals with foveal eyes also move quickly the focus of attention when looking at fixed objects, which means that they continually experiment motion. Hence, also in case of fixed images, conjugate, vergence, saccadic, smooth pursuit, and vestibulo-ocular movements lead to acquire visual information from relative motion. We claim that the production of such a continuous visual stream naturally drives feature extraction, since the corresponding convolutional features are expected not to change during motion. The enforcement of this consistency condition creates a mine of visual data during animal life. Of course, we need to compute the optical flow at pixel level so as to enforce the consistency of all the extracted features. Early studies on this problem (6), along with recent related improvements (see e.g. (7)) suggest to start computing the velocity field by enforcing brightness invariance. As the optical flow is gained, it is used to enforce motion consistency on the visual features. Interestingly, the theory we propose is quite related to the variational approach that is used to determine the optical flow in (6), but the joint feature development of the features can also be used to reinforce motion estimation. It is worth mentioning that an effective visual system must also develop features that do not follow motion invariance. These kind of features can be conveniently combined with those that are discussed in this paper with the purpose of carrying out high level visual tasks.
The visual features are derived in the framework of the principle of cognitive action (8), which gives rise to a time-variant differential equation, where the Lagrangian coordinates correspond with the values of the convolutional filters. The learning process can be interpreted in the framework of the minimization of the cognitive action that offers a self-consistent framework.
We consider the mechanisms that give rise to the construction of local features for any pixel of the retina, at any time t. These features, along with the video itself, can be regarded as visual fields, that are defined on the retina and on a given horizon of time [0 . . T]. A set of symbols are extracted at every layer of a deep architecture, so as each pixel — along with its context — turns out to be represented by the list of symbols extracted at each layer. The computational process that we define involves the video as well as appropriate vector fields that are defined on the domain
. In what follows, points on the retina will be represented with two dimensional vectors
on a defined coordinate system. The temporal coordinate is usually denoted by t, and, therefore, the video signal on the pair (x, t) is C(x, t). The color field can be thought of as a special field that is characterized by m components for each single pixel (m = 3 for RGB) . We are concerned with the problem of extracting visual features that, unlike the components of the video, express the information associated with the pair (x, t) and its spatial context. Basically, one would like to extract visual features that characterize the information in the neighborhood of pixel x. A possible way of constructing this kind of features is to define1
Here we assume that n symbols are generated from the m components of the video. Notice that the kernel is responsible of expressing the spatial dependencies. It is worth mentioning that whenever
the above definition reduces to an ordinary spatial convolution. The computation of
yields a field with n features, and Eq. (1) can be used for carrying out a piping scheme where a new set of features
is computed from
. Of course, this process can be continued according to a deep computational structure with a homogeneous convolutional-based computation, which yields the features
-th convolutional layer. The theory proposed in this paper focuses on the construction of any of these convolutional layers which are expected to provide higher and higher abstraction as we increase the number of layers. The filters
are what completely determines the features
. In this paper we formulate a theory for the discovery of
that is based on three driving principles, that are described below.
Optimization of information-based indices Beginning from the color field C, we attach a symbol of a discrete vocabulary to pixel (x, t) with probability
. This is obtained by estimating the random variable F(X) where F is the map that the agent is expected to learn, that is defined on the basis of Eq. 1. The conditional entropy S(Y | X, T, F) is given by S(Y | X, T, F) =
is the conditional probability of Y conditioned to the values of X, T and
is the joint measure of the variable X, T, F, and
is a Borel set in the (X, T, F) space. We assume that
is subject to the probabilistic constraints
(normalization) and
(positivity). We can rewrite the conditional entropy as
, where
is a spacetime measure. Clearly, we want to keep the conditional entropy as small as possible so as to develop dominating features. At the same time we must ensure that the entropy of variable Y ,
must be as high as possible, since this ensures the development of all the features associated with the alphabet of symbols. If we use the law of total probability to express in terms of the conditional probability
and use the above assumptions we get
To sum up (see the Supplementary Material for further details on the computation of S(Y )), the index , which is somewhat related to the classic Shannon mutual information, must be maximized (9; 10).
Motion invariance If we focus attention on a the pixel x at time t, which moves according to the trajectory a constant. This “adiabatic” condition is thus expressed by the condition
, which yields
where is the velocity field that we assume to be given, and
is the partial derivative with respect to
. When replacing
as stated by Eq. (1) we get
which holds for any . Notice that this constraint is linear in the field
. This can be interpreted by stating that learning under motion invariance consists of determining elements of the kernel of the function
. Clearly, the learning process is expected to keep the value of
small as possible.
Parsimony principle Like any principled formulation of learning, we require the filters to obey the parsimony principle. Amongst the philosophical implications, it also favors the development of a unique solution. Given the filters , there are two parsimony terms, one
, that penalizes abrupt spatial changes, and another one,
that penalizes quick temporal transitions. Ordinary regularization issues suggest to discover functions
is “small”, where are spatial and temporal differential operators, and
are non-negative reals. We assumed an ergodic translation of
, that, in this case, only involves the temporal factor h(t).
Overall, the process of learning is regarded as the minimization of the cognitive action
where are positive multipliers. While the first and third principles are typically adopted in classic unsupervised learning, motion invariance does characterize the approach followed in this paper. Of course, there are visual features that do not obey the motion invariance principle. Animals easily estimate the distance to the objects in the environment, a property that clearly indicates the need for features whose value do depend on motion. The perception of vertical visual cues, as well as a reasonable estimate of the angle with respect to the vertical line also suggests the need for features that are motion dependent. Basically, the process of learning consists of solving the variational problem
(see the Supplementary Material for details). As it will be shown in the following, in our multi-layer implementation the minimization of
takes place at each layer of the architecture, involving the filters of the considered layer only, relying on a piping scheme that is inspired to developmental learning issues.
The field theory of the previous section can be approximated over the discrete (and bounded) retina , where the video frames are represented. Instead of the fields
, we have a bunch of functions of time
, indexed by the point on the retina x other than the filter/feature index i and the input channel index j. Similarly, the color field will be replaced by
. Using Einstein notation we have that the discretized form of the feature fields is
where the sum in y is performed over
. The two pieces of the motion invariance term (4) become
where the spatial gradient operator. Such term of motion invariance becomes a quadratic form in
and
. The other relevant terms of the theory (entropy, relative entropy) are trivially functions of
and t.
We assume filters to have a unique and finite size for all the features. As a consequence, for each feature i, we can flatten the filter into a vector, and concatenate the n filter-vectors into q. We selected a second order term to implement the parsimony principle,
being
positive constants. If we make the entropy term local in time, and evaluate the first variation of the discretized cognitive action, the differential Euler-Lagrange (EL) equations are (for the sake of simplicity, we skip the derivations, see Supplementary Material for all the details):
where are the fourth and third derivatives of q over time,
is a positive constant, and we have used the notation
(so that for example
). In order to define the other terms, we introduce the notation
to indicate the area (volume if m > 1) of the input signal centred around x, of the same size of the filters, flattened into a vector. We have
, and the notation
indicates a block-diagonal matrix whose blocks are A. The matrix M is composed of
, and
is a distribution over the retina. Analogously, we define
, being
. We have
is a squared matrix composed of
repetitions of
is a positive constant,
, and
. The matrix O is composed of
. Finally,
returns the n-length vector with the result of the convolutions of the n filters with the input,
if the condition in brackets is true, otherwise it is 0, and it operates element-wise when a vector of conditions is provided.
In deriving equations some conditions arises naturally at t = T (see the Supplementary Material for more details):
An interesting special case of these equations is that obtained with a null signal . With this assumption our equations (6) become
Now assume that positive, then
In order to see whether this equation can be stable we need to apply the Routh-Hurvitz criterion. For a fourth order ODE this criterion reduces to check if a > 0, b > 0,
that in our case means that
So for example if we choose we obtain a stable equation. This being said it is also crucial to notice that we have control over the important parameter of the theory
as long as you choose the regularization parameters carefully.
We implemented a solver for the differential equation of (6) that is based on the Euler method with step size . After having reduced the equation to the first order, the variables that were updated at each t are
, and
. The code and data we used to run the following experiments can be downloaded at http://see.supplementary.material, together with the full list of model parameters. We randomly selected two real world video sequences from the Hollywood Dataset HOHA2 (11), that we will refer to as “skater” and “car”, and a clip from the movie “The Matrix” ( c
Warner Bros. Pictures). The frame rate of all the videos is
25 fps (we set
), each frame was rescaled to
and converted to grayscale. Videos have different lengths, ranging from
seconds, and they were repeated in loop until 45, 000 frames were generated, thus covering a significantly longer time span. We randomly initialized q(0), while the derivatives at time t = 0 were set to 0. We used the softmax function to force a probabilistic activation of the features, and computed the optical flow v using an implementation from the OpenCV library. Convolutional filters cover squared areas of the input frame, and we set
to be the uniform distribution. All the results that we report are averaged over 10 different runs of the algorithms.
The video is presented gradually to the agent so as to favour the acquisition of small chunks of information. We start from a completely null signal (all pixel intensities are zero), and we slowly increase the level of detail and the pixel intensities, in function of . In detail,
is the spatial convolution operator,
the source video signal,
is a Gaussian filter of variance
is a customizable scaling factor. We start with
, and then
is progressively increased as time passes,
). We refer to the quantity
as the “blurring factor”. In order to be able to (approximately) satisfy the conditions in Eq. (7) we need to keep the derivatives small, so we implement a “reset plan” according to which the video signal undergoes a reset whenever the derivatives become too large. Formally, if
, or
, or
then we forced
to
, for all j), and then we set to 0 all the derivatives.
Our experiments are designed (i) to evaluate the dynamics of the cognitive action in function of different temporal regularities imposed to the model weights (parsimony), and then (ii) to evaluate the effects of motion, that introduces a spatio-temporal regularization on single and multi-layer architectures. When evaluating the temporal regularities, the cognitive action is composed by the entropy-based and parsimony terms only, and we experiment four instances of the set of parameters . Each instance is characterized by the roots of the characteristic polynomial that lead to stable or not-stable configurations, and with only real or also imaginary parts, keeping the roots close to zero, and fulfilling the conditions of Eq. (9) when stability and reality are needed. These configurations are all based on values of
, while
. We performed experiments on the “skater” video clip, setting n = 5 features, and chose filters of size
are reported in Fig. 1. The plots indicate that there is an initial oscillation that is due to the effects of the blurring factor, that vanish after about 10k frames. The Mutual Information (MI) (I) portion of the cognitive action correctly increases over time , and it is pushed toward larger values in the two extreme cases of “no-stability, reality” and “no-stability, no-reality”. The latter shows more evident oscillations in the frame-by-frame MI value, due to the roots with imaginary part. In all the configurations the norm of q increases over time (with different speeds), due to the small values of k, while the frequency of reset operations is larger in the “no-stability, no-reality” case, as expected.
We evaluated the quality of the developed features by freezing the final q of Fig. 1 and computing the MI index over a single repetition of the whole video clip, reporting the results in Tab. 1 (a). This is the procedure we will follow in the rest of the paper when reporting numerical results in all the tables. We notice that, while in Fig. 1 we compute the MI on a frame-by-frame basis, here we compute it over the whole frames of the video at once, thus in a batch-mode setting. The result confirms that the two extreme configurations “no-stability, reality” and “no-stability, no-reality” show better results, on average. These performances are obtained thanks to the effect of the reset mechanism, that allows even such unstable configurations to develop good solutions. When the reset operations are disabled, we easily incurred into numerical errors due to strong oscillations, while for example, the “stability" cases were less affected by this phenomenon.
We also compared the dynamics of the system on multiple video clips and using different filter sizes () and number of features (n = 5 and n = 11) in Fig 2. We selected the “stability, reality” configuration of Fig. 1, that fulfils the conditions of Eq. (9). Changing the video clip does not change the considerations we did so far, while increasing the filter size and number of features can lead to smaller MI index values, mostly due to the need of a better balancing the two entropy terms to cope with the larger number of features. The MI of Tab. 1 (b) confirms this point. Interestingly, the best results are obtained in the longer video clip (“The Matrix”) that requires less repetitions of the video, being closer to the real online setting.
Fig. 3 and Tab. 1 (c) show the results we obtain when using different blurring plans (“skater” clip), that is, different values of that lead to the blurring factors reported in the first graph of Fig. 3. These results suggest that a gradual introduction of the video signal helps the system to find better solutions than in the case in which no-plans are used, but also that a too-slow plan is not beneficial. The cognitive action has a big bump when no-plans are used, while this effect is more controlled and reduced in the case of both the slow and fast plans.
In order to study the effect of motion in multi-layer architectures (up to 3 layers), we still kept the most stable configuration (“stability, reality”, filters, 5 features), and introduced the motion-related term in the cognitive action. Our multi-layer architecture is composed of a stack of computational models developed accordingly to (5). A new layer
is activated whenever layer
has processed a large number of frames (
), and the parameters of layer
are not updated anymore. We initially considered the case in which all the layers
share the same value
that weighs the motion-based term. Tab. 2 shows the MI we get for different weighting schemes. Introducing motion helps in almost all the cases (for appropriate
- the smallest values of
are a good choice on average), and, as expected, a too strong enforcement of the motion-related term leads to degenerate solutions with small MI. We repeated these experiments also in a different setting. In detail, after having evaluated layer
for all the values of
, we selected the model with the largest MI and started evaluating layer
on top of it. Tab. 3 reports the outcome of this experience. We clearly see that motion plays an important role in increasing the average MI. In the case of “car”, we also obtained two (uncommon) positive results when strongly weighing
. They are due to very frequent reset operations, that avoided the system to alter the filters when the motion-based term was leading to very large derivatives. This is an interesting behaviour that, however, was not common in the other cases we reported.
Figure 1: Comparing 4 configurations of the parameters, characterized by different properties in terms of stability and reality of the roots of the characteristic polynomial. The input video is reproduced (in loop) for 45k frames (x-axis). From left-to-right, top-to-bottom we report the Cognitive Action (CA), the portion of the cognitive action that is about the Mutual Information (MI) (that we maximize), the portion that is about the Conditional Entropy, the MI per-frame, the norm of q(t), and the fraction of “reset” operations performed every 1000 frames.
Figure 2: Different number of features and filter sizes (1st row: ; 2nd row:
) in 3 videos. See Fig. 1 for a description of the plots.
Figure 3: Three different blurring plans (n = 11 and filters of size
Table 1: MI on (a) the “skater” video, given the models of Fig. 1 (S=stability, (b) different videos, number of features, filter sizes (SR); (c) different blurring plans (SR).
Table 2: MI in different videos, up to 3 layers (), and for multiple weighting factors
the motion-based term. All layers share the same
In this paper we have introduced a new approach to learning visual features according to the principle of least cognitive action. The experiments indicate the remarkable difference coming from the incorporation of motion invariance, with respect to the features only driven by information-based principles, which also results in the improvement of the mutual information from the video to the features.
The theory is coherent with the different role of the ventral stream and dorsal stream (12) that has been observed in humans and other primates. The enforcement of motion invariance is clearly conceived for extracting features that are useful for object recognition to assolve the “what” task (ventral stream), whereas “dorsal neurons”, that are involved for where/how environmental interactions are expected not to use motion invariance. The model behind the learning of the filters indicates the need to access to velocity estimation, which is consistent with neuroanatomical evidence.
Although the experimental results reported in the paper assume a uniform probability distribution in the spatiotemporal domain, the given formulation in the framework of the principle of least cognitive action suggests that the optimization must take place in areas of high saliency. In this case, the reformulation of the Euler-Lagrange equations given in this paper leads to identify the crucial role of eye movements in animals with foveal eyes.
Table 3: Same structure of Tab. 2. Here the model with the best is selected and used as basis to activate a new layer (layer
is the same as Tab. 2).
[1] D. Marr. Vision. Freeman, San Francisco, 1982. Partially reprinted in (? ).
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
[3] Tomaso A. Poggio and Fabio Anselmi. Visual Cortex and Deep Networks: Learning Invariant Representations. The MIT Press, 1st edition, 2016.
[4] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
[5] Marco Gori, Marco Lippi, Marco Maggini, and Stefano Melacci. Semantic video labeling by developmental visual agents. Computer Vision and Image Understanding, 146:9–26, 2016.
[6] B. K.P. Horn and B.G. Schunck. Determining optical flow. Artificial Intelligence, 17(1-3):185– 203, 1981.
[7] Simon Baker, Daniel Scharstein, J. P. Lewis, Stefan Roth, Michael J. Black, and Richard Szeliski. A database and evaluation methodology for optical flow. Int. J. Comput. Vision, 92(1):1–31, March 2011.
[8] Alessandro Betti and Marco Gori. The principle of least cognitive action. Theor. Comput. Sci., 633:83–99, 2016.
[9] Marco Gori, Stefano Melacci, Marco Lippi, and Marco Maggini. Information theoretic learning for pixel-based visual agents. In European Conference on Computer Vision, pages 864–875. Springer, 2012.
[10] Stefano Melacci and Marco Gori. Unsupervised learning by minimal entropy encoding. IEEE Trans. Neural Netw. Learning Syst., 23(12):1849–1861, 2012.
[11] Marcin Marszałek, Ivan Laptev, and Cordelia Schmid. Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition, 2009.
[12] Melvyn A. Goodale and A. David. Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20–25, 1992.