Critical Learning Periods in Deep Neural Networks

2017·arXiv

ABSTRACT

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their thoughtful feedback, and for suggesting new experiments and relevant literature. Supported by ONR N00014-17-1-2072, ARO W911NF-17-1-0304, AFOSR FA9550-15-1-0229 and FA8650-11-1-7156.

REFERENCES

Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep rep- resentations. Journal of Machine Learning Research, 19(1):1947–1980, 2018.

Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191 of Translations of Mathematical Monographs. American Mathematical Society and Oxford University Press, 2000.

Martin S Banks, Richard N Aslin, and Robert D Letson. Sensitive period for the development of human binocular vision. Science, 190(4215):675–677, 1975.

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In Proceedings of the International Conference on Learning Representations, 2017.

Nigel W Daw. Visual Development. Springer, New York, NY, 3rd edition, 2014.

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.

Ronald Aylmer Fisher. Theory of statistical estimation. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 22, pp. 700–725. Cambridge University Press, 1925.

Fred Giffin and Donald E Mitchell. The rate of recovery of vision after early monocular deprivation in kittens. The Journal of Physiology, 274(1):511–537, 1978.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

Anita E Hendrickson, JA Movshon, Howard M Eggers, Martin S Gizzi, RG Boothe, and Lynne Kior- pes. Effects of early unilateral blur on the macaque’s visual system. ii. anatomical observations. Journal of Neuroscience, 7(5):1327–1339, 1987.

Takao K Hensch. Critical period regulation. Annuual review of neuroscience, 27:549–579, 2004.

Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

Eric R Kandel, James H Schwartz, Thomas M Jessell, Steven A Siegelbaum, and A James Hudspeth. Principles of Neural Science. McGraw-Hill, New York, NY, 5th edition, 2013.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of the International Conference on Learning Representations, 2017.

Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameteri- zation trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.

Eric I Knudsen. Sensitive periods in the development of the brain and behavior. Journal of cognitive neuroscience, 16(8):1412–1425, 2004.

Ivo Kohler. The formation and transformation of the perceptual world. Psychological Issues Monographs. International Universities Press, Inc., New York, NY, 1964.

Masakazu Konishi. Birdsong: from behavior to neuron. Annual review of neuroscience, 8(1):125– 170, 1985.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech- nical report, University of Toronto, 2009.

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss land- scape of neural nets. In Advances in Neural Information Processing Systems, pp. 6391–6401, 2018.

James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. Proceedings of International Conference on Machine Learning, 37:2408–2417, 2015.

Donald E Mitchell. The extent of visual recovery from early monocular or binocular visual depriva- tion in kittens. The Journal of physiology, 395(1):639–660, 1988.

Gr´egoire Montavon, Mikio L Braun, and Klaus-Robert M¨uller. Kernel analysis of deep networks. Journal of Machine Learning Research, 12(Sep):2563–2581, 2011.

George D Mower. The effect of dark rearing on the time course of the critical period in cat visual cortex. Developmental Brain Research, 58(2):151–158, 1991.

Carl R Olson and Ralph D Freeman. Profile of the sensitive period for monocular deprivation in kittens. Experimental Brain Research, 39(1):17–21, 1980.

Pasko Rakic, Jean-Pierre Bourgeois, Maryellen F Eckenhoff, Nada Zecevic, and Patricia S Goldman-Rakic. Concurrent overproduction of synapses in diverse regions of the primate cerebral cortex. Science, 232(4747):232–235, 1986.

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.

Jost T Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for sim- plicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

George M Stratton. Some preliminary experiments on vision without inversion of the retinal image. Psychological Review, 3(6):611–617, 1896.

David Taylor et al. Critical period for deprivation amblyopia in children. Transactions of the ophthalmological societies of the United Kingdom, 99(3):432–439, 1979.

Gunter K von Noorden. New clinical aspects of stimulus deprivation amblyopia. American journal of ophthalmology, 92(3):416–421, 1981.

Torsten N Wiesel. Postnatal development of the visual cortex and the influence of environment. Nature, 299(5884):583, 1982.

Torsten N Wiesel and David H Hubel. Single-cell responses in striate cortex of kittens deprived of vision in one eye. Journal of neurophysiology, 26(6):1003–1017, 1963a.

Torsten N Wiesel and David H Hubel. Effects of visual deprivation on morphology and physiology of cells in the cat’s lateral geniculate body. Journal of neurophysiology, 26(6):978–993, 1963b.

A DETAILS OF THE EXPERIMENTS

A.1 ARCHITECTURES AND TRAINING

In all of the experiments, unless otherwise stated, we use the following All-CNN architecture,

adapted from Springenberg et al. (2014):

conv 96 - conv 96 - conv 192 s2 - conv 192 - conv 192 - conv 192 s2 - conv 192 - conv1 192 - conv1 10 - avg. pooling - softmax

where each conv block consists of a convolution, batch normalization and ReLU activations.

conv1 denotes a convolution. The network is trained with SGD, with a batch size of 128,

learning rate starting from 0.05 and decaying smoothly by a factor of .97 at each epoch. We also

use weight decay with coefficient 0.001. In the experiments with a fixed learning rate, we fix the

learning rate to 0.001, which we find to allow convergence without excessive overfitting. For the

ResNet experiments, we use the ResNet-18 architecture from He et al. (2016) with initial learning

rate 0.1, learning rate decay .97 per epoch, and weight decay 0.0005. When training with Adam, we

use a learning rate of 0.001 and weight decay 0.0001.

When experimenting with varying network depths, we use the following architecture:

conv 96 - [conv - conv s2]- conv - conv1 - conv1 10

In order to avoid interferences between the annealing scheme and the architecture, in these experi-

ments we fix the learning rate to 0.001.

The Fully Connected network used for the MNIST experiments has hidden layers of size

[2500, 2000, 1500, 1000, 500]. All hidden layers use batch normalization followed by ReLU acti-

vations. We fix the learning rate to 0.005. Weight decay is not used. We use data augmentation with

random translations up to 4 pixels and random horizontal flipping. For MNIST, we pad the images

with zeros to bring them to size .

A.2 APPROXIMATIONS OF THE FISHER INFORMATION MATRIX

To compute the trace of the Fisher Information Matrix, we use the following expression derived

directly from the definition:

where the input image x is sampled from the dataset, while the label y is sampled from the output

posterior. Expectations are approximated by Monte-Carlo sampling. Notice, however, that this

expression depends only on the local gradients of the loss with respect to the weights at a point

, so it can be noisy when the loss landscape is highly irregular. This is not a problem

for ResNets (Li et al. (2018)), but for other architectures we use a different technique, proposed in

Achille & Soatto (2018). More in detail, let L(w) be the standard cross-entropy loss. Given the

current weights of the network, we find the diagonal matrix that minimizes:

where is a parameter that controls the smoothness of the approximation. Notice that can

be minimized efficiently using the method in Kingma et al. (2015). To see how this relates to

the Fisher Information Matrix, assume that L(w) can be approximated locally in as L(w) =

. We can then rewrite as:

Taking the derivative with respect to , and setting it to zero, we obtain . We can then

use to estimate the trace of the Hessian, and hence of the Fisher information.

A.3 CURVE FITTING

Fitting of sensitivity curves and synaptic density profiles from the literature was performed using:

as the fitting equation, where t is the age at the time of sampling and and d are unconstrained

parameters (Banks et al., 1975).

The exponential fit of the sensitivity to the Fisher Information trace uses the expression

where a, b and c are unconstrained parameters, F(t) is the Fisher Information trace at epoch t of the

training of a network without deficits and is the sensitivity computed using a window of size k.

is the increase in the final test error over a baseline when the network is trained in the presence

of a deficit between epochs t and t + k.

B ADDITIONAL PLOTS

Figure 6: Log of the norm of the gradient means (solid line) and standard deviation (dashed line) during training when: (Left) No deficit is present, (Center) a blur deficit is present until epoch 70, and (Right) a deficit is present until the last epoch. Notice that the presence of a deficit does not decrease the magnitude of the gradients propagated to the first layers during the last epochs, rather it seems to increase it, suggesting that vanishing gradients are not the cause of the critical period for the blurring deficit.

Figure 7: Same plot as in Figure 5, but for a noise deficit. Unlike with blur, much more resources are allocated to the lower layers rather than to the higher layers. This may explain why it is easier for the network to reconfigure itself in order to solve the task after the deficit is removed.

Figure 8: Visualization of the filters of the first layer of the network used for the experiment in Figure 1. In the absence of a deficit, the network learns high-frequency filters, as seen by the fact that many filters are not smooth (first picture). However, when a blurring deficit is present, the network learns only smooth filters corresponding to the lower frequencies of the input (third picture). If the deficit is removed after the end of the critical period, the network does not manage to learn high-frequency filters (second picture).

C EXPERIMENTAL DESIGN AND COMPARISON WITH ANIMAL MODELS

Critical periods are task- and deficit-specific. The specific task we address is visual acuity, but the

performance is necessarily measured through different mechanisms in animals and Artificial Neural

Networks. In animals, visual acuity is traditionally tested by the ability to discriminate between

black-and-white contrast gratings (with varying spatial frequency) and a uniform gray field. The

outcome of such tests generally correlates well with the ability of the animal to use the eye to solve

other visual tasks relying on acuity. Convolutional Neural Networks, on the other hand, have a very

different sensory processing mechanism (based on heavily quantized data), which may trivialize

such a test. Rather, we directly measure the performance of the network on an high-level task,

specifically image classification, for which CNNs are optimized.

We chose to simulate cataracts in our DNN experiments, a deficit which allows us to explore its com-

plex interactions with the structure of the data and the architecture of the network. Unfortunately,

while the overall trends of cataract-induced critical periods have been studied and understood in an-

imal models, there is not enough data to confidently regress sensibility curves comparable to those

obtained in DNNs. For this reason, in Figure 1 we compare the performance loss in a DNN trained

in the presence of a cataract-like deficit with the results obtained from monocularly deprived kittens,

which exhibit similar trends and are one of the most common experimental paradigms in the visual

neurosciences.

Simulating complete visual deprivation in a neural network is not as simple as feeding a constant

stimulus: A network presented with a constant blank input will rapidly become trivial and thus

unable to train on new data. This is to be expected, since a blank input is a perfectly predictable

stimulus and thus the network can quickly learn the (trivial) solution to the task. We instead wanted

to model an uninformative stimulus, akin to noise. Moreover, even when the eyes are sutured or

maintained in the darkness, there will be background excitation of photoreceptors that is best mod-

eled as noise. To account for this, we simulate sensory deprivation by replacing the input images

with a dataset composed of (uninformative) random Gaussian noise. This way the network is trained

on solving the highly non-trivial task of memorizing the association between the many (but finite)

noise patterns and their corresponding labels.

Designed for Accessibility and to further Open Science