DiracNets: Training Very Deep Neural Networks Without Skip-Connections by Sergey Zagoruyko, Nikos Komodakis
Deep neural networks with skip-connections, such as ResNet, show excellent performance in various image classification benchmarks. It is though observed that the initial motivation behind them - training deeper networks - does not actually hold true, and the benefits come from increased capacity, rather than from depth. Motivated by this, and inspired from ResNet, we propose a simple Dirac weight parameterization, which allows us to train very deep plain networks without skip-connections, and achieve nearly the same performance. This parameterization has a minor computational cost at training time and no cost at all at inference. We're able to achieve 95.5% accuracy on CIFAR-10 with 34-layer deep plain network, surpassing 1001-layer deep ResNet, and approaching Wide ResNet. Our parameterization also mostly eliminates the need of careful initialization in residual and non-residual networks. The code and models for our experiments are available at this https URL
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
1 comment:
There has been a lack of discussion about binarization in neural networks. Multiplying those +1/-1 values by weights and summing allows you to store values with a high degree of independence. For a given binary input and target value you get an error. You divide the error by the number of binary values and then you simply correct each of the weights by the reduced error taking account of the binary sign. That gives a full correction to get the correct target output. In higher dimensional space most vectors are orthogonal. For a different binary input the adjustments you made to the weights will not align at all. In fact they will sum to Gaussian noise by the central limit theorem. The value you previously stored for the second binary input will now be contaminated by a slight amount of Gaussian which you can correct for. This will now introduce an even smaller amount of Gaussian noise on the value for the first binary input. Iterating back and forth will get rid of the noise entirely for both binary inputs.
This has high use in random projection,reservoir and extreme learning machine computing.
Post a Comment