Do frameworks (e.g. Keras) use the wrong distribution for Glorot/Xavier weight initialization?

by Plankalkül   Last Updated October 19, 2018 11:19 AM

When looking at the paper describing the Glorot/Xavier uniform weight initialization, the weights are sampled according to a uniform distribution according to equation 16

$$W \sim U[-\frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}]$$

If I interpret the paper correctly, $n_j$ is the number of neurons in the current layer, and $n_{j+1}$ is the number of neurons in the next layer.

Looking at the implementation in Keras the bounds of the uniform distribution are calculated using a square root over fan_in + fan_out and fan_in is the numbers of neurons in the previous layer, and fan_out is the number of neurons in the current layer. So the implementation seems to be going in the "opposite" direction.

Can someone explain why you are allowed to use the opposite direction? I assume this is done because it is just difficult to get the number of neurons of the next layer when doing an initialization locally in the layer itself (since it is not known what the next layer will be).

Related Questions

Is this normal convolution or something special?

Updated August 14, 2017 21:19 PM

Network in Network in keras implementation

Updated April 13, 2017 09:19 AM