Weight Initialization in FCN
In a FCN, basically, for each layer separately, we have a random init of the weights
- We don’t use all 0s for the weights. Remember when using backpropagation, we need to calculate dL/dX in the intermediate layers to get dL/dW in the end. setting weights to 0s will cause dL/dX to be 0, then no update for the network.
- We don’t initialize all weights of FCN using a constant. If all weights are set to a constant, then all the neurons will have the same value at any time during training. We need the network weights to have a more complex structure then we could learn the complex distribution of the input data.
Remember the shape of tanh. When we are using tanh as the activation function, if we initialize the weights too large, the output of each layer will approach 1/-1 gradually in the network. If we initialize the weights too small, the output of each layer will approach 0 gradually.
And both of these two situations lead to vanishing gradient.
To deal with this, we want to have a way of weight initialization for each layer to maintain the variation. To be more clear, for each layer, we want the variation of input = the variation of output.
And this is the math deduction, which helps us arrive at the conclusion that we need Var(wi) = 1/Din to maintain the variation.
For tanh, Xavier initializer solves the problem.
As for ReLU, the term proposed by Kaiming solves the problem
Batch Normalization
But using the techniques of weight initialization can only be useful in the first forward pass. After the first backpropagation, the weights will be updated and we don’t know how the weights will change and most likely they are not zero-mean and unit-variance anymore.
If the inputs to each network layer have different distributions, not zero-mean and unit-variance, we have the following troubles
- could lead to vanishing gradient in the activation function part
- different layers accepting different distributions of data could make the network converge slower
Therefore, in a neural network, we could add a layer of batch normalization right before the activation layer.
The workflow of the batch normalization is like this:
Assumes that we have a mini-batch data of shape NxD, where N is the number of data in the batch and D is the size of each data. For each position in D, we calculate the mean and variance of this position, using the data we have already seen during training and make the data zero-mean and unit-variance across “N”.
Additionally, we don’t want the data to be strictly zero-mean and unit-variance so we calculate a linear transformation of X, adding some fluctuation to make the network learn better about the distribution. The parameters in the linear transformation, gamma, and beta can be learned during training.
This is the numerical representation of the shape of data, before and after batch normalization.
And we also have a lot of different kinds of normalization techniques. The following image is a graphical representation of the workflow of each technique.
Batch Normalization also got some downsides:
- Not understood completely theoretically
- Different performance in training and testing