Sparse auto-encoder with KL divergence
This document provides related derivations for the sparse auto-encoder network model with an KL divergence constraint. Stochastic gradient descent algorithm is used for training this model.
A three layer auto-encodernet work is used as an example.
Loss function :
where (W,b) and (W’,b’) refer to coefficients in the encoding and decoding phases,
refers to the number of units in hidden layer. The KL divergence can be expanded as
where
denotes the average activation of the i-th hidden unit (averaged over the training set). Assume
is activation of the i-th hidden unit with the j-th sample, thus
can be expressed as
Assume
refers to the avtive function (We used sigmoid in the manuscript), we will provide the derivatives of ∂L/∂W’, ∂L/∂b’, ∂L/∂W and ∂L/∂b below.
Note that the coefficients W’ and b’ have nothing to do with the term
, thus, for coefficients in the last layer, the derivatives can be written as
Assume
refers to residual of the j-th neuron in layer l, thus
The partial derivatives of W,b can be expressed as