Sparse auto-encoder with KL divergence


A three layer auto-encodernet work is used as an example.

Loss function :

where (W,b) and (W’,b’) refer to coefficients in the encoding and decoding phases, refers to the number of units in hidden layer. The KL divergence can be expanded as

where denotes the average activation of the i-th hidden unit (averaged over the training set). Assume is activation of the i-th hidden unit with the j-th sample, thus can be expressed as

Assume refers to the avtive function (We used sigmoid in the manuscript), we will provide the derivatives of ∂L/∂W’, ∂L/∂b’, ∂L/∂W and ∂L/∂b below.

Note that the coefficients W’ and b’ have nothing to do with the term , thus, for coefficients in the last layer, the derivatives can be written as

Assume refers to residual of the j-th neuron in layer l, thus

The partial derivatives of W,b can be expressed as