A singular Riemannian geometry approach to Deep Neural Networks.
|
The Jacobian matrix containing the derivatives with respect the the weights and biases of a fully connected layer is a sparse matrix. For example, let us consider a fully connected layer from \( \mathbb{R}^{2}\) to \( \mathbb{R}^{2} \) with sigmoid activation function. We can write the map realizing the layer, as:
\( \Lambda(\underline{X},\underline{W}_1,b_1,\underline{W}_2,b_2) = \begin{pmatrix} \sigma (\underline{W}_1 \cdot \underline{X} + b_1) \\ \sigma (\underline{W}_2 \cdot \underline{X} + b_2) \end{pmatrix} \)
where \(\underline{W}_1 = (w_{11},w_{12})\) and \(\underline{W}_2 = (w_{21},w_{22})\) are the weights of the layer, \(b_1,b_2\) the biases and \(\underline{X} = (x_1,x_2)\) the input data. Calling \(\Lambda_1(\underline{X},\underline{W}_1,b_1) = \sigma (\underline{W}_1 \cdot \underline{X} + b_1)\) and \(\Lambda_2(\underline{X},\underline{W}_2,b_2) = \sigma (\underline{W}_2 \cdot \underline{X} + b_2)\) the two components of the map \( \Lambda\), the Jacobian matrix containing the derivatives with respect to the weights and the biases, for a fixed input, is given by
\[ J \Lambda = \begin{pmatrix} \frac{\partial \Lambda_1}{\partial w_{11}} & \frac{\partial \Lambda_1}{\partial w_{12}} & \frac{\partial \Lambda_1}{\partial b_1} & \frac{\partial \Lambda_1}{\partial w_{21}} & \frac{\partial \Lambda_1}{\partial w_{22}} & \frac{\partial \Lambda_1}{\partial b_2}\\ \frac{\partial \Lambda_2}{\partial w_{11}} & \frac{\partial \Lambda_2}{\partial w_{12}} & \frac{\partial \Lambda_2}{\partial b_1} & \frac{\partial \Lambda_2}{\partial w_{21}} & \frac{\partial \Lambda_2}{\partial w_{22}} & \frac{\partial \Lambda_2}{\partial b_2} \end{pmatrix} \]
Considering that \(\Lambda_1\) does not depend on \(w_{21},w_{22},b_2\) and that \(\Lambda_2\) is not a function of \(w_{11},w_{12},b_1\), we find that the Jacobian of \( \Lambda \) assumes the form
\[ J \Lambda = \begin{pmatrix} \frac{\partial \Lambda_1}{\partial w_{11}} & \frac{\partial \Lambda_1}{\partial w_{12}} & \frac{\partial \Lambda_1}{\partial b_1} & 0 & 0 & 0\\ 0 & 0 & 0 & \frac{\partial \Lambda_2}{\partial w_{21}} & \frac{\partial \Lambda_2}{\partial w_{22}} & \frac{\partial \Lambda_2}{\partial b_2} \end{pmatrix} \]
Now let us consider a generic fully connected layers with n nodes, whose input is a vector space of dimension m. Then, following the , the map realizing the layer seen as a function of the weights and biases, is a vector valued function from \( \mathbb{R}^{n \cdot (m+1)} \) to \( \mathbb{R}^{n} \).The Jacobian matrix of this layer is a \( m \times (n \cdot m)\) matrix with the following structure:
In order to compute the pullback of the metric, however, we need to compute some products between the Jacobian matrices. To this end, we implement the functions reduced_standard_mul and standard_reduced_mul contained in matrix_utils.h, computing the product between a reduced matrix and a standard one (reduced_standard_mul) and the product bewteen a standard matrix and a reduced one (standard_reduced_mul). The conversion between the standard and the reduced forms of the Jacobian matrix are handled by the reduced_to_standard and standard_to_reduced functions.