Learning Machine Learning 3 - Some matrix calculus, least squares and Lagrange multipliers

 

This is the last of the background material posts... Next up we'll (finally!) get started working though some of the CS229 lectures!

The gradient and Hessian

From vector calculus we're familiar with the notion of the gradient of a function \(f:\RR^n\to\RR\), \(\nabla f\). Assuming cartesian coordinates it is the vector of partial derivatives, \((\partial f/\partial x_1,\partial f/\partial x_2,\dots,\partial f/\partial x_n)^\top\), and points in the direction of the greatest rate of change of \(f\). If we think of \(f\) as a real-valued function of real \(m\times n\) matrices, \(f:\text{Mat}_{m,n}(\RR)\to\RR\), then \(\nabla f\) is an \(m\times n\) matrix,\begin{equation*}\nabla f=\begin{pmatrix}\frac{\partial f}{\partial x_{11}}&\frac{\partial f}{\partial x_{12}}&\cdots&\frac{\partial f}{\partial x_{1n}}\\\frac{\partial f}{\partial x_{21}}&\frac{\partial f}{\partial x_{22}}&\cdots&\frac{\partial f}{\partial x_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial f}{\partial x_{m1}}&\frac{\partial f}{\partial x_{m2}}&\cdots&\frac{\partial f}{\partial x_{mn}}\end{pmatrix}.\end{equation*}This is the special case of the derivative or Jacobian matrix, \(\mathbf{J}\), of a vector valued function \(f:\RR^n\to\RR^m\), \begin{equation*}\mathbf{J}=\begin{pmatrix}\frac{\partial f_1}{\partial x_1}&\frac{\partial f_1}{\partial x_2}&\cdots&\frac{\partial f_1}{\partial x_n}\\\frac{\partial f_2}{\partial x_1}&\frac{\partial f_2}{\partial x_2}&\cdots&\frac{\partial f_2}{\partial x_n}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial f_m}{\partial x_1}&\frac{\partial f_m}{\partial x_2}&\cdots&\frac{\partial f_m}{\partial x_n}\end{pmatrix},\end{equation*}when \(m=1\). The Jacobian matrix of the gradient of a function \(f:\RR^n\to\RR\) is the Hessian matrix, \(\mathbf{H}\). This is the matrix of second order partials, \begin{equation*}\mathbf{H}=\begin{pmatrix}\frac{\partial^2f}{\partial x_1^2}&\frac{\partial^2f}{\partial x_1x_2}&\cdots&\frac{\partial^2f}{\partial x_1x_2}\\\frac{\partial^2f}{\partial x_2x_1}&\frac{\partial^2f}{\partial x_2^2}&\cdots&\frac{\partial^2f}{\partial x_2x_n}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial^2f}{\partial x_nx_1}&\frac{\partial^2f}{\partial x_nx_2}&\cdots&\frac{\partial^2f}{\partial x_n^2}\end{pmatrix},\end{equation*} and appears, along with the gradient, in the Taylor expansion of a function \(f:\RR^n\to\RR\) about a point \(\mathbf{a}\in\RR^n\),\begin{equation*}f(\mathbf{x})=f(\mathbf{a})+\nabla f(\mathbf{a})^\top(\mathbf{x}-\mathbf{a})+\frac{1}{2}(\mathbf{x}-\mathbf{a})^\top\mathbf{H}(\mathbf{a})(\mathbf{x}-\mathbf{a})+\dots\end{equation*}The Hessian matrix is real symmetric so can be diagonalised through a similarity transformation by an orthogonal matrix.

Let's consider some typical gradient calculations. Suppose \(f:\RR^n\to\RR\) is given by \(f(\mathbf{x})=\mathbf{b}^\top\mathbf{x}\). Then \begin{equation*}\frac{\partial f(\mathbf{x})}{\partial x_k}=\frac{\partial}{\partial x_k}\sum_{i=1}^nb_ix_i=b_k\end{equation*} so that \(\nabla f=\mathbf{b}\). Now consider \(f(\mathbf{x})=\mathbf{x}^\top\mathbf{A}\mathbf{x}\) for some \(n\times n\) matrix \(\mathbf{A}\). We have \begin{align*}\frac{\partial f(\mathbf{x})}{\partial x_k}&=\frac{\partial}{\partial x_k}\sum_{i,j=1}^nA_{ij}x_ix_j\\&=\frac{\partial}{\partial x_k}\left(A_{kk}x_k^2+\sum_{j\neq k}A_{kj}x_kx_j+\sum_{i\neq k}A_{ik}x_ix_k+\sum_{i,j\neq k}A_{ij}x_ix_j\right)\\&=2A_{kk}x_k+\sum_{i\neq k}A_{ik}x_i+\sum_{i\neq k}A_{ki}x_i\end{align*} so that \(\nabla f=(\mathbf{A}+\mathbf{A}^\top)\mathbf{x}\). In particular, if \(\mathbf{A}\) is symmetric, then \(\nabla f=2\mathbf{A}\mathbf{x}\). The Hessian matrix of \(f\), that is, the derivative of the gradient, is easily seen to be \(2\mathbf{A}\).

A note on notation. Rather confusingly some machine learning materials, including the supplementary notes accompanying CS229, use a notation for the Hessian usually reserved for the Laplacian, \(\nabla^2\), which of course it is not. Also, in machine learning literature it is common to append a subscript to the gradient symbol, \(\nabla_\mathbf{x}\) or  \(\nabla_\mathbf{X}\), for example. So we might write \begin{align*}\nabla_{\mathbf{x}}(\mathbf{b}^\top\mathbf{x})&=\mathbf{b}\\\nabla_{\mathbf{x}}(\mathbf{x}^\top\mathbf{A}\mathbf{x})&=(\mathbf{A}+\mathbf{A}^\top)\mathbf{x}.\end{align*}

Derivatives of traces

Consider the function \(f:\text{Mat}_n(\RR)\to\RR\) given by \(f(\mathbf{X})=\tr\mathbf{X}\). Clearly, \(\nabla f=\mathbf{I}\), or using the more explicit notation, \begin{equation*}\nabla_{\mathbf{X}}\tr\mathbf{X}=\mathbf{I}.\end{equation*} Some other functions involving the trace appear. For example, if \(\mathbf{X}\in\text{Mat}_{m,n}(\RR)\) and \(\mathbf{B}\in\text{Mat}_{n,m}(\RR)\) then we could define a function \(f:\text{Mat}_{m,n}(\RR)\to\RR\) by \(f(\mathbf{X})=\tr\mathbf{X}\mathbf{B}\) and it is straightforward to verify that \(\nabla f=\mathbf{B}^\top\),\begin{equation*}\nabla_{\mathbf{X}}\tr\mathbf{X}\mathbf{B}=\mathbf{B}^\top.\end{equation*}Just a little more work is required to compute the gradient of the function \(f:\text{Mat}_{m,n}(\RR)\to\RR\) defined by \(f(\mathbf{X})=\tr\mathbf{X}\mathbf{B}\mathbf{X}^\top\mathbf{C}\) in which, \(\mathbf{B}\in\text{Mat}_n(\RR)\) and  \(\mathbf{C}\in\text{Mat}_m(\RR)\). We have \begin{equation*}(\mathbf{X}\mathbf{B}\mathbf{X}^\top\mathbf{C})_{ij}=\sum_{\alpha,\beta,\gamma}X_{i\alpha}B_{\alpha\beta}X_{\gamma\beta}C_{\gamma j},\end{equation*}so that\begin{equation*}\tr\mathbf{X}\mathbf{B}\mathbf{X}^\top\mathbf{C}=\sum_{i,\alpha,\beta,\gamma}X_{i\alpha}B_{\alpha\beta}X_{\gamma\beta}C_{\gamma i}\end{equation*} and considering \(\partial/\partial X_{ab}\) we pick up two terms,\begin{equation*}\sum_{\beta,\gamma}B_{b\beta}X_{\gamma\beta}C_{\gamma a}=(\mathbf{C}^\top\mathbf{X}\mathbf{B}^\top)_{ab}\end{equation*}and\begin{equation*}\sum_{i,\alpha}X_{i\alpha}B_{\alpha b}C_{ai}=(\mathbf{C}\mathbf{X}\mathbf{B})_{ab}\end{equation*} so that \(\nabla f=\mathbf{C}\mathbf{X}\mathbf{B}+\mathbf{C}^\top\mathbf{X}\mathbf{B}^\top\) or\begin{equation*}\nabla_\mathbf{X}\tr\mathbf{X}\mathbf{B}\mathbf{X}^\top\mathbf{C}=\mathbf{C}\mathbf{X}\mathbf{B}+\mathbf{C}^\top\mathbf{X}\mathbf{B}^\top.\end{equation*}

Derivatives of determinants

Recall that near the identity the determinant behaves like the trace, that is,\begin{equation*}\det(\mathbf{I}+\epsilon\mathbf{A})=1+\epsilon\tr\mathbf{A}+\mathcal{O}(\epsilon^2).\end{equation*}We will show that \begin{equation*}\frac{d}{dt}\det\mathbf{A}(t)=\det\mathbf{A}(t)\tr\left(\mathbf{A}(t)^{-1}\frac{d\mathbf{A}(t)}{dt}\right).\end{equation*} Indeed, \begin{align*}\frac{d}{dt}\det\mathbf{A}(t)&=\lim_{\epsilon\to0}\frac{\det\mathbf{A}(t+\epsilon)-\det\mathbf{A}(t)}{\epsilon}\\&=\det\mathbf{A}(t)\lim_{\epsilon\to0}\frac{\det\left(\mathbf{A}(t)^{-1}\mathbf{A}(t+\epsilon)\right)-1}{\epsilon}\\&=\det\mathbf{A}(t)\lim_{\epsilon\to0}\frac{\det\left(\mathbf{I}+\epsilon\mathbf{A}(t)^{-1}\frac{d\mathbf{A}(t)}{dt}+\mathcal{O}(\epsilon^2)\right)-1}{\epsilon}\\&=\det\mathbf{A}(t)\lim_{\epsilon\to0}\frac{1+\epsilon\tr\left(\mathbf{A}(t)^{-1}\frac{d\mathbf{A}(t)}{dt}\right)+\mathcal{O}(\epsilon^2)-1}{\epsilon}\\&=\det\mathbf{A}(t)\tr\left(\mathbf{A}(t)^{-1}\frac{d\mathbf{A}(t)}{dt}\right).\end{align*}

Let us now consider the function \(f:\text{M}_n(\RR)\to\RR\) given by \(f(\mathbf{X})=\det\mathbf{X}\). Recall that one expression for the determinant is \begin{equation*}\det\mathbf{X}=\sum_{j=1}^n(-1)^{i+j}X_{ij}\adj X_{ji},\end{equation*} where \(\adj\mathbf{X}\) is the adjunct, the transpose of the matrix of cofactors, with \(\mathbf{X}\adj\mathbf{X}=\det\mathbf{X}\mathbf{I}\). It follows immediately that \(\nabla f(\mathbf{X})=\det\mathbf{X}\left(\mathbf{X}^{-1}\right)^\top\), or, \begin{equation*}\nabla_\mathbf{X}\det\mathbf{X}=\det\mathbf{X}\left(\mathbf{X}^{-1}\right)^\top.\end{equation*}

Linear least squares

Suppose we have a linear system \(\mathbf{A}\mathbf{x}=\mathbf{b}\) where \(\mathbf{A}\in\text{M}_{m,n}(\RR)\), \((m>n)\), \(\mathbf{x}\in\RR^n\) is unknown and \(\mathbf{b}\in\RR^m\). If \(\mathbf{b}\) does not belong to the image of \(\mathbf{A}\), or, equivalently, does not belong to the column space of \(\mathbf{A}\), this has no solution. However it does have an approximate solution, namely, the vector \(\mathbf{x}\) which minimises the average, \(E\), of the squares of the errors (the \(1/2\) is simply for convenience), \begin{equation*}E=\frac{1}{2m}\sum_{i=1}^m\left(\sum_{j=1}^nA_{ij}x_j-b_i\right)^2.\end{equation*}This can be written more compactly as\begin{equation*}E(\mathbf{x})=\frac{1}{2m}\left(\mathbf{x}^\top\mathbf{A}^\top\mathbf{A}\mathbf{x}-2\mathbf{b}^\top\mathbf{A}\mathbf{x}+\mathbf{b}^\top\mathbf{b}\right).\end{equation*}We can now calculate the gradient as \(\nabla E(\mathbf{x})=\frac{1}{m}\left(\mathbf{A}^\top\mathbf{A}\mathbf{x}-\mathbf{A}^\top\mathbf{b}\right)\) and setting this to be zero we obtain what are known as the normal equations, \begin{equation*}\mathbf{x}=(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b}.\end{equation*}It's worth noting here that if we calculate the derivative of the gradient, in other words the Hessian of \(E(\mathbf{x})\), we obtain \(\frac{1}{m}\mathbf{A}^\top\mathbf{A}\). This is a positive semidefinite matrix and so we can say that \(E(\mathbf{x})\) is convex, in other words its local minimum is its global minimum.

Lagrange multipliers

For a function \(f:\RR^n\to\RR\) the extrema are found by considering those points \(\mathbf{x}\) such that the gradient vanishes \(\nabla f(\mathbf{x})=0\). The problem addressed by the technique of Lagrange multipliers is finding the extrema of \(f(\mathbf{x})\) subject to a constraint \(g(\mathbf{x})=0\). The ket fact is that extrema of  \(f(\mathbf{x})\) subject to the constraint \(g(\mathbf{x})=0\) satisfy \begin{equation*}\nabla f(\mathbf{x})=\lambda\nabla g(\mathbf{x})\end{equation*} with \(\lambda\) known as the Lagrange multiplier. The intuition behind this is as follows. Any increment \(\Delta\mathbf{x}\) on the surface \(g(\mathbf{x})=0\) is such that \(\Delta\mathbf{x}\cdot\nabla g(\mathbf{x})=0\). But the projection \(\Delta\mathbf{x}\cdot\nabla f(\mathbf{x})\) in any direction \(\Delta\mathbf{x}\) preserving the constraint must also vanish for \(f(\mathbf{x}\) an extrema on the surface so \(\nabla f(\mathbf{x})\) and \(\nabla g(\mathbf{x})\) must be parallel. The pair of equations,\begin{equation*}\nabla f(\mathbf{x})=\lambda\nabla g(\mathbf{x})\quad\text{and}\quad g(\mathbf{x})=0\end{equation*} are precisely the equations we obtain when considering the extrema of the unconstrained function \(L:\RR^{n+1}\to\RR\) defined as \begin{equation*}L(\mathbf{x},\lambda)=f(\mathbf{x})-\lambda g(\mathbf{x}).\end{equation*} 

If we have more than one constraint we need more Lagrange multipiers. Suppose we have \(k\) constraints, \(g_i(\mathbf{x})=0\), \(i=1,\dots,k\). By the natural extension of the intuition for one constraint, the projection of \(\nabla f(\mathbf{x})\) on a direction \(\Delta\mathbf{x}\) consistent with all constraints must vanish. The direction \(\Delta\mathbf{x}\) must itself satisfy \(\Delta\mathbf{x}\cdot\nabla g_i(\mathbf{x})=0\) for \(i=1,\dots,k\). So we should set \begin{equation*}\nabla f(\mathbf{x})=\lambda_1\nabla g_1(\mathbf{x})+\cdots+\lambda_k\nabla g_k(\mathbf{x}).\end{equation*}The unconstrained function in this case is \begin{equation*}L(\mathbf{x},\lambda)=f(\mathbf{x})-\lambda_1g_1(\mathbf{x})-\cdots-\lambda_kg_k(\mathbf{x}).\end{equation*}

An example that leads directly to eigenvalue/eigenvector analysis is the following. We seek the extrema of \(f(\mathbf{x})=\mathbf{x}^\top\mathbf{A}\mathbf{x}\), for \(\mathbf{A}\) real symmetric, subject to the constraint \(\mathbf{x}^\top\mathbf{x}=1\). Thus we form the function\begin{equation*}L(\mathbf{x},\lambda)= \mathbf{x}^\top\mathbf{A}\mathbf{x}-\lambda(\mathbf{x}^\top\mathbf{x}-1)\end{equation*} and find its unconstrained extrema. Thus we are led to\begin{equation*}\mathbf{A}\mathbf{x}=\lambda\mathbf{x}\quad\text{and}\quad \mathbf{x}^\top\mathbf{x}=1.\end{equation*}