Learning Machine Learning 3 - Some matrix calculus, least squares and Lagrange multipliers

From vector calculus we're familiar with the notion of the gradient of a function $$f:\RR^n\to\RR$$, $$\nabla f$$. Assuming cartesian coordinates it is the vector of partial derivatives, $$(\partial f/\partial x_1,\partial f/\partial x_2,\dots,\partial f/\partial x_n)^\top$$, and points in the direction of the greatest rate of change of $$f$$. If we think of $$f$$ as a real-valued function of real $$m\times n$$ matrices, $$f:\text{Mat}_{m,n}(\RR)\to\RR$$, then $$\nabla f$$ is an $$m\times n$$ matrix,\begin{equation*}\nabla f=\begin{pmatrix}\frac{\partial f}{\partial x_{11}}&\frac{\partial f}{\partial x_{12}}&\cdots&\frac{\partial f}{\partial x_{1n}}\\\frac{\partial f}{\partial x_{21}}&\frac{\partial f}{\partial x_{22}}&\cdots&\frac{\partial f}{\partial x_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial f}{\partial x_{m1}}&\frac{\partial f}{\partial x_{m2}}&\cdots&\frac{\partial f}{\partial x_{mn}}\end{pmatrix}.\end{equation*}

Andrew Jacobs
Learning Machine Learning 2 - The multivariate Gaussian

If a $$N$$-dimensional random vector $$\mathbf{X}=(X_1,X_2,\dots,X_N)^\top$$ is distributed according to a multivariate Gaussian distribution with mean vector \begin{equation*}\boldsymbol{\mu}=(\Exp[X_1],\Exp[X_2],\dots,\Exp[X_N])^\top\end{equation*} and covariance matrix $$\mathbf{\Sigma}$$ such that $$\Sigma_{ij}=\cov(X_i,X_j)$$ we write $$\mathbf{X}\sim\mathcal{N}(\boldsymbol{\mu},\mathbf{\Sigma})$$ and the probability density function is given by \begin{equation*}p(\mathbf{x}|\boldsymbol{\mu},\mathbf{\Sigma})=\frac{1}{(2\pi)^{N/2}\sqrt{\det\mathbf{\Sigma}}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right).\end{equation*}