Learning Machine Learning 2 - The multivariate Gaussian

 

Definition and normalisation

 Bivariate Gaussian density with \(\mu_1=\mu_2=0\) and \(\sigma_1=1.6\), \(\sigma_2=1.2\) and \(\rho=-0.5\)

Bivariate Gaussian density with \(\mu_1=\mu_2=0\) and \(\sigma_1=1.6\), \(\sigma_2=1.2\) and \(\rho=-0.5\)

If a \(N\)-dimensional random vector \(\mathbf{X}=(X_1,X_2,\dots,X_N)^\top\) is distributed according to a multivariate Gaussian distribution with mean vector \begin{equation*}\boldsymbol{\mu}=(\mu_1,\mu_2,\dots,\mu_N)^\top,\end{equation*}(\(\Exp[X_i]=\mu_i\)) and covariance matrix \(\mathbf{\Sigma}\) such that \(\Sigma_{ij}=\cov(X_i,X_j)\) we write \(\mathbf{X}\sim\mathcal{N}(\boldsymbol{\mu},\mathbf{\Sigma})\) and the probability density function is given by \begin{equation*}p(\mathbf{x}|\boldsymbol{\mu},\mathbf{\Sigma})=\frac{1}{(2\pi)^{N/2}\sqrt{\det\mathbf{\Sigma}}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right).\end{equation*}We will assume here that that the covariance matrix is positive definite. In other words we assume the eigenvalues of \(\mathbf{\Sigma}\) are all positive. To check the normalisation of this we proceed as follows. Being symmetric, the matrix \(\mathbf{\Sigma}^{-1}\) is diagonalisable, so we can write \(\mathbf{P}^{-1}\mathbf{\Sigma}^{-1}\mathbf{P}=\diag(\lambda_i^{-1})\) where \(\mathbf{P}\) is an orthogonal matrix and \(\lambda_i\) are the eigenvalues (all non-zero) of the covariance matrix \(\mathbf{\Sigma}\). Therefore, we can write the exponent as \begin{equation*}-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{P}\diag(\lambda_i^{-1})\mathbf{P}^\top(\mathbf{x}-\boldsymbol{\mu}),\end{equation*} which suggests we consider the change of variables \(\mathbf{y}=\mathbf{P}^\top\mathbf{x}\) and define \(\boldsymbol{\nu}=\mathbf{P}^\top\boldsymbol{\mu}\). Noting that \begin{equation*}\left\lvert\det\left(\frac{\partial\mathbf{x}}{\partial\mathbf{y}}\right)\right\rvert=|\det\mathbf{P}|=1,\end{equation*} we can then rewrite the integral as \begin{align*}\int\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\,d\mathbf{x}&=\int\exp\left(-\frac{1}{2}(\mathbf{y}-\boldsymbol{\nu})^\top\diag(\lambda_i^{-1})(\mathbf{y}-\boldsymbol{\nu})\right)\,d\mathbf{y}\\&=\int\exp\left(-\frac{1}{2}\sum_{i=1}^N\left(\frac{y_i-\nu_i}{\sqrt{\lambda_i}}\right)^2\right)\,d\mathbf{y}\\&=\int\prod_{i=1}^N\sqrt{2\pi}\sqrt{\lambda_i}p(y_i|\nu_i,\lambda_i)\,d\mathbf{y}\\&=\prod_{i=1}^N\sqrt{2\pi}\sqrt{\lambda_i}\\&=(2\pi)^{N/2}\sqrt{\det\mathbf{\Sigma}}.\end{align*}

Mean and variance

Let's confirm the interpretations of \(\boldsymbol{\mu}\) and \(\mathbf{\Sigma}\) as respectively the mean vector and covariance matrices. To check the mean we need to compute\begin{equation*}\Exp[\mathbf{x}]=\frac{1}{(2\pi)^{N/2}\sqrt{\det\mathbf{\Sigma}}}\int\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\mathbf{x}\,d\mathbf{x}.\end{equation*}This equation is to be understood on a component by component basis. As in the verification of the normalisation constant we first exploit the fact that the covariance matrix is diagonalisable. Thus, using the same notation as above we can rewrite the right hand side as \begin{align*}&\int\prod_{i=1}^Np(y_i|\nu_i,\lambda_i)\mathbf{P}\mathbf{y}\,d\mathbf{y}\\&=\mathbf{P}\int\prod_{i=1}^Np(y_i|\nu_i,\lambda_i)\mathbf{y}\,d\mathbf{y}\\&=\mathbf{P}\boldsymbol{\nu}\\&=\boldsymbol{\mu}.\end{align*}That is, \(\Exp[\mathbf{x}]=\boldsymbol{\mu}\). To compute the covariance matrix we need to compute \(\Exp[\mathbf{x}\mathbf{x}^\top]-\boldsymbol{\mu}\boldsymbol{\mu}^\top\). In the same way as for the mean we have \begin{align*}\Exp[\mathbf{x}\mathbf{x}^\top]&=\frac{1}{(2\pi)^{N/2}\sqrt{\det\mathbf{\Sigma}}}\int\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\mathbf{x}\mathbf{x}^\top\,d\mathbf{x}\\&=\int\prod_{i=1}^Np(y_i|\nu_i,\lambda_i)\mathbf{P}\mathbf{y}\mathbf{y}^\top\mathbf{P}^\top\,d\mathbf{y}\\&=\mathbf{P}\int\prod_{i=1}^Np(y_i|\nu_i,\lambda_i)\mathbf{y}\mathbf{y}^\top\,d\mathbf{y}\mathbf{P}^\top\\&=\mathbf{P}\left(\diag(\lambda_i)+\boldsymbol{\nu}\boldsymbol{\nu}^\top\right)\mathbf{P}^\top\\&=\mathbf{\Sigma}+\boldsymbol{\mu}\boldsymbol{\mu}^\top.\end{align*}Thus we have verified that \(\Exp[\mathbf{x}\mathbf{x}^\top]-\boldsymbol{\mu}\boldsymbol{\mu}^\top=\mathbf{\Sigma}\).

Conditional and marginal Gaussian distributions

Suppose we have an \(N\)-dimensional vector of random variables \(\mathbf{X}\sim\mathcal{N}(\boldsymbol{\mu},\mathbf{\Sigma})\) . Let's consider this vector as being made up of two subvectors, \(\mathbf{X}=(\mathbf{X}_a,\mathbf{X}_b)^\top\) of dimensions \(d\) and \(N-d\) respectively. We will consider the conditional, \(p(\mathbf{x}_a|\mathbf{x}_b)\), and marginal, \(p(\mathbf{x}_a)\), probability density functions, finding them both to be Gaussian. Fixing notation, we will consider the mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\mathbf{\Sigma}\) to be partitioned as \(\boldsymbol{\mu}=(\boldsymbol{\mu}_a,\boldsymbol{\mu}_b)^\top\) and \begin{equation*}\mathbf{\Sigma}=\begin{pmatrix}\mathbf{\Sigma}_{aa}&\mathbf{\Sigma}_{ab}\\\mathbf{\Sigma}_{ba}&\mathbf{\Sigma}_{bb}\end{pmatrix}\end{equation*}where \(\mathbf{\Sigma}_{ba}=\mathbf{\Sigma}_{ab}^\top\) and \(\mathbf{\Sigma}_{aa}\) is symmetric \(a\times a\) and \(\mathbf{\Sigma}_{bb}\) is symmetric \((N-a)\times (N-a)\). The inverse of the  covariance matrix, \(\mathbf{\Sigma}\), is called the precision matrix, \(\mathbf{\Lambda}\), the partitioned version of which is \begin{equation*}\mathbf{\Lambda}=\begin{pmatrix}\mathbf{\Lambda}_{aa}&\mathbf{\Lambda}_{ab}\\\mathbf{\Lambda}_{ba}&\mathbf{\Lambda}_{bb}\end{pmatrix}.\end{equation*}It is straightforward to check the general result for a partitioned matrix, \begin{equation*}\begin{pmatrix}\mathbf{A}&\mathbf{B}\\\mathbf{C}&\mathbf{D}\end{pmatrix}^{-1}=\begin{pmatrix}\mathbf{M}&-\mathbf{M}\mathbf{B}\mathbf{D}^{-1}\\-\mathbf{D}^{-1}\mathbf{C}\mathbf{M}&\mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}\mathbf{M}\mathbf{B}\mathbf{D}^{-1}\end{pmatrix}\end{equation*}where \(\mathbf{M}=(\mathbf{A}-\mathbf{B}\mathbf{D}^{-1}\mathbf{C})^{-1}\).

We will seek to express the joint (full) density as the product of the conditional and marginal densities, \(p(\mathbf{x}_a,\mathbf{x}_b)=p(\mathbf{x}_a|\mathbf{x}_b)p(\mathbf{x}_b)\). We'll focus first on the conditional factor. We can rewrite the quadratic form in the exponent of the full density as follows,\begin{align*}&(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\\&=(\mathbf{x}_a-\boldsymbol{\mu}_a)^\top\mathbf{\Lambda}_{aa}(\mathbf{x}_a-\boldsymbol{\mu}_a)+(\mathbf{x}_a-\boldsymbol{\mu}_a)^\top\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_b)\\&+(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Lambda}_{ba}(\mathbf{x}_a-\boldsymbol{\mu}_a)+(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Lambda}_{bb}(\mathbf{x}_b-\boldsymbol{\mu}_b)\\&=\mathbf{x}_a^\top\mathbf{\Lambda}_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top(\mathbf{\Lambda}_{aa}\boldsymbol{\mu}_a-\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_b))\\&+\boldsymbol{\mu}_a^\top\mathbf{\Lambda}_{aa}\boldsymbol{\mu}_a-2(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Lambda}_{ba}\boldsymbol{\mu}_a+(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Lambda}_{bb}(\mathbf{x}_b-\boldsymbol{\mu}_b).\end{align*} Now, considering the quadratic form in the standard multivariate Gaussian density,  \begin{equation*}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})=\mathbf{x}^\top\mathbf{\Sigma}^{-1}\mathbf{x}-\mathbf{x}^\top\mathbf{\Sigma}^{-1}\boldsymbol{\mu}+\text{constant},\end{equation*}we are led to identify the conditional covariance as \(\mathbf{\Sigma}_{a|b}=\mathbf{\Lambda}_{aa}^{-1}\). The conditional mean should then be \(\boldsymbol{\mu}_{a|b}=\boldsymbol{\mu}_a-\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_b)\). These can also be expressed in terms of the covariance matrix, \(\mathbf{\Sigma}\), as \(\mathbf{\Sigma}_{a|b}=\mathbf{\Sigma}_{aa}-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba}\) and \(\boldsymbol{\mu}_{a|b}=\boldsymbol{\mu}_a-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}(\mathbf{x}_b-\boldsymbol{\mu}_b)\). Now we need to consider \begin{equation*}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})-(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})^\top\mathbf{\Sigma}_{a|b}^{-1}(\mathbf{x}_a-\boldsymbol{\mu}_{a|b}).\end{equation*}Now let's consider the expansion of this 'conditional' quadratic form to see what's 'left over'.\begin{align*}(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})^\top\mathbf{\Sigma}_{a|b}^{-1}(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})&=\mathbf{x}_a^\top\mathbf{\Lambda}_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top(\mathbf{\Lambda}_{aa}\boldsymbol{\mu}_{a}-\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_{b}))\\&+(\boldsymbol{\mu}_a-\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_b))^\top\mathbf{\Lambda}_{aa}(\boldsymbol{\mu}_a-\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_b))\\&=\mathbf{x}_a^\top\mathbf{\Lambda}_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top(\mathbf{\Lambda}_{aa}\boldsymbol{\mu}_{a}-\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_{b}))\\&+\boldsymbol{\mu}_a^\top\mathbf{\Lambda}_{aa}\boldsymbol{\mu}_a-2(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Lambda}_{ba}\boldsymbol{\mu}_a\\&+(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Lambda}_{ba}\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_b).\end{align*}We can now see that \begin{equation*}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})=(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})^\top\mathbf{\Sigma}_{a|b}^{-1}(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})+(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top(\mathbf{\Lambda}_{bb}-\mathbf{\Lambda}_{ba}\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab})(\mathbf{x}_b-\boldsymbol{\mu}_b).\end{equation*}But considering the inverse of a partitioned matrix result we see that \(\mathbf{\Lambda}_{bb}-\mathbf{\Lambda}_{ba}\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab}=\mathbf{\Sigma}_{bb}^{-1}\) and so, finally, we obtain the pleasing result,\begin{equation*}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})=(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})^\top\mathbf{\Sigma}_{a|b}^{-1}(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})+(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Sigma}_{bb}^{-1}(\mathbf{x}_b-\boldsymbol{\mu}_b).\end{equation*}This shows us that the conditional density, \(p(\mathbf{x}_a|\mathbf{x}_b)\) is Gaussian with mean \begin{equation*}\boldsymbol{\mu}_{a|b}=\boldsymbol{\mu}_a-\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab}(\mathbf{x}_b-\boldsymbol{\mu}_b)=\boldsymbol{\mu}_a-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}(\mathbf{x}_b-\boldsymbol{\mu}_b)\end{equation*}and covariance\begin{equation*}\mathbf{\Sigma}_{a|b}=\mathbf{\Lambda}_{aa}^{-1}=\mathbf{\Sigma}_{aa}-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba}.\end{equation*}and that the marginal density \(p(\mathbf{x}_b)\) is Gaussian with mean \(\boldsymbol{\mu}_b\) and covariance \begin{equation*}\mathbf{\Sigma}_b=(\mathbf{\Lambda}_{bb}-\mathbf{\Lambda}_{ba}\mathbf{\Lambda}_{aa}^{-1}\mathbf{\Lambda}_{ab})^{-1}=\mathbf{\Sigma}_{bb}.\end{equation*}We should confirm the normalisations for the conditional and marginal densities. This is best done by noting that we can write\begin{equation*}\begin{pmatrix}\mathbf{\Sigma}_{aa}&\mathbf{\Sigma}_{ab}\\\mathbf{\Sigma}_{ba}&\mathbf{\Sigma}_{bb}\end{pmatrix}\begin{pmatrix}\mathbf{I}&\mathbf{0}\\-\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba}&\mathbf{I}\end{pmatrix}=\begin{pmatrix}\mathbf{\Sigma}_{aa}-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba}&\mathbf{\Sigma}_{ab}\\\mathbf{0}&\mathbf{\Sigma}_{bb}\end{pmatrix}.\end{equation*}Taking determinants this then leads to \begin{equation*}\det\mathbf{\Sigma}=\det(\mathbf{\Sigma}_{aa}-\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba})\det\mathbf{\Sigma}_{bb}.\end{equation*}Thus we have found that \(p(\mathbf{x}_a,\mathbf{x}_b)=p(\mathbf{x}_a|\mathbf{x}_b)p(\mathbf{x}_b)\) where the full density is given by \begin{equation*}p(\mathbf{x}_a,\mathbf{x}_b)=p(\mathbf{x})=\frac{1}{(2\pi)^{N/2}\sqrt{\det\mathbf{\Sigma}}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right),\end{equation*}the conditional density by,\begin{equation*}p(\mathbf{x}_a|\mathbf{x}_b)=\frac{1}{(2\pi)^{d/2}\sqrt{\det\mathbf{\Sigma}_{a|b}}}\exp\left(-\frac{1}{2}(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})^\top\mathbf{\Sigma}_{a|b}^{-1}(\mathbf{x}_a-\boldsymbol{\mu}_{a|b})\right),\end{equation*} and the marginal density by,\begin{equation*}p(\mathbf{x}_b)=\frac{1}{(2\pi)^{(N-d)/2}\sqrt{\det\mathbf{\Sigma}_{b}}}\exp\left(-\frac{1}{2}(\mathbf{x}_b-\boldsymbol{\mu}_b)^\top\mathbf{\Sigma}_b^{-1}(\mathbf{x}_b-\boldsymbol{\mu}_b)\right).\end{equation*}