From vector calculus we're familiar with the notion of the **gradient** of a function \(f:\RR^n\to\RR\), \(\nabla f\). Assuming cartesian coordinates it is the vector of partial derivatives, \((\partial f/\partial x_1,\partial f/\partial x_2,\dots,\partial f/\partial x_n)^\top\), and points in the direction of the greatest rate of change of \(f\). If we think of \(f\) as a real-valued function of real \(m\times n\) matrices, \(f:\text{Mat}_{m,n}(\RR)\to\RR\), then \(\nabla f\) is an \(m\times n\) matrix,\begin{equation*}\nabla f=\begin{pmatrix}\frac{\partial f}{\partial x_{11}}&\frac{\partial f}{\partial x_{12}}&\cdots&\frac{\partial f}{\partial x_{1n}}\\\frac{\partial f}{\partial x_{21}}&\frac{\partial f}{\partial x_{22}}&\cdots&\frac{\partial f}{\partial x_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial f}{\partial x_{m1}}&\frac{\partial f}{\partial x_{m2}}&\cdots&\frac{\partial f}{\partial x_{mn}}\end{pmatrix}.\end{equation*}

If a \(N\)-dimensional random vector \(\mathbf{X}=(X_1,X_2,\dots,X_N)^\top\) is distributed according to a **multivariate Gaussian distribution** with mean vector \begin{equation*}\boldsymbol{\mu}=(\Exp[X_1],\Exp[X_2],\dots,\Exp[X_N])^\top\end{equation*} and covariance matrix \(\mathbf{\Sigma}\) such that \(\Sigma_{ij}=\cov(X_i,X_j)\) we write \(\mathbf{X}\sim\mathcal{N}(\boldsymbol{\mu},\mathbf{\Sigma})\) and the probability density function is given by \begin{equation*}p(\mathbf{x}|\boldsymbol{\mu},\mathbf{\Sigma})=\frac{1}{(2\pi)^{N/2}\sqrt{\det\mathbf{\Sigma}}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right).\end{equation*}

By way of "warming up", in this and the next two posts we'll review some of the foundational material (mostly from probability and stats) we'll need. This review is in no way comprehensive. On the contrary it is highly selective and brief.

**Bayes' theorem**

Bayes' theorem is absolutely critical so let's kick off by recalling it, \begin{equation*}P(A|B)=\frac{P(B|A)P(A)}{P(B)}.\end{equation*}

Read MoreDriving the huge demand for maths skills at the moment is data, big data. The rate of increase in the production of data over the last 20-30 years defies hyperbole. By 2020, it is estimated that our data production will be equivalent to every one of the 7.5 billion or so people on earth each producing 1.7 MB of data *every second*! Reassuringly, less than 1% of that data is currently being analysed, but given the scale of information being recorded it's no surprise that the "sexiest job of the 21st century", across a broad range of industries, is data scientist, a job that simply didn't exist when I was an undergraduate. Data science is a broad and, frankly, pretty dull term covering a host of sub-disciplines with super-cool, borderline dystopian sounding names such as artificial intelligence, machine learning, deep learning and neural networks.

40-50 years ago, if you knew enough maths to be able to teach, say, calculus and basic probability theory, and you weren't considering an academic pathway, then becoming a teacher was a pretty attractive prospect. Essentially, the alternative was to become an actuary. So plenty of people who really knew maths became maths teachers. More recently, first the finance industry, quickly followed by tech, have become veritable sponges of graduate maths talent. The result? Fewer people who really know maths becoming maths teachers. So just as demand for higher maths skills is escalating, the talent going into teaching the crucial foundations for those skills is declining!

Read More