# Learning Machine Learning 1 - Some probability and stats

By way of "warming up", in this and the next two posts we'll review some of the foundational material (mostly from probability and stats) we'll need. This review is in no way comprehensive. On the contrary it is highly selective and brief.

**Bayes' theorem**

Bayes' theorem is absolutely critical so let's kick off by recalling it, \begin{equation*}P(A|B)=\frac{P(B|A)P(A)}{P(B)}.\end{equation*}On the left hand side we have \(P(A|B)\) the probability of \(A\) given \(B\). In the numerator on the right hand side we have, \(P(B|A)P(A)\), the probability of \(B\) given \(A\) times \(P(A)\) the probability of \(A\). That's divided by \(P(B)\), the probability of \(B\). An important perspective on this relation is that it is telling how to update a prior probability \(P(A)\) given new information \(B\). In this context we could think of \(B\) as being some observations and \(A\) as being some model parameters. Then, up to normalisation (by \(P(B)\)), Bayes is saying that the updated probability of the model parameters given the observations \(B\), \(P(A|B)\), is proportional to the product of the **prior** probability \(P(A)\) and the **likelihood**, \(P(B|A)\), of the observed data given the model parameters. We'll have much more to say on this perspective soon. For now let's consider a couple of classic examples of Bayes in action.

**Example **[The base rate fallacy] From Wikipedia: *"A group of police officers have breathalysers displaying false drunkenness in 5% of the cases in which the driver is sober. However, the breathalyzers never fail to detect a truly drunk person. One in a thousand drivers is driving drunk. Suppose the police officers then stop a driver at random, and force the driver to take a breathalyzer test. It indicates that the driver is drunk. We assume you don't know anything else about him or her. How high is the probability he or she really is drunk?"* OK, so a common response is something like 95%. But let's use Bayes' theorem to see if this is even remotely accurate. Let's start turning our story into maths. We are told that \(P(+|D)=1\), in other words, the probability of a positive test given a drunk driver is 1. We're also told that \(P(D)=0.001\), one in a thousand drivers is drunk, and \(P(+|D')=0.05\), the probability of a positive test given a sober driver. What we want to know is \(P(D|+)\), the probability of driver being drunk given that he has tested positive. Bayes tells us that \[P(D|+)=\frac{P(+|D)P(D)}{P(+)}\] and to find \(P(+)\) we can use that \begin{align*}P(+)&=P(+|D)P(D)+P(+|D')P(D')\\&=1\times0.001+0.05\times0.999\\&=0.05095\end{align*} so that \begin{equation*}P(D|+)=\frac{0.001}{0.05095}\approx0.02.\end{equation*} So in fact the correct probability that a random driver who tests positive is actually drunk (given this particular test) is just 2%! I've no doubt that real breathalysers are far superior to the ones used to illustrate the story here. But the moral is a generally applicable one. It is difficult, even with a good test, to screening for rare diseases or, indeed, behaviours.

**Example** [The Monte Hall Problem] There is a prize behind one of three doors. The contestant picks a door and then one of the remaining doors swings open revealing no prize behind that one. The contestant is then asked if they prefer to stick with their original choice or to switch to the remaining (closed) door. The classic response is that there is a 50:50 chance so no point switching doors. Let's analyse this with Bayes. For convenience, let's label the three doors \(A\), \(B\) and \(C\) and suppose the contestant chooses \(A\) with door \(B\) swinging open, an event we'll label \(BO\). We then want to compute \(P(A|BO)\) and \P(C|BO)\) to decide whether to stick or switch. Now it should be clear that \(P(A)=P(B)=P(C)=1/3\). Consider \P(BO)\), the probability that door \(B\) swings open. To compute this we note that \begin{align*}P(BO)&=P(BO|A)P(A)+P(BO|B)P(B)+P(BO|C)P(C)\\&=\frac{1}{2}\times\frac{1}{3}+0\times\frac{1}{3}+1\times\frac{1}{3}\\&=\frac{1}{2}\end{align*}

Now we're ready to use Bayes to compute

\begin{equation*}P(A|BO)=\frac{P(BO|A)P(A)}{P(BO)}=\frac{\frac{1}{2}\times\frac{1}{3}}{\frac{1}{2}}=\frac{1}{3}\end{equation*}

and

\begin{equation*}P(C|BO)=\frac{P(BO|C)P(C)}{P(BO)}=\frac{1\times\frac{1}{3}}{\frac{1}{2}}=\frac{2}{3}\end{equation*}

The conclusion is clear. We must switch.

**The mean and variance and other moments**

We recall that the expected value, \(\Exp[X]\), of a random variable \(X\) is its **mean**, or (probability weighted) average, often denoted \(\langle X\rangle\) or with the greek letter, \(\mu_X\). For a discrete distribution with \(N\) outcomes \(X_i\) and corresponding probabilities \(p_i\) (such that \(\sum_{i=1}^Np_i=1\)) we might write \begin{equation*}\Exp[X]=\langle X\rangle=\mu_X=\sum_{i=1}^NX_ip_i.\end{equation*} For a random variable drawn from a continuous distribution with probability density \(p_X\) we'd have \begin{equation*}\Exp[X]=\int xp_X(x)\,dx\end{equation*} (Note that we'll use the notation \(p_X(x)\) in both discrete and continuous cases. In discrete cases this is typically known as the probability mass function.) The mean is just the first moment of a distribution. The \(n\)'th moment is given by \(\Exp[X^n]\) and the \(n\)th central moment is \(\Exp[(X-\mu_X)^n]]\). The second central moment is called the **variance**, often denoted \(\var(X)\) or \(\sigma_X^2\), with \(\sigma_X\) being the **standard deviation**. It's not difficult to see that \begin{equation*}\sigma_X^2=\langle X^2\rangle -\langle X\rangle^2\end{equation*}

The **cummulative distribution function** of a random variable drawn from a distribution with probability density \(p_X(x)\) is \begin{equation*}P(X\leq x)=\int_{-\infty}^xp_X(t)\,dt.\end{equation*}

**Joint distributions - covariance, correlation and independence**

If \(X\) and \(Y\) are two discrete random variables then we can consider their joint probability distribution \(p_{X,Y}(x,y)\), the collection of joint probabilities, which must of course be such that \(\sum_{x,y}p_{X,Y}(x,y)=1\). In the continuous random variable case this total probability rule would correspond to the statement that \begin{equation*}\int\int p_{X,Y}(x,y)\,dx\,dy=1.\end{equation*} Given a joint probability density \(p_{X,Y}(x,y)\) we obtain the **marginal densities**, \(p_X(x)\) and \(p_Y(y)\), as \begin{equation*}p_X(x)=\int p_{X,Y}(x,y)\,dy\end{equation*} and \begin{equation*}p_Y(y)=\int p_{X,Y}(x,y)\,dx\end{equation*}

The **covariance** of two random variables \(X\) and \(Y\), \(\cov(X,Y)\), is defined to be \begin{equation*}\cov(X,Y)=\Exp[(X-\mu_X)(Y-\mu_Y)]=\Exp[XY]-\mu_X\mu_Y\end{equation*}In particular, \(\cov(X,X)=\var(X)\) and we can combine the variances and covariances into a single (symmetric) **covariance matrix**, \(\Sigma(X,Y)\),\begin{equation*}\Sigma(X,Y)=\begin{pmatrix}\var(X)&\cov(X,Y)\\\cov(X,Y)&\var(Y)\end{pmatrix}\end{equation*}This generalises in the obvious way to an \(n\)-dimensional vector \(\mathbf{X}\in\RR^n\) of random variables.

Recall that two events, \(A\) and \(B\), are **independent** if and only if their joint probability is the product of their individual probabilities, \(P(A,B)=P(A\cap B)=P(A)P(B)\). In terms of distributions or probability densities, two random variables are independent if and only if \(p_{X,Y}(x,y)=p_X(x)p_Y(y)\). In this case \(\cov(X,Y)=0\) since \begin{align*}\cov(X,Y)&=\int\int(X-\mu_X)(Y-\mu_Y)p_{X,Y}(x,y)\,dx\,dy\\&=\int(X-\mu_X)p_X(x)\,dx\int(Y-\mu_Y)p_Y(y)\,dy=0.\end{align*} Thus, independence implies vanishing off-diagnonals of the covariance matrix but the converse is not true. As an example consider a random variable \(X\) which can take values \(\pm1\) with probability 0.5 and a second random variable \(Y\) which is such that \(Y=0\) if \(X=-1\) and \(Y=\pm1\) with probability 0.5 if \(X=1\). Then it's straightforward to check \(\Exp[X]=0\) and \(\Exp[Y]=0\) and \(\cov(X,Y)=\Exp[XY]=0\) but they are clearly dependent.

Given the covariance of two random variables, \(\cov(X,Y)\), the **correlation**, \(\rho(X,Y)\), is the dimensionless quantity taking values between \(-1\) and \(1\) defined as \begin{equation*}\rho(X,Y)=\frac{\cov(X,Y)}{\sqrt{\var(X)\var(Y)}}.\end{equation*} The covariance matrix can then be written as \begin{equation*}\begin{pmatrix}\var(X)&\rho(X,Y)\sqrt{\var(X)\var(Y)}\\\rho(X,Y)\sqrt{\var(X)\var(Y)}&\var(Y)\end{pmatrix}\end{equation*}

Let's now look at some specific distributions.

**Bernoulli distribution**

Suppose a random variable \(X\) can take just two values, \(X\in\{0,1\}\), and that the probability that \(X=1\) is \(p\) and so the probability of \(X=0\) is \(1-p\). This is often called a Bernoulli trial. The probability mass function is simply \(p_X(x)=p^x(1-p)^{(1-x)}\). It is straightforward to verify that the mean is \(p\) and the variance is given by \(p(1-p)\).

**Binomial distribution**

In a Bernoulli trial suppose we regard the outcome 1, which has probability \(p\) as a success and 0 as a fail. Now consider a sequence of \(n\) Bernoulli trials. We can consider the random variable which is the number of successes in this sequence of \(n\) trials. It's not difficult to see what the probability mass function must be in this case. The probability of a single outcome of \(k\) success, and hence \(n-k\) fails, is \(p^k(1-p)^{n-k}\). This must be multiplied by the total number of ways of selecting \(k\) distinct combinations of \(n\) objects, irrespective of order, that is, \(\binom{n}{k}\). Thus we have \begin{equation*}p(k;n,p)=\binom{n}{k}p^k(1-p)^{n-k}.\end{equation*}The binomial theorem, \begin{equation*}(x+y)^n=\sum_{k=0}^n\binom{n}{k}x^ky^{n-k},\end{equation*} confirms that the probabilities sum to 1. Indeed it is convenient to introduce \(q=1-p\). Then \begin{equation*}\sum_{k=0}^n\binom{n}{k}p^kq^{n-k}=(p+q)^n=1.\end{equation*} A neat way of computing the mean is then to notice that \begin{align*}\Exp[k]&=\sum_{k=0}^nk\binom{n}{k}p^kq^{n-k}\\&=p\frac{\partial}{\partial p}\sum_{k=0}^n\binom{n}{k}p^kq^{n-k}\\&=p\frac{\partial}{\partial p}(p+q)^n=pn.\end{align*} That is, the mean of the binomial distribution is \(pn\), \(\Exp[k]=pn\). We can play a similar game to compute the variance, \(\langle k^2\rangle-\langle k\rangle^2\). Thus, \begin{align*}\langle k^2\rangle&=p\frac{\partial}{\partial p}\left(p\frac{\partial}{\partial p}(p+q)^n\right)\\&=pn+p^2n^2-p^2n,\end{align*} so that \(\var(k)=p(1-p)n\).

**Poisson distribution**

The Poisson distribution is often used to model the number of times an event occurs in a specified time or space. The probability mass function is given by \begin{equation*}p(k;\lambda)=e^{-\lambda}\frac{\lambda^k}{k!}.\end{equation*} We can understand the Poisson distribution as a limit of the binomial distribution. Suppose we were considering the number of claims made to an insurance company in a certain timeframe, \(t\). Suppose we divide \(t\) up into a certain number, say \(n\), intervals and take the probability of a claim (assumed equally likely throughout the year) in an interval to be \(p\). Then the distribution of the number \(k\) of claims is binomial. Now let's consider dividing the time \(t\) into smaller and smaller intervals. The probability of a claim should be proportional to the size of the interval so that \(p=\lambda/n\) for some constant \(\lambda\) (I've absorbed the time interval into this constant). Note that this means \(\lambda=np\), in other words, \(\lambda\) is the average number of claims in the time \(t\). In the limit as \(n\to\infty\), the claims are thus distributed according to the probability mass function \begin{align*}&\lim_{n\to\infty}\frac{n(n-1)\cdots(n-k+1)}{k!}\left(\frac{\lambda}{n}\right)^k\left(1-\frac{\lambda}{n}\right)^{n-k}\\&=

\lim_{n\to\infty}\frac{n^k+\mathcal{O}(n^{k-1})}{k!}\frac{\lambda^k}{n^k}\left(1-\frac{\lambda}{n}\right)^n\left(1-\frac{\lambda}{n}\right)^{-k}\\&=e^{-\lambda}\frac{\lambda^k}{k!}.\end{align*} It is easy to see that the total probability is 1 and is straightforward to check that both the mean and variance are \(\lambda\), \(\langle k\rangle=\lambda=\langle k^2\rangle-\langle k\rangle^2\).

**Uniform distribution**

The probability density function of the uniform distribution is \begin{equation*}p(x)=\begin{cases}\frac{1}{b-a}&a\leq x\leq b\\0&x<a\text{ or }x>b\end{cases}.\end{equation*}Clearly \(\int p(x)\,dx=1\), the mean is given by \begin{equation*}\Exp[x]=\int_a^b\frac{x}{b-a}\,dx=\frac{a+b}{2},\end{equation*}and the variance by \begin{equation*}\Exp[x^2]-\Exp[x]^2=\int_a^b\frac{x^2}{b-a}\,dx-\frac{(a+b)^2}{4}=\frac{1}{12}(b-a)^2.\end{equation*}

**Interlude (1) - The law of large numbers**

Suppose that \(\{X_1,\dots,X_n\}\) are drawn from a distribution with mean \(\mu\) and variance \(\sigma^2\). The **(strong) law of large numbers** says that** **if \(\bar{X}_n\) is the sample mean then with probability 1 the limit, \(\lim_{n\to\infty}\bar{X}_n\), is \(\mu\).

**The Gaussian (normal) distribution**

The probability density function of the Gaussian distribution of mean \(\mu\) and variance \(\sigma^2\) is denoted \(p(x|\mu,\sigma^2)\) and given by \begin{equation*}p(x|\mu,\sigma^2)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right).\end{equation*}This distribution occurs so often that it is also called the **normal distribution**. If a random variable \(X\) is distributed normally with mean \(\mu\) and variance \(\sigma^2\) it is common to express this fact as \(X\sim\mathcal{N}(\mu,\sigma^2)\). The **standard normal distribution** has mean 0 and variance 1 and it is convenient to work with this for the purposes of verifying normalisation and calculating moments. Indeed, the fact that \begin{equation*}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty e^{-\frac{1}{2}x^2}\,dx=1\end{equation*} can be seen as follows. Consider the square of the integral, \begin{align*}\left(\int_{-\infty}^\infty e^{-x^2}\,dx\right)^2&=\int_{-\infty}^\infty e^{-x^2}\,dx\int_{-\infty}^\infty e^{-y^2}\,dy\\&=\int_{-\infty}^\infty\int_{-\infty}^\infty e^{-(x^2+y^2)}\,dx\,dy.\end{align*} Considering this as an integral over the Cartesian plane we now switch to polar coordinates, \(x=r\cos\theta\) and \(y=r\sin\theta\), to obtain \begin{align*}\int_0^\infty dr\int_0^{2\pi}d\theta re^{-r^2}&=2\pi\int_0^\infty re^{-r^2}\,dr\\&=\pi\int_0^\infty e^{-u}\,du\\&=\pi.\end{align*}Thus we have obtained that \begin{equation*}\int_{-\infty}^\infty e^{-x^2}\,dx=\sqrt{\pi}\end{equation*}and the desired result now follows from a simple change of variable. For any odd \(n\) we have \begin{equation*}\int_{-\infty}^\infty x^ne^{-\frac{1}{2}x^2}\,dx=0\quad n=1,3\dots\end{equation*} since the integrand is a product of an even and odd function. In particular we confirm that the mean of \(X\sim\mathcal{N}(0,1)\) is zero, \(\Exp[X]=0\). To obtain the result for even \(n\) we first observe that \begin{equation*}\int_{-\infty}^\infty e^{-\alpha x^2}\,dx=\sqrt{\pi}\alpha^{-1/2}.\end{equation*}We can now differentiate this with respect to \(\alpha\) to obtain \begin{equation*}\int_{-\infty}^\infty x^2e^{-\alpha x^2}\,dx=\frac{1}{2}\sqrt{\pi}\alpha^{-3/2}.\end{equation*} Differentiating again we obtain \begin{equation*}\int_{-\infty}^\infty x^4e^{-\alpha x^2}=\frac{1\cdot3}{2^2}\sqrt{\pi}\alpha^{-5/2}.\end{equation*} The general result is now clear, \begin{equation*}\int_{-\infty}^\infty x^ne^{-\alpha x^2}=\frac{1\cdot3\cdots(n-1)}{2^{n/2}\alpha^{(n+1)/2}}\sqrt{\pi}\quad n=0,2,4,\dots\end{equation*}so that, in particular, \begin{equation*}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty x^ne^{-\frac{1}{2}x^2}=1\cdot3\cdots(n-1)\quad n=0,2,4,\dots\end{equation*}This confirms that the variance of \(X\sim\mathcal{N}(0,1)\) is 1 since \(\Exp[x^2]=1\).

Having checked that \(\int p(x|0,1)dx=1\), a simple change of variable can be used to check that \(\int p(x|\mu,\sigma^2)dx=1\). The cumulative distribution function of the standard normal distribution is typically denoted \(\Phi(x)\) with \begin{equation*}P(X\leq x)=\Phi(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^xe^{-\frac{1}{2}t^2}\,dt.\end{equation*}For \(X\sim\mathcal{N}(\mu,\sigma^2)\) we have \begin{equation*}P(X\leq x)=\frac{1}{\sigma\sqrt{2\pi}}\int_{-\infty}^x\exp\left(-\frac{1}{2}\left(\frac{t-\mu}{\sigma}\right)^2\right)\,dt.\end{equation*}If \(X\sim\mathcal{N}(0,1)\) then consider the random variable \(Y=\sigma X+\mu\). We will verify that \(Y\sim\mathcal{N}(\mu,\sigma^2)\). \begin{align*}P(Y\leq y)&=P(\sigma X+\mu\leq y)\\&=P\left(X\leq\frac{y-\mu}{\sigma}\right)\\&=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\frac{y-\mu}{\sigma}}e^{-\frac{1}{2}t^2}\,dt\\&=\frac{1}{\sigma\sqrt{2\pi}}\int_{-\infty}^y\exp\left(-\frac{1}{2}\left(\frac{s-\mu}{\sigma}\right)^2\right)\,ds.\end{align*}

**Interlude (2) - The central limit theorem**

A reason for its ubiquity of the normal distribution is provided by the central limit theorem which says (continuing from the previous interlude) that for large \(n\) the sample mean is distributed approximately normally with mean \(\mu\) and variance \(\sigma^2\), that is, \(\bar{X}_n\sim\mathcal{N}(\mu,\sigma^2)\). More precisely, \begin{equation*}\lim_{n\to\infty}P\left(\frac{\bar{X}_n-\mu}{\sigma/\sqrt{n}}\leq x\right)=\Phi(x).\end{equation*}

**The gamma distribution**

The probability density function of the gamma distribution with shape \(\alpha\) and rate \(\beta\) is \begin{equation*}p(x|\alpha,\beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x},\end{equation*}with the gamma function defined by \begin{equation*}\Gamma(\alpha)=\int_0^\infty x^{\alpha-1}e^{-x}\,dx.\end{equation*}It is straightforward to verify that this density integrates to 1, that its mean is \(\alpha/\beta\) and its variance is \(\alpha/\beta^2\).

Considering the change of (random) variables, \(Y=1/X\), we find the probability density function for \(Y\) to be \begin{align*}p_Y(y|\alpha,\beta)&=p(x(y)|\alpha,\beta)\left\lvert\frac{dx}{dy}\right\rvert\\&=\frac{\beta^\alpha}{\Gamma(\alpha)}y^{-\alpha-1}e^{-\beta/y},\end{align*}the density function for the **inverse gamma distribution**.

**The beta distribution**

The probability density function of the beta distribution is defined as \begin{equation*}p(x|\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\end{equation*}where \begin{equation*}B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}.\end{equation*}The mean is given by\begin{equation*}\Exp[X]=\frac{\alpha}{\alpha+\beta}\end{equation*} and the variance by\begin{equation*}\var[X]=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.\end{equation*}