Learning Machine Learning 4 - Linear regression, gradient descent and feature normalization


House price data from Portland - a first encounter with MatLab

The CS229 course kicks off with Andrew Ng introducing some data which will be used to illustrate different algorithms. If you sign up for his Coursera course, this data is provided as part of one of the early programming assignments. I've never used MatLab before but it proved really straightforward to load the data, convert the table to an array, extract the vectors of data and produce this plot of prices against living area:


In the context of machine learning a collection of data like this is called a training set, a set of pairs \(\{(x^{(i)},y^{(i)});i=1\dots m\}\) where the \(x^{(i)}\) are called features and the \(y^{(i)}\) are called targets. It will typically be the case that the lowercase \(m\) will be used for the size of the dataset. The total number of features (in the housing price example the dataset also contains number of rooms) will be denoted by the lowercase \(n\). A given input in a training set would be denoted \(x_i^{(j)}\) - the \(i\)th feature in the \(j\)th training set example. In supervised learning, such a training set is fed into a learning algorithm to learn an hypothesis function from the space of features to the space of targets. In the housing price example this would be a function \(h:\mathbb{R}\to\mathbb{R}\) which we should then be able to use to predict house values based on a given living area. When the target variable is continuous the learning problem is called a regression problem in contrast to the case of a discrete target variable in which case its called a classification problem.  

Linear regression

Let's suppose we have \(n\) features. In a linear regression learning algorithm we try to find a function \(h:\mathbb{R}^{n+1}\to\mathbb{R}\) given by \begin{equation*}h(\mathbf{x})=\sum_{i=0}^n\theta_i x_i=\boldsymbol{\theta}^\top\mathbf{x}=\mathbf{x}^\top\boldsymbol{\theta}\end{equation*}in which we take \(x_0=1\) so that both \(\boldsymbol{\theta}\) and \(\mathbf{x}\) are \((n+1)\)-dimensional vectors. Our job is to learn the vector of parameters, \(\boldsymbol{\theta}\). The natural approach is to demand that in some well defined sense \(h\) performs "as well as possible" on our training set. Defining  \(\mathbf{X}\in\text{Mat}_{m,n+1}(\mathbb{R})\) by \(X_{ij}=x^{(i)}_j\) and gathering the targets \(y^{(i)}\) into the obvious column vector \(\mathbf{y}\in\mathbb{R}^m\), the "perfect" solution is that \(\mathbf{X}\boldsymbol{\theta}=\mathbf{y}\). In other words, that \(h(\mathbf{x}^{(i)})=y^{(i)}\) for \(i=1\dots m\). This is of course not generally possible so, as discussed, we obtain the approximate solution through least squares by minimising the average of the squares of the errors, what we'll here call a cost function, \(J(\boldsymbol{\theta})\), defined as \begin{equation*}J(\boldsymbol{\theta})=\frac{1}{2m}\sum_{i=1}^m(h(\mathbf{x}^{(i)})-y^{(i)})^2=\frac{1}{2m}(\mathbf{X}\boldsymbol{\theta}-\mathbf{y})^\top(\mathbf{X}\boldsymbol{\theta}-\theta)=\frac{1}{2m}\left(\boldsymbol{\theta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\theta}-2\mathbf{y}^\top\mathbf{X}\boldsymbol{\theta}+\mathbf{y}^\top\mathbf{y}\right).\end{equation*}We can compute the gradient of \(J(\boldsymbol{\theta})\), \begin{equation*}\nabla J(\boldsymbol{\theta})=\frac{1}{m}\left(\mathbf{X}^\top\mathbf{X}\boldsymbol{\theta}-\mathbf{X}^\top\mathbf{y}\right),\end{equation*}and set it equal to zero to obtain the normal equations, \begin{equation*}\boldsymbol{\theta}=(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}.\end{equation*}

In the CS229 lectures this analytic solution is in fact only considered later with different flavours of an approximate solution to the minimisation (known as gradient descent and which we'll consider presently) explored beforehand. One potential issue with the normal equations solution is that matrix inversion is computationally costly for large \(n\) (the matrix inversion here is of a \((n+1)\times(n+1)\) matrix). 

Gradient descent

A non-analytic approach to minimising \(J(\boldsymbol{\theta})\) is called the gradient descent algorithm and exploits the fact that, as previously noted, the gradient of a function, \(\nabla f\), points in the direction of greatest rate of change of \(f\). Thus, the gradient descent algorithm works by initialising the vector \(\boldsymbol{\theta}\) appropriately and then updating it, iteratively, according to \begin{equation*}\boldsymbol{\theta}:=\boldsymbol{\theta}-\alpha\nabla J(\boldsymbol{\theta}).\end{equation*}Notice here the notation \(:=\) meant to indicate (without the need for more indices) that \(\boldsymbol{\theta}\) is being updated based on its old value and that we have a minus sign here to ensure we are taking steps in the direction of steepest descent. The parameter \(\alpha\) is called the learning rate and controls the size of the step. The minimisation problem we're applying gradient descent to here is relatively straightforward since, as previously discussed, the function \(J(\boldsymbol{\theta})\) is convex and so we know that the local minimum is the global minimum. Even so, if the learning rate is too big the algorithm may never converge and if it is too small it may take too long to converge. Using the expression for the gradient we then have \begin{equation*}\boldsymbol{\theta}:=\boldsymbol{\theta}+\frac{\alpha}{m}\mathbf{X}^\top(\mathbf{y}-\mathbf{X}\boldsymbol{\theta}).\end{equation*}Note that the magnitude of the update is proportional to the error term, \(\mathbf{y}-\mathbf{X}\boldsymbol{\theta}\). The variant of gradient descent presented here is called batch gradient descent since the parameters are updated by considering all training examples on each iteration, with iterations continuing until some suitable measure of convergence is satisfied. Since at the minimum \(\nabla J(\boldsymbol{\theta})=0\) a natural criterion for convergence would be \(|\nabla J(\boldsymbol{\theta})|<\epsilon\) for some suitably small \(\epsilon\). In practice, I chose to record the values of the cost function at each iteration and confirm convergence by eyeballing the corresponding graph.

One thing I quickly realised was the importance of normalising the features. In the Coursera course this is mentioned, but only in the context of multiple features. In fact even with a single feature it can be important because of the scale discrepancy between the 'zeroth' dummmy feature, \(x_0^{(i)}=1\), and the actual feature, in this case, \(x_1^{(i)}\approx2500\). Here's a look at the results of running batch gradient descent without feature normalisation, with a learning rate of \(\alpha=10^{-9}\) (in this example, without feature normalisation it turned out that no convergence is possible with a learning rate bigger than \(\approx1\times10^{-7}\)!) and with the number of iterations set to 5000.


The first of these confirm that convergence was achieved. The surface plot makes clear the unbalanced nature of the \(\theta_0\) and \(\theta_1\) dependencies, itself a consequence of the fact that \(x_1\) is three orders of magnitude larger than the dummy feature \(x_0=1\). On the contour plots I've superimposed the path to convergence (red dots) starting from different theta initialisations. They reveal that once we're "in the valley" we're unable to progress towards the true minimum marked by the red cross. The gradient vector is too weak along the base of the valley and simultaneously strong in the perpendicular direction with the result that where we converge to depends strongly on how we initialise \(\theta\). The second contour plot shows the same procedure starting from \(\theta_0=\theta_1=100\):


Finally, here's the plot showing the data points with the hypothesis computed analytically through the normal equations as well as through the unnormalised batch gradient descent algorithm.

Normalising features can be achieved by subtracting the mean and dividing by the standard deviation. Thus, for the \(i\)th feature we'd replace the \(j\)th training set example by \begin{equation*}\tilde{x}_i^{(j)}=\frac{x_i^{(j)}-\mu_i}{\sigma_i},\end{equation*} where the mean \(\mu_i\) and standard deviation \(\sigma_i\) are calculated over the \(m\) training set examples for the \(i\)th feature.

Having normalised the data we can then re-run the batch gradient descent algorithm. It's worth noting that in this case we derive the parameters of an hypothesis function \(h(\tilde{x})=\theta_0+\theta_1\tilde{x}\) which takes normalised inputs \(\tilde{x}=(x-\mu)/\sigma\). From this we obtain an hypothesis function \begin{equation*}f(x)=h\left(\frac{x-\mu}{\sigma}\right)=\theta_0-\frac{\theta_1\mu}{\sigma}+\frac{\theta_1}{\sigma}x,\end{equation*} on unnormalised features. The performance improvement is dramatic - we require a learning rate of just 0.01 and only 500 iterations to achieve convergence to the analytic solution. Here are the corresponding charts.


Another variant of gradient descent is stochastic gradient descent. In this case, rather than update theta based on the entire training set at each iteration, we update theta on the basis of each training set example in turn within each iteration. Compare MatLab code for batch gradient descent:

function [theta,theta_history, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1); % we record the values of the cost function
theta_history=zeros(2,num_iters); % also record the values of theta
for iter = 1:num_iters
    theta=theta-alpha*(1/m)*X'*(X*theta-y); % theta update
    J_history(iter) = computeCost(X, y, theta);

and the corresponding code for stochastic gradient descent:

function [theta,theta_history, J_history] = stochgradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters*m, 1); % we record the values of the cost function
theta_history=zeros(2,num_iters*m); % also record the values of theta
XT=X'; % need to take transpose here since can't do X'(:,i)
for iter = 1:num_iters
   for i=1:m
       theta=theta-alpha*(1/m)*XT(:,i)*(X(i,:)*theta-y(i)); % theta update  
       J_history(count) = computeCost(X, y, theta);

Clearly, stochastic gradient descent is expected to be particularly useful if the number of training examples \(m\) is large. Though it may not converge to the global minimum, it will typically be good enough and indeed can be improved by allowing the learning rate to decrease to zero as the algorithm runs.

Applied to the housing price data I found it was possible to reduce the learning rate to 0.1 and the number of iterations to 50 and still obtain good results. Here's the contour plot and a zoomed-in version showing how convergence to the global minimum isn't quite achieved.


Matlab has a built in minimisation routine called fminunc which I’ll use in future examples instead of batch gradient descent. It is used as follows. Having defined a function to calculate the cost function and gradient, for example,

function [cost,grad] = cost_grad(X, y, theta)  [m,n] =size(X);   cost = 0;  grad=zeros(n,1);  cost=(1/(2*m))*(X*theta-y)'*(X*theta-y);  grad=(1/m)*X'*(X*theta-y); end

in the case of linear regression, we can call fminunc as follows:

options = optimoptions('fminunc',...
'SpecifyObjectiveGradient',true,'PlotFcn','optimplotfval');     [theta_fminunc, cost_fminunc] = ... 	fminunc(@(t)(cost_grad(X, y,t)), theta, options);


In the options object defined here I’m specifying that I’ll pass the gradient to fminunc and also that I’d like to see a plot of the cost function at each iteration. The cost_grad function is passed to the fminunc function as a function handle hence the @ symbol. The plot of the cost function shows that in this example just 4 iterations were needed!

Andrew JacobsComment