Least Squares and Normal Distributions
The method of least squares estimates the coefficients of a model function by minimizing the sum of the squared errors between the model and the observed values. In this post, I show the derivation of the parameter estimates for a linear model. In addition, I show that the maximum likelihood estimation is the same as the least squares estimation when we assume the errors are normally distributed.
Least Squares Estimation
Suppose we have a set of two-dimensional data points that we observed by measuring some kind of phenomenon:
Also, suppose that the measuring device is inaccurate. The data we observed contain errors for values on the vertical axis. Despite the errors, we know the correct readings fall somewhere on a line given by the following linear equation:
For each data point, we can compute the error as the difference between the observed value and the correct value according to the model function:
The errors can be positive or negative. Taking the square of each error always yields a positive number. We can define the sum of the squared errors like this:
Since the coefficients are unknown variables, we can treat the sum of the squared errors as a function of the coefficients:
To estimate the coefficients of the model function using the least squares method, we need to figure out what values for the coefficients give us the smallest value for the sum of the squared errors. We can find the minimum by first taking the partial derivative of the sum of squares function with respect to each of the coefficients, setting the derivative to zero, and then solving for the coefficient. Here are the derivatives with respect to each coefficient:
Setting the derivative with respect to the first coefficient to zero, we get the following result:
Rearranging the equation and solving for the coefficient:
Setting the derivative with respect to the second coefficient to zero, we get the following result:
Rearranging the equation and solving for the coefficient:
Each one of the coefficients is given in terms of the other. Since there are two equations and two unknowns, you can plug one into the other to derive the final outcome. Another way to do this might be to treat the results as a system of linear equations arranged as follows:
The coefficients can then be found by solving the following matrix equation:
This method is perhaps a cleaner approach. It can also work well for model functions with many coefficients, such as higher order polynomials or multivariable functions.
Maximum Likelihood Estimation
Now let’s assume the errors are normally distributed. That is to say, the observed values are normally distributed around the model. The probability density function for the normal distribution looks like this:
Here we treat our model as the mean. We also consider the standard deviation, depicted by sigma, which measures the spread of the data around the mean. Given our observed data points, we want to figure out what the most likely values are for the mean and standard deviation. For a single data point alone, the likelihood function for a given mean and standard deviation is:
The likelihood is equal to the probability density. For all data points combined, the likelihood function for a given mean and standard deviation is equal to the product of the density at each individual data point:
At this point, we just need to find the mean and standard deviation values that maximize the likelihood function. Similar to what we did in the previous section, we can find the maximum by taking the partial derivative of the likelihood function with respect to each of the coefficients, setting the derivative to zero, and then solving for the coefficients. This might be easier to do if we first take the natural logarithm of the likelihood function:
Since we’re interested in finding the coefficients of the model function, we can replace the mean parameter with the body of the model function and treat the likelihood function as a function of the coefficients:
Let’s call this the log-likelihood function. Since the natural logarithm function is a monotonically increasing function, we can maximize the log-likelihood function and get the same result we would get if we maximized the original likelihood function. Here are the partial derivatives of the log-likelihood function with respect to each of the coefficients:
Setting the derivative with respect to the first coefficient to zero, we get the following result:
Rearranging the equation and solving for the coefficient:
Setting the derivative with respect to the second coefficient to zero, we get the following result:
Rearranging the equation and solving for the coefficient:
As you can see, we get at the same results we got from using the method of least squares to estimate the coefficients. For completeness, the same procedure can be used to find the standard deviation:
Setting the derivative to zero, we get the following result for the standard deviation:
Rearranging the equation and solving for sigma:
Note that this result may yield a biased estimate of the standard deviation when computing the value based on a limited number of samples. It might be more appropriate to use an unbiased estimator that takes the number of degrees of freedom into consideration. But that’s out of scope for this post. Perhaps it’s a topic I’ll explore at another time.
Comments