Minimizing with Gradient Descent
The previous post demonstrates the use of a hill climbing algorithm to find a set of parameters that minimize a cost function associated with a coin toss game. In this post, I want to explore the use of a gradient descent algorithm as an alternative. The two classes of algorithms are very similar in that they both iteratively update an estimated parameter set. But while the hill climbing algorithm only updates one parameter at a time, the gradient descent approach updates all parameters in proportion to the direction of steepest descent.
The Algorithm
The gradient descent algorithm is a way of finding a local minimum of a differentiable function. In this case, we’ll use the same cost function used in the previous post titled Visualizing the Climb up the Hill as our example. Recall the following equations based on the model of the coin toss game:
The two equations above must hold true if we have a valid set of weights for a given target distribution. If we estimate a set of weights that do not satisfy these equations, then we know the estimate can be improved. The sum of squared errors makes for a convenient cost function:
Note the use of a parameter vector for the input. As you might expect, this is a convenient notation for multivariable functions. Now once we have a cost function to work with, we also need to find the gradient of the cost function:
The gradient is a vector containing partial derivatives of the cost function, one for each of the variables. Each component is the slope of the function along the corresponding axis. The vector points in the direction of steepest ascent. The apply the gradient descent algorithm, we need to subtract a value in the direction of the gradient from an estimated value to obtain an improved value. Here is the formula:
Starting with an initial guess, we can apply the above repeatedly until the gradient has a magnitude of zero. Once the gradient is zero, we know we have reached a local minimum. In practice, however, you might want to stop once the gradient is sufficiently small. In this example, we terminate once the magnitude of the gradient is below a predetermined step size:
To ensure each increment adjusts the estimate by a fixed step size, we need to multiply the gradient by a scalar factor called the learning rate:
In this instance, the learning rate is simply the step size divided by the magnitude of the gradient vector. The learning rate is recomputed each iteration.
Visualizations
I want to compare and contrast the gradient descent algorithm outlined above with the hill climbing algorithms used in the previous post. To do so, let’s use the same target distribution as we used before:
These values can be represented symbolically like this:
These values can be used to produce a concrete form of the cost function:
Just like with the hill climbing algorithms used in the previous post, to apply the gradient descent algorithm we need to start out with an initial estimate. Let’s use the following set of parameters as the initial guess:
Applying the gradient descent procedure using the above as the initial input, the output of the first iteration is the input to the second iteration. The estimate is updated repeatedly until the process terminates. Here is the final result:
The algorithm terminates after 4,333 iterations. The following trace shows the path it takes from start to finish:
Using the same technique, we can run through several more examples, each with different starting points:
As you can see, the final result very much depends on the initial estimate. In each case, the path is a smooth and slightly curved line from start to finish. Notice how in each case, the trajectory is similar to that of the stochastic hill climbing algorithm used in the previous post. When using the same step size, the gradient descent approach requires around 70% to 80% of the number of iterations to complete when compared to the hill climbing alternative.
Learning Rate
Each one of the examples illustrated above requires over a thousand iterations to complete. Some of them require over four thousand iterations. The number of iterations required depends largely on the learning rate. I think it’s worth pointing out here that other learning rate schemes are possible. Some learning rates may yield a much faster convergence than the one used here. Consider this alternative:
With the cost function used in the examples above, this alternative learning rate speeds up the rate of convergence by at least two orders of magnitude. In the small toy examples illustrated here, this isn’t going to make any tangible difference; it already executes fast enough. But in a larger application, an optimized learning rate can have a significant impact on the runtime performance of the gradient descent algorithm.
Comments