Finding the Roots with Newton’s Method
In the last post, we explored the use of gradient descent and other optimization methods to find the root of a Lagrangian function. These optimization methods work by finding the minimum of a cost function. In this post, I want to explore the multivariate form of Newton’s method as an alternative. Unlike optimization methods such as gradient descent, Newton’s method can find solutions that lie on a saddle point, eliminating the need for a cost function. This may or may not be a better approach.
The Method
Newton’s method is an iterative root-finding technique for solving equations. It is sometimes called the Newton–Raphson method. In the univariate form, you can use this method to find the root of an equation in the following form:
Starting with an initial guess, you can iteratively find successively closer and closer approximations to the root. If you’ve taken an undergraduate course in numerical methods, the following formula might look familiar:
You can apply the formula above repeatedly until you find a sufficiently close value. This method works for a single equation and a single unknown. But what happens if you have multiple equations and multiple unknowns? Consider the following:
This is a system of equations that must be satisfied. And each equation has multiple unknowns. We can represent this more concisely using a vector function:
The multivariate form of Newton’s method works in very much the same way as the univariate form of the method. The main difference is that the initial guess and the revised estimate are vectors instead of scalar values. Here is the formula:
Notice that instead of dividing the function by its derivative like we do in the univariate case, here we multiply the function by the inverse of its Jacobian matrix. I like to think of the Jacobian matrix as the multivariate equivalent of a derivative. The Jacobian matrix of a vector function is a matrix containing all of its partial derivatives:
Since we are interested in the inverse of the Jacobian matrix, we need to consider what happens if the matrix is not invertible. A matrix is not invertible if it is not a square matrix. Our Jacobian might be a rectangular matrix. And even if it’s a square matrix, it might be a singular matrix with a zero determinant. This can happen if one of the elements of the vector function is a linear combination of the others. To get around these problems, we can use a Moore–Penrose pseudoinverse instead. There are a number of ways to compute the pseudoinverse. One method is to use singular value decomposition. The matrix can be factorized as follows:
This is the singular value decomposition of the Jacobian matrix. It is broken down into the product of three distinct matrices. The individual components can then be recomposed in the following form to find the pseudoinverse:
The specific details regarding singular value decomposition are beyond the scope of this post. You can find plenty of resources online if you want to learn more about it. In the examples in the following sections, I am using a numerics library to compute the pseudoinverse. The library uses singular value decomposition under the hood.
Solving the Coin Toss Problem
In the previous post, we created a Lagrangian function based on the model of a weighted coin toss game. We then used the gradient descent algorithm to find the point at which the gradient of the Lagrangian function is equal to zero. Now we want to use Newton’s method instead of gradient descent to solve the coin toss problem. The generic notation used to describe Newton’s method in the previous section is different than the notation used for the model of the coin toss game. Here is a mapping of the two different notation schemes:
We will use the coin toss game notation in the following sections. The examples shown in the following sections reproduce the solutions to the coin toss problem presented in the previous post. Only this time around, we use Newton’s method instead of gradient descent to find the zeros of the gradient of the Lagrangian function for each example.
Example with 3 Coin Tosses, Scoring Function A
Using the same target distribution and constraint functions as we did with the example in the previous post, let’s use Newton’s method to find the solution to this variation of the coin toss problem. We can construct a Lagrangian function with the following scoring function:
Once we have the Lagrangian function, we can compute the gradient and use it to apply Newton’s method. The following diagrams show the trace of each iteration of Newton’s method using four different starting points:
Regardless of the starting point, the final value is always the same. This is what we expect. And this matches the solution found via other methods. Now check out the number of iterations required for each trace:
Compare that to the number of iterations required using the gradient descent method. Newton’s method doesn’t require nearly as many iterations as gradient descent.
Example with 3 Coin Tosses, Scoring Function B
Using the same target distribution and constraint functions as we did with the example in the previous post, let’s use Newton’s method to find the solution to this variation of the coin toss problem. We can construct a Lagrangian function with the following scoring function:
Once we have the Lagrangian function, we can compute the gradient and use it to apply Newton’s method. The following diagrams show the trace of each iteration of Newton’s method using four different starting points:
Regardless of the starting point, the final value is always the same. This is what we expect. And this matches the solution found via other methods. Now check out the number of iterations required for each trace:
Compare that to the number of iterations required using the gradient descent method. Newton’s method doesn’t require nearly as many iterations as gradient descent.
Example with 4 Coin Tosses
Let’s consider another example using a model of the coin toss game with four flips per round. Like the examples above, we’ll use the same target distribution and constraint functions as we did with the example in the previous post. Suppose we start with the following initial guess:
Now suppose we plug the following scoring function into our Lagrangian function:
Applying Newton’s method, here is the result:
Now suppose we plug the following scoring function into our Lagrangian function:
Applying Newton’s method, here is the result:
As you can see, once again, we get the same results that we found in the previous post using gradient descent. Here are the number of iterations required using Newton’s method:
Again, the number of iterations required to reach a solution using Newton’s method is significantly less than the number of iterations required using gradient descent.
Performance Considerations
We have demonstrated here that Newton’s method can find a solution to the coin toss problem in far fewer iterations than the gradient descent method used in the previous post. Newton’s method requires only a handful of iterations versus the tens of thousands or even hundreds of thousands of iterations required using gradient descent. But that doesn’t necessarily mean Newton’s method is a more efficient approach.
While both methods must evaluate a gradient in each iteration, Newton’s method must also compute the values of a Jacobian matrix that is a square of the size of the gradient vector. Furthermore, each iteration of Newton’s method must also compute the inverse of the Jacobian matrix, which might be a computationally expensive task. These two methods might have very different algorithmic complexity characteristics in a larger problem space. Imagine a model of the coin toss game with ten, twenty, or even fifty flips per round.
Some of the other optimization and root-finding methods mentioned in the previous post might have better performance characteristics than either Newton’s method or gradient descent. The best way to find out might be to try out each method on larger problem sizes and empirically measure how well they perform.
Comments