Jim Killingsworth

Find­ing the Roots with New­ton’s Method

In the last post, we ex­plored the use of gra­di­ent de­scent and oth­er op­ti­miza­tion meth­ods to find the root of a La­grangian func­tion. These op­ti­miza­tion meth­ods work by find­ing the min­i­mum of a cost func­tion. In this post, I want to ex­plore the mul­ti­vari­ate form of New­ton’s method as an al­ter­na­tive. Un­like op­ti­miza­tion meth­ods such as gra­di­ent de­scen­t, New­ton’s method can find so­lu­tions that lie on a sad­dle point, elim­i­nat­ing the need for a cost func­tion. This may or may not be a bet­ter ap­proach.

The Method

New­ton’s method is an it­er­a­tive root-find­ing tech­nique for solv­ing equa­tion­s. It is some­times called the New­ton–Raph­son method. In the uni­vari­ate for­m, you can use this method to find the root of an equa­tion in the fol­low­ing for­m:

Figure 1

Start­ing with an ini­tial guess, you can it­er­a­tive­ly find suc­ces­sive­ly clos­er and clos­er ap­prox­i­ma­tions to the root. If you’ve tak­en an un­der­grad­u­ate course in nu­mer­i­cal meth­od­s, the fol­low­ing for­mu­la might look fa­mil­iar:

Figure 2

You can ap­ply the for­mu­la above re­peat­ed­ly un­til you find a suf­fi­cient­ly close val­ue. This method works for a sin­gle equa­tion and a sin­gle un­known. But what hap­pens if you have mul­ti­ple equa­tions and mul­ti­ple un­known­s? Con­sid­er the fol­low­ing:

Figure 3

This is a sys­tem of equa­tions that must be sat­is­fied. And each equa­tion has mul­ti­ple un­known­s. We can rep­re­sent this more con­cise­ly us­ing a vec­tor func­tion:

Figure 4

The mul­ti­vari­ate form of New­ton’s method works in very much the same way as the uni­vari­ate form of the method. The main dif­fer­ence is that the ini­tial guess and the re­vised es­ti­mate are vec­tors in­stead of scalar val­ues. Here is the for­mu­la:

Figure 5

No­tice that in­stead of di­vid­ing the func­tion by its de­riv­a­tive like we do in the uni­vari­ate case, here we mul­ti­ply the func­tion by the in­verse of its Ja­co­bian ma­trix. I like to think of the Ja­co­bian ma­trix as the mul­ti­vari­ate equiv­a­lent of a de­riv­a­tive. The Ja­co­bian ma­trix of a vec­tor func­tion is a ma­trix con­tain­ing all of its par­tial de­riv­a­tives:

Figure 6

Since we are in­ter­est­ed in the in­verse of the Ja­co­bian ma­trix, we need to con­sid­er what hap­pens if the ma­trix is not in­vert­ible. A ma­trix is not in­vert­ible if it is not a square ma­trix. Our Ja­co­bian might be a rec­tan­gu­lar ma­trix. And even if it’s a square ma­trix, it might be a sin­gu­lar ma­trix with a ze­ro de­ter­mi­nan­t. This can hap­pen if one of the el­e­ments of the vec­tor func­tion is a lin­ear com­bi­na­tion of the oth­er­s. To get around these prob­lem­s, we can use a Moore–Pen­rose pseudoin­verse in­stead. There are a num­ber of ways to com­pute the pseudoin­verse. One method is to use sin­gu­lar val­ue de­com­po­si­tion. The ma­trix can be fac­tor­ized as fol­lows:

Figure 7

This is the sin­gu­lar val­ue de­com­po­si­tion of the Ja­co­bian ma­trix. It is bro­ken down in­to the prod­uct of three dis­tinct ma­tri­ces. The in­di­vid­ual com­po­nents can then be re­com­posed in the fol­low­ing form to find the pseudoin­verse:

Figure 8

The spe­cif­ic de­tails re­gard­ing sin­gu­lar val­ue de­com­po­si­tion are be­yond the scope of this post. You can find plen­ty of re­sources on­line if you want to learn more about it. In the ex­am­ples in the fol­low­ing sec­tion­s, I am us­ing a nu­mer­ics li­brary to com­pute the pseudoin­verse. The li­brary us­es sin­gu­lar val­ue de­com­po­si­tion un­der the hood.

Solv­ing the Coin Toss Prob­lem

In the pre­vi­ous post, we cre­at­ed a La­grangian func­tion based on the mod­el of a weight­ed coin toss game. We then used the gra­di­ent de­scent al­go­rithm to find the point at which the gra­di­ent of the La­grangian func­tion is equal to ze­ro. Now we want to use New­ton’s method in­stead of gra­di­ent de­scent to solve the coin toss prob­lem. The gener­ic no­ta­tion used to de­scribe New­ton’s method in the pre­vi­ous sec­tion is dif­fer­ent than the no­ta­tion used for the mod­el of the coin toss game. Here is a map­ping of the two dif­fer­ent no­ta­tion schemes:

Figure 9

We will use the coin toss game no­ta­tion in the fol­low­ing sec­tion­s. The ex­am­ples shown in the fol­low­ing sec­tions re­pro­duce the so­lu­tions to the coin toss prob­lem pre­sent­ed in the pre­vi­ous post. On­ly this time around, we use New­ton’s method in­stead of gra­di­ent de­scent to find the ze­ros of the gra­di­ent of the La­grangian func­tion for each ex­am­ple.

Ex­am­ple with 3 Coin Toss­es, Scor­ing Func­tion A

Us­ing the same tar­get dis­tri­b­u­tion and con­straint func­tions as we did with the ex­am­ple in the pre­vi­ous post, let’s use New­ton’s method to find the so­lu­tion to this vari­a­tion of the coin toss prob­lem. We can con­struct a La­grangian func­tion with the fol­low­ing scor­ing func­tion:

Figure 10

Once we have the La­grangian func­tion, we can com­pute the gra­di­ent and use it to ap­ply New­ton’s method. The fol­low­ing di­a­grams show the trace of each it­er­a­tion of New­ton’s method us­ing four dif­fer­ent start­ing points:

Figure 11
Figure 12
Figure 13
Figure 14

Re­gard­less of the start­ing point, the fi­nal val­ue is al­ways the same. This is what we ex­pec­t. And this match­es the so­lu­tion found via oth­er meth­od­s. Now check out the num­ber of it­er­a­tions re­quired for each trace:

Figure 15

Com­pare that to the num­ber of it­er­a­tions re­quired us­ing the gra­di­ent de­scent method. New­ton’s method does­n’t re­quire near­ly as many it­er­a­tions as gra­di­ent de­scen­t.

Ex­am­ple with 3 Coin Toss­es, Scor­ing Func­tion B

Us­ing the same tar­get dis­tri­b­u­tion and con­straint func­tions as we did with the ex­am­ple in the pre­vi­ous post, let’s use New­ton’s method to find the so­lu­tion to this vari­a­tion of the coin toss prob­lem. We can con­struct a La­grangian func­tion with the fol­low­ing scor­ing func­tion:

Figure 16

Once we have the La­grangian func­tion, we can com­pute the gra­di­ent and use it to ap­ply New­ton’s method. The fol­low­ing di­a­grams show the trace of each it­er­a­tion of New­ton’s method us­ing four dif­fer­ent start­ing points:

Figure 17
Figure 18
Figure 19
Figure 20

Re­gard­less of the start­ing point, the fi­nal val­ue is al­ways the same. This is what we ex­pec­t. And this match­es the so­lu­tion found via oth­er meth­od­s. Now check out the num­ber of it­er­a­tions re­quired for each trace:

Figure 21

Com­pare that to the num­ber of it­er­a­tions re­quired us­ing the gra­di­ent de­scent method. New­ton’s method does­n’t re­quire near­ly as many it­er­a­tions as gra­di­ent de­scen­t.

Ex­am­ple with 4 Coin Toss­es

Let’s con­sid­er an­oth­er ex­am­ple us­ing a mod­el of the coin toss game with four flips per round. Like the ex­am­ples above, we’ll use the same tar­get dis­tri­b­u­tion and con­straint func­tions as we did with the ex­am­ple in the pre­vi­ous post. Sup­pose we start with the fol­low­ing ini­tial guess:

Figure 22

Now sup­pose we plug the fol­low­ing scor­ing func­tion in­to our La­grangian func­tion:

Figure 23

Ap­ply­ing New­ton’s method, here is the re­sult:

Figure 24

Now sup­pose we plug the fol­low­ing scor­ing func­tion in­to our La­grangian func­tion:

Figure 25

Ap­ply­ing New­ton’s method, here is the re­sult:

Figure 26

As you can see, once again, we get the same re­sults that we found in the pre­vi­ous post us­ing gra­di­ent de­scen­t. Here are the num­ber of it­er­a­tions re­quired us­ing New­ton’s method:

Figure 27

Again, the num­ber of it­er­a­tions re­quired to reach a so­lu­tion us­ing New­ton’s method is sig­nif­i­cant­ly less than the num­ber of it­er­a­tions re­quired us­ing gra­di­ent de­scen­t.

Per­for­mance Con­sid­er­a­tions

We have demon­strat­ed here that New­ton’s method can find a so­lu­tion to the coin toss prob­lem in far few­er it­er­a­tions than the gra­di­ent de­scent method used in the pre­vi­ous post. New­ton’s method re­quires on­ly a hand­ful of it­er­a­tions ver­sus the tens of thou­sands or even hun­dreds of thou­sands of it­er­a­tions re­quired us­ing gra­di­ent de­scen­t. But that does­n’t nec­es­sar­i­ly mean New­ton’s method is a more ef­fi­cient ap­proach.

While both meth­ods must eval­u­ate a gra­di­ent in each it­er­a­tion, New­ton’s method must al­so com­pute the val­ues of a Ja­co­bian ma­trix that is a square of the size of the gra­di­ent vec­tor. Fur­ther­more, each it­er­a­tion of New­ton’s method must al­so com­pute the in­verse of the Ja­co­bian ma­trix, which might be a com­pu­ta­tion­al­ly ex­pen­sive task. These two meth­ods might have very dif­fer­ent al­go­rith­mic com­plex­i­ty char­ac­ter­is­tics in a larg­er prob­lem space. Imag­ine a mod­el of the coin toss game with ten, twen­ty, or even fifty flips per round.

Some of the oth­er op­ti­miza­tion and root-find­ing meth­ods men­tioned in the pre­vi­ous post might have bet­ter per­for­mance char­ac­ter­is­tics than ei­ther New­ton’s method or gra­di­ent de­scen­t. The best way to find out might be to try out each method on larg­er prob­lem sizes and em­pir­i­cal­ly mea­sure how well they per­for­m.

Ac­com­pa­ny­ing source code is avail­able on GitHub.

Com­ments

Show comments