Jim Killingsworth

Min­i­miz­ing with Gra­di­ent De­scent

The pre­vi­ous post demon­strates the use of a hill climb­ing al­go­rithm to find a set of pa­ra­me­ters that min­i­mize a cost func­tion as­so­ci­at­ed with a coin toss game. In this post, I want to ex­plore the use of a gra­di­ent de­scent al­go­rithm as an al­ter­na­tive. The two class­es of al­go­rithms are very sim­i­lar in that they both it­er­a­tive­ly up­date an es­ti­mat­ed pa­ra­me­ter set. But while the hill climb­ing al­go­rithm on­ly up­dates one pa­ra­me­ter at a time, the gra­di­ent de­scent ap­proach up­dates all pa­ra­me­ters in pro­por­tion to the di­rec­tion of steep­est de­scen­t.

The Al­go­rithm

The gra­di­ent de­scent al­go­rithm is a way of find­ing a lo­cal min­i­mum of a dif­fer­en­tiable func­tion. In this case, we’ll use the same cost func­tion used in the pre­vi­ous post ti­tled Vi­su­al­iz­ing the Climb up the Hill as our ex­am­ple. Re­call the fol­low­ing equa­tions based on the mod­el of the coin toss game:

Figure 1

The two equa­tions above must hold true if we have a valid set of weights for a giv­en tar­get dis­tri­b­u­tion. If we es­ti­mate a set of weights that do not sat­is­fy these equa­tion­s, then we know the es­ti­mate can be im­proved. The sum of squared er­rors makes for a con­ve­nient cost func­tion:

Figure 2

Note the use of a pa­ra­me­ter vec­tor for the in­put. As you might ex­pec­t, this is a con­ve­nient no­ta­tion for mul­ti­vari­able func­tion­s. Now once we have a cost func­tion to work with, we al­so need to find the gra­di­ent of the cost func­tion:

Figure 3

The gra­di­ent is a vec­tor con­tain­ing par­tial de­riv­a­tives of the cost func­tion, one for each of the vari­ables. Each com­po­nent is the slope of the func­tion along the cor­re­spond­ing ax­is. The vec­tor points in the di­rec­tion of steep­est as­cen­t. The ap­ply the gra­di­ent de­scent al­go­rith­m, we need to sub­tract a val­ue in the di­rec­tion of the gra­di­ent from an es­ti­mat­ed val­ue to ob­tain an im­proved val­ue. Here is the for­mu­la:

Figure 4

Start­ing with an ini­tial guess, we can ap­ply the above re­peat­ed­ly un­til the gra­di­ent has a mag­ni­tude of ze­ro. Once the gra­di­ent is ze­ro, we know we have reached a lo­cal min­i­mum. In prac­tice, how­ev­er, you might want to stop once the gra­di­ent is suf­fi­cient­ly smal­l. In this ex­am­ple, we ter­mi­nate once the mag­ni­tude of the gra­di­ent is be­low a pre­de­ter­mined step size:

Figure 5

To en­sure each in­cre­ment ad­justs the es­ti­mate by a fixed step size, we need to mul­ti­ply the gra­di­ent by a scalar fac­tor called the learn­ing rate:

Figure 6

In this in­stance, the learn­ing rate is sim­ply the step size di­vid­ed by the mag­ni­tude of the gra­di­ent vec­tor. The learn­ing rate is re­com­put­ed each it­er­a­tion.

Vi­su­al­iza­tions

I want to com­pare and con­trast the gra­di­ent de­scent al­go­rithm out­lined above with the hill climb­ing al­go­rithms used in the pre­vi­ous post. To do so, let’s use the same tar­get dis­tri­b­u­tion as we used be­fore:

Figure 7

These val­ues can be rep­re­sent­ed sym­bol­i­cal­ly like this:

Figure 8

These val­ues can be used to pro­duce a con­crete form of the cost func­tion:

Figure 9

Just like with the hill climb­ing al­go­rithms used in the pre­vi­ous post, to ap­ply the gra­di­ent de­scent al­go­rithm we need to start out with an ini­tial es­ti­mate. Let’s use the fol­low­ing set of pa­ra­me­ters as the ini­tial guess:

Figure 10

Ap­ply­ing the gra­di­ent de­scent pro­ce­dure us­ing the above as the ini­tial in­put, the out­put of the first it­er­a­tion is the in­put to the sec­ond it­er­a­tion. The es­ti­mate is up­dat­ed re­peat­ed­ly un­til the process ter­mi­nates. Here is the fi­nal re­sult:

Figure 11

The al­go­rithm ter­mi­nates af­ter 4,333 it­er­a­tions. The fol­low­ing trace shows the path it takes from start to fin­ish:

Figure 12

Us­ing the same tech­nique, we can run through sev­er­al more ex­am­ples, each with dif­fer­ent start­ing points:

Figure 13
Figure 14
Figure 15

As you can see, the fi­nal re­sult very much de­pends on the ini­tial es­ti­mate. In each case, the path is a smooth and slight­ly curved line from start to fin­ish. No­tice how in each case, the tra­jec­to­ry is sim­i­lar to that of the sto­chas­tic hill climb­ing al­go­rithm used in the pre­vi­ous post. When us­ing the same step size, the gra­di­ent de­scent ap­proach re­quires around 70% to 80% of the num­ber of it­er­a­tions to com­plete when com­pared to the hill climb­ing al­ter­na­tive.

Learn­ing Rate

Each one of the ex­am­ples il­lus­trat­ed above re­quires over a thou­sand it­er­a­tions to com­plete. Some of them re­quire over four thou­sand it­er­a­tions. The num­ber of it­er­a­tions re­quired de­pends large­ly on the learn­ing rate. I think it’s worth point­ing out here that oth­er learn­ing rate schemes are pos­si­ble. Some learn­ing rates may yield a much faster con­ver­gence than the one used here. Con­sid­er this al­ter­na­tive:

Figure 16

With the cost func­tion used in the ex­am­ples above, this al­ter­na­tive learn­ing rate speeds up the rate of con­ver­gence by at least two or­ders of mag­ni­tude. In the small toy ex­am­ples il­lus­trat­ed here, this is­n’t go­ing to make any tan­gi­ble dif­fer­ence; it al­ready ex­e­cutes fast enough. But in a larg­er ap­pli­ca­tion, an op­ti­mized learn­ing rate can have a sig­nif­i­cant im­pact on the run­time per­for­mance of the gra­di­ent de­scent al­go­rith­m.

Ac­com­pa­ny­ing source code is avail­able on GitHub.

Com­ments

Show comments