Jim Killingsworth

Equal­i­ty Con­straints and La­grange Mul­ti­pli­ers

My last few posts have cen­tered around a weight­ed coin toss game in which the weights of a set of bi­ased coins are de­ter­mined based on a known tar­get dis­tri­b­u­tion. And while mul­ti­ple so­lu­tions are pos­si­ble, the in­clu­sion of a scor­ing func­tion al­lowed for a unique so­lu­tion to be found. Un­til now, I was not sure how to in­clude the scor­ing func­tion in such a way that I could solve the prob­lem nu­mer­i­cal­ly for an ar­bi­trary num­ber of coin toss­es. In this post, I show how to use the method of La­grange mul­ti­pli­ers to min­i­mize the scor­ing func­tion while con­form­ing to the con­straints of the coin toss prob­lem.

La­grangian Func­tion

In pre­vi­ous post­s, we fo­cused pri­mar­i­ly on find­ing ar­bi­trary so­lu­tions to the weight­ed coin toss prob­lem. The scor­ing func­tion that ranked each pos­si­ble so­lu­tion was a sec­ondary con­cern. Here we want to fo­cus on min­i­miz­ing the scor­ing func­tion as our pri­ma­ry con­cern while still sat­is­fy­ing the con­di­tions that yield a valid re­sult for a giv­en tar­get dis­tri­b­u­tion. We can frame the prob­lem as a scor­ing func­tion that we want to min­i­mize along with a set of equal­i­ty con­straints that we need to sat­is­fy:

Figure 1

Our ob­jec­tive is not to find the ab­solute min­i­mum of the scor­ing func­tion. Rather, we want to find a set of pa­ra­me­ters that sat­is­fy the giv­en con­straints while al­so yield­ing the small­est pos­si­ble val­ue when plugged in­to the scor­ing func­tion. The key to un­der­stand­ing the method of La­grange mul­ti­pli­ers is know­ing that the gra­di­ent of the ob­jec­tive func­tion—the scor­ing func­tion in this case—is equal to a lin­ear com­bi­na­tion of the gra­di­ents of the con­straint func­tions at the point at which we find the op­ti­mal so­lu­tion:

Figure 2

The lamb­da co­ef­fi­cients are the La­grange mul­ti­pli­er­s. You can find plen­ty of re­sources on­line that ex­plain this re­la­tion­ship in more de­tail. For our pur­pos­es here, let’s just take it as a given. If you were to ex­pand out the gra­di­ents and in­clude the equal­i­ty con­straints, you would have a sys­tem of equa­tions that can be solved. We can pack­age this sys­tem of equa­tions up in an el­e­gant fash­ion us­ing the La­grangian func­tion:

Figure 3

Now, if we take the gra­di­ent of the La­grangian func­tion and set it equal to the ze­ro vec­tor, we have a very con­cise way to ex­press the sys­tem of equa­tion­s:

Figure 4

It might even be too con­cise. But it makes sense if you ex­pand the gra­di­ent and un­fold it a bit. Here is the com­plete sys­tem of equa­tion­s:

Figure 5

Note that when us­ing the method of La­grange mul­ti­pli­er­s, not on­ly do we need to find the weights of the bi­ased coin­s, but we al­so need to solve for the La­grange mul­ti­pli­er­s. This tech­nique in­tro­duces more vari­ables that we need to solve for. But that’s no prob­lem. Once we have the com­plete sys­tem of equa­tion­s, we can get on with the busi­ness of solv­ing them.

Solv­ing with Gra­di­ent De­scent

There is more than one way to solve a sys­tem of equa­tion­s. One way to do it is to use the gra­di­ent de­scent al­go­rithm demon­strat­ed in my pre­vi­ous post. We can use the mag­ni­tude of the gra­di­ent as the cost func­tion. In the ex­am­ples that fol­low, we’ll use the square of the mag­ni­tude of the gra­di­ent be­cause it’s eas­i­er that way:

Figure 6

The square of the mag­ni­tude can be found by com­put­ing the sum of the squares of each el­e­ment in the gra­di­ent vec­tor. The equa­tion above can be ex­pand­ed out like this:

Figure 7

In the ex­am­ples il­lus­trat­ed in the fol­low­ing sec­tion­s, we’ll use the same learn­ing rate cal­cu­la­tion de­scribed in my pre­vi­ous post ti­tled Min­i­miz­ing with Gra­di­ent De­scent. But in­stead of us­ing a fixed step size, we’ll use a vari­able step size:

Figure 8

The step size re­mains con­stant for the first 10,000 it­er­a­tions and then de­creas­es as­ymp­tot­i­cal­ly there­after. Here is a vi­su­al­iza­tion:

Figure 9

De­creas­ing the step size like this helps the al­go­rithm con­verge to a so­lu­tion faster. As we will see in the fol­low­ing sec­tion­s, it still takes a lot of it­er­a­tions to find a so­lu­tion us­ing the gra­di­ent de­scent method.

Ex­am­ple with 3 Coin Toss­es

Let’s con­sid­er a con­crete ex­am­ple of us­ing the method of La­grange mul­ti­pli­ers and the gra­di­ent de­scent al­go­rithm to find the op­ti­mal so­lu­tion to the coin toss prob­lem. In this first ex­am­ple, we’ll use a mod­el of the coin toss game with three flips per round. You can find a de­tailed de­scrip­tion of this mod­el in my post ti­tled Vi­su­al­iz­ing the Climb up the Hill. Sup­pose we start with the fol­low­ing tar­get dis­tri­b­u­tion:

Figure 10

This il­lus­tra­tion de­picts the prob­a­bil­i­ty mass func­tion of the ex­pect­ed out­come. These val­ues can be rep­re­sent­ed us­ing the fol­low­ing no­ta­tion:

Figure 11

Based on the mod­el of the coin toss game with three flips per round, the fol­low­ing con­straint func­tions must equal ze­ro when we have a valid so­lu­tion for the weight­s:

Figure 12

Once we have the con­straint func­tion­s, we can go ahead and in­cor­po­rate them in­to the La­grangian func­tion. Here is what it looks like so far:

Figure 13

Now we just need to plug in a scor­ing func­tion to pro­duce a con­crete La­grangian func­tion that we can work with. The next two sec­tions demon­strate the re­sults of the op­ti­miza­tion pro­ce­dure us­ing two dif­fer­ent scor­ing func­tion vari­ants.

Ex­am­ple with 3 Coin Toss­es, Scor­ing Func­tion A

Us­ing the two con­straint func­tions de­fined for the coin toss game with three flips per round, we can con­struct a La­grangian func­tion with the fol­low­ing scor­ing func­tion:

Figure 14

Once we have a con­crete La­grangian func­tion, we can use it to come up with a cost func­tion as de­scribed ear­lier—by tak­ing the square of the mag­ni­tude of the gra­di­en­t. We can then use this cost func­tion to ap­ply the gra­di­ent de­scent method. The fol­low­ing di­a­grams show the trace of the gra­di­ent de­scent us­ing four dif­fer­ent start­ing points:

Figure 15
Figure 16
Figure 17
Figure 18

Not all di­men­sions are shown in these il­lus­tra­tions. The col­ors of the heatmap are based on the fi­nal val­ues of the La­grange mul­ti­pli­er­s. The im­por­tant thing to no­tice here is that, re­gard­less of the start­ing point, the fi­nal val­ue is al­ways the same. And the fi­nal val­ue match­es the val­ue found by a dif­fer­ent method.

Here is a break­down of the num­ber of it­er­a­tions re­quired for each trace:

Figure 19

Us­ing gra­di­ent de­scent to solve the sys­tem of equa­tions yields the cor­rect re­sult. But with such a large num­ber of it­er­a­tions, it does­n’t seem like a very ef­fi­cient way to ar­rive at the so­lu­tion. While not vis­i­ble in the chart­s, the trace falls in­to a tight zigzag­ging pat­tern as it moves clos­er to the fi­nal so­lu­tion, slow­ing down the per­for­mance of the al­go­rith­m.

Ex­am­ple with 3 Coin Toss­es, Scor­ing Func­tion B

Us­ing the two con­straint func­tions de­fined for the coin toss game with three flips per round, we can con­struct a La­grangian func­tion with the fol­low­ing scor­ing func­tion:

Figure 20

Once we have a con­crete La­grangian func­tion, we can use it to come up with a cost func­tion as de­scribed ear­lier—by tak­ing the square of the mag­ni­tude of the gra­di­en­t. We can then use this cost func­tion to ap­ply the gra­di­ent de­scent method. The fol­low­ing di­a­grams show the trace of the gra­di­ent de­scent us­ing four dif­fer­ent start­ing points:

Figure 21
Figure 22
Figure 23
Figure 24

Not all di­men­sions are shown in these il­lus­tra­tions. The col­ors of the heatmap are based on the fi­nal val­ues of the La­grange mul­ti­pli­er­s. The im­por­tant thing to no­tice here is that, re­gard­less of the start­ing point, the fi­nal val­ue is al­ways the same. And the fi­nal val­ue match­es the val­ue found by a dif­fer­ent method.

Here is a break­down of the num­ber of it­er­a­tions re­quired for each trace:

Figure 25

Us­ing gra­di­ent de­scent to solve the sys­tem of equa­tions yields the cor­rect re­sult. But with such a large num­ber of it­er­a­tions, it does­n’t seem like a very ef­fi­cient way to ar­rive at the so­lu­tion. While not vis­i­ble in the chart­s, the trace falls in­to a tight zigzag­ging pat­tern as it moves clos­er to the fi­nal so­lu­tion, slow­ing down the per­for­mance of the al­go­rith­m.

Ex­am­ple with 4 Coin Toss­es

Let’s con­sid­er an­oth­er ex­am­ple of us­ing the method of La­grange mul­ti­pli­ers and the gra­di­ent de­scent al­go­rithm to find the op­ti­mal so­lu­tion to the coin toss prob­lem. In this ex­am­ple, we’ll use a mod­el of the coin toss game with four flips per round. You can find a de­tailed de­scrip­tion of this mod­el in my post ti­tled Es­ti­mat­ing the Weights of Bi­ased Coins. Sup­pose we start with the fol­low­ing tar­get dis­tri­b­u­tion:

Figure 26

This il­lus­tra­tion de­picts the prob­a­bil­i­ty mass func­tion of the ex­pect­ed out­come. These val­ues can be rep­re­sent­ed us­ing the fol­low­ing no­ta­tion:

Figure 27

Based on the mod­el of the coin toss game with four flips per round, the fol­low­ing con­straint func­tions must equal ze­ro when we have a valid so­lu­tion for the weight­s:

Figure 28

We can use the con­straint func­tions above to cre­ate a La­grangian func­tion. Us­ing the La­grangian func­tion, we can come up with a cost func­tion, as de­scribed ear­lier. Once we have a cost func­tion, we can use it to ap­ply the gra­di­ent de­scent al­go­rith­m. In this ex­am­ple, we’ll start with the fol­low­ing ini­tial guess:

Figure 29

Now sup­pose we plug the fol­low­ing scor­ing func­tion in­to our La­grangian func­tion:

Figure 30

Ap­ply­ing the gra­di­ent de­scent method, here is the re­sult:

Figure 31

Now sup­pose we plug the fol­low­ing scor­ing func­tion in­to our La­grangian func­tion:

Figure 32

Ap­ply­ing the gra­di­ent de­scent method, here is the re­sult:

Figure 33

You can com­pare these re­sults with the re­sults found via a dif­fer­ent method and see that they are the same. Here are the num­ber of it­er­a­tions re­quired:

Figure 34

Once again, we see that the gra­di­ent de­scent al­go­rithm re­quires a large num­ber of it­er­a­tions. It’s an it­er­a­tive method that falls in­to a slow zigzag as it moves clos­er to the fi­nal val­ue.

Con­straint Qual­i­fi­ca­tion and De­pen­dent Equa­tions

Us­ing the op­ti­miza­tion tech­nique il­lus­trat­ed in the ex­am­ples above, the so­lu­tion al­ways con­verges to the most op­ti­mal set of val­ues for the weights of the bi­ased coin­s, re­gard­less of the ini­tial guess. There is on­ly one unique so­lu­tion for a giv­en scor­ing func­tion and set of con­straints. When us­ing it­er­a­tive op­ti­miza­tion meth­od­s, the op­ti­mal point is the so­lu­tion found up­on reach­ing the fi­nal it­er­a­tion:

Figure 35

This tech­nique, how­ev­er, does not nec­es­sar­i­ly con­verge to a unique set of La­grange mul­ti­pli­er­s. The La­grange mul­ti­pli­er the­o­rem guar­an­tees a unique set of La­grange mul­ti­pli­ers on­ly if the con­straint qual­i­fi­ca­tion as­sump­tion is sat­is­fied. Maybe it does­n’t mat­ter here, since we still get the right an­swer for the weights of the bi­ased coin­s. But I think it’s worth point­ing out nonethe­less. So what is the con­straint qual­i­fi­ca­tion as­sump­tion that guar­an­tees a unique set of La­grange mul­ti­pli­er­s? It de­pends on whether there are mul­ti­ple con­straints or on­ly one con­strain­t. Con­sid­er the gra­di­ent of the con­straint func­tions at the op­ti­mal point:

Figure 36

In the case of mul­ti­ple con­straints, the gra­di­ent vec­tors of each con­straint at the op­ti­mal point must be lin­ear­ly in­de­pen­dent of each oth­er. Where there is on­ly one con­strain­t, the gra­di­ent must not equal the ze­ro vec­tor at the op­ti­mal point. None of the ex­am­ples il­lus­trat­ed in the sec­tions above sat­is­fy the con­straint qual­i­fi­ca­tion con­di­tion. But let’s take a clos­er look at the con­straints. Con­sid­er the set of con­straints for the coin toss game with three flips per round:

Figure 37

Re­mem­ber, each con­straint func­tion must equal ze­ro. And these two equa­tions are not in­de­pen­dent of each oth­er. The sec­ond con­straint can be ex­pressed in terms of the first:

Figure 38

Since it is not an in­de­pen­dent equa­tion, we can drop the sec­ond con­straint en­tire­ly. Us­ing on­ly the first con­strain­t, we can ap­ply the method of La­grange mul­ti­pli­ers in a way that sat­is­fies the con­straint qual­i­fi­ca­tion as­sump­tion. In this case, there is on­ly one La­grange mul­ti­pli­er, and it al­ways con­verges to the same val­ue us­ing it­er­a­tive meth­od­s, re­gard­less of the ini­tial guess. Now let’s con­sid­er the set of con­straints for the coin toss game with four flips per round:

Figure 39

Again, each con­straint func­tion must equal ze­ro. And like be­fore, this is not an in­de­pen­dent set of equa­tion­s. We can ex­press one con­straint as a lin­ear com­bi­na­tion of the oth­er­s:

Figure 40

If we ex­clude the de­pen­dent equa­tion from the set of con­straints, we can sat­is­fy the con­straint qual­i­fi­ca­tion as­sump­tion. As I men­tioned ear­lier, sat­is­fy­ing the con­straint qual­i­fi­ca­tion as­sump­tion may not be nec­es­sary if we on­ly care about the op­ti­mal set of val­ues for the weights of the bi­ased coin­s. But this knowl­edge might come in handy when ex­plor­ing oth­er op­ti­miza­tion tech­niques that might be more ef­fi­cient than the gra­di­ent de­scent method used in the ex­am­ples il­lus­trat­ed in the pre­vi­ous sec­tion­s.

Oth­er Op­ti­miza­tion and Root-Find­ing Meth­ods

In the ex­am­ples above, the gra­di­ent de­scent method re­quires tens of thou­sands of it­er­a­tions to con­verge to a so­lu­tion. In some cas­es, it even takes hun­dreds of thou­sands of it­er­a­tions. While I tried to op­ti­mize it a bit us­ing a vari­able step size, it still seems like a large num­ber of it­er­a­tions. The nu­mer­ics li­brary I am us­ing comes with some al­ter­na­tive op­ti­miza­tion meth­ods that can be used in place of my gra­di­ent de­scent im­ple­men­ta­tion. These al­ter­na­tives can find the min­i­mum of the cost func­tion in far few­er it­er­a­tions. See the ta­bles be­low.

Ex­am­ple with 3 coin toss­es, scor­ing func­tion A:

Figure 41

Ex­am­ple with 3 coin toss­es, scor­ing func­tion B:

Figure 42

Ex­am­ple with 4 coin toss­es, scor­ing func­tion A:

Figure 43

Ex­am­ple with 4 coin toss­es, scor­ing func­tion B:

Figure 44

While one method might be able to ar­rive at a so­lu­tion in few­er it­er­a­tions than an­oth­er method, it is im­por­tant to re­mem­ber that not all it­er­a­tions are cre­at­ed equal­ly. An it­er­a­tion in one method might be com­pu­ta­tion­al­ly more ex­pen­sive than an it­er­a­tion in an­oth­er. When tak­ing run­time per­for­mance in­to con­sid­er­a­tion, the best method might not be the one with the fewest it­er­a­tions. Al­so, find­ing the min­i­mum of a cost func­tion is not the on­ly way to solve for the un­known­s. Keep in mind, the ul­ti­mate goal is to find the pa­ra­me­ters in which the gra­di­ent of the La­grangian func­tion is equal to the ze­ro vec­tor:

Figure 45

The crit­i­cal points of the La­grangian func­tion may oc­cur at sad­dle points in­stead of at min­i­mums or max­i­mum­s. Some root-find­ing meth­ods can solve this type of prob­lem di­rect­ly with­out hav­ing to use a cost func­tion. With the nu­mer­ics li­brary I am us­ing, I was able to ap­ply Broy­den’s method to solve the coin toss prob­lem based on the gra­di­ent of the La­grangian func­tion, with­out hav­ing to use a cost func­tion.

Ac­com­pa­ny­ing source code is avail­able on GitHub.

Com­ments

Show comments