Jim Killingsworth

Per­for­mance Tun­ing for the Coin Toss Mod­el

I wrapped up the last post ex­press­ing a de­sire to study the ap­prox­i­ma­tion tech­nique us­ing larg­er mod­els of the coin toss game. Up un­til now, I was us­ing a naive im­ple­men­ta­tion of the com­pu­ta­tion method to per­form the cal­cu­la­tion­s—an im­ple­men­ta­tion that was crude­ly im­ple­ment­ed and too slow for larg­er mod­el­s. In this post, I demon­strate an al­ter­na­tive ap­proach that has a much bet­ter per­for­mance pro­file. I al­so de­scribe a sim­ple tech­nique that can be used to re­duce the num­ber of it­er­a­tions re­quired when ap­ply­ing the hill climb­ing al­go­rith­m.

Op­ti­mized Com­pu­ta­tion Method

To get around the per­for­mance is­sues ref­er­enced above, I de­cid­ed to im­ple­ment the com­pu­ta­tion method us­ing an en­tire­ly dif­fer­ent ap­proach. I think the best way to de­scribe this new ap­proach is to work through an ex­am­ple. Sup­pose we have a mod­el of the coin toss game with four coin toss events per round:

Figure 1

Just like we did in some of the pre­vi­ous post­s, we can cre­ate a graph­i­cal rep­re­sen­ta­tion of the coin toss mod­el us­ing a state di­a­gram:

Figure 2

This di­a­gram il­lus­trates the start­ing state and all of the pos­si­ble state tran­si­tion­s. In this case, we use one set of vari­ables to rep­re­sent state tran­si­tions away from the ini­tial state and an­oth­er set of vari­ables to rep­re­sent state tran­si­tions to­wards the ini­tial state. Here is the re­la­tion­ship be­tween these two sets of vari­ables:

Figure 3

We want to pre­com­pute these val­ues ahead of time and look them up lat­er in­stead of com­put­ing them on the fly. Since our mod­el is sym­met­ri­cal by de­f­i­n­i­tion, we can as­sume the coin in the ini­tial state is al­ways a fair coin:

Figure 4

For this ex­am­ple, let’s choose some ar­bi­trary val­ues for the re­main­ing state tran­si­tion­s:

Figure 5

Now we want to cre­ate some lookup ta­bles for our state tran­si­tion val­ues. We’ll use these lookup ta­bles when com­put­ing the like­li­hood of land­ing on each one of the states af­ter each toss of the coin. Let’s cre­ate two ar­rays and pop­u­late them with these val­ues:

Figure 6

These are our lookup ta­bles. No­tice that these ar­rays are padded with three ex­tra el­e­ments, one in the front and two in the back. You’ll see in a minute why this is nec­es­sary. We are us­ing point­ers to re­fer to the first val­ue in each ar­ray. Now let’s do some point­er arith­metic:

Figure 7

One point­er is in­cre­ment­ed, while the oth­er is decre­ment­ed. This ef­fec­tive­ly shifts these ar­rays, one to the right and one to the left. We want to align them in a way that makes it easy to per­form our com­pu­ta­tions lat­er on. Here is how our lookup ta­bles ap­pear af­ter the shift:

Figure 8

By de­f­i­n­i­tion, every round of the coin toss game starts out in the ze­ro state, so we know with 100% cer­tain­ty what state we’re go­ing to be in be­fore the first coin toss. Thus, we can rep­re­sent our ini­tial state vec­tor with the fol­low­ing ar­ray:

Figure 9

This ar­ray is al­lo­cat­ed to the same size as the ar­rays used for the state tran­si­tion lookup ta­bles. And like be­fore, we use a point­er to re­fer to the first val­ue in the ar­ray. Now let’s do some more point­er arith­metic:

Figure 10

In this case, we cre­ate a pair of point­ers that point to two dif­fer­ent el­e­ments of the same ar­ray. We can treat these two point­ers as if they were point­ers to two dif­fer­ent ar­rays, even though they’re re­al­ly not. In essence, this is what our two ar­rays look like:

Figure 11

Now that we have ar­rays rep­re­sent­ing our state tran­si­tion val­ues and our ini­tial state vec­tor, along with point­ers that prop­er­ly align the data, we can com­pute the val­ues of the state vec­tor af­ter each toss of the coin. Here is the al­go­rith­m:

Figure 12

No­tice that we al­ways dou­ble the val­ue in the ze­ro off­set at the end of each it­er­a­tion of the out­er loop. This is nec­es­sary be­cause we are on­ly solv­ing half the prob­lem. Since we know our mod­el is sym­met­ri­cal, we don’t both­er to cal­cu­late val­ues for tran­si­tions in­to the neg­a­tive states. They are al­ways a mir­ror im­age of the val­ues for the pos­i­tive states. How­ev­er, we do need to con­sid­er the neg­a­tive state that tran­si­tions in­to the ze­ro state. It is the same as the pos­i­tive state that tran­si­tions in­to the ze­ro state, hence the dou­bling.

Here are the com­put­ed val­ues af­ter out­er loop it­er­a­tion #1:

Figure 13

Here are the com­put­ed val­ues af­ter out­er loop it­er­a­tion #2:

Figure 14

Here are the com­put­ed val­ues af­ter out­er loop it­er­a­tion #3:

Figure 15

Here are the com­put­ed val­ues af­ter out­er loop it­er­a­tion #4:

Figure 16

Once the loop ter­mi­nates, we have the prob­a­bil­i­ties of land­ing on each state af­ter the fourth and fi­nal toss of the coin. States with a ze­ro val­ue are nev­er ter­mi­nal states. The states with non-ze­ro val­ues are the ter­mi­nal states. Here are the rel­e­vant val­ues:

Figure 17

Is this the most op­ti­mal com­pu­ta­tion method? Prob­a­bly not. For even-num­bered coin toss­es, the odd­-num­bered states are al­ways ze­ro. For odd­-num­bered coin toss­es, the even-num­bered states are al­ways ze­ro. We could prob­a­bly use this knowl­edge to op­ti­mize the in­ner loop even fur­ther, but it might make the al­go­rithm a lit­tle more com­pli­cat­ed.

Con­sid­er al­so that each it­er­a­tion of the in­ner loop is in­de­pen­dent of the oth­er­s. This means that they can be run out of or­der or in par­al­lel. Since we’re us­ing point­ers to ref­er­ence el­e­ments of the state tran­si­tion lookup ta­bles and state vec­tor ar­rays, we could eas­i­ly mod­i­fy our pro­gram to use hard­ware-spe­cif­ic SIMD in­trin­sics such as those for the SSE or AVX in­struc­tion set­s. This would al­low us to par­al­lelize the com­pu­ta­tions in the in­ner loop.

Count­ing the Float­ing-Point Op­er­a­tions

The ex­am­ple we worked through in the com­pu­ta­tion method de­scribed above is small enough that we can eas­i­ly count the num­ber of float­ing-point op­er­a­tions re­quired to com­pute the re­sult. Here is a ta­ble show­ing the count of all ad­di­tion and mul­ti­pli­ca­tion op­er­a­tions need­ed to com­plete each it­er­a­tion of the out­er loop:

Figure 18

We can add up the to­tal num­ber of op­er­a­tions for each it­er­a­tion of the out­er loop to ar­rive at the to­tal num­ber of op­er­a­tions nec­es­sary to reach the fi­nal re­sult:

Figure 19

This tells us how many op­er­a­tions are re­quired for a mod­el with four coin toss events. But what if we’re us­ing a much larg­er mod­el of the coin toss game? The fol­low­ing ta­ble shows how to com­pute the num­ber of op­er­a­tions re­quired for any it­er­a­tion of the out­er loop, re­gard­less of the size of the coin toss mod­el:

Figure 20

Now we need to add up the num­ber of op­er­a­tions used in each it­er­a­tion to get the to­tal num­ber of op­er­a­tions re­quired to com­pute the fi­nal re­sult:

Figure 21

This for­mu­la tells us the to­tal num­ber of float­ing-point op­er­a­tions nec­es­sary to cal­cu­late the fi­nal re­sult for a coin toss mod­el with an ar­bi­trary num­ber of coin toss events. But I think it might be con­ve­nient to rep­re­sent this in al­ge­bra­ic form in­stead of sum­ma­tion for­m. Con­sid­er the fol­low­ing re­la­tion­ship:

Figure 22

This is the for­mu­la for the tri­an­gu­lar num­ber se­quence. We can use this re­la­tion­ship to re­place the sum­ma­tion above and present our so­lu­tion in the fol­low­ing al­ge­bra­ic for­m:

Figure 23

As you can see, this in­di­cates that our al­go­rithm has qua­drat­ic time com­plex­i­ty. But this does­n’t tell us every­thing. How we ac­cess mem­o­ry and whether or not we make ef­fi­cient use of the cache can have an im­pact on per­for­mance re­gard­less of the num­ber of float­ing-point op­er­a­tions re­quired to com­plete a task.

Com­par­i­son to Ma­trix Mul­ti­pli­ca­tion

In my ear­li­er post ti­tled Gen­er­al­iz­ing the Coin Toss Markov Mod­el, we in­ves­ti­gat­ed a com­pu­ta­tion method based on the prod­uct of state vec­tors and state tran­si­tion ma­tri­ces. I am cu­ri­ous how this com­pu­ta­tion method com­pares to the op­ti­mized com­pu­ta­tion method an­a­lyzed in the pre­vi­ous sec­tion. For this analy­sis, we’ll use the fol­low­ing no­ta­tion:

Figure 24

We can think of the row vec­tor as a ma­trix with a sin­gle row. Let’s start by first count­ing the num­ber of op­er­a­tions need­ed to com­pute the prod­uct of two square ma­tri­ces and the num­ber of op­er­a­tions need­ed to com­pute the prod­uct of a row vec­tor and a square ma­trix:

Figure 25

The num­ber of op­er­a­tions re­quired de­pends on the size of the ma­trix. And the size of the ma­trix de­pends on the num­ber of coin toss events we are mod­el­ing. But the num­ber of ma­trix op­er­a­tions we need to com­pute al­so de­pends on the num­ber of coin toss events. Re­call the fol­low­ing for­mu­la from our gen­er­al­ized coin toss Markov mod­el:

Figure 26

If we are mod­el­ing a sys­tem with four coin toss events, we can ex­pand the above as fol­lows:

Figure 27

In this case, there are a to­tal of four ma­trix op­er­a­tions. Each ma­trix op­er­a­tion con­tains many el­e­men­tary op­er­a­tions. We want to count the num­ber of el­e­men­tary op­er­a­tions. Since ma­trix mul­ti­pli­ca­tion is as­so­cia­tive, we’ll get the same re­sult whether we eval­u­ate the ex­pres­sion from left to right or right to left­—as­sum­ing we don’t have any float­ing-point round­ing er­rors. But the num­ber of el­e­men­tary op­er­a­tions re­quired to eval­u­ate this ex­pres­sion de­pends on the or­der in which we per­form the eval­u­a­tion. Here is the analy­sis for right-as­so­cia­tive eval­u­a­tion:

Figure 28

Us­ing this in­for­ma­tion, we can ex­press the to­tal num­ber of float­ing-point op­er­a­tions as a func­tion of the num­ber of coin toss events:

Figure 29

Thus, our ma­trix prod­uct has a quar­tic poly­no­mi­al time com­plex­i­ty when us­ing right-as­so­cia­tive eval­u­a­tion. We can use this for­mu­la to com­pute the to­tal num­ber of el­e­men­tary op­er­a­tions need­ed for a mod­el with four coin toss events:

Figure 30

This is about two or­ders of mag­ni­tude more than the num­ber of op­er­a­tions re­quired when us­ing the op­ti­mized com­pu­ta­tion method. And the gap is even worse for mod­els with a high­er num­ber of coin toss events. But the dif­fer­ence is not as bad if we eval­u­ate the ma­trix prod­uct from left to right in­stead of right to left. Here is the analy­sis for left­-as­so­cia­tive eval­u­a­tion:

Figure 31

Us­ing this in­for­ma­tion, we can ex­press the to­tal num­ber of float­ing-point op­er­a­tions as a func­tion of the num­ber of coin toss events:

Figure 32

Ac­cord­ing­ly, our ma­trix prod­uct has cu­bic poly­no­mi­al time com­plex­i­ty when us­ing left­-as­so­cia­tive eval­u­a­tion. We can use this for­mu­la to com­pute the to­tal num­ber of el­e­men­tary op­er­a­tions need­ed for a mod­el with four coin toss events:

Figure 33

This is a bet­ter fig­ure, but it’s still more than ten times the num­ber of op­er­a­tions re­quired when us­ing the op­ti­mized com­pu­ta­tion method. And the gap still gets worse for mod­els with a high­er num­ber of coin toss events. For each com­pu­ta­tion method, we can plot the num­ber of op­er­a­tions re­quired as a func­tion of the num­ber of coin toss events to get an idea of what the growth rate of each method looks like:

Figure 34

Keep in mind that the ver­ti­cal ax­is has a log­a­rith­mic scale. As you can see, the op­ti­mized com­pu­ta­tion method scales much bet­ter than meth­ods us­ing ma­trix mul­ti­pli­ca­tion. And per­haps this is not sur­pris­ing when you con­sid­er that, in the gen­er­al­ized coin toss mod­el, the state tran­si­tion ma­trix is a sparse ma­trix that con­tains most­ly ze­ros. Per­form­ing ad­di­tion and mul­ti­pli­ca­tion op­er­a­tions against those ze­ro val­ues is a waste of com­pu­ta­tion re­sources.

Com­par­i­son to Al­ge­bra­ic Ma­nip­u­la­tion

Sup­pose we have a set of al­ge­bra­ic for­mu­las that we can use to com­pute the ex­pect­ed out­come of the coin toss game giv­en a set of bi­as­es. We might be able to cal­cu­late the re­sults with few­er op­er­a­tions than any of the meth­ods de­scribed above. In an ear­li­er post ti­tled Es­ti­mat­ing the Weights of Bi­ased Coins, we de­rived a set of equa­tions to com­pute the out­come for a mod­el of the coin toss game with four coin toss events. Let’s do some­thing sim­i­lar here:

Figure 35

With four coin toss events, there are six­teen pos­si­ble coin toss se­quences. The ta­ble above shows the prob­a­bil­i­ty of each one, along with the ter­mi­nal state af­ter the fi­nal coin toss. We can ex­press the chance of end­ing up in each one of the fi­nal states with the fol­low­ing:

Figure 36

Re­mem­ber, the coin in the ini­tial state is al­ways a fair coin. The for­mu­las above can be sim­pli­fied to con­tain few­er op­er­a­tions:

Figure 37

You might want to stop here and check my work to make sure I did this cor­rect­ly. It’s easy to make a mis­take. With these for­mu­las, we can now count all the ad­di­tion and mul­ti­pli­ca­tion op­er­a­tions to get the to­tal num­ber of float­ing-point op­er­a­tions need­ed to com­pute the re­sult­s:

Figure 38

Adding up the to­tal num­ber of op­er­a­tions for each re­sult, we find that, at least in the case where there are four coin toss events, there are few­er op­er­a­tions re­quired than with any of the com­pu­ta­tion meth­ods ex­am­ined in the pre­vi­ous sec­tion­s:

Figure 39

It’s not clear to me how to gen­er­al­ize this for a mod­el with an ar­bi­trary num­ber of coin toss events. How­ev­er, it is clear to me that the prob­a­bil­i­ty of get­ting a se­quence of all heads or all tail­s, re­gard­less of the num­ber of coin toss­es, can be ex­pressed like this:

Figure 40

This com­pu­ta­tion has lin­ear time com­plex­i­ty. The num­ber of op­er­a­tions re­quired is di­rect­ly pro­por­tion­al to the num­ber of coin toss events. Know­ing this, we can as­sume that an ap­proach us­ing pre­de­ter­mined al­ge­bra­ic for­mu­las has at least a lin­ear growth rate. That’s a best-case sce­nar­i­o. But re­al­is­ti­cal­ly, it’s prob­a­bly not that good. Nonethe­less, this ap­proach still might have a bet­ter per­for­mance pro­file than the op­ti­mized com­pu­ta­tion method we de­tailed ear­lier. It might be worth ex­plor­ing this idea fur­ther.

Per­for­mance Bot­tle­neck

The chal­lenge with us­ing the al­ge­bra­ic ap­proach is com­ing up with the for­mu­las for mod­els with a high num­ber of coin toss events. These for­mu­las al­so need to be eval­u­at­ed in a man­ner that has an ac­cept­able per­for­mance pro­file. In some of the pre­vi­ous post­s, I used a com­put­er al­ge­bra li­brary to build up ex­pres­sion trees rep­re­sent­ing the al­ge­bra­ic for­mu­las. These ex­pres­sion trees were then mapped to a dif­fer­ent ex­pres­sion tree for­mat and com­piled in­to ex­e­cutable func­tion­s.

This method worked beau­ti­ful­ly for small­er mod­els of the coin toss game. But the com­pi­la­tion step turned out to be one of the per­for­mance bot­tle­necks pre­vent­ing this method from be­ing used for larg­er mod­els of the coin toss game. Fur­ther­more, the ex­e­cutable func­tions gen­er­at­ed by the com­pi­la­tion step did­n’t run near­ly as fast as the op­ti­mized com­pu­ta­tion method. I was al­so run­ning in­to stack over­flow er­rors when at­tempt­ing to solve for larg­er mod­el­s. It was un­us­able for mod­els with more than about twen­ty coin toss events. I haven’t looked too deeply in­to it yet, but I think I might know what the prob­lem is. Con­sid­er the fol­low­ing ex­pres­sion:

Figure 41

This is just a sum of four num­ber­s. For this for­mu­la, the ex­pres­sion tree gen­er­at­ed by the com­put­er al­ge­bra li­brary would look like this:

Figure 42

This re­mains a very flat tree struc­ture re­gard­less of how many num­bers we are adding to­geth­er. Ide­al­ly, the sum would be com­piled as a loop with an ac­cu­mu­la­tor. But that’s not what hap­pen­s. In prepa­ra­tion for the com­pi­la­tion step, this ex­pres­sion tree gets mapped to a bi­na­ry ex­pres­sion tree for­mat that looks like this:

Figure 43

For com­plex ex­pres­sions with many operand­s, this can be a very deeply nest­ed tree struc­ture. And that’s where I think the prob­lem lies. This deep tree struc­ture might be what was caus­ing the com­pi­la­tion step to take a long time. It might al­so ex­plain why the gen­er­at­ed func­tions ran too slow­ly and why the eval­u­a­tion of larg­er mod­els was ex­haust­ing the call stack.

Hill Climb­ing with De­scend­ing Step Sizes

In some of the ex­am­ples we looked at in the pre­vi­ous post­s, we used a hill climb­ing al­go­rithm as an op­ti­miza­tion tech­nique to find pa­ra­me­ters that min­i­mize a cost func­tion. In all of these ex­am­ples, we used a fixed step size. In the last post, we used a step size that would de­liv­er an ac­cu­ra­cy of five dec­i­mal places:

Figure 44

Con­sid­er the ex­am­ples il­lus­trat­ed for the lin­ear poly­no­mi­al method in the pre­vi­ous post. Ap­ply­ing the hill climb­ing al­go­rithm while us­ing this val­ue as the step size, the op­ti­miza­tion task took tens of thou­sands of it­er­a­tions to com­plete. We can sig­nif­i­cant­ly re­duce the num­ber of it­er­a­tions nec­es­sary by us­ing a se­ries of tiered step sizes arranged in de­scend­ing or­der:

Figure 45

The ob­jec­tive here is to run the hill climb­ing al­go­rithm to com­ple­tion us­ing the first step size. Once com­plete, the process is re­peat­ed with the next step size us­ing the re­sult of the pre­vi­ous run as the start­ing point. We keep re­peat­ing this process un­til we’ve run the al­go­rithm to com­ple­tion for the small­est step size. Re­pro­duc­ing the lin­ear poly­no­mi­al ex­am­ples from the pre­vi­ous post, here are the paths tak­en by the hill climb­ing al­go­rithm when us­ing de­scend­ing step sizes:

Figure 46
Figure 47
Figure 48
Figure 49

Ex­cept for the last one, which drifts off in­to a lo­cal min­i­mum, all paths fin­ish with the same re­sult. And this re­sult is the same one we found when us­ing a fixed step size. But when us­ing de­scend­ing step sizes, the re­sults con­verge in far few­er it­er­a­tions. Here is a com­par­ison:

Figure 50

The dif­fer­ence is off the chart­s. Be­gin­ning the hill climb­ing method with a large step size al­lows the process to move quick­ly to­wards the tar­get area, while the small­er step sizes en­able it to ze­ro in on a pre­cise re­sult. The ben­e­fits are clear. And this tech­nique might be use­ful for oth­er op­ti­miza­tion meth­ods as well.

Ex­am­ple with 20 Coin Toss­es

With the per­for­mance en­hance­ments out­lined in the sec­tions above, we can now ap­ply the poly­no­mi­al ap­prox­i­ma­tion tech­nique to larg­er mod­els of the coin toss game. Us­ing a mod­el of the coin toss game with twen­ty coin toss events, let’s use the qua­drat­ic poly­no­mi­al ap­prox­i­ma­tion tech­nique de­scribed in the pre­vi­ous post to find a set of weights that ap­prox­i­mate the fol­low­ing tar­get dis­tri­b­u­tion:

Figure 51

Start­ing with a set of fair coins for the ini­tial guess, we can ap­ply the hill climb­ing al­go­rithm to find an op­ti­mal set of weights for the bi­ased coin­s. The process com­pletes af­ter 25 it­er­a­tions. Here are the re­sult­s:

Figure 52
Figure 53

The re­sults are com­put­ed al­most in­stan­ta­neous­ly. With­out the op­ti­mized com­pu­ta­tion method, these cal­cu­la­tions would have tak­en about twen­ty sec­onds to run on my cur­rent hard­ware. And with­out the de­scend­ing step size op­ti­miza­tion, this task would have tak­en about thir­ty minutes to com­plete.

Ex­am­ple with 50 Coin Toss­es

Let’s do an­oth­er ex­am­ple. In this one, we’ll use the cu­bic poly­no­mi­al ap­prox­i­ma­tion tech­nique on a mod­el with fifty coin toss events. A mod­el this size would be im­prac­ti­cal or im­pos­si­ble to eval­u­ate with­out the per­for­mance op­ti­miza­tions chron­i­cled in this post. Here is the tar­get dis­tri­b­u­tion we want to ap­prox­i­mate:

Figure 54

Start­ing with a set of fair coins for the ini­tial guess, we can ap­ply the hill climb­ing al­go­rithm to find an op­ti­mal set of weights for the bi­ased coin­s. The process com­pletes af­ter 1,746 it­er­a­tions. Here are the re­sult­s:

Figure 55
Figure 56

This is a pret­ty good ap­prox­i­ma­tion, but it is not the most op­ti­mal re­sult that can be found with a cu­bic poly­no­mi­al. When us­ing fair coins for the ini­tial guess, the hill climb­ing method takes a route that ap­pears to lead to a lo­cal min­i­mum. Al­so, no­tice that the num­ber of it­er­a­tions re­quired is two or­ders of mag­ni­tude more than the oth­er ex­am­ples in the last two sec­tion­s. I have some ideas for im­prov­ing this tech­nique fur­ther, but I will save that dis­cus­sion for an­oth­er time.

Ac­com­pa­ny­ing source code is avail­able on GitHub.

Com­ments

Show comments