Performance Tuning for the Coin Toss Model
I wrapped up the last post expressing a desire to study the approximation technique using larger models of the coin toss game. Up until now, I was using a naive implementation of the computation method to perform the calculations—an implementation that was crudely implemented and too slow for larger models. In this post, I demonstrate an alternative approach that has a much better performance profile. I also describe a simple technique that can be used to reduce the number of iterations required when applying the hill climbing algorithm.
Optimized Computation Method
To get around the performance issues referenced above, I decided to implement the computation method using an entirely different approach. I think the best way to describe this new approach is to work through an example. Suppose we have a model of the coin toss game with four coin toss events per round:
Just like we did in some of the previous posts, we can create a graphical representation of the coin toss model using a state diagram:
This diagram illustrates the starting state and all of the possible state transitions. In this case, we use one set of variables to represent state transitions away from the initial state and another set of variables to represent state transitions towards the initial state. Here is the relationship between these two sets of variables:
We want to precompute these values ahead of time and look them up later instead of computing them on the fly. Since our model is symmetrical by definition, we can assume the coin in the initial state is always a fair coin:
For this example, let’s choose some arbitrary values for the remaining state transitions:
Now we want to create some lookup tables for our state transition values. We’ll use these lookup tables when computing the likelihood of landing on each one of the states after each toss of the coin. Let’s create two arrays and populate them with these values:
These are our lookup tables. Notice that these arrays are padded with three extra elements, one in the front and two in the back. You’ll see in a minute why this is necessary. We are using pointers to refer to the first value in each array. Now let’s do some pointer arithmetic:
One pointer is incremented, while the other is decremented. This effectively shifts these arrays, one to the right and one to the left. We want to align them in a way that makes it easy to perform our computations later on. Here is how our lookup tables appear after the shift:
By definition, every round of the coin toss game starts out in the zero state, so we know with 100% certainty what state we’re going to be in before the first coin toss. Thus, we can represent our initial state vector with the following array:
This array is allocated to the same size as the arrays used for the state transition lookup tables. And like before, we use a pointer to refer to the first value in the array. Now let’s do some more pointer arithmetic:
In this case, we create a pair of pointers that point to two different elements of the same array. We can treat these two pointers as if they were pointers to two different arrays, even though they’re really not. In essence, this is what our two arrays look like:
Now that we have arrays representing our state transition values and our initial state vector, along with pointers that properly align the data, we can compute the values of the state vector after each toss of the coin. Here is the algorithm:
Notice that we always double the value in the zero offset at the end of each iteration of the outer loop. This is necessary because we are only solving half the problem. Since we know our model is symmetrical, we don’t bother to calculate values for transitions into the negative states. They are always a mirror image of the values for the positive states. However, we do need to consider the negative state that transitions into the zero state. It is the same as the positive state that transitions into the zero state, hence the doubling.
Here are the computed values after outer loop iteration #1:
Here are the computed values after outer loop iteration #2:
Here are the computed values after outer loop iteration #3:
Here are the computed values after outer loop iteration #4:
Once the loop terminates, we have the probabilities of landing on each state after the fourth and final toss of the coin. States with a zero value are never terminal states. The states with non-zero values are the terminal states. Here are the relevant values:
Is this the most optimal computation method? Probably not. For even-numbered coin tosses, the odd-numbered states are always zero. For odd-numbered coin tosses, the even-numbered states are always zero. We could probably use this knowledge to optimize the inner loop even further, but it might make the algorithm a little more complicated.
Consider also that each iteration of the inner loop is independent of the others. This means that they can be run out of order or in parallel. Since we’re using pointers to reference elements of the state transition lookup tables and state vector arrays, we could easily modify our program to use hardware-specific SIMD intrinsics such as those for the SSE or AVX instruction sets. This would allow us to parallelize the computations in the inner loop.
Counting the Floating-Point Operations
The example we worked through in the computation method described above is small enough that we can easily count the number of floating-point operations required to compute the result. Here is a table showing the count of all addition and multiplication operations needed to complete each iteration of the outer loop:
We can add up the total number of operations for each iteration of the outer loop to arrive at the total number of operations necessary to reach the final result:
This tells us how many operations are required for a model with four coin toss events. But what if we’re using a much larger model of the coin toss game? The following table shows how to compute the number of operations required for any iteration of the outer loop, regardless of the size of the coin toss model:
Now we need to add up the number of operations used in each iteration to get the total number of operations required to compute the final result:
This formula tells us the total number of floating-point operations necessary to calculate the final result for a coin toss model with an arbitrary number of coin toss events. But I think it might be convenient to represent this in algebraic form instead of summation form. Consider the following relationship:
This is the formula for the triangular number sequence. We can use this relationship to replace the summation above and present our solution in the following algebraic form:
As you can see, this indicates that our algorithm has quadratic time complexity. But this doesn’t tell us everything. How we access memory and whether or not we make efficient use of the cache can have an impact on performance regardless of the number of floating-point operations required to complete a task.
Comparison to Matrix Multiplication
In my earlier post titled Generalizing the Coin Toss Markov Model, we investigated a computation method based on the product of state vectors and state transition matrices. I am curious how this computation method compares to the optimized computation method analyzed in the previous section. For this analysis, we’ll use the following notation:
We can think of the row vector as a matrix with a single row. Let’s start by first counting the number of operations needed to compute the product of two square matrices and the number of operations needed to compute the product of a row vector and a square matrix:
The number of operations required depends on the size of the matrix. And the size of the matrix depends on the number of coin toss events we are modeling. But the number of matrix operations we need to compute also depends on the number of coin toss events. Recall the following formula from our generalized coin toss Markov model:
If we are modeling a system with four coin toss events, we can expand the above as follows:
In this case, there are a total of four matrix operations. Each matrix operation contains many elementary operations. We want to count the number of elementary operations. Since matrix multiplication is associative, we’ll get the same result whether we evaluate the expression from left to right or right to left—assuming we don’t have any floating-point rounding errors. But the number of elementary operations required to evaluate this expression depends on the order in which we perform the evaluation. Here is the analysis for right-associative evaluation:
Using this information, we can express the total number of floating-point operations as a function of the number of coin toss events:
Thus, our matrix product has a quartic polynomial time complexity when using right-associative evaluation. We can use this formula to compute the total number of elementary operations needed for a model with four coin toss events:
This is about two orders of magnitude more than the number of operations required when using the optimized computation method. And the gap is even worse for models with a higher number of coin toss events. But the difference is not as bad if we evaluate the matrix product from left to right instead of right to left. Here is the analysis for left-associative evaluation:
Using this information, we can express the total number of floating-point operations as a function of the number of coin toss events:
Accordingly, our matrix product has cubic polynomial time complexity when using left-associative evaluation. We can use this formula to compute the total number of elementary operations needed for a model with four coin toss events:
This is a better figure, but it’s still more than ten times the number of operations required when using the optimized computation method. And the gap still gets worse for models with a higher number of coin toss events. For each computation method, we can plot the number of operations required as a function of the number of coin toss events to get an idea of what the growth rate of each method looks like:
Keep in mind that the vertical axis has a logarithmic scale. As you can see, the optimized computation method scales much better than methods using matrix multiplication. And perhaps this is not surprising when you consider that, in the generalized coin toss model, the state transition matrix is a sparse matrix that contains mostly zeros. Performing addition and multiplication operations against those zero values is a waste of computation resources.
Comparison to Algebraic Manipulation
Suppose we have a set of algebraic formulas that we can use to compute the expected outcome of the coin toss game given a set of biases. We might be able to calculate the results with fewer operations than any of the methods described above. In an earlier post titled Estimating the Weights of Biased Coins, we derived a set of equations to compute the outcome for a model of the coin toss game with four coin toss events. Let’s do something similar here:
With four coin toss events, there are sixteen possible coin toss sequences. The table above shows the probability of each one, along with the terminal state after the final coin toss. We can express the chance of ending up in each one of the final states with the following:
Remember, the coin in the initial state is always a fair coin. The formulas above can be simplified to contain fewer operations:
You might want to stop here and check my work to make sure I did this correctly. It’s easy to make a mistake. With these formulas, we can now count all the addition and multiplication operations to get the total number of floating-point operations needed to compute the results:
Adding up the total number of operations for each result, we find that, at least in the case where there are four coin toss events, there are fewer operations required than with any of the computation methods examined in the previous sections:
It’s not clear to me how to generalize this for a model with an arbitrary number of coin toss events. However, it is clear to me that the probability of getting a sequence of all heads or all tails, regardless of the number of coin tosses, can be expressed like this:
This computation has linear time complexity. The number of operations required is directly proportional to the number of coin toss events. Knowing this, we can assume that an approach using predetermined algebraic formulas has at least a linear growth rate. That’s a best-case scenario. But realistically, it’s probably not that good. Nonetheless, this approach still might have a better performance profile than the optimized computation method we detailed earlier. It might be worth exploring this idea further.
Performance Bottleneck
The challenge with using the algebraic approach is coming up with the formulas for models with a high number of coin toss events. These formulas also need to be evaluated in a manner that has an acceptable performance profile. In some of the previous posts, I used a computer algebra library to build up expression trees representing the algebraic formulas. These expression trees were then mapped to a different expression tree format and compiled into executable functions.
This method worked beautifully for smaller models of the coin toss game. But the compilation step turned out to be one of the performance bottlenecks preventing this method from being used for larger models of the coin toss game. Furthermore, the executable functions generated by the compilation step didn’t run nearly as fast as the optimized computation method. I was also running into stack overflow errors when attempting to solve for larger models. It was unusable for models with more than about twenty coin toss events. I haven’t looked too deeply into it yet, but I think I might know what the problem is. Consider the following expression:
This is just a sum of four numbers. For this formula, the expression tree generated by the computer algebra library would look like this:
This remains a very flat tree structure regardless of how many numbers we are adding together. Ideally, the sum would be compiled as a loop with an accumulator. But that’s not what happens. In preparation for the compilation step, this expression tree gets mapped to a binary expression tree format that looks like this:
For complex expressions with many operands, this can be a very deeply nested tree structure. And that’s where I think the problem lies. This deep tree structure might be what was causing the compilation step to take a long time. It might also explain why the generated functions ran too slowly and why the evaluation of larger models was exhausting the call stack.
Hill Climbing with Descending Step Sizes
In some of the examples we looked at in the previous posts, we used a hill climbing algorithm as an optimization technique to find parameters that minimize a cost function. In all of these examples, we used a fixed step size. In the last post, we used a step size that would deliver an accuracy of five decimal places:
Consider the examples illustrated for the linear polynomial method in the previous post. Applying the hill climbing algorithm while using this value as the step size, the optimization task took tens of thousands of iterations to complete. We can significantly reduce the number of iterations necessary by using a series of tiered step sizes arranged in descending order:
The objective here is to run the hill climbing algorithm to completion using the first step size. Once complete, the process is repeated with the next step size using the result of the previous run as the starting point. We keep repeating this process until we’ve run the algorithm to completion for the smallest step size. Reproducing the linear polynomial examples from the previous post, here are the paths taken by the hill climbing algorithm when using descending step sizes:
Except for the last one, which drifts off into a local minimum, all paths finish with the same result. And this result is the same one we found when using a fixed step size. But when using descending step sizes, the results converge in far fewer iterations. Here is a comparison:
The difference is off the charts. Beginning the hill climbing method with a large step size allows the process to move quickly towards the target area, while the smaller step sizes enable it to zero in on a precise result. The benefits are clear. And this technique might be useful for other optimization methods as well.
Example with 20 Coin Tosses
With the performance enhancements outlined in the sections above, we can now apply the polynomial approximation technique to larger models of the coin toss game. Using a model of the coin toss game with twenty coin toss events, let’s use the quadratic polynomial approximation technique described in the previous post to find a set of weights that approximate the following target distribution:
Starting with a set of fair coins for the initial guess, we can apply the hill climbing algorithm to find an optimal set of weights for the biased coins. The process completes after 25 iterations. Here are the results:
The results are computed almost instantaneously. Without the optimized computation method, these calculations would have taken about twenty seconds to run on my current hardware. And without the descending step size optimization, this task would have taken about thirty minutes to complete.
Example with 50 Coin Tosses
Let’s do another example. In this one, we’ll use the cubic polynomial approximation technique on a model with fifty coin toss events. A model this size would be impractical or impossible to evaluate without the performance optimizations chronicled in this post. Here is the target distribution we want to approximate:
Starting with a set of fair coins for the initial guess, we can apply the hill climbing algorithm to find an optimal set of weights for the biased coins. The process completes after 1,746 iterations. Here are the results:
This is a pretty good approximation, but it is not the most optimal result that can be found with a cubic polynomial. When using fair coins for the initial guess, the hill climbing method takes a route that appears to lead to a local minimum. Also, notice that the number of iterations required is two orders of magnitude more than the other examples in the last two sections. I have some ideas for improving this technique further, but I will save that discussion for another time.
Comments