Visualizing the Climb up the Hill

The hill climbing algorithm described in my previous post finds the weights of biased coins for a coin toss game in which the distribution of possible outcomes is known. In the example presented, there are many possible solutions. A cost function is used to find a valid solution, and a scoring function is used to narrow down the set of valid solutions to a single result. In this post, I want to look at some visualizations to get a better feel for how the algorithm works.

The Model

We’ll use the same model of the coin toss game described in an earlier post titled Estimating the Weights of Biased Coins. But to be able to plot the value of the cost function for every combination of weights on a two-dimensional illustration, we need to limit the number of coin tosses in each round so that there are only two different weights that can vary. Since the first coin toss always uses a fairly weighted coin, we can model the coin toss game using three flips per round. The Markov model looks like this:

$Figure 1$

In this model, every round starts in the zero state. The coin in the zero state is always a fair coin. The model can be represented in tabular form like this:

$Figure 2$

Since there are three coin tosses per round, with two possible outcomes for each flip, there are a total of eight unique coin toss sequences possible. The following table lists all possible outcomes along with the terminal state and probability formula for each combination:

$Figure 3$

While the model has a total of seven states, there are only four possible terminal states. And since the model is symmetrical, the probability of ending on one of the positive terminal states is equal to the probability of ending on the corresponding negative terminal state:

$Figure 4$

Plugging in the probability formulas, the equations above can be rewritten as follows:

$Figure 5$

Simplifying, these equations can be reduced to the following:

$Figure 6$

Based on the set of equations above, we can state a relationship between the weight of the coin in the +1 state and the weight of the coin in the +2 state:

$Figure 7$

The bias of each coin must be between a minimum value of zero and a maximum value of one. If we know the expected outcome of the coin toss game, the minimum and maximum values might be further constrained. Suppose the coin in the +2 state always lands on heads:

$Figure 8$

The case above shows the minimum value for the weight of the coin in the +1 state when the coin in the +2 state is at the maximum. The opposite extreme occurs when the coin in the +1 state always lands on heads:

$Figure 9$

As demonstrated in my two previous posts, there is a range of valid values that are possible for a given target distribution. There is no unique solution without including an additional constraint such as a scoring function.

Cost Function

A cost function assigns a value to a set of weights based on a target distribution. The value of the cost function determines how close an estimated set of weights is to a valid set of weights for a given probability mass function. Suppose the target distribution looks like this:

The probability mass function illustrated above is symmetrical. The values can also be represented like this:

$Figure 11$

Our cost function must compare the difference between the target distribution and the expected outcome of a given set of weights. In this example, as with the previous post, I want to use the sum of squared errors as the cost function:

$Figure 12$

A value of zero is the most ideal. A zero value means the given weights will have an expected outcome that matches the target distribution. A non-zero value indicates an error relative to the ideal. Here is what the cost function looks like on a three-dimensional surface plot:

This gives a visualization of where the high and low spots are. Since the cost function is used in the context of a hill climbing algorithm, it might be more intuitive to display the values on an inverted surface plot because it makes it look more like a hill:

A surface plot does a nice job of illustrating the curvature of the cost function. For overlaying additional information on top of the cost function, a heatmap might be a more useful alternative. Here is a visualization of the cost function as a heatmap:

In the chart above, the plateau of valid values is superimposed on top of the heatmap as a green line. These are inputs to the cost function that will yield a zero value. As you can see, there is a range of solutions that will yield an expected outcome that matches the target distribution.

Steepest Ascent Hill Climb

Given a target distribution, the steepest ascent hill climbing algorithm can be used to find a valid set of weights starting with an arbitrary initial guess. The steepest ascent method always chooses the most optimal move for every iteration until the cost function is minimized. My previous post describes this algorithm in depth. Let’s consider an initial guess that looks like this:

Starting with the initial guess above, the algorithm converges on the following set of values:

The algorithm terminates after 6,047 iterations. The following trace shows the path it takes from start to finish:

Using the same technique, we can run through several more examples, each with different starting points:

In each example above, the path follows a vertical, horizontal, or diagonal route towards an optimum set of values. Sometimes the route bends in a dog-leg fashion part of the way through. The final set of weights discovered by the hill climbing algorithm differs based on the starting position of the initial guess.

Stochastic Hill Climb

Unlike the steepest ascent hill climbing algorithm used in the previous section, a stochastic hill climbing algorithm does not always choose the best move on each iteration. Instead, it randomly chooses from any move that improves the cost function. In this example, the random decision is weighted based on how much each potential move improves the cost function. Consider the following initial guess:

Starting with the initial guess above, the algorithm converges on the following set of values:

The algorithm terminates after 6,070 iterations. The following trace shows the path it takes from start to finish:

Using the same technique, we can run through several more examples, each with different starting points:

Again, the final set of weights discovered by the hill climbing algorithm varies based on the starting position. But the final results are not the same as those found by the steepest ascent algorithm. In these examples, the trace follows a fairly direct route to the final set of values. Although slightly curved, the routes do not make any sharp changes in direction.

Scoring Function A

A scoring function can be used in addition to a cost function to constrain the valid set of weights to a single optimum value. One scoring function analyzed in my previous post computes the sum of squared differences between the weights and the centermost value. Here is the function:

$Figure 28$

Keep in mind we only want to consider inputs in which the cost function evaluates to zero. We can use a numerical method to find the optimum parameter values that minimize the scoring function when the cost function is zero:

$Figure 29$

Here is a visual representation of the minimization of the scoring function:

And here is the optimum set of weights for this scoring function:

Since the scoring function can compute a score for all combinations of weights, the scoring function can be represented on a surface plot or heatmap in much the same way as the cost function. Here is a visualization of the scoring function as a heatmap:

Notice the optimum value does not fall on the point with the best absolute score. This is because it is constrained to values where the cost function is zero. Here is the same depiction overlaid on a heatmap of the cost function:

In the two illustrations above, the intensity of the plateau line varies based on the score. A better score is depicted with a thicker and more brightly colored line.

Scoring Function B

The scoring function in the previous section is not the only one possible. Another scoring function analyzed in my previous post computes the sum of squared differences between each weight and the weight of its preceding neighbor. Here is the function:

$Figure 34$

$Figure 35$

Here is a visual representation of the minimization of the scoring function:

And here is the optimum set of weights for this scoring function:

In the two illustrations above, the intensity of the plateau line varies based on the score. A better score is depicted with a thicker and more brightly colored line.

Next Steps

Simplifying the model to three coin tosses per round makes it possible to visualize a cost function, scoring function, and path taken by a hill climbing algorithm. It also makes the model easier to reason about. But I still haven’t come up with a general solution to finding the optimum set of weights for a given cost function and scoring function in a model with an arbitrarily large number of coin tosses per round. I still have some ideas I want to explore. I am hoping these visualizations can help aid further analysis.

Accompanying source code is available on GitHub.

The Mod­el

Cost Func­tion

Steep­est As­cent Hill Climb

Sto­chas­tic Hill Climb

Scor­ing Func­tion A

Scor­ing Func­tion B