Solving Multi-Armed Bandits

Using the $\varepsilon$-greedy approach to solving the multi-armed bandit problem.

Code available on GitHub


Say you have a couple of different website designs, and you want to figure out which one is going to lead to the most user engagement. The canonical, rigorous approach to this is A/B Testing, where you take one subset of users and show them the existing website, and another subset of users and show them a candidate website design, and using some pre-determined metric for performance decide whether the new website outperformed the existing website with statistical significance. If this sounds a bit like the scientific method, it’s because it is — double-blind A/B testing is basically the gold standard in medical research to evaluate the efficacy of a treatment, for example. It gives a rigorous way to quantify whether a change is better than the baseline, how much better, and how certain you are of the result.

But for many applications you need to do the test on-line, and compare multiple possible treatments (the various website designs, or different recommender algorithms if you’re watching Netflix) to determine which is best. In this case, you really want to minimize the number of times the subjects (i.e. your users) see a bad treatment (i.e. a crappy website re-design). Basically, you don’t know how well each treatment will perform, and you want to optimize some metric quickly, and you don’t really care about getting reliable measurements on the sub-optimal treatments. You’ve stumbled into the world of multi-armed bandits.


The Bandit Problem

Suppose you walk into a casino and there’s a row of one-armed bandits (slot machines). You have a pocket full of money, and want to figure out which bandit is paying out best. This is the metaphor at the core of multi-armed bandits.

The simplest way to test out the bandits and figure out what pays the best is to play each of them one at a time, over and over, and average their payout per pull. This will give you total information — you’ll know how much each bandit pays — but you’ll also spend a lot of time pumping in quarters and pulling arms without regard for having the highest payout. This is what is done with A/B testing — test each of the bandits equally, and figure out what the payout rates are.

The core question of multi-armed bandits is thus:

How can I get the maximum payout out of the bandits?

In data science, bandits appear as a higher level alternative to basic A/B testing where we are concerned with the total payout. A/B testing (or A/B/C, or A/B/C/D…) is an effective way to compare multiple options, such as website designs or search algorithms, and find out which is best on a given performance metric. The trouble is that it’s online, and your customers are subjects of an experiment. Less effective options may cost you goodwill or revenue, so you want to quickly throw out bad options while continuing to test the better options. You also don’t necessarily care about how bad the worst options are, and only want the best one. This is the heart of the multi-armed bandit problem.


Bootstrapping & Confidence Intervals

Much of the analysis here will rely on understanding the payout rate we’ve measured, and how confident we are in that rate, so we will rely on bootstrapping. Bootstrapping is an algorithmic way to compute the uncertainty in a value when we have only sampled a finite number of points from a distribution to compute that value.

The core idea is this: as we draw from a distribution, our data points in principle start to resemble the actual distribution. So if we want to know a statistical quantity, we can start by sampling from our pool of data points to compute the quantity.

In practice, given a data set of $N$ points, we sample randomly from that set to build a sample set of size $N$, with replacement so that we don’t just sample the same set over and over again. We then use that sample to compute some statistical property, such as the average of that data set. We then repeat this multiple times and repeatedly compute the same statistical property. The Central Limit Theorem tells us that if we do this, the distribution in the average will converge on a normal distribution with mean $\overline{X}$ and variance $\sigma$.

Once we know $\overline{X}$ and $\sigma$ on that $\overline{X}$, we can use the properties of the normal distribution to apply confidence intervals. We know that 95% of the normal distribution is within $2 \sigma$ to each side of the mean, so we can say that $X = \overline{X} \pm 2 \sigma$ with 95% confidence.

The challenges, which I won’t spend too much time on here, are in whether the data set is actually representative of the real distribution. We see this a lot in political polls, where there may be some skew in who’s answering the phone, for example. Polling companies then have to unskew their sample by figuring out how the distribution of their respondents compares to the population at large, rescale everything, etc. As with most of data science, it’s 80% data cleaning.

Bootstrapping is pretty straightforward to implement in code Here’s a Python function I used to make the examples below:

import numpy as np

def bootstrap(data_set, n_bootstraps = 100):

  avgs = np.zeros(n_bootstraps)

  for i in range(n_bootstraps):
    sub_sample = np.random.choice(data_set, size=len(data_set))
    avgs[i] = np.mean(sub_sample)

  return np.mean(avgs), np.std(avgs)

That function takes in a data set, and computes an average and the confidence interval of that average using bootstrapping. It’s that simple to implement.

Now that we’ve covered that, we can take a look at a toy multi-armed bandit problem.


The Toy Bandit Problem

Throughout this post, we will use the same toy bandit problem. For this, we create a set of bandits (in this case, ten), and give them random payouts between $0.$ and $0.1$. We then try the three strategies below for maximizing our payout, which is equivalent to figuring out which bandit has the highest payout rate. We assume that the bandits all pay \$1 when they pay at all — the adaptation to variable pay rates is straightforward.

We then deploy each of the three strategies below: random sampling, a dropping method I made up as an attempt to solve the problem with an alternative approach, and the commonly used $\varepsilon$-greedy strategy for solving the problem (in this case, according to Wikipedia, what I actually used was the $\varepsilon$-decreasing strategy).

We evaluate the strategy based on the total payout after five thousand pulls of the arm. Other ways to evaluate it might be the shortest number of pulls to reach a desired payout, or the regret, a measure of how much payout is missed relative to just playing the known highest paying bandit for each pull.


The Brute Force Bandit Solution

The simplest and perhaps most obvious approach is to just pull bandits at random, record the payout, and use our bootstrapping to find the payout rates and the uncertainties on that as we keep pulling the arms on the bandits. Each bandit arm will be pulled roughly the same number of times, and using bootstrapping we will figure out as we make more pulls what the payout rates are for each bandit. We can see below that this works pretty well in our example below, where the dark line is the bootstrapping average and the shaded area is $2 \sigma$ to either side of the average, which is the 95% confidence interval. That range isn’t particularly important for the random approach here, but it will be relevant for the two other approaches we will discuss later.

Payout and uncertainty for our ten bandit example

Payout and uncertainty for our ten bandit example

If we do this, we expect our payout to be the average of all the individual payout rates, since we are selecting the bandits at random. This is indeed what we see. This establishes a baseline for performance — since this approach is just random guessing, anything we come up with needs to do better than this.

Total payout with each pull for the random approach, plotted against the expected payout taken by averaging all the bandit payouts

Total payout with each pull for the random approach, plotted against the expected payout taken by averaging all the bandit payouts

Remember, the goal of the multi-armed bandit problem is to maximize returns from pulling the lever of bandits with unknown payout rates. To check whether a clever method works well, we should make sure it at least does better than the simplest method. In this case, it’s to just keep pulling the bandit arms at random and keep track of the averages. This is a great experimental design if you want to figure out the payout rates for each bandit, but for many applications this is overkill and you only care about finding the best-paying bandit. Let’s look at an approach meant to optimize payout rather than figure out the payout rates for every bandit.


Dropping the Lowest Performer

I can’t understand something until I try to solve the problem myself, so this is my own attempt at solving the multi-armed bandit problem. It is not standard, and it has a lot of problems, which I’ll discuss in the end, but it did actually sorta work in some cases.

The idea here is simple: we want to play on the highest paying bandit most of the time, but we don’t know which is the highest paying bandit. One thing we can try is to stop pulling the lowest paying bandit once we’re certain it’s the lowest paying bandit. To do this, as we learn more about each bandit’s payout, if a bandit has a payout that we’re pretty confident is lower than all other bandit’s payout, we stop playing that bandit. Mathematically, we stop pulling bandit $i$ once we know that $r_i + t \sigma_i < r_j - t \sigma_j$ for all the $j \neq i$, for some confidence threshold $t$ which, in this example, I set to $2$.

Does this work? Well, yes.

Total payout comparison of dropping versus random selection

Total payout comparison of dropping versus random selection

In this case, it does noticeably better than random pulling — 20% better, which is pretty good! The dropping approach had a hard time at first figuring out the lowest paying bandits, because there are a few that are pretty close to each other, but once it started dropping the performance improved, around pull 3000 or so.

dropping_payrates.png

This approach did work, and in some of my test cases, which few bandits that were well-separated, it actually was the best performing approach. But it does have a few problems. In my particular implementation, I tossed out the lowest performing bandit one at a time. This meant we spent a lot of time still sorting out each individual payout rate, like with the random selection. This could be ameliorated by checking the most optimistic payout rate against $\textrm{max} (\overline{r}_j - t \sigma_j)$, so that anything that’s well behind the best-looking payout gets tossed instead of the discarding the biggest laggard one at a time.

The biggest risk with this sort of discarding approach, though, is that a statistical anomaly may sometimes throw out the highest payer, and there’s no mechanism to bring it back. What we really want is an approach that keeps exploring, even if just a little bit, to make sure we haven’t prematurely decided on a bandit with a lucky run or thrown out the best bandit after an unlucky run. This balancing of exploration with exploitation is what $\varepsilon$-greedy optimization is built for.


$\varepsilon$-Greedy Optimization

As we said before, our goal with the multi-armed bandit problem is to maximize payout, not measure the payout rate for every bandit. It is sufficient for this problem to just figure out which bandit is paying out the most. In the random model above, we basically assumed we knew nothing about the highest-paying bandit. We therefore could not optimize our payout because we selected the bandit using no prior information.

However, we do have some prior information — the bootstrapped statistics on the payout rate, which converges (if slowly) to the actual payout rate. This suggests thinking like Bayesians, where we have priors and make our bandit selection including that information. Ideally, if we knew the highest paying bandit, we’d just keep pulling that lever. So we want an approach that will usually pull what we think is the highest-paying bandit. But we don’t start off knowing which bandit pays the best, so we need to pick some bandits at random to get more information about their payout rates, to actually find the highest-paying bandit.

$\varepsilon$-greedy algorithms balance exploration with exploitation by preferring to pull what it thinks is the highest paying bandit’s lever, while occasionally picking a bandit at random to find out if that one pays better. To do this, we select some value of $0 < \varepsilon < 1$ and with every pull pick a random number between $0$ and $1$ — if that number is greater than $\varepsilon$, we pull the bandit with the highest expected payout (In my implementation, we look at the maximum payout rate $\overline{r} + 2 \sigma$, getting extra greedy. You could be less greedy and go for the maximum $\overline{r}$.), otherwise we select a bandit at random.

For my implementation, I put an exploration schedule on $\varepsilon$ so that, after an initial period of randomly pulling to get a feel for each bandit, we start off with the selection being a coin toss between the maximum and a random bandit, and taper off until we are only playing the highest payout bandit. Thus, there’s an initial period of strong exploration, then we start tapering off the exploration once we think we’ve found the best bandit.

Does this work? You bet your ass it does.

$\varepsilon$-greedy wins the day

$\varepsilon$-greedy wins the day

For this particular case, the greedy algorithm more than doubles the payout of random pulling. The exact performance will depend on the relative payout rates, how many bandits are in the system, etc., but in this case the $\varepsilon$-greedy strategy is much, much better than the other two approaches.

We achieve this much higher payout because the $\varepsilon$-greedy strategy stops playing suboptimal bandits quickly, as we can see from the growing information about the payouts:

Computed pay rates using the $\varepsilon$-greedy strategy

Computed pay rates using the $\varepsilon$-greedy strategy

Notice how the algorithm very quickly stops pulling the suboptimal bandits, and only bandits 1 (the optimal, $r = .095$), 5 ($r = .081$), and 7 ($r = .087$) get much play at all, and we spend most of our time playing bandit 1. This is why $\varepsilon$-greedy strategies work so well — they tend to stick with the best paying bandits and only occasionally check that they are the best-paying bandits.

There are other approaches to improve the $\varepsilon$-greedy strategy, such as using an adaptive schedule that shrinks $\varepsilon$ towards zero as we become more confident we’ve found the best-paying bandit. One way of doing this might be to use figure out from our prior bootstrapping the probability that any of the bandits have a higher payout than the maximum $\overline{r}_i$ from the properties of a normal distribution, and set $\varepsilon$ to that probability. This means that as we get more confident, we reduce our exploration automatically, which removes the need for human intervention or guessing at the best $\varepsilon$ schedule.


Some Last Discussion

In the example we used, we found that the $\varepsilon$-greedy algorithm massively outperformed my own attempt at solving the problem. This was not always the case, and in some instances my dropping approach worked best. In those cases, the bandits were not as clustered in their payout rates as in this case, so it was easy to locate the worst-paying bandits quickly. In both cases, we have to have an initial run of random uniform sampling of the bandits to build up statistics

There’s even more sophisticated approaches to balancing exploration with exploitation, such as Thompson sampling, for example. But that’s it! These are the basic approaches to dealing with the multi-armed bandit problem. There’s a lot of nuance that can go into these optimizations, but the basic idea is to balance exploration with exploitation to make sure you find the maximum possible payout. Once again, the code used to generate these plots is available on GitHub.

$\setCounter{0}$
Previous
Previous

Sommelier numérique

Next
Next

A Machine Learning Quicky — Why Use L2 Regularization