Reinforcement Learning Primer

2020-12-02

Let's say you're at a casino, and you see a slot machine with two levers, each with its own win percentage. You don't know either lever's win percentage, and you only have money for 20 pulls. What's a good strategy to use to maximize your earnings?

Try it out below.

Overall: 0/0

This problem is known as the bandit problem. The primary dilemma is you want to gain more information (i.e. learn the win rate of each lever), while also balancing your 20 limited pulls and choosing the better lever. It turns out there are a few possible strategies.

First, you can decide to pull a random lever for the first K moves (in this case, that could be 10). Then, once you have a better approximation of their true probabilities, you can choose the greedy option for the remaining pulls. This is known as explore-then-commit.

Alternatively, you can choose a random lever with probability $\epsilon$ and the greedy lever otherwise. This is known as $\epsilon$ -greedy. So, there could be a 70% chance that you go with the greedy option and a 30% you randomly choose a lever.

There are many other approaches (which are better than these), but the goal is to balance exploring new states—trying different levers and acquiring more information—with exploitation—taking the best action.

Real bandit problems

Let's say you're teaching an algorithm to play blackjack. The only information given is what constitutes a win and a loss. When training, how can the algorithm know what a bad move is? Eventually, it must learn that hitting when your hand is 17+ will result in a loss most of the time. So, you train it with thousands of games of blackjack, and the moves the algorithm makes depends on your exploration strategy. For example, for the first 500 moves, you can tell the algorithm to play completely random moves, collecting data on bad moves, or you could use $\epsilon$ -greedy or many other strategies.

Reinforcement Learning

The main idea behind reinforcement learning is an agent (e.g. self-driving car algorithm, blackjack AI, etc.) takes actions, which move it to a new state, and each state, action pair has a corresponding reward. If the action was beneficial, the agent is rewarded. If the action was harmful, then agent is "punished" (given negative reward). Hence, the optimal actions are reinforced.

For example, if you're on an empty highway going 20 MPH, it would be beneficial (positive reward) to accelerate. If you're in bumper-to-bumper traffic, it would be very harmful to accelerate, as you'd crash into the car in front of you. So, for each state (empty highway vs. traffic), you may have a different optimal action.

In the real world, taking actions comes with uncertainty. If you're on the highway and you change lanes to the left, you're reasonably confident the car to the left will maintain its speed, but that's not guaranteed. They could accelerate, which would result in you abandoning your lane change. It could also change lanes, brake, and many other actions.

Likewise, if you have a hard 17 in Blackjack (no Ace) and you hit, there are scenarios in which you win and others in which you lose.

So, every state and action has a set of transition probabilities, and this is where our bandit problem comes in. Going back to the highway problem, turning left could result in a ~99% chance you end up in left lane safely, and a ~1% chance you crash. However, the algorithm has to learn those probabilities, so it must explore and take sub-optimal actions during training.

In reality, states contain much more information. A blackjack MDP could consider a state as your card count and the dealer's card. A highway MDP could contain your position, velocity and the position and velocities of all the other cars on the highway.

Therefore, reinforcement learning consists of a set of states, actions, transition probabilities, and rewards. Once you collect enough data, you can mathematically determine the optimal action at each given state.

Utility

How can you tell if you're in a good or a bad state? We know the concept of a reward, but most actions may not have a reward. For blackjack, you get a positive reward if you win, and a negative reward if you lose. For the average move, you get no feedback. So, we must have a way to propagate the utility of a state.

We use the concept of discounted return. A good analogy is if you get $1,000 in 5 years, that is worth significantly less than $1,000 today. Receiving $1,000 in 10 years is worth even less than both. So, we use a discount factor $\gamma$ , which is value from 0 to 1, typically around 0.9 or 0.95. The equation is as follows:

$U_{k+1}(s)= \max_a \left(R(s,a) + \gamma \sum_{s^{\prime}} T(s^{\prime} | s, a) U_{k}(s^{\prime}) \right)$

There are a few things to unpack. First, we want to take the action that maximizes the utility. When considering the utility of a state (e.g. driving on a highway), we don't care about a suboptimal action (e.g. crashing into the car on the left).

Second, we take the expected utility of the next action with our summation, as it's summing the probabilities multiplied by their utilities.

Finally, we see a subscript $U_k$ . This is because we must iteratively update the utilities until they converge. Once the utilities converge, we take the best action at each state.

State Uncertainty

We've talked about transition probabilities, but real-world problems are even more complicated. For example, in a self-driving car, the sensors may have some inherent uncertainty or may be partially obscured. So, you may not even know what state you are in.

In that case, you add an additional layer of uncertainty. You have to make decisions based on a belief of your state. So, your belief could be a 0.3 probability you're in state A, 0.5 probability you're in state B, and 0.2 probability you're in state C. You have to take an action based on that limited information.