Notes: Reinforcement Learning




January 13, 2017

Notes on Reinforcement learning.

Reinforcement learning is the problem faced by an agent that must learn behaviour through trial-and-error interactions with a dynamic environment. It is appropriately thought of as a class of problems, rather than as a set of techniques.

There are two main strategies for solving Reinforcement Learning problems. The first is to search in the space for behaviours in order to find one that performs well in the environment. This approach has been taken by work in genetic algorithms and genetic programming. The second is to use statistical techniques and dynamic programming methods to estimate the utility of taking actions in state of the world.

In the standard RL model, on each step of interaction the agent receives as input i, some indication of the current state, S, of the environment. The agent chooses an action, a, to generate an output. The action changes the state of the environment and the value of this state transition is communicated to the agent through a scalar reinforcement signal, r. It should choose actions that tend to increase the long-run sum of values of the reinforcement signal. It can learn to do this overtime by systematic trial and error.

An intuitive way to understand the relation between the agent and its environment is with the following example:

Environment : You are in state 65. You have 4 possible actions. Agent: I’ll take action 2. Environment: You received a reinforcement of 7 units. You are in state 15. You have 2 possible actions.

The agent’s job is to find a policy Π, mapping states to actions, that maximizes some long run measure of reinforcement. We assume the environment is stationary.

Important ideas in Reinforcement Learning that came up –

Markov Decision Process

“Markov” generally means that given the present state, the future and the past are independent. This is just like search, where the successor function could only depend on the current state (not the history). In the deterministic single-agent search problems, we wanted an optimal plan or sequence of actions, from start to goal.

For MDPs we want an optimal policy Π* : S ⇒ A

A policy Π gives an action for each state. An optimal policy is one that maxmizes expected utility if followed. An explicit policy defines a reflex agent.


Its reasonable to maximize the sum of rewards and/or to prefer rewards now to rewards later

One solution: values of rewards decay exponentially.

How to discount? Each time we descend a level. we multiply in the discount once.

Why discount? Sooner rewards probably do have higher utility than later rewards. Also helps algorithm converge.

Example: discount of 0.5 U([1,2,3]) = 11 +20.5 + 3*o.25

What if the game lasts forever? Solution:

Absorbing state- guarantee that for every policy, a terminal state will eventually be reached.

How to solve MDPs?

The value (utility) of a state S:

V*(s) = expected utility starting in S and acting optimally. (tells us how good each state is)

Q*(s,a) = expected utility starting out having taken action ‘a’ from state ‘s’ and (thereafter) acting optimally. (‘s’ is fixed and we vary ‘a’)

The optimal policy: Π(s) = optimal action from state s = argmax Q(s,a) (its the action that achieves the maximum)

Values of States

Fundamental operation : compute the (expectimax) value of the state.

Recursive definition of value:

V(s) = max Q(s,a)

Q*(s,a) = max Σ T(s,a,s’)[R(s,a,s’) + V*(s)] — Bellman’s equation


Policy = map of states to actions. Utility = sum of discounted rewards. Values = expected future utility from a state (max node) Q-values = expected future utility from a q-state (chance node)

Both value iteration and policy iteration compute the same thing (all optimal values)

In Value iteration –

In policy iteration –

Unknown MDP : Model Based Learning

Unknown MDP : Model Free Learning

Passive Reinforcement Learning :

What’s good about direct evaluation?

What’s bad about it?

It wastes information about state connnection. Each state must be learned seperately. So, it takes log time to learn.

Sample of V(s) : sample = R(s,Π(s),s’) + ϒV(s’)

Update to V(s): V(s) ⇐ (1-α)V(s) + (α)sample Can also be written as : V(s) ⇐ V(s) + (α)[sample – V(s)]

Problems with TD value Learning: TD value learning is a model – free way to do policy evaluation, mimicking Bellman Updates with running sample averages. Idea : Learn Q-values, not values. Makes action selection model – free too!

Active Reinforcement Learning


How to explore?

Several schemes for forcing exploration:

Simplest: random action (Ε – greedy) - Every time step, flip a coin. - With (small) probability Ε, act normally. - With (large) probability 1-Ε, act on current policy.

Problems with random actions?


Example : random exploration and exploration functions both end up optimal, but random exploration has higher regret.

Approximate Q- learning

Generalizing across states:

Instead, we want to generalize:

Learn about some number of training states from experience. Generalize that experience to new, similar situations. This is a fundamental idea in machine learning.

Solution: describe a state using a vector of features (properties)

Linear Value Functions

Intuitive interpretation:

Example: if something unexpectedly bad happens, blame the features that were on : disprefer all states with that state’s features.

Policy Search

Problem: often the feature-based policies that work well aren’t the ones that approximate V/Q best.

Solution: Learn policies that minimize rewards, not the values that predict them.

Policy Search: start with an OK solution (eg: Q-learning) then fine-tune by hill climbing on feature weights.

Simplest policy search :

Start with an initial linear value function or Q-function. Nudge each feature weight up and down and see if your policy is better than before.


Better methods exploit look-ahead structure, sample wisely, change multiple parameters.