Evaluation Functions

Goal: To estimate the utility of the current state.

Recap: Minimax Example

Alterations of max and min nodes.

Temporal Difference (TD) Learning

Is a machine learning algorithm.

- Use Monte Carlo to generate data.

- Use this data to fit weights w in your function approximation.

Function Approximation

For example, the evaluation function could be a linear weighted combination of some features: V (s; w) = w · φ(s), which is what we did for reinforcement learning.

Monte Carlo

Use Monte Carlo simulations using some fixed policies π agent, π opp which are supposed to be approximations of the minimax policies

Side Note

Features should be designed such that they each have the same range and the same inherent structure. I.e. indicator features are good, but features that lie on a range are not as good.

Recap: Reinforcement Learning

Using an exploration policy based on just the current value function is a bad idea, because we can get stuck exploiting local optima and not exploring.

So we can use epsilon greedy that will explore some of the time and proceed based on current Value function.

Machine Learning Algorithm

Machine learning has three parts:

1) Model

2) Data

3) Learning Algorithm (and associated objective function)

Recap: General Machine Learning Framework

If you can come up with a prediction function, paramaterized by w, and a target function, then we can formulate the following:

Data

Data points of (s0, a1, r1, s1).

Learning Algorithm

Update weights as follows:

Model

We build a feature extractor phi(s), and parameterize V by w to define our value function V.

Example

Given step size n=0.5, discount r=1, and reward is end utility,

Difference between Q-learning and TD Learning

Q learning:

• Operate on Qˆ opt(s, a; w). That is,

• Off-policy: value is based on estimate of optimal policy

• To use, don’t need to know MDP transitions T(s, a, s0 )

TD learning:

• Operate on Vˆ π(s; w)

• On-policy: value is based on exploration policy (usually based on Vˆ π)

• To use, need to know rules of the game Succ(s, a)

Recap: Q-learning Algorithm

Link to CS221 Notes

CS221 Games2 PDF Link

TD Learning

Simultaneous Games

Policy (Strategy) Types

Pure strategy; that is, for each state you have one fixed action.

Mixed strategy; that is, a probability distribution over several actions.

Which is better? How do we deal with games with simultaneous moves?

For zero-sum games.

Mixed strategies - doesn't matter.

Pure strategies - going second is better (intuitively, second player sees first player's move).

TD Learning

Simultaneous Games

Zero-Sum Games

Non-Zero Sum Games