Evaluation Functions
Goal: To estimate the utility of the current state. 
Recap: Minimax Example
Alterations of max and min nodes.
Temporal Difference (TD) Learning 
Is a machine learning algorithm. 
- Use Monte Carlo to generate data. 
- Use this data to fit weights w in your function approximation. 
Function Approximation
For example, the evaluation function could be a linear weighted combination of some features: V (s; w) = w · φ(s), which is what we did for reinforcement learning.
Monte Carlo
Use Monte Carlo simulations using some fixed policies π agent, π opp which are supposed to be approximations of the minimax policies
Side Note
Features should be designed such that they each have the same range and the same inherent structure. I.e. indicator features are good, but features that lie on a range are not as good. 
Recap: Reinforcement Learning
Using an exploration policy based on just the current value function is a bad idea, because we can get stuck exploiting local optima and not exploring.
So we can use epsilon greedy that will explore some of the time and proceed based on current Value function.
Machine Learning Algorithm
Machine learning has three parts:
1) Model
2) Data
3) Learning Algorithm (and associated objective function)
Recap: General Machine Learning Framework
If you can come up with a prediction function, paramaterized by w, and a target function, then we can formulate the following:
Data
Data points of (s0, a1, r1, s1).
Learning Algorithm
Update weights as follows:
Model
We build a feature extractor phi(s), and parameterize V by w to define our value function V. 
Example
Given step size n=0.5, discount r=1, and reward is end utility,
Difference between Q-learning and TD Learning
Q learning: 
• Operate on Qˆ opt(s, a; w). That is, 
• Off-policy: value is based on estimate of optimal policy 
• To use, don’t need to know MDP transitions T(s, a, s0 )
TD learning:
• Operate on Vˆ π(s; w) 
• On-policy: value is based on exploration policy (usually based on Vˆ π) 
• To use, need to know rules of the game Succ(s, a)
Recap: Q-learning Algorithm
Link to CS221 Notes

TD Learning

Simultaneous Games

Policy (Strategy) Types
Pure strategy; that is, for each state you have one fixed action.
Mixed strategy; that is, a probability distribution over several actions. 
Which is better? How do we deal with games with simultaneous moves?
For zero-sum games.
Mixed strategies - doesn't matter.
Pure strategies - going second is better (intuitively, second player sees first player's move).

Zero-Sum Games

Non-Zero Sum Games

   Login to remove ads X
Feedback | How-To