Q learning:
• Operate on Qˆ opt(s, a; w). That is,
• Off-policy: value is based on estimate of optimal policy
• To use, don’t need to know MDP transitions T(s, a, s0 )
TD learning:
• Operate on Vˆ
π(s; w)
• On-policy: value is based on exploration policy (usually based on
Vˆ
π)
• To use, need to know rules of the game Succ(s, a)