MountainCar Off-Policy RL Pipeline
Collect experience with a behavior policy, initialize a Q-table, train with off-policy importance sampling, then evaluate the greedy target policy — step by step.
Step 1
Data Collection
IdleBehavior Policy
Action: 0 left · 1 coast · 2 right
State visitation heatmap (x = position bin, y = velocity bin)
Collection Metrics
Episodes0
Successes0
Success rate0.00%
Avg return0.00
Avg length0.00
Cur step0
Cur return0.00
Last action-
Behavior prob-
Position-
Velocity-
Discrete state-
Action 0 (←)0
Action 1 (·)0
Action 2 (→)0
StatusIdle
Step 2
Build Q-Table
Waiting for Step 1-
Step 3
Train Model
Waiting for Step 2-
Step 4
Test Model
Waiting for Step 3Greedy Policy Playback
Trained greedy policy playback
Test Metrics
Episodes0
Successes0
Success rate0.00%
Avg return0.00
Avg length0.00
Cur step0
Cur return0.00
StatusIdle