MountainCar Off-Policy RL Pipeline

Collect experience with a behavior policy, initialize a Q-table, train with off-policy importance sampling, then evaluate the greedy target policy — step by step.

Step 1

Data Collection

Idle

Behavior Policy

Policy p_smart (mixture) Target Episodes Max Steps / Episode Speed (steps/frame) 8 Seed (optional)

Action: 0 left · 1 coast · 2 right

State visitation heatmap (x = position bin, y = velocity bin)

Collection Metrics

Episodes0

Successes0

Success rate0.00%

Avg return0.00

Avg length0.00

Cur step0

Cur return0.00

Last action-

Behavior prob-

Position-

Velocity-

Discrete state-

Action 0 (←)0

Action 1 (·)0

Action 2 (→)0

StatusIdle

Step 2

Build Q-Table

Waiting for Step 1

Discretization Bins Discount Factor γ

-

Step 3

Train Model

Waiting for Step 2

Pass 1 — Ordinary MC 0%

Pass 2 — Weighted IS 0%

-

Step 4

Test Model

Waiting for Step 3

Greedy Policy Playback

Test Episodes Max Steps / Episode Speed (steps/frame) 8

Trained greedy policy playback

Test Metrics