Skip to content

amcheste/pokemon-red-ai

Repository files navigation

Pokémon Red RL Toolkit

Pokémon Red: Reinforcement Learning Toolkit

A Gymnasium-compatible environment and training pipeline for Pokémon Red, built on PyBoy, Stable-Baselines3, and sb3-contrib.

Tests Python License


This repository provides everything needed to train RL agents to play Pokémon Red: an emulator-backed Gymnasium environment with three first-class observation treatments (pixel / symbolic / hybrid), RecurrentPPO training scripts, an event-flag-based reward calculator covering 18 critical-path milestones, live Streamlit monitoring dashboards, configurable alerts (desktop / Slack / email), and an analysis layer with bootstrap confidence intervals via rliable.


Repository at a glance

Path What's there
pokemon_red_ai/environment/ Gymnasium env wrapping PyBoy + 3 observation treatments
pokemon_red_ai/training/ RecurrentPPO trainer, callbacks (W&B, alerts, monitoring)
pokemon_red_ai/analysis/ Treatment-comparison logic (comparison.py)
scripts/train.py Primary training entry point
scripts/eval.py Deterministic evaluation harness
scripts/analyze.py rliable bootstrap analysis → publication-quality figures
scripts/compare.py Streamlit dashboard for side-by-side run comparison
scripts/monitor.py Streamlit dashboard for live single-run monitoring
scripts/run_pilots.sh Launch a multi-treatment / multi-seed run grid
docs/research_playbook.md Step-by-step operational guide for long-running experiments
tests/ 833 unit + integration tests (pytest)

Quick start

# 1. Install
git clone https://github.com/amcheste/pokemon-red-ai.git
cd pokemon-red-ai
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2. Generate save states (one-time; requires a legal Pokémon Red ROM)
python3 scripts/create_save_states.py --rom path/to/PokemonRed.gb

# 3. Smoke test (~5 min, verify the pipeline end-to-end)
python3 scripts/train.py \
    --rom path/to/PokemonRed.gb \
    --save-state states/s0_post_intro.state \
    --observation-type pixel --total-timesteps 50000 --seed 42 \
    --save-dir ./training_output/smoketest

# 4. Run a multi-treatment / multi-seed grid (3 treatments × 3 seeds)
scripts/run_pilots.sh --rom path/to/PokemonRed.gb --parallel 3

# 5. Generate publication-quality figures
python3 scripts/analyze.py --results-dir ./training_output \
    --output-dir ./figures --format pdf --reps 10000

For unattended overnight runs, configure desktop / Slack / email alerts:

cp configs/alerts.example.yaml configs/alerts.yaml   # then enable channels

The full operational playbook (including compute estimates and a parallel-run strategy on Apple Silicon) is in docs/research_playbook.md.


Observation treatments

Three encoder paths, all feeding into the same LSTM (hidden size 256) and PPO policy / value heads (pi=[256,128], vf=[256,128]). Selected via --observation-type on scripts/train.py.

Treatment Observation Encoder Params Feature dim
pixel 80×72×1 grayscale Game Boy screen NatureCNN (Mnih et al. 2015), features_dim=256 ~564K 256
symbolic Player position, party stats, 18-flag bit-vector, exploration counters (29 features) 3-layer MLP 29 → 640 → 640 → 256 ~594K 256
hybrid pixelsymbolic streams NatureCNN(256) + symbolic MLP(256), concatenated ~1.16M 512

The pixel and symbolic encoders are sized to within 10% on trainable parameter count to neutralize the encoder-capacity confound when comparing modalities (Henderson et al. 2018; Engstrom et al. 2020; Andrychowicz et al. 2021). Strict per-forward FLOP matching across CNN and MLP architectures distorts encoder design and is reported transparently rather than enforced. Per-condition learning rates are selected from a pre-registered log-uniform grid following Eimer et al. (2023).

Run scripts/check_encoder_capacity.py to print the exact parameter / FLOP table and assert the 10% match constraint (exits non-zero on violation).

Implementation: pokemon_red_ai/training/models.py; observation construction in pokemon_red_ai/environment/observations.py.

The package also ships three legacy observation types (multi_modal, screen_only, minimal) for backward compatibility with earlier scripts.

Reward function

The default events reward strategy uses a configurable set of 18 event flags between Pallet Town and the Boulder Badge. Each flag transition 0 → 1 awards a fixed positive reward exactly once per episode. A small per-step time penalty, a new-map discovery bonus, and a party-faint penalty are also active by default.

Four other reward strategies are available (standard / exploration / progress / sparse); see pokemon_red_ai/environment/rewards.py for the full menu and configuration knobs.

The flag list with bit offsets is in pokemon_red_ai/game/event_flags.py.

Statistical analysis

Following Agarwal et al. 2021, Deep Reinforcement Learning at the Edge of the Statistical Precipice, the analysis tooling reports:

  • Point estimate: interquartile mean (IQM) over per-seed scores. Robust to outlier seeds in either tail.
  • Uncertainty: 95% percentile bootstrap with 2,000 resamples.
  • Pairwise comparison: probability of improvement, Pr[score_A > score_B] via stratified bootstrap.

Implemented in scripts/analyze.py (post-hoc figures) and pokemon_red_ai/analysis/comparison.py (reusable backend for the live Streamlit comparison and any notebook work).

Live monitoring

Tool Use case
Weights & Biases (auto-enabled in train.py) Cloud telemetry; per-treatment run grouping; check from any device
streamlit run scripts/monitor.py Single-run live dashboard: reward curves, event flags, maps, level / party / money
streamlit run scripts/compare.py Multi-run comparison: IQM table, learning-curve overlays with 95% bands, milestone race
pokemon_red_ai.training.alerts Desktop / Slack / email alerts on first badge, reward plateau, training crash

Use as a library

The training pipeline is fully usable outside the bundled scripts:

from pokemon_red_ai.environment import PokemonRedGymEnv
from sb3_contrib import RecurrentPPO

env = PokemonRedGymEnv(
    rom_path="PokemonRed.gb",
    observation_type="hybrid",
    reward_strategy="events",
    max_episode_steps=15_000,
)
model = RecurrentPPO("MultiInputLstmPolicy", env, verbose=1)
model.learn(total_timesteps=1_000_000)

Custom reward strategies, observation types, and callback chains are documented in DEVELOPER_GUIDE.md.

Running the tests

./venv/bin/python3 -m pytest                  # full suite (~17s)
./venv/bin/python3 -m pytest tests/unit/      # unit only
./venv/bin/python3 -m pytest -k comparison    # specific module

Acknowledgments

Built on PyBoy (Game Boy emulation), Stable-Baselines3 and sb3-contrib (RL algorithms), Gymnasium (RL interface), and rliable (statistics). Memory addresses verified against the pret/pokered disassembly.

License & ROM

MIT, see LICENSE.

You must own a legal copy of the Pokémon Red ROM. This repository does not distribute, link to, or facilitate acquisition of any copyrighted game data.

About

Reinforcement learning toolkit for training agents to play Pokémon Red. Built on PyBoy, Stable-Baselines3, and Gymnasium. Includes pixel/symbolic/hybrid observation treatments and rliable-based statistical analysis.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors