Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove reinforcement learning notebook from executable build #144

Merged
merged 7 commits into from
Aug 29, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -11,10 +11,6 @@ jobs:
steps:
- checkout

- run:
name: Install deps for building atari-py
command: sudo apt-get update && sudo apt-get install -y cmake ffmpeg

- run:
name: Install Python dependencies
command: |
3 changes: 1 addition & 2 deletions .github/workflows/conda.yml
Original file line number Diff line number Diff line change
@@ -14,8 +14,7 @@ jobs:

strategy:
matrix:
# NOTE: Gym/atari deps need to be solved for this to work on windows
os: [ubuntu, macos] #, windows]
os: [ubuntu, macos, windows]

defaults:
run:
Original file line number Diff line number Diff line change
@@ -77,31 +77,31 @@ You will train your Pong agent through an "on-policy" method using policy gradie

**1.** First, you should install OpenAI Gym (using `pip install gym[atari]` - this package is currently not available on conda), and import NumPy, Gym and the necessary modules:

```{code-cell}
```python
import numpy as np
import gym
```

Gym can monitor and save the output using the `Monitor` wrapper:

```{code-cell}
```python
from gym import wrappers
from gym.wrappers import Monitor
```

**2.** Instantiate a Gym environment for the game of Pong:

```{code-cell}
```python
env = gym.make("Pong-v0")
```

**3.** Let's review which actions are available in the `Pong-v0` environment:

```{code-cell}
```python
print(env.action_space)
```

```{code-cell}
```python
print(env.get_action_meanings())
```

@@ -111,7 +111,7 @@ For simplicity, your policy network will have one output — a (log) probability

**4.** Gym can save videos of the agent's learning in an MP4 format — wrap `Monitor()` around the environment by running the following:

```{code-cell}
```python
env = Monitor(env, "./video", force=True)
```

@@ -127,7 +127,7 @@ Pong screen frames are 210x160 pixels over 3 color dimensions (red, green and bl

**1.** Check the Pong's observations:

```{code-cell}
```python
print(env.observation_space)
```

@@ -143,7 +143,7 @@ In Gym, the agent's actions and observations can be part of the `Box` (n-dimensi

(You can refer to the OpenAI Gym core [API](https://github.com/openai/gym/blob/master/gym/core.py) for more information about Gym's core classes and methods.)

```{code-cell}
```python
import matplotlib.pyplot as plt

env.seed(42)
@@ -157,7 +157,7 @@ To feed the observations into the policy (neural) network, you need to convert t

**3.** Set up a helper function for frame (observation) preprocessing:

```{code-cell}
```python
def frame_preprocessing(observation_frame):
# Crop the frame.
observation_frame = observation_frame[35:195]
@@ -173,7 +173,7 @@ def frame_preprocessing(observation_frame):

**4.** Preprocess the random frame from earlier to test the function — the input for the policy network is an 80x80 1D image:

```{code-cell}
```python
preprocessed_random_frame = frame_preprocessing(random_frame)
plt.imshow(preprocessed_random_frame, cmap="gray")
print(preprocessed_random_frame.shape)
@@ -193,42 +193,42 @@ Next, you will define the policy as a simple feedforward network that uses a gam
Start by creating a random number generator instance for the experiment
(seeded for reproducibility):

```{code-cell}
```python
rng = np.random.default_rng(seed=12288743)
```

Then:

- Set the input (observation) dimensionality - your preprocessed screen frames:

```{code-cell}
```python
D = 80 * 80
```

- Set the number of hidden layer neurons.

```{code-cell}
```python
H = 200
```

- Instantiate your policy (neural) network model as an empty dictionary.

```{code-cell}
```python
model = {}
```

In a neural network, _weights_ are important adjustable parameters that the network fine-tunes by forward and backward propagating the data.

**2.** Using a technique called [Xavier initialization](https://www.deeplearning.ai/ai-notes/initialization/#IV), set up the network model's initial weights with NumPy's [`Generator.standard_normal()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.standard_normal.html) that returns random numbers over a standard Normal distribution, as well as [`np.sqrt()`](https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html?highlight=numpy.sqrt#numpy.sqrt):

```{code-cell}
```python
model["W1"] = rng.standard_normal(size=(H, D)) / np.sqrt(D)
model["W2"] = rng.standard_normal(size=H) / np.sqrt(H)
```

**3.** Your policy network starts by randomly initializing the weights and feeds the input data (frames) forward from the input layer through a hidden layer to the output layers. This process is called the _forward pass_ or _forward propagation_, and is outlined in the function `policy_forward()`:

```{code-cell}
```python
def policy_forward(x, model):
# Matrix-multiply the weights by the input in the one and only hidden layer.
h = np.dot(model["W1"], x)
@@ -251,7 +251,7 @@ Note that there are two _activation functions_ for determining non-linear relati

**4.** Define the sigmoid function separately with NumPy's [`np.exp()`](https://numpy.org/doc/stable/reference/generated/numpy.exp.html?highlight=numpy.exp#numpy.exp) for computing exponentials:

```{code-cell}
```python
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
```
@@ -262,7 +262,7 @@ During learning in your deep RL algorithm, you use the action log probabilities

**1.** Let's define the backward pass function (`policy_backward()`) with the help of NumPy's modules for array multiplication — [`np.dot()`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html?highlight=numpy.dot#numpy.dot) (matrix multiplication), [`np.outer()`](https://numpy.org/doc/stable/reference/generated/numpy.outer.html) (outer product computation), and [`np.ravel()`](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html) (to flatten arrays into 1D arrays):

```{code-cell}
```python
def policy_backward(eph, epdlogp, model):
dW2 = np.dot(eph.T, epdlogp).ravel()
dh = np.outer(epdlogp, model["W2"])
@@ -276,7 +276,7 @@ Using the intermediate hidden "states" of the network (`eph`) and the gradients

**2.** When applying backpropagation during agent training, you will need to save several variables for each episode. Let's instantiate empty lists to store them:

```{code-cell}
```python
# All preprocessed observations for the episode.
xs = []
# All hidden "states" (from the network) for the episode.
@@ -292,21 +292,21 @@ You will reset these variables manually at the end of each episode during traini

**3.** Next, to perform a gradient ascent when optimizing the agent's policy, it is common to use deep learning _optimizers_ (you're performing optimization with gradients). In this example, you'll use [RMSProp](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp) — an adaptive optimization [method](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). Let's set a discounting factor — a decay rate — for the optimizer:

```{code-cell}
```python
decay_rate = 0.99
```

**4.** You will also need to store the gradients (with the help of NumPy's [`np.zeros_like()`](https://numpy.org/doc/stable/reference/generated/numpy.zeros_like.html)) for the optimization step during training:

- First, save the update buffers that add up gradients over a batch:

```{code-cell}
```python
grad_buffer = {k: np.zeros_like(v) for k, v in model.items()}
```

- Second, store the RMSProp memory for the optimizer for gradient ascent:

```{code-cell}
```python
rmsprop_cache = {k: np.zeros_like(v) for k, v in model.items()}
```

@@ -316,7 +316,7 @@ In this section, you will set up a function for computing discounted rewards (`d

To provide more weight to shorter-term rewards over longer-term ones, you will use a _discount factor_ (gamma) that is often a floating-point number between 0.9 and 0.99.

```{code-cell}
```python
gamma = 0.99


@@ -363,48 +363,48 @@ You can stop the training at any time or/and check saved MP4 videos of saved pla

**1.** For demo purposes, let's limit the number of episodes for training to 3. If you are using hardware acceleration (CPUs and GPUs), you can increase the number to 1,000 or beyond. For comparison, Andrej Karpathy's original experiment took about 8,000 episodes.

```{code-cell}
```python
max_episodes = 3
```

**2.** Set the batch size and the learning rate values:
- The _batch size_ dictates how often (in episodes) the model performs a parameter update. It is the number of times your agent can collect the state-action trajectories. At the end of the collection, you can perform the maximization of action-probability multiples.
- The [_learning rate_](https://en.wikipedia.org/wiki/Learning_rate) helps limit the magnitude of weight updates to prevent them from overcorrecting.

```{code-cell}
```python
batch_size = 3
learning_rate = 1e-4
```

**3.** Set the game rendering default variable for Gym's `render` method (it is used to display the observation and is optional but can be useful during debugging):

```{code-cell}
```python
render = False
```

**4.** Set the agent's initial (random) observation by calling `reset()`:

```{code-cell}
```python
observation = env.reset()
```

**5.** Initialize the previous observation:

```{code-cell}
```python
prev_x = None
```

**6.** Initialize the reward variables and the episode count:

```{code-cell}
```python
running_reward = None
reward_sum = 0
episode_number = 0
```

**7.** To simulate motion between the frames, set the single input frame (`x`) for the policy network as the difference between the current and previous preprocessed frames:

```{code-cell}
```python
def update_input(prev_x, cur_x, D):
if prev_x is not None:
x = cur_x - prev_x
@@ -415,7 +415,7 @@ def update_input(prev_x, cur_x, D):

**8.** Finally, start the training loop, using the functions you have predefined:

```{code-cell}
```python
:tags: [output_scroll]

while episode_number < max_episodes:
@@ -546,7 +546,7 @@ A few notes:

- If you have previously run an experiment and want to repeat it, your `Monitor` instance may still be running, which may throw an error the next time you try to traini the agent. Therefore, you should first shut down `Monitor` by calling `env.close()` by uncommenting and running the cell below:

```{code-cell}
```python
# env.close()
```

5 changes: 0 additions & 5 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -8,14 +8,9 @@ dependencies:
- matplotlib
- pandas
- statsmodels
- pip
- imageio
- pooch
- ffmpeg # For gym/atari
# For building the site
- sphinx<5
- myst-nb
- sphinx-book-theme
- sphinx-copybutton
- pip:
- gym[atari]==0.19
3 changes: 0 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -5,8 +5,5 @@ matplotlib
pandas
statsmodels
imageio
gym==0.18.3
atari-py==0.2.5
pooch==1.5.1
# For supporting .md-based notebooks
jupytext