|
30 | 30 | than correct output is [0, 0, 1] where third column corresponds to action 3.
|
31 | 31 | 4. Save the data: state, the 'correct' output and the reward you've got.
|
32 | 32 |
|
33 |
| -This is what you do after each frame. Now do this for a while, a few |
34 |
| -(or one, however you like) games to collect enough data for training. |
| 33 | +This is what you do after each frame. Now do this for a while (a few |
| 34 | +or one game, however you like) to collect enough data for one training iteration. |
35 | 35 |
|
36 | 36 | After you've collected enough data, do one iteration of training:
|
37 | 37 | 1. Construct gradient loss vector. This step is the core of gradient decent method.
|
|
52 | 52 | """
|
53 | 53 |
|
54 | 54 | # functions
|
55 |
| -def prep_observation(observation, zeros_and_ones): |
| 55 | +def prep_observation(observation, zeros_and_ones=False): |
56 | 56 | obs_2d = observation[:, :, 0] # from RGB to R
|
57 | 57 | obs_2d_cut = obs_2d[93:193, 8:152] # Specific to Breakout: whole space 33:193, 8:152 ; not including bricks 93:193, 8:152
|
58 | 58 | obs_2d_cut_ds = obs_2d_cut[::2, ::2] # downsample by 2: a b c d e f d -> to -> a c e d
|
@@ -103,7 +103,7 @@ def plot_pixels(observation): # this one is used to plot how input to nn looks l
|
103 | 103 | reward_shift = 10 # to account for lagging reward (frames)
|
104 | 104 | reward_discount = 0.99
|
105 | 105 | obs_discount = 0.8 # discount last frame to account for velocity of objects (used in running_frame)
|
106 |
| -training_batch_size = 5 # number of games to perform one optimization step |
| 106 | +training_batch_size = 5 # number of games to play before performing one optimization step |
107 | 107 | neurons = 32 # single hidden layer with that many neurons
|
108 | 108 | input_size = 3600 # hand written number indicating number of pixels fed into nn
|
109 | 109 | actions = [0, 1, 2] # true actions are 1, 2, 3 this is used for indexing
|
|
0 commit comments