Add a tutorial notebook for Blackjack-v1 (#64)

Farama-Foundation · Oct 23, 2022 · 67ab18e · 67ab18e
1 parent 060754d
commit 67ab18e
Show file tree

Hide file tree

Showing 4 changed files with 364 additions and 0 deletions.
diff --git a/docs/_static/img/tutorials/blackjack_AE_loop.jpg b/docs/_static/img/tutorials/blackjack_AE_loop.jpg
diff --git a/docs/_static/img/tutorials/blackjack_with_usable_ace.png b/docs/_static/img/tutorials/blackjack_with_usable_ace.png
diff --git a/docs/_static/img/tutorials/blackjack_without_usable_ace.png b/docs/_static/img/tutorials/blackjack_without_usable_ace.png
diff --git a/docs/tutorials/blackjack_tutorial.py b/docs/tutorials/blackjack_tutorial.py
@@ -0,0 +1,364 @@
+"""
+Solving Blackjack with Q-Learning
+=================================
+
+"""
+
+
+# %%
+# .. image:: /_static/img/tutorials/blackjack_AE_loop.jpg
+#   :width: 650
+#   :alt: agent-environment-diagram
+#
+# In this tutorial, we’ll explore and solve the *Blackjack-v1*
+# environment.
+#
+# **Blackjack** is one of the most popular casino card games that is also
+# infamous for being beatable under certain conditions. This version of
+# the game uses an infinite deck (we draw the cards with replacement), so
+# counting cards won’t be a viable strategy in our simulated game.
+#
+# **Objective**: To win, your card sum should be greater than than the
+# dealers without exceeding 21.
+#
+# **Approach**: To solve this environment by yourself, you can pick your
+# favorite discrete RL algorithm. The presented solution uses *Q-learning*
+# (a model-free RL algorithm).
+#
+
+
+# %%
+# Imports and Environment Setup
+# ------------------------------
+#
+
+# Author: Till Zemann
+# License: MIT License
+
+from collections import defaultdict
+
+import gym
+import numpy as np
+import seaborn as sns
+from matplotlib import pyplot as plt
+from matplotlib.patches import Patch
+
+# Let's start by creating the blackjack environment.
+# Note: We are going to follow the rules from Sutton & Barto.
+# Other versions of the game can be found below for you to experiment.
+
+env = gym.make("Blackjack-v1", sab=True)
+
+
+# %%
+# .. code:: py
+#
+#   # Other possible environment configurations:
+#
+#   env = gym.make('Blackjack-v1', natural=True, sab=False)``
+#
+#   env = gym.make('Blackjack-v1', natural=False, sab=False)``
+#
+
+
+# %%
+# Observing the environment
+# ------------------------------
+#
+# First of all, we call ``env.reset()`` to start an episode. This function
+# resets the environment to a starting position and returns an initial
+# ``observation``. We usually also set ``done = False``. This variable
+# will be useful later to check if a game is terminated. In this tutorial
+# we will use the terms observation and state synonymously but in more
+# complex problems a state might differ from the observation it is based
+# on.
+#
+
+# reset the environment to get the first observation
+done = False
+observation, info = env.reset()
+
+print(observation)
+
+
+# %%
+# Note that our observation is a 3-tuple consisting of 3 discrete values:
+#
+# -  The players current sum
+# -  Value of the dealers face-up card
+# -  Boolean whether the player holds a usable ace (An ace is usable if it
+#    counts as 11 without busting)
+#
+
+
+# %%
+# Executing an action
+# ------------------------------
+#
+# After receiving our first observation, we are only going to use the
+# ``env.step(action)`` function to interact with the environment. This
+# function takes an action as input and executes it in the environment.
+# Because that action changes the state of the environment, it returns
+# four useful variables to us. These are:
+#
+# -  ``next_state``: This is the observation that the agent will receive
+#    after taking the action.
+# -  ``reward``: This is the reward that the agent will receive after
+#    taking the action.
+# -  ``terminated``: This is a boolean variable that indicates whether or
+#    not the episode is over.
+# -  ``truncated``: This is a boolean variable that also indicates whether
+#    the episode ended by early truncation.
+# -  ``info``: This is a dictionary that might contain additional
+#    information about the environment.
+#
+# The ``next_state``, ``reward``, and ``done`` variables are
+# self-explanatory, but the ``info`` variable requires some additional
+# explanation. This variable contains a dictionary that might have some
+# extra information about the environment, but in the Blackjack-v1
+# environment you can ignore it. For example in Atari environments the
+# info dictionary has a ``ale.lives`` key that tells us how many lives the
+# agent has left. If the agent has 0 lives, then the episode is over.
+#
+# Blackjack-v1 doesn’t have a ``env.render()`` function to render the
+# environment, but in other environments you can use this function to
+# watch the agent play. Important to note is that using ``env.render()``
+# is optional - the environment is going to work even if you don’t render
+# it, but it can be helpful to see an episode rendered out to get an idea
+# of how the current policy behaves. Note that it is not a good idea to
+# call this function in your training loop because rendering slows down
+# training by a lot. Rather try to build an extra loop to evaluate and
+# showcase the agent after training.
+#
+
+# sample a random action from all valid actions
+action = env.action_space.sample()
+
+# execute the action in our environment and receive infos from the environment
+observation, reward, terminated, truncated, info = env.step(action)
+
+print("observation:", observation)
+print("reward:", reward)
+print("terminated:", terminated)
+print("truncated:", truncated)
+print("info:", info)
+
+
+# %%
+# Once ``terminated = True`` or ``truncated=True``, we should stop the
+# current episode and begin a new one with ``env.reset()``. If you
+# continue executing act`ons without resetting the environment, it still
+# responds but the output won’t be useful for training (it might even be
+# harmful if the agent learns on invalid data).
+#
+
+
+# %%
+# Building an agent
+# ------------------------------
+#
+# Let’s build a ``Q-learning agent`` to solve *Blackjack-v1*! We’ll need
+# some functions for picking an action and updating the agents action
+# values. To ensure that the agents expores the environment, one possible
+# solution is the ``epsilon-greedy`` strategy, where we pick a random
+# action with the percentage ``epsilon`` and the greedy action (currently
+# valued as the best) ``1 - epsilon``.
+#
+
+
+class BlackjackAgent:
+    def __init__(self, lr=1e-3, epsilon=0.1, epsilon_decay=1e-4):
+        """
+        Initialize an Reinforcement Learning agent with an empty dictionary
+        of state-action values (q_values), a learning rate and an epsilon.
+        """
+        self.q_values = defaultdict(
+            lambda: np.zeros(env.action_space.n)
+        )  # maps a state to action values
+        self.lr = lr
+        self.epsilon = epsilon
+        self.epsilon_decay = epsilon_decay
+
+    def get_action(self, state):
+        """
+        Returns the best action with probability (1 - epsilon)
+        and a random action with probability epsilon to ensure exploration.
+        """
+        # with probability epsilon return a random action to explore the environment
+        if np.random.random() < self.epsilon:
+            action = env.action_space.sample()
+
+        # with probability (1 - epsilon) act greedily (exploit)
+        else:
+            action = np.argmax(self.q_values[state])
+        return action
+
+    def update(self, state, action, reward, next_state, done):
+        """
+        Updates the Q-value of an action.
+        """
+        old_q_value = self.q_values[state][action]
+        max_future_q = np.max(self.q_values[next_state])
+        target = reward + self.lr * max_future_q * (1 - done)
+        self.q_values[state][action] = (1 - self.lr) * old_q_value + self.lr * target
+
+    def decay_epsilon(self):
+        self.epsilon = self.epsilon - epsilon_decay
+
+
+# %%
+# To train the agent, we will let the agent play one episode (one complete
+# game is called an episode) at a time and then update it’s Q-values after
+# each episode. The agent will have to experience a lot of episodes to
+# explore the environment sufficiently.
+#
+# Now we should be ready to build the training loop.
+#
+
+# hyperparameters
+learning_rate = 1e-3
+start_epsilon = 0.8
+n_episodes = 200_000
+epsilon_decay = start_epsilon / n_episodes  # less exploration over time
+
+agent = BlackjackAgent(
+    lr=learning_rate, epsilon=start_epsilon, epsilon_decay=epsilon_decay
+)
+
+
+def train(agent, n_episodes):
+    for episode in range(n_episodes):
+
+        # reset the environment
+        state, info = env.reset()
+        done = False
+
+        # play one episode
+        while not done:
+            action = agent.get_action(observation)
+            next_state, reward, terminated, truncated, info = env.step(action)
+            done = (
+                terminated or truncated
+            )  # if the episode terminated or was truncated early, set done to True
+            agent.update(state, action, reward, next_state, done)
+            state = next_state
+
+        agent.update(state, action, reward, next_state, done)
+
+
+# %%
+# Great, let’s train!
+#
+
+train(agent, n_episodes)
+
+
+# %%
+# Visualizing the results
+# ------------------------------
+#
+
+
+def create_grids(agent, usable_ace=False):
+
+    # convert our state-action values to state values
+    # and build a policy dictionary that maps observations to actions
+    V = defaultdict(float)
+    policy = defaultdict(int)
+    for obs, action_values in agent.q_values.items():
+        V[obs] = np.max(action_values)
+        policy[obs] = np.argmax(action_values)
+
+    X, Y = np.meshgrid(
+        np.arange(12, 22), np.arange(1, 11)  # players count
+    )  # dealers face-up card
+
+    # create the value grid for plotting
+    Z = np.apply_along_axis(
+        lambda obs: V[(obs[0], obs[1], usable_ace)], axis=2, arr=np.dstack([X, Y])
+    )
+    value_grid = X, Y, Z
+
+    # create the policy grid for plotting
+    policy_grid = np.apply_along_axis(
+        lambda obs: policy[(obs[0], obs[1], usable_ace)], axis=2, arr=np.dstack([X, Y])
+    )
+    return value_grid, policy_grid
+
+
+def create_plots(value_grid, policy_grid, title="N/A"):
+
+    # create a new figure with 2 subplots (left: state values, right: policy)
+    X, Y, Z = value_grid
+    fig = plt.figure(figsize=plt.figaspect(0.4))
+    fig.suptitle(title, fontsize=16)
+
+    # plot the state values
+    ax1 = fig.add_subplot(1, 2, 1, projection="3d")
+    ax1.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap="viridis", edgecolor="none")
+    plt.xticks(range(12, 22), range(12, 22))
+    plt.yticks(range(1, 11), ["A"] + list(range(2, 11)))
+    ax1.set_title("State values: " + title)
+    ax1.set_xlabel("Player sum")
+    ax1.set_ylabel("Dealer showing")
+    ax1.zaxis.set_rotate_label(False)
+    ax1.set_zlabel("Value", fontsize=14, rotation=90)
+    ax1.view_init(20, 220)
+
+    # plot the policy
+    fig.add_subplot(1, 2, 2)
+    ax2 = sns.heatmap(policy_grid, linewidth=0, annot=True, cmap="Accent_r", cbar=False)
+    ax2.set_title("Policy: " + title)
+    ax2.set_xlabel("Player sum")
+    ax2.set_ylabel("Dealer showing")
+    ax2.set_xticklabels(range(12, 22))
+    ax2.set_yticklabels(["A"] + list(range(2, 11)), fontsize=12)
+
+    # add a legend
+    legend_elements = [
+        Patch(facecolor="lightgreen", edgecolor="black", label="Hit"),
+        Patch(facecolor="grey", edgecolor="black", label="Stick"),
+    ]
+    ax2.legend(handles=legend_elements, bbox_to_anchor=(1.3, 1))
+    return fig
+
+
+# state values & policy with usable ace (ace counts as 11)
+value_grid, policy_grid = create_grids(agent, usable_ace=True)
+fig1 = create_plots(value_grid, policy_grid, title="With usable ace")
+plt.show()
+
+
+# %%
+# .. image:: /_static/img/tutorials/blackjack_with_usable_ace.png
+#
+
+# state values & policy without usable ace (ace counts as 1)
+value_grid, policy_grid = create_grids(agent, usable_ace=False)
+fig2 = create_plots(value_grid, policy_grid, title="Without usable ace")
+plt.show()
+
+
+# %%
+# .. image:: /_static/img/tutorials/blackjack_without_usable_ace.png
+#
+# It's good practice to call env.close() at the end of your script,
+# so that any used resources by the environment will be closed.
+#
+
+env.close()
+
+
+# %%
+# Hopefully this Tutorial helped you get a grip of how to interact with
+# OpenAI-Gym environments and sets you on a journey to solve many more RL
+# challenges.
+#
+# It is recommended that you solve this environment by yourself (project
+# based learning is really effective!). You can apply your favorite
+# discrete RL algorithm or give Monte Carlo ES a try (covered in `Sutton &
+# Barto <http://incompleteideas.net/book/the-book-2nd.html>`__, section
+# 5.3) - this way you can compare your results directly to the book.
+#
+# Best of fun!
+#