diff --git a/docs/content/basic_usage.md b/docs/content/basic_usage.md index 2ff30477d..b4088bb2d 100644 --- a/docs/content/basic_usage.md +++ b/docs/content/basic_usage.md @@ -1,23 +1,29 @@ --- layout: "contents" -title: API +title: Basic Usage firstpage: --- # Basic Usage +Gymnasium is a project that provide an API for all single agent reinforcement learning environments that include implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more. + +The API contains four key functions: ``make``, ``reset``, ``step`` and ``render`` that this basic usage will introduce you to. At the core of Gymnasium is ``Env`` which is a high level python class representing a markov decision process from reinforcement learning theory (this is not a perfect reconstruction missing several components of MDPs). Within gymnasium, environments (MDPs) are implements as ``Env`` along with ``Wrappers`` that can change the results passed to the user. + ## Initializing Environments -Initializing environments is very easy in Gymnasium and can be done via: +Initializing environments is very easy in Gymnasium and can be done via the ``make`` function: ```python import gymnasium as gym env = gym.make('CartPole-v1') ``` +This will return an ``Env`` for users to interact with. To see all environments you can create, use ``gymnasium.envs.registry.keys()``.``make`` includes a number of additional parameters to adding wrappers, specifying keywords to the environment and more. + ## Interacting with the Environment -Gymnasium implements the classic "agent-environment loop": +The classic "agent-environment loop" pictured below is simplified representation of reinforcement learning that Gymnasium implements. ```{image} /_static/diagrams/AE_loop.png :width: 50% @@ -31,29 +37,15 @@ Gymnasium implements the classic "agent-environment loop": :class: only-dark ``` -The agent performs some actions in the environment (usually by passing some control inputs to the environment, e.g. torque inputs of motors) and observes -how the environment's state changes. One such action-observation exchange is referred to as a *timestep*. - -The goal in RL is to manipulate the environment in some specific way. For instance, we want the agent to navigate a robot -to a specific point in space. If it succeeds in doing this (or makes some progress towards that goal), it will receive a positive reward -alongside the observation for this timestep. The reward may also be negative or 0, if the agent did not yet succeed (or did not make any progress). -The agent will then be trained to maximize the reward it accumulates over many timesteps. - -After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed, or the agent may have succeeded in completing a task. In that case, we want to reset the environment to a new initial state. The environment issues a terminated signal to the agent if it enters such a terminal state. Sometimes we also want to end the episode after a fixed number of timesteps, in this case, the environment issues a truncated signal. -This is a new change in API (v0.26 onwards). Earlier a commonly done signal was issued for an episode ending via any means. This is now changed in favour of issuing two signals - terminated and truncated. - -Let's see what the agent-environment loop looks like in Gymnasium. -This example will run an instance of `LunarLander-v2` environment for 1000 timesteps. Since we pass `render_mode="human"`, you should see a window pop up rendering the environment. +This loop is implemented using the following gymnasium code ```python import gymnasium as gym env = gym.make("LunarLander-v2", render_mode="human") -env.action_space.seed(42) - -observation, info = env.reset(seed=42) +observation, info = env.reset() for _ in range(1000): - action = env.action_space.sample() + action = env.action_space.sample() # agent policy that uses the observation and info observation, reward, terminated, truncated, info = env.step(action) if terminated or truncated: @@ -69,112 +61,41 @@ The output should look something like this: :align: center ``` -Every environment specifies the format of valid actions by providing an `env.action_space` attribute. Similarly, -the format of valid observations is specified by `env.observation_space`. -In the example above we sampled random actions via `env.action_space.sample()`. Note that we need to seed the action space separately from the -environment to ensure reproducible samples. - - -### Change in env.step API - -Previously, the step method returned only one boolean - `done`. This is being deprecated in favour of returning two booleans `terminated` and `truncated` (v0.26 onwards). +### Explaining the code -`terminated` signal is set to `True` when the core environment terminates inherently because of task completion, failure etc. a condition defined in the MDP. -`truncated` signal is set to `True` when the episode ends specifically because of a time-limit or a condition not inherent to the environment (not defined in the MDP). -It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step. +First, an environment is created using ``make`` with an additional keyword `"render_mode"` that specifies how the environment should be visualised. See ``render`` for details on the default meaning of different render modes. In this example, we use the ``"LunarLander"`` environment where the agent controls a spaceship that needs to land safely. -This is explained in detail in the `Handling Time Limits` section. +After initializing the environment, we ``reset`` the environment to get the first observation of the environment. For initializing the environment with a particular random seed or options (see environment documentation for possible values) use the ``seed`` or ``options`` parameters with ``reset``. -#### Backward compatibility +Next, the agent performs an action in the environment, ``step``, this can be imagined as moving a robot or pressing a button on a games' controller that causes a change within the environment. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be for instance positive for destroying an enemy or a negative reward for moving into lava. One such action-observation exchange is referred to as a *timestep*. -Gym will retain support for the old API through compatibility wrappers. +However, after some timesteps, the environment may end, this is called the terminal state. For instance, the robot may have crashed, or the agent have succeeded in completing a task, the environment will need to stop as the agent cannot continue. In gymnasium, if the environment has terminated, this is returned by ``step``. Similarly, we may also want the environment to end after a fixed number of timesteps, in this case, the environment issues a truncated signal. If either of ``terminated`` or ``truncated`` are `true` then ``reset`` should be called next to restart the environment. -Users can toggle the old API through `make` by setting `apply_api_compatibility=True`. - -```python -env = gym.make("CartPole-v1", apply_api_compatibility=True) -``` -This can also be done explicitly through a wrapper: -```python -from gymnasium.wrappers import StepAPICompatibility -env = StepAPICompatibility(CustomEnv(), output_truncation_bool=False) -``` -For more details see the wrappers section. +## Action and observation spaces +Every environment specifies the format of valid actions and observations with the ``env.action_space`` and ``env.observation_space`` attributes. This is helpful for both knowing the expected input and output of the environment as all valid actions and observation should be contained with the respective space. -## Checking API-Conformity +In the example, we sampled random actions via ``env.action_space.sample()`` instead of using an agent policy, mapping observations to actions which users will want to make. See one of the agent tutorials for an example of creating and training an agent policy. -If you have implemented a custom environment and would like to perform a sanity check to make sure that it conforms to -the API, you can run: +Every environment should have the attributes ``action_space`` and ``observation_space``, both of which should be instances of classes that inherit from ``Space``. Gymnasium has support for a major of possible spaces are users need: -```python ->>> from gymnasium.utils.env_checker import check_env ->>> check_env(env) -``` - -This function will throw an exception if it seems like your environment does not follow the Gymnasium API. It will also produce -warnings if it looks like you made a mistake or do not follow a best practice (e.g. if `observation_space` looks like -an image but does not have the right dtype). Warnings can be turned off by passing `warn=False`. By default, `check_env` will -not check the `render` method. To change this behavior, you can pass `skip_render_check=False`. +- ``Box``: describes an n-dimensional continuous space. It's a bounded space where we can define the upper and lower + limits which describe the valid values our observations can take. +- ``Discrete``: describes a discrete space where {0, 1, ..., n-1} are the possible values our observation or action can take. + Values can be shifted to {a, a+1, ..., a+n-1} using an optional argument. +- ``Dict``: represents a dictionary of simple spaces. +- ``Tuple``: represents a tuple of simple spaces. +- ``MultiBinary``: creates an n-shape binary space. Argument n can be a number or a list of numbers. +- ``MultiDiscrete``: consists of a series of ``Discrete`` action spaces with a different number of actions in each element. -> After running `check_env` on an environment, you should not reuse the instance that was checked, as it may have already -been closed! +For example usage of spaces, see their [documentation](/api/spaces) along with [utility functions](/api/spaces/utils). There are a couple of more niche spaces ``Graph``, ``Sequence`` and ``Text``. -## Spaces +## Modifying the environment -Spaces are usually used to specify the format of valid actions and observations. -Every environment should have the attributes `action_space` and `observation_space`, both of which should be instances -of classes that inherit from `Space`. -There are multiple `Space` types available in Gymnasium: +Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can also be chained to combine their effects. Most environments that are generated via ``gymnasium.make`` will already be wrapped by default using the ``TimeLimit``, ``OrderEnforcing`` and ``PassiveEnvChecker``. -- `Box`: describes an n-dimensional continuous space. It's a bounded space where we can define the upper and lower limits which describe the valid values our observations can take. -- `Discrete`: describes a discrete space where {0, 1, ..., n-1} are the possible values our observation or action can take. Values can be shifted to {a, a+1, ..., a+n-1} using an optional argument. -- `Dict`: represents a dictionary of simple spaces. -- `Tuple`: represents a tuple of simple spaces. -- `MultiBinary`: creates a n-shape binary space. Argument n can be a number or a `list` of numbers. -- `MultiDiscrete`: consists of a series of `Discrete` action spaces with a different number of actions in each element. +In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along with (possibly optional) parameters to the wrapper's constructor: -```python ->>> from gymnasium.spaces import Box, Discrete, Dict, Tuple, MultiBinary, MultiDiscrete ->>> import numpy as np ->>> ->>> observation_space = Box(low=-1.0, high=2.0, shape=(3,), dtype=np.float32) ->>> observation_space.sample() -[ 1.6952509 -0.4399011 -0.7981693] ->>> ->>> observation_space = Discrete(4) ->>> observation_space.sample() -1 ->>> ->>> observation_space = Discrete(5, start=-2) ->>> observation_space.sample() --2 ->>> ->>> observation_space = Dict({"position": Discrete(2), "velocity": Discrete(3)}) ->>> observation_space.sample() -OrderedDict([('position', 0), ('velocity', 1)]) ->>> ->>> observation_space = Tuple((Discrete(2), Discrete(3))) ->>> observation_space.sample() -(1, 2) ->>> ->>> observation_space = MultiBinary(5) ->>> observation_space.sample() -[1 1 1 0 1] ->>> ->>> observation_space = MultiDiscrete([ 5, 2, 2 ]) ->>> observation_space.sample() -[3 0 0] - ``` - -## Wrappers - -Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. -Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can -also be chained to combine their effects. Most environments that are generated via `gymnasium.make` will already be wrapped by default. - -In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along -with (possibly optional) parameters to the wrapper's constructor: ```python >>> import gymnasium >>> from gymnasium.wrappers import RescaleAction @@ -186,19 +107,6 @@ Box([-1. -1. -1. -1.], [1. 1. 1. 1.], (4,), float32) Box([0. 0. 0. 0.], [1. 1. 1. 1.], (4,), float32) ``` - -There are three very common things you might want a wrapper to do: - -- Transform actions before applying them to the base environment -- Transform observations that are returned by the base environment -- Transform rewards that are returned by the base environment - -Such wrappers can be easily implemented by inheriting from `ActionWrapper`, `ObservationWrapper`, or `RewardWrapper` and implementing the -respective transformation. - -However, sometimes you might need to implement a wrapper that does some more complicated modifications (e.g. modify the -reward based on data in `info`). Such wrappers -can be implemented by inheriting from `Wrapper`. Gymnasium already provides many commonly used wrappers for you. Some examples: - `TimeLimit`: Issue a truncated signal if a maximum number of timesteps has been exceeded (or the base environment has issued a truncated signal). @@ -206,7 +114,9 @@ Gymnasium already provides many commonly used wrappers for you. Some examples: - `RescaleAction`: Rescale actions to lie in a specified interval - `TimeAwareObservation`: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov. -If you have a wrapped environment, and you want to get the unwrapped environment underneath all of the layers of wrappers (so that you can manually call a function or change some underlying aspect of the environment), you can use the `.unwrapped` attribute. If the environment is already a base environment, the `.unwrapped` attribute will just return itself. +For a full list of implemented wrappers in gymnasium, see [wrappers](/api/wrappers). + +If you have a wrapped environment, and you want to get the unwrapped environment underneath all the layers of wrappers (so that you can manually call a function or change some underlying aspect of the environment), you can use the `.unwrapped` attribute. If the environment is already a base environment, the `.unwrapped` attribute will just return itself. ```python >>> wrapped_env @@ -215,46 +125,8 @@ If you have a wrapped environment, and you want to get the unwrapped environment ``` -## Playing within an environment - -You can also play the environment using your keyboard using the `play` function in `gymnasium.utils.play`. -```python -from gymnasium.utils.play import play -play(gymnasium.make('Pong-v0')) -``` -This opens a window of the environment and allows you to control the agent using your keyboard. +## More information -Playing using the keyboard requires a key-action map. This map should have type `dict[tuple[int], int | None]`, which maps the keys pressed to action performed. -For example, if pressing the keys `w` and `space` at the same time is supposed to perform action `2`, then the `key_to_action` dict should look like this: -```python -{ - # ... - (ord('w'), ord(' ')): 2, - # ... -} -``` -As a more complete example, let's say we wish to play with `CartPole-v0` using our left and right arrow keys. The code would be as follows: -```python -import gymnasium as gym -import pygame -from gymnasium.utils.play import play - -mapping = {(pygame.K_LEFT,): 0, (pygame.K_RIGHT,): 1} -play(gym.make("CartPole-v1",render_mode="rgb_array"), keys_to_action=mapping) -``` -where we obtain the corresponding key ID constants from pygame. If the `key_to_action` argument is not specified, then the default `key_to_action` mapping for that env is used, if provided. - -Furthermore, if you wish to plot real time statistics as you play, you can use `gymnasium.utils.play.PlayPlot`. Here's some sample code for plotting the reward for last 5 second of gameplay: -```python -import gymnasium as gym -import pygame -from gymnasium.utils.play import PlayPlot, play - -def callback(obs_t, obs_tp1, action, rew, terminated, truncated, info): - return [rew, ] - -plotter = PlayPlot(callback, 30 * 5, ["reward"]) -mapping = {(pygame.K_LEFT,): 0, (pygame.K_RIGHT,): 1} -env = gym.make("CartPole-v1", render_mode="rgb_array") -play(env, callback=plotter.callback, keys_to_action=mapping) -``` +* [Making a Custom environment using the Gymnasium API](/tutorials/environment_creation) +* [Training an agent to play blackjack](/tutorials/blackjack_tutorial) +* [Compatibility with OpenAI Gym](/content/gym_compatibility) diff --git a/docs/content/environment_creation.md b/docs/content/environment_creation.md deleted file mode 100644 index 1f0bd7f52..000000000 --- a/docs/content/environment_creation.md +++ /dev/null @@ -1,416 +0,0 @@ ---- -layout: "contents" -title: Environment Creation ---- - -# Make your own custom environment - -This documentation overviews creating new environments and relevant useful wrappers, utilities and tests included in Gymnasium designed for the creation of new environments. -You can clone gym-examples to play with the code that is presented here. We recommend that you use a virtual environment: - -```console -git clone https://github.com/Farama-Foundation/gym-examples -cd gym-examples -python -m venv .env -source .env/bin/activate -pip install -e . -``` - -## Subclassing gymnasium.Env - -Before learning how to create your own environment you should check out [the documentation of Gymnasium's API](/api/core). - -We will be concerned with a subset of gym-examples that looks like this: - -```sh -gym-examples/ - README.md - setup.py - gym_examples/ - __init__.py - envs/ - __init__.py - grid_world.py - wrappers/ - __init__.py - relative_position.py - reacher_weighted_reward.py - discrete_action.py - clip_reward.py - ``` - -To illustrate the process of subclassing `gymnasium.Env`, we will implement a very simplistic game, called `GridWorldEnv`. -We will write the code for our custom environment in `gym-examples/gym_examples/envs/grid_world.py`. -The environment consists of a 2-dimensional square grid of fixed size (specified via the `size` parameter during construction). -The agent can move vertically or horizontally between grid cells in each timestep. The goal of the agent is to navigate to a -target on the grid that has been placed randomly at the beginning of the episode. - -- Observations provide the location of the target and agent. -- There are 4 actions in our environment, corresponding to the movements "right", "up", "left", and "down". -- A done signal is issued as soon as the agent has navigated to the grid cell where the target is located. -- Rewards are binary and sparse, meaning that the immediate reward is always zero, unless the agent has reached the target, then it is 1. - -An episode in this environment (with `size=5`) might look like this: - - - -where the blue dot is the agent and the red square represents the target. - - -Let us look at the source code of `GridWorldEnv` piece by piece: - -### Declaration and Initialization - -Our custom environment will inherit from the abstract class `gymnasium.Env`. You shouldn't forget to add the `metadata` attribute to your class. -There, you should specify the render-modes that are supported by your environment (e.g. `"human"`, `"rgb_array"`, `"ansi"`) -and the framerate at which your environment should be rendered. Every environment should support `None` as render-mode; you don't need to add it in the metadata. -In `GridWorldEnv`, we will support the modes "rgb_array" and "human" and render at 4 FPS. - -The `__init__` method of our environment will accept the integer `size`, that determines the size of the square grid. -We will set up some variables for rendering and define `self.observation_space` and `self.action_space`. -In our case, observations should provide information about the location of the agent and target on the 2-dimensional grid. -We will choose to represent observations in the form of dictionaries with keys `"agent"` and `"target"`. An observation -may look like ` {"agent": array([1, 0]), "target": array([0, 3])}`. -Since we have 4 actions in our environment ("right", "up", "left", "down"), we will use `Discrete(4)` as an action space. -Here is the declaration of `GridWorldEnv` and the implementation of `__init__`: - -```python -import gymnasium as gym -from gymnasium import spaces -import pygame -import numpy as np - - -class GridWorldEnv(gym.Env): - metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 4} - - def __init__(self, render_mode=None, size=5): - self.size = size # The size of the square grid - self.window_size = 512 # The size of the PyGame window - - # Observations are dictionaries with the agent's and the target's location. - # Each location is encoded as an element of {0, ..., `size`}^2, i.e. MultiDiscrete([size, size]). - self.observation_space = spaces.Dict( - { - "agent": spaces.Box(0, size - 1, shape=(2,), dtype=int), - "target": spaces.Box(0, size - 1, shape=(2,), dtype=int), - } - ) - - # We have 4 actions, corresponding to "right", "up", "left", "down" - self.action_space = spaces.Discrete(4) - - """ - The following dictionary maps abstract actions from `self.action_space` to - the direction we will walk in if that action is taken. - I.e. 0 corresponds to "right", 1 to "up" etc. - """ - self._action_to_direction = { - 0: np.array([1, 0]), - 1: np.array([0, 1]), - 2: np.array([-1, 0]), - 3: np.array([0, -1]), - } - - assert render_mode is None or render_mode in self.metadata["render_modes"] - self.render_mode = render_mode - - """ - If human-rendering is used, `self.window` will be a reference - to the window that we draw to. `self.clock` will be a clock that is used - to ensure that the environment is rendered at the correct framerate in - human-mode. They will remain `None` until human-mode is used for the - first time. - """ - self.window = None - self.clock = None - -``` - -### Constructing Observations From Environment States - -Since we will need to compute observations both in `reset` and `step`, it is often convenient to have -a (private) method `_get_obs` that translates the environment's state into an observation. However, this is not mandatory -and you may as well compute observations in `reset` and `step` separately: -```python - def _get_obs(self): - return {"agent": self._agent_location, "target": self._target_location} -``` -We can also implement a similar method for the auxiliary information that is returned by `step` and `reset`. In our case, -we would like to provide the manhattan distance between the agent and the target: -```python - def _get_info(self): - return {"distance": np.linalg.norm(self._agent_location - self._target_location, ord=1)} -``` -Oftentimes, info will also contain some data that is only available inside the `step` method (e.g. individual reward -terms). In that case, we would have to update the dictionary that is returned by `_get_info` in `step`. - -### Reset - -The `reset` method will be called to initiate a new episode. You may assume that the `step` method will not -be called before `reset` has been called. Moreover, `reset` should be called whenever a done signal has been issued. -Users may pass the `seed` keyword to `reset` to initialize any random number generator that is used by the environment -to a deterministic state. It is recommended to use the random number generator `self.np_random` that is provided by the environment's -base class, `gymnasium.Env`. If you only use this RNG, you do not need to worry much about seeding, *but you need to remember to -call `super().reset(seed=seed)`* to make sure that `gymnasium.Env` correctly seeds the RNG. -Once this is done, we can randomly set the state of our environment. -In our case, we randomly choose the agent's location and the random sample target positions, until it does not coincide with the agent's position. - -The `reset` method should return a tuple of the initial observation -and some auxiliary information. We can use the methods `_get_obs` -and `_get_info` that we implemented earlier for that: - -```python - def reset(self, seed=None, options=None): - # We need the following line to seed self.np_random - super().reset(seed=seed) - - # Choose the agent's location uniformly at random - self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int) - - # We will sample the target's location randomly until it does not coincide with the agent's location - self._target_location = self._agent_location - while np.array_equal(self._target_location, self._agent_location): - self._target_location = self.np_random.integers( - 0, self.size, size=2, dtype=int - ) - - observation = self._get_obs() - info = self._get_info() - - if self.render_mode == "human": - self._render_frame() - - return observation, info -``` - -### Step - -The `step` method usually contains most of the logic of your environment. It accepts an `action`, computes the state of -the environment after applying that action and returns the 4-tuple `(observation, reward, done, info)`. -Once the new state of the environment has been computed, we can check whether it is a terminal state and we set `done` -accordingly. Since we are using sparse binary rewards in `GridWorldEnv`, computing `reward` is trivial once we know `done`. To gather -`observation` and `info`, we can again make use of `_get_obs` and `_get_info`: - -```python - def step(self, action): - # Map the action (element of {0,1,2,3}) to the direction we walk in - direction = self._action_to_direction[action] - # We use `np.clip` to make sure we don't leave the grid - self._agent_location = np.clip( - self._agent_location + direction, 0, self.size - 1 - ) - # An episode is done iff the agent has reached the target - terminated = np.array_equal(self._agent_location, self._target_location) - reward = 1 if terminated else 0 # Binary sparse rewards - observation = self._get_obs() - info = self._get_info() - - if self.render_mode == "human": - self._render_frame() - - return observation, reward, terminated, False, info -``` - -### Rendering - -Here, we are using PyGame for rendering. A similar approach to rendering is used in many environments that are included -with Gymnasium and you can use it as a skeleton for your own environments: - -```python - def render(self): - if self.render_mode == "rgb_array": - return self._render_frame() - - def _render_frame(self): - if self.window is None and self.render_mode == "human": - pygame.init() - pygame.display.init() - self.window = pygame.display.set_mode((self.window_size, self.window_size)) - if self.clock is None and self.render_mode == "human": - self.clock = pygame.time.Clock() - - canvas = pygame.Surface((self.window_size, self.window_size)) - canvas.fill((255, 255, 255)) - pix_square_size = ( - self.window_size / self.size - ) # The size of a single grid square in pixels - - # First we draw the target - pygame.draw.rect( - canvas, - (255, 0, 0), - pygame.Rect( - pix_square_size * self._target_location, - (pix_square_size, pix_square_size), - ), - ) - # Now we draw the agent - pygame.draw.circle( - canvas, - (0, 0, 255), - (self._agent_location + 0.5) * pix_square_size, - pix_square_size / 3, - ) - - # Finally, add some gridlines - for x in range(self.size + 1): - pygame.draw.line( - canvas, - 0, - (0, pix_square_size * x), - (self.window_size, pix_square_size * x), - width=3, - ) - pygame.draw.line( - canvas, - 0, - (pix_square_size * x, 0), - (pix_square_size * x, self.window_size), - width=3, - ) - - if self.render_mode == "human": - # The following line copies our drawings from `canvas` to the visible window - self.window.blit(canvas, canvas.get_rect()) - pygame.event.pump() - pygame.display.update() - - # We need to ensure that human-rendering occurs at the predefined framerate. - # The following line will automatically add a delay to keep the framerate stable. - self.clock.tick(self.metadata["render_fps"]) - else: # rgb_array - return np.transpose( - np.array(pygame.surfarray.pixels3d(canvas)), axes=(1, 0, 2) - ) -``` - -### Close - -The `close` method should close any open resources that were used by the environment. In many cases, -you don't actually have to bother to implement this method. However, in our example `render_mode` may -be `"human"` and we might need to close the window that has been opened: - -```python - def close(self): - if self.window is not None: - pygame.display.quit() - pygame.quit() -``` - -In other environments `close` might also close files that were opened -or release other resources. You shouldn't interact with the environment after having called `close`. - - -## Registering Envs - -In order for the custom environments to be detected by Gymnasium, they must be registered as follows. We will choose to put this code in `gym-examples/gym_examples/__init__.py`. - -```python -from gymnasium.envs.registration import register - -register( - id='gym_examples/GridWorld-v0', - entry_point='gym_examples.envs:GridWorldEnv', - max_episode_steps=300, -) -``` -The environment ID consists of three components, two of which are optional: an optional namespace (here: `gym_examples`), a mandatory name (here: `GridWorld`) and an optional but recommended version (here: v0). It might have also been registered as `GridWorld-v0` (the recommended approach), `GridWorld` or `gym_examples/GridWorld`, and the appropriate ID should then be used during environment creation. - -The keyword argument `max_episode_steps=300` will ensure that GridWorld environments that are instantiated via `gymnasium.make` -will be wrapped in a `TimeLimit` wrapper (see [the wrapper documentation](/api/wrappers) -for more information). A done signal will then be produced if the agent has reached the target *or* 300 steps have been -executed in the current episode. To distinguish truncation and termination, you can check `info["TimeLimit.truncated"]`. - -Apart from `id` and `entrypoint`, you may pass the following additional keyword arguments to `register`: - -| Name | Type | Default | Description | -|---------------------|----------|----------|-----------------------------------------------------------------------------------------------------------| -| `reward_threshold` | `float` | `None` | The reward threshold before the task is considered solved | -| `nondeterministic` | `bool` | `False` | Whether this environment is non-deterministic even after seeding | -| `max_episode_steps` | `int` | `None` | The maximum number of steps that an episode can consist of. If not `None`, a `TimeLimit` wrapper is added | -| `order_enforce` | `bool` | `True` | Whether to wrap the environment in an `OrderEnforcing` wrapper | -| `autoreset` | `bool` | `False` | Whether to wrap the environment in an `AutoResetWrapper` | -| `kwargs` | `dict` | `{}` | The default kwargs to pass to the environment class | - -Most of these keywords (except for `max_episode_steps`, `order_enforce` and `kwargs`) do not alter the behavior -of environment instances but merely provide some extra information about your environment. -After registration, our custom `GridWorldEnv` environment can be created with `env = gymnasium.make('gym_examples/GridWorld-v0')`. - -`gym-examples/gym_examples/envs/__init__.py` should have: - -```python -from gym_examples.envs.grid_world import GridWorldEnv -``` - -If your environment is not registered, you may optionally pass a module to import, that would register your environment before creating it like this - -`env = gymnasium.make('module:Env-v0')`, where `module` contains the registration code. For the GridWorld env, the registration code is run by importing `gym_examples` so if it were not possible to import gym_examples explicitly, you could register while making by `env = gymnasium.make('gym_examples:gym_examples/GridWorld-v0)`. This is especially useful when you're allowed to pass only the environment ID into a third-party codebase (eg. learning library). This lets you register your environment without needing to edit the library's source code. - -## Creating a Package - -The last step is to structure our code as a Python package. This involves configuring `gym-examples/setup.py`. A minimal example of how to do so is as follows: - -```python -from setuptools import setup - -setup( - name="gym_examples", - version="0.0.1", - install_requires=["gymnasium==0.26.2", "pygame==2.1.0"], -) -``` - -## Creating Environment Instances - -After you have installed your package locally with `pip install -e gym-examples`, you can create an instance of the environment via: - -```python -import gym_examples -env = gymnasium.make('gym_examples/GridWorld-v0') -``` - -You can also pass keyword arguments of your environment's constructor to `gymnasium.make` to customize the environment. -In our case, we could do: - -```python -env = gymnasium.make('gym_examples/GridWorld-v0', size=10) -``` - -Sometimes, you may find it more convenient to skip registration and call the environment's -constructor yourself. Some may find this approach more pythonic and environments that are instantiated like this are -also perfectly fine (but remember to add wrappers as well!). - -## Using Wrappers - -Oftentimes, we want to use different variants of a custom environment, or we want to -modify the behavior of an environment that is provided by Gymnasium or some other party. -Wrappers allow us to do this without changing the environment implementation or adding any boilerplate code. -Check out the [wrapper documentation](/api/wrappers/) for details on how to -use wrappers and instructions for implementing your own. -In our example, observations cannot be used directly in learning code because they are dictionaries. -However, we don't actually need to touch our environment implementation to fix this! We can simply add -a wrapper on top of environment instances to flatten observations into a single array: - -```python -import gym_examples -from gymnasium.wrappers import FlattenObservation - -env = gymnasium.make('gym_examples/GridWorld-v0') -wrapped_env = FlattenObservation(env) -print(wrapped_env.reset()) # E.g. [3 0 3 3], {} -``` - -Wrappers have the big advantage that they make environments highly modular. For instance, instead of flattening the -observations from GridWorld, you might only want to look at the relative position of the target and the agent. -In the section on [ObservationWrappers](/api/wrappers/#observationwrapper) we have implemented -a wrapper that does this job. This wrapper is also available in gym-examples: - -```python -import gym_examples -from gym_examples.wrappers import RelativePosition - -env = gymnasium.make('gym_examples/GridWorld-v0') -wrapped_env = RelativePosition(env) -print(wrapped_env.reset()) # E.g. [-3 3], {} -``` - diff --git a/docs/content/handling_timelimits.md b/docs/content/handling_timelimits.md deleted file mode 100644 index bd0d716d0..000000000 --- a/docs/content/handling_timelimits.md +++ /dev/null @@ -1,75 +0,0 @@ -# Handling Time Limits - -In using Gymnasium environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The `done` signal received (in previous versions of OpenAI Gym < 0.26) from `env.step` indicated whether an episode has ended. However, this signal did not distinguish whether the episode ended due to `termination` or `truncation`. - -In using Gymnasium environments with reinforcement learning code, a common problem observed is how time limits are -incorrectly handled. The `done` signal received (in previous versions of gymnasium < 0.26) from `env.step` indicated -whether an episode has ended. However, this signal did not distinguish whether the episode ended due to `termination` or `truncation`. - -## Termination - -Termination refers to the episode ending after reaching a terminal state that is defined as part of the environment -definition. Examples are - task success, task failure, robot falling down etc. Notably, this also includes episodes -ending in finite-horizon environments due to a time-limit inherent to the environment. Note that to preserve Markov -property, a representation of the remaining time must be present in the agent's observation in finite-horizon environments. -[(Reference)](https://arxiv.org/abs/1712.00378) - -## Truncation - -Truncation refers to the episode ending after an externally defined condition (that is outside the scope of the Markov -Decision Process). This could be a time-limit, a robot going out of bounds etc. - -An infinite-horizon environment is an obvious example of where this is needed. We cannot wait forever for the episode -to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is -not a terminal state since it has a non-zero transition probability of moving to another state as per the Markov -Decision Process that defines the RL problem. This is also different from time-limits in finite horizon environments -as the agent in this case has no idea about this time-limit. - -## Importance in learning code - -Bootstrapping (using one or more estimated values of a variable to update estimates of the same variable) is a key -aspect of Reinforcement Learning. A value function will tell you how much discounted reward you will get from a -particular state if you follow a given policy. When an episode stops at any given point, by looking at the value of -the final state, the agent is able to estimate how much discounted reward could have been obtained if the episode has -continued. This is an example of handling truncation. - -More formally, a common example of bootstrapping in RL is updating the estimate of the Q-value function, - -```math -Q_{target}(o_t, a_t) = r_t + \gamma . \max_a(Q(o_{t+1}, a_{t+1})) -``` -In classical RL, the new `Q` estimate is a weighted average of the previous `Q` estimate and `Q_target` while in Deep -Q-Learning, the error between `Q_target` and the previous `Q` estimate is minimized. - -However, at the terminal state, bootstrapping is not done, - -```math -Q_{target}(o_t, a_t) = r_t -``` - -This is where the distinction between termination and truncation becomes important. When an episode ends due to -termination we don't bootstrap, when it ends due to truncation, we bootstrap. - -While using gymnasium environments, the `done` signal (default for < v0.26) is frequently used to determine whether to -bootstrap or not. However, this is incorrect since it does not differentiate between termination and truncation. - -A simple example of value functions is shown below. This is an illustrative example and not part of any specific algorithm. - -```python -# INCORRECT -vf_target = rew + gamma * (1-done)* vf_next_state -``` - -This is incorrect in the case of episode ending due to a truncation, where bootstrapping needs to happen but it doesn't. - -## Solution - -From v0.26 onwards, Gymnasium's `env.step` API returns both termination and truncation information explicitly. -In the previous version truncation information was supplied through the info key `TimeLimit.truncated`. -The correct way to handle terminations and truncations now is, - -```python -# terminated = done and 'TimeLimit.truncated' not in info # This was needed in previous versions. - -vf_target = rew + gamma*(1-terminated)*vf_next_state -``` diff --git a/docs/content/vectorising.md b/docs/content/vectorising.md deleted file mode 100644 index d2f66edb1..000000000 --- a/docs/content/vectorising.md +++ /dev/null @@ -1,340 +0,0 @@ ---- -layout: "contents" -title: Vectorising your environments ---- - -# Vectorizing your environments - -## Vectorized Environments - -*Vectorized environments* are environments that run multiple independent copies of the same environment in parallel using [multiprocessing](https://docs.python.org/3/library/multiprocessing.html). Vectorized environments take as input a batch of actions, and return a batch of observations. This is particularly useful, for example, when the policy is defined as a neural network that operates over a batch of observations. -Gymnasium provides two types of vectorized environments: - -- `gymnasium.vector.SyncVectorEnv`, where the different copies of the environment are executed sequentially. -- `gymnasium.vector.AsyncVectorEnv`, where the different copies of the environment are executed in parallel using [multiprocessing](https://docs.python.org/3/library/multiprocessing.html). This creates one process per copy. - - - -Similar to `gymnasium.make`, you can run a vectorized version of a registered environment using the `gymnasium.vector.make` function. This runs multiple copies of the same environment (in parallel, by default). - -The following example runs 3 copies of the ``CartPole-v1`` environment in parallel, taking as input a vector of 3 binary actions (one for each copy of the environment), and returning an array of 3 observations stacked along the first dimension, with an array of rewards returned by each copy, and an array of booleans indicating if the episode in each parallel environment has ended. - -```python ->>> import gymnasium as gym ->>> envs = gym.vector.make("CartPole-v1", num_envs=3) ->>> envs.reset() ->>> actions = np.array([1, 0, 1]) ->>> observations, rewards, termination, truncation, infos = envs.step(actions) - ->>> observations -array([[ 0.00122802, 0.16228443, 0.02521779, -0.23700266], - [ 0.00788269, -0.17490888, 0.03393489, 0.31735462], - [ 0.04918966, 0.19421194, 0.02938497, -0.29495203]], - dtype=float32) ->>> rewards -array([1., 1., 1.]) ->>> termination -array([False, False, False]) ->>> truncation -array([False, False, False]) ->>> infos -{} -``` - -The function `gymnasium.vector.make` is meant to be used only in basic cases (e.g. running multiple copies of the same registered environment). For any other use cases, please use either the `SyncVectorEnv` for sequential execution or `AsyncVectorEnv` for parallel execution. These use cases may include: - -- Running multiple instances of the same environment with different parameters (e.g. ``"Pendulum-v0"`` with different values for the gravity). -- Running multiple instances of an unregistered environment (e.g. a custom environment). -- Using a wrapper on some (but not all) environment copies. - - -### Creating a vectorized environment - -To create a vectorized environment that runs multiple environment copies, you can wrap your parallel environments inside `gymnasium.vector.SyncVectorEnv` (for sequential execution), or `gymnasium.vector.AsyncVectorEnv` (for parallel execution, with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html)). These vectorized environments take as input a list of callables specifying how the copies are created. - -```python ->>> envs = gymnasium.vector.AsyncVectorEnv([ -... lambda: gymnasium.make("CartPole-v1"), -... lambda: gymnasium.make("CartPole-v1"), -... lambda: gymnasium.make("CartPole-v1") -... ]) -``` - -Alternatively, to create a vectorized environment of multiple copies of the same registered environment, you can use the function `gymnasium.vector.make()`. - -```python ->>> envs = gymnasium.vector.make("CartPole-v1", num_envs=3) # Equivalent -``` - -To enable automatic batching of actions and observations, all of the environment copies must share the same `action_space` and `observation_space`. However, all of the parallel environments are not required to be exact copies of one another. For example, you can run 2 instances of ``Pendulum-v1`` with different values for gravity in a vectorized environment with: - -```python ->>> env = gym.vector.AsyncVectorEnv([ -... lambda: gym.make("Pendulum-v1", g=9.81), -... lambda: gym.make("Pendulum-v1", g=1.62) -... ]) -``` - -See the `Observation & Action spaces` section for more information about automatic batching. - -When using `AsyncVectorEnv` with either the ``spawn`` or ``forkserver`` start methods, you must wrap your code containing the vectorized environment with ``if __name__ == "__main__":``. See [this documentation](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) for more information. - -```python -if __name__ == "__main__": - envs = gymnasium.vector.make("CartPole-v1", num_envs=3, context="spawn") -``` - -### Working with vectorized environments - -While standard Gymnasium environments take a single action and return a single observation (with a reward, and boolean indicating termination), vectorized environments take a *batch of actions* as input, and return a *batch of observations*, together with an array of rewards and booleans indicating if the episode ended in each environment copy. - - -```python ->>> envs = gymnasium.vector.make("CartPole-v1", num_envs=3) ->>> envs.reset() -(array([[-0.02792548, -0.04423395, 0.00026012, 0.04486719], - [-0.04906582, 0.02779809, 0.02881928, -0.04467649], - [ 0.0036706 , -0.00324916, 0.047668 , -0.02039891]], - dtype=float32), {}) - ->>> actions = np.array([1, 0, 1]) ->>> observations, rewards, termination, truncation, infos = envs.step(actions) - ->>> observations -array([[ 0.00187507, 0.18986781, -0.03168437, -0.301252 ], - [-0.02643229, -0.18816885, 0.04371385, 0.3034975 ], - [-0.02803041, 0.24251814, 0.02660446, -0.29707024]], - dtype=float32) ->>> rewards -array([1., 1., 1.]) ->>> termination -array([False, False, False]) ->>> truncation -array([False, False, False]) ->>> infos -{} -``` - -Vectorized environments are compatible with any environment, regardless of the action and observation spaces (e.g. container spaces like `gymnasium.spaces.Dict`, or any arbitrarily nested spaces). In particular, vectorized environments can automatically batch the observations returned by `VectorEnv.reset` and `VectorEnv.step` for any standard Gymnasium `Space` (e.g. `gymnasium.spaces.Box`, `gymnasium.spaces.Discrete`, `gymnasium.spaces.Dict`, or any nested structure thereof). Similarly, vectorized environments can take batches of actions from any standard Gymnasium `Space`. - -```python ->>> class DictEnv(gymnasium.Env): -... observation_space = gymnasium.spaces.Dict({ -... "position": gymnasium.spaces.Box(-1., 1., (3,), np.float32), -... "velocity": gymnasium.spaces.Box(-1., 1., (2,), np.float32) -... }) -... action_space = gymnasium.spaces.Dict({ -... "fire": gymnasium.spaces.Discrete(2), -... "jump": gymnasium.spaces.Discrete(2), -... "acceleration": gymnasium.spaces.Box(-1., 1., (2,), np.float32) -... }) -... -... def reset(self): -... return self.observation_space.sample() -... -... def step(self, action): -... observation = self.observation_space.sample() -... return observation, 0., False, False, {} - ->>> envs = gymnasium.vector.AsyncVectorEnv([lambda: DictEnv()] * 3) ->>> envs.observation_space -Dict(position:Box(-1.0, 1.0, (3, 3), float32), velocity:Box(-1.0, 1.0, (3, 2), float32)) ->>> envs.action_space -Dict(fire:MultiDiscrete([2 2 2]), jump:MultiDiscrete([2 2 2]), acceleration:Box(-1.0, 1.0, (3, 2), float32)) - ->>> envs.reset() ->>> actions = { -... "fire": np.array([1, 1, 0]), -... "jump": np.array([0, 1, 0]), -... "acceleration": np.random.uniform(-1., 1., size=(3, 2)) -... } ->>> observations, rewards, termination, truncation, infos = envs.step(actions) ->>> observations -{"position": array([[-0.5337036 , 0.7439302 , 0.41748118], - [ 0.9373266 , -0.5780453 , 0.8987405 ], - [-0.917269 , -0.5888639 , 0.812942 ]], dtype=float32), -"velocity": array([[ 0.23626241, -0.0616814 ], - [-0.4057572 , -0.4875375 ], - [ 0.26341468, 0.72282314]], dtype=float32)} -``` - -The environment copies inside a vectorized environment automatically call `gymnasium.Env.reset` at the end of an episode. In the following example, the episode of the 3rd copy ends after 2 steps (the agent fell in a hole), and the parallel environment gets reset (observation ``0``). - -```python ->>> envs = gymnasium.vector.make("FrozenLake-v1", num_envs=3, is_slippery=False) ->>> envs.reset() -(array([0, 0, 0]), {'prob': array([1, 1, 1]), '_prob': array([ True, True, True])}) ->>> observations, rewards, termination, truncation, infos = envs.step(np.array([1, 2, 2])) ->>> observations, rewards, termination, truncation, infos = envs.step(np.array([1, 2, 1])) ->>> observations -array([8, 2, 0]) ->>> termination -array([False, False, True]) -``` - -Vectorized environments will return `infos` in the form of a dictionary where each value is an array of length `num_envs` and the _i-th_ value of the array represents the info of the _i-th_ environment. -Each `key` of the info is paired with a boolean mask `_key` representing whether or not the _i-th_ environment has data. -If the _dtype_ of the returned info is whether `int`, `float`, `bool` or any _dtype_ inherited from `np.number`, an array of the same _dtype_ will be returned. Otherwise, the array will have _dtype_ `object`. - - -```python ->>> envs = gymnasium.vector.make("CartPole-v1", num_envs=3) ->>> observations, infos = envs.reset() - ->>> actions = np.array([1, 0, 1]) ->>> observations, rewards, termination, truncation, infos = envs.step(actions) - ->>> while not any(np.logical_or(termination, truncation)): -... observations, rewards, termination, truncation, infos = envs.step(actions) - ->>> termination -[False, True, False] - ->>> infos -{'final_observation': array([None, - array([-0.11350546, -1.8090094 , 0.23710881, 2.8017728 ], dtype=float32), - None], dtype=object), '_final_observation': array([False, True, False])} -``` - -## Observation & Action spaces - -Like any Gymnasium environment, vectorized environments contain the two properties `VectorEnv.observation_space` and `VectorEnv.action_space` to specify the observation and action spaces of the environments. Since vectorized environments operate on multiple environment copies, where the actions taken and observations returned by all of the copies are batched together, the observation and action *spaces* are batched as well so that the input actions are valid elements of `VectorEnv.action_space`, and the observations are valid elements of `VectorEnv.observation_space`. - -```python ->>> envs = gymnasium.vector.make("CartPole-v1", num_envs=3) ->>> envs.observation_space -Box([[-4.8 ...]], [[4.8 ...]], (3, 4), float32) ->>> envs.action_space -MultiDiscrete([2 2 2]) -``` - -In order to appropriately batch the observations and actions in vectorized environments, the observation and action spaces of all of the copies are required to be identical. - -```python ->>> envs = gymnasium.vector.AsyncVectorEnv([ -... lambda: gymnasium.make("CartPole-v1"), -... lambda: gymnasium.make("MountainCar-v0") -... ]) -RuntimeError: Some environments have an observation space different from `Box([-4.8 ...], [4.8 ...], (4,), float32)`. -In order to batch observations, the observation spaces from all environments must be equal. -``` -However, sometimes it may be handy to have access to the observation and action spaces of a particular copy, and not the batched spaces. You can access those with the properties `VectorEnv.single_observation_space` and `VectorEnv.single_action_space` of the vectorized environment. - -```python ->>> envs = gymnasium.vector.make("CartPole-v1", num_envs=3) ->>> envs.single_observation_space -Box([-4.8 ...], [4.8 ...], (4,), float32) ->>> envs.single_action_space -Discrete(2) -``` -This is convenient, for example, if you instantiate a policy. In the following example, we use `VectorEnv.single_observation_space` and `VectorEnv.single_action_space` to define the weights of a linear policy. Note that, thanks to the vectorized environment, we can apply the policy directly to the whole batch of observations with a single call to `policy`. - -```python ->>> from gymnasium.spaces.utils import flatdim ->>> from scipy.special import softmax - ->>> def policy(weights, observations): -... logits = np.dot(observations, weights) -... return softmax(logits, axis=1) - ->>> envs = gymnasium.vector.make("CartPole-v1", num_envs=3) ->>> weights = np.random.randn( -... flatdim(envs.single_observation_space), -... envs.single_action_space.n -... ) ->>> observations, infos = envs.reset() ->>> actions = policy(weights, observations).argmax(axis=1) ->>> observations, rewards, termination, truncation, infos = envs.step(actions) -``` - -## Intermediate Usage - -### Shared memory - -`AsyncVectorEnv` runs each environment copy inside an individual process. At each call to `AsyncVectorEnv.reset` or `AsyncVectorEnv.step`, the observations of all of the parallel environments are sent back to the main process. To avoid expensive transfers of data between processes, especially with large observations (e.g. images), `AsyncVectorEnv` uses a shared memory by default (``shared_memory=True``) that processes can write to and read from at minimal cost. This can increase the throughput of the vectorized environment. - -```python ->>> env_fns = [lambda: gymnasium.make("BreakoutNoFrameskip-v4")] * 5 - ->>> envs = gymnasium.vector.AsyncVectorEnv(env_fns, shared_memory=False) ->>> envs.reset() ->>> %timeit envs.step(envs.action_space.sample()) -2.23 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) - ->>> envs = gymnasium.vector.AsyncVectorEnv(env_fns, shared_memory=True) ->>> envs.reset() ->>> %timeit envs.step(envs.action_space.sample()) -1.36 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) -``` - -### Exception handling - -Because sometimes things may not go as planned, the exceptions raised in any given environment copy are re-raised in the vectorized environment, even when the copy runs in parallel with `AsyncVectorEnv`. This way, you can choose how to handle these exceptions yourself (with ``try ... except``). - -```python ->>> class ErrorEnv(gymnasium.Env): -... observation_space = gymnasium.spaces.Box(-1., 1., (2,), np.float32) -... action_space = gymnasium.spaces.Discrete(2) -... -... def reset(self): -... return np.zeros((2,), dtype=np.float32), {} -... -... def step(self, action): -... if action == 1: -... raise ValueError("An error occurred.") -... observation = self.observation_space.sample() -... return observation, 0., False, False, {} - ->>> envs = gymnasium.vector.AsyncVectorEnv([lambda: ErrorEnv()] * 3) ->>> observations, infos = envs.reset() ->>> observations, rewards, termination, termination, infos = envs.step(np.array([0, 0, 1])) -ERROR: Received the following error from Worker-2: ValueError: An error occurred. -ERROR: Shutting down Worker-2. -ERROR: Raising the last exception back to the main process. -ValueError: An error occurred. -``` - -## Advanced Usage - -### Custom spaces - -Vectorized environments will batch actions and observations if they are elements from standard Gymnasium spaces, such as `gymnasium.spaces.Box`, `gymnasium.spaces.Discrete`, or `gymnasium.spaces.Dict`. However, if you create your own environment with a custom action and/or observation space (inheriting from `gymnasium.Space`), the vectorized environment will not attempt to automatically batch the actions/observations, and instead, it will return the raw tuple of elements from all parallel environments. - -In the following example, we create a new environment `SMILESEnv`, whose observations are strings representing the [SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) notation of a molecular structure, with a custom observation space `SMILES`. The observations returned by the vectorized environment are contained in a tuple of strings. - -```python ->>> class SMILES(gymnasium.Space): -... def __init__(self, symbols): -... super().__init__() -... self.symbols = symbols -... -... def __eq__(self, other): -... return self.symbols == other.symbols - ->>> class SMILESEnv(gymnasium.Env): -... observation_space = SMILES("][()CO=") -... action_space = gymnasium.spaces.Discrete(7) -... -... def reset(self): -... self._state = "[" -... return self._state -... -... def step(self, action): -... self._state += self.observation_space.symbols[action] -... reward = terminated = (action == 0) -... return self._state, float(reward), terminated, False, {} - ->>> envs = gymnasium.vector.AsyncVectorEnv( -... [lambda: SMILESEnv()] * 3, -... shared_memory=False -... ) ->>> envs.reset() ->>> observations, rewards, termination, truncation, infos = envs.step(np.array([2, 5, 4])) ->>> observations -('[(', '[O', '[C') -``` - -Custom observation and action spaces may inherit from the `gymnasium.Space` class. However, most use cases should be covered by the existing space classes (e.g. `gymnasium.spaces.Box`, `gymnasium.spaces.Discrete`, etc...), and container classes (`gymnasium.spaces.Tuple` and `gymnasium.spaces.Dict`). Moreover, some implementations of reinforcement learning algorithms might not handle custom spaces properly. Use custom spaces with care. - -If you use `AsyncVectorEnv` with a custom observation space, you must set ``shared_memory=False``, since shared memory and automatic batching are not compatible with custom spaces. In general, if you use custom spaces with `AsyncVectorEnv`, the elements of those spaces must be `pickleable`. - diff --git a/docs/environments/mujoco.md b/docs/environments/mujoco.md index dbe4ddaef..45b1557a6 100644 --- a/docs/environments/mujoco.md +++ b/docs/environments/mujoco.md @@ -35,7 +35,7 @@ pip install gymnasium[mujoco] These environments also require that the MuJoCo engine be installed. As of October 2021 DeepMind has acquired MuJoCo and is open-sourcing it in 2022, making it free for everyone. Instructions on installing the MuJoCo engine can be found on their [website](https://mujoco.org) and [GitHub repository](https://github.com/deepmind/mujoco). Using MuJoCo with Gymnasium also requires that the framework `mujoco` be installed (this dependency is installed with the above command). -For MuJoCo V3 enviroments and older the `mujoco-py` framework is required (`pip install mujoco-py`) which can be found in the [GitHub repository](https://github.com/openai/mujoco-py/tree/master/mujoco_py) +For MuJoCo V3 environments and older the `mujoco-py` framework is required (`pip install mujoco-py`) which can be found in the [GitHub repository](https://github.com/openai/mujoco-py/tree/master/mujoco_py) There are ten Mujoco environments: Ant, HalfCheetah, Hopper, Humanoid, HumanoidStandup, IvertedDoublePendulum, InvertedPendulum, Reacher, Swimmer, and Walker. All of these environments are stochastic in terms of their initial state, with a Gaussian noise added to a fixed initial state in order to add stochasticity. The state spaces for MuJoCo environments in Gymnasium consist of two parts that are flattened and concatenated together: a position of a body part ('*mujoco-py.mjsim.qpos*') or joint and its corresponding velocity ('*mujoco-py.mjsim.qvel*'). Often, some of the first positional elements are omitted from the state space since the reward is calculated based on their values, leaving it up to the algorithm to infer those hidden values indirectly.