Fix README to work.

RaghuSpaceRajan · RaghuSpaceRajan · commit 1a25b35cd638 · 2021-06-07T22:00:27.000+02:00
diff --git a/README.md b/README.md
@@ -27,69 +27,39 @@ There are 4 parts to the package:
 
 2) **Complex Environment Wrappers**: Similar to the toy environment, this is parameterised by a `config` dict which contains all the information needed to inject the dimensions into Atari or Mujoco environments. Please see [`example.py`](example.py) for some simple examples of how to use these. The Atari wrapper is in [`mdp_playground/envs/gym_env_wrapper.py`](mdp_playground/envs/gym_env_wrapper.py) and the Mujoco wrapper is in [`mdp_playground/envs/mujoco_env_wrapper.py`](mdp_playground/envs/mujoco_env_wrapper.py).
 
-3) **Experiments**: Experiments are launched using [`run_experiments.py`](run_experiments.py). Config files for experiments are located inside the [`experiments`](experiments) directory. Please read the [instructions](#running-experiments) below for details.
+3) **Experiments**: Experiments are launched using [`run_experiments.py`](run_experiments.py). Config files for experiments are located inside the [`experiments`](experiments) directory. Please read the [instructions](#running-experiments) below for details on how to launch experiments.
 
 4) **Analysis**: [`plot_experiments.ipynb`](plot_experiments.ipynb) contains code to plot the standard plots from the paper.
 
-## Installation
-
-### Production use
-We recommend using `conda` to manage environments. After setup of the environment, you can install MDP Playground in two ways:
-#### Manual
-To install MDP Playground manually, clone the repository and run:
-```bash
-pip install -e .[extras]
-```
-This might be the preferred way if you want easy access to the included experiments.
 
-#### From PyPI
-MDP Playground is also on PyPI. Just run:
-```bash
-pip install mdp_playground[extras]
-```
+## Running experiments from the main paper
+For reproducing experiments from the main paper, please continue reading.
 
+For general instructions, please see [here](#installation).
 
-### Reproducing results from the paper
-We recommend using `conda` environments to manage virtual `Python` environments to run the experiments. Unfortunately, you will have to maintain 2 environments - 1 for the "older" **discrete toy** experiments and 1 for the "newer" **continuous and complex** experiments from the paper. As mentioned in Appendix P in the paper, this is because of issues with Ray, the library that we used for our baseline agents.
+### Installation for running experiments from the main paper
+We recommend using `conda` environments to manage virtual `Python` environments to run the experiments. Unfortunately, you will have to maintain 2 environments - 1 for the "older" **discrete toy** experiments and 1 for the "newer" **continuous and complex** experiments from the paper. As mentioned in Appendix section **Tuned Hyperparameters** in the paper, this is because of issues with Ray, the library that we used for our baseline agents.
 
 Please follow the following commands to install for the discrete toy experiments:
 ```bash
 conda create -n py36_toy_rl_disc_toy python=3.6
 conda activate py36_toy_rl_disc_toy
 cd mdp-playground
+pip install -r requirements.txt
 pip install -e .[extras_disc]
 ```
 
-Please follow the following commands to install for the continuous and complex experiments:
+Please follow the following commands to install for the continuous and complex experiments. **IMPORTANT**: In case, you do not have MuJoCo, please ignore any mujoco-py related installation errors below:
 ```bash
 conda create -n py36_toy_rl_cont_comp python=3.6
 conda activate py36_toy_rl_cont_comp
 cd mdp-playground
+pip install -r requirements.txt
 pip install -e .[extras_cont]
 wget 'https://ray-wheels.s3-us-west-2.amazonaws.com/master/8d0c1b5e068853bf748f72b1e60ec99d240932c6/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl'
 pip install ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl[rllib,debug]
 ```
 
-## Running experiments
-For reproducing experiments from the main paper, please see [below](#running-experiments-from-the-main-paper).
-
-For general instructions, please continue reading.
-
-You can run experiments using:
-```
-run-mdpp-experiments -c <config_file> -e <exp_name> -n <config_num>
-```
-The `exp_name` is a prefix for the filenames of CSV files where stats for the experiments are recorded. The CSV stats files will be saved to the current directory.<br>
-Each of the command line arguments has defaults. Please refer to the documentation inside [`run_experiments.py`](run_experiments.py) for further details on the command line arguments. (Or run it with the `-h` flag to bring up help.)
-
-The config files for experiments from the [paper](https://arxiv.org/abs/1909.07750) are in the experiments directory.<br>
-The name of the file corresponding to an experiment is formed as: `<algorithm_name>_<dimension_names>.py`<br>
-Some sample `algorithm_name`s are: `dqn`, `rainbow`, `a3c`, `a3c_lstm`, `ddpg`, `td3` and `sac`<br>
-Some sample `dimension_name`s are: `seq_del` (for **delay** and **sequence length** varied together), `p_r_noises` (for **P** and **R noises** varied together),
-`target_radius` (for varying **target radius**) and `time_unit` (for varying **time unit**)<br>
-For example, for algorithm **DQN** when varying dimensions **delay** and **sequence length**, the corresponding experiment file is [`dqn_seq_del.py`](experiments/dqn_seq_del.py)
-
-## Running experiments from the main paper
 We list here the commands for the experiments from the main paper:
 ```bash
 # Discrete toy environments:
@@ -108,6 +78,8 @@ python run_experiments.py -c experiments/ddpg_move_to_a_point_irr_dims.py -e ddp
 python run_experiments.py -c experiments/ddpg_move_to_a_point_p_order_2.py -e ddpg_move_to_a_point_p_order_2
 
 # Complex environments:
+# The commands below run all configs serially.
+# In case, you want to parallelise on a cluster, please provide the CLI argument -n <config_number> at the end of the given commands. Please refer to the documentation for run_experiments.py for this.
 conda activate py36_toy_rl_cont_comp
 python run_experiments.py -c experiments/dqn_qbert_del.py -e dqn_qbert_del
 python run_experiments.py -c experiments/ddpg_halfcheetah_time_unit.py -e ddpg_halfcheetah_time_unit
@@ -121,6 +93,43 @@ python run_experiments.py -c experiments/ddpg_halfcheetah_time_unit.py -e ddpg_h
 
 The CSV stats files will be saved to the current directory and can be analysed in [`plot_experiments.ipynb`](plot_experiments.ipynb).
 
+
+## Installation
+For reproducing experiments from the main paper, please see [here](#running-experiments-from-the-main-paper).
+
+### Production use
+We recommend using `conda` to manage environments. After setup of the environment, you can install MDP Playground in two ways:
+#### Manual
+To install MDP Playground manually, clone the repository and run:
+```bash
+pip install -e .[extras]
+```
+This might be the preferred way if you want easy access to the included experiments.
+
+#### From PyPI
+MDP Playground is also on PyPI. Just run:
+```bash
+pip install mdp_playground[extras]
+```
+
+
+## Running experiments
+You can run experiments using:
+```
+run-mdpp-experiments -c <config_file> -e <exp_name> -n <config_num>
+```
+The `exp_name` is a prefix for the filenames of CSV files where stats for the experiments are recorded. The CSV stats files will be saved to the current directory.<br>
+Each of the command line arguments has defaults. Please refer to the documentation inside [`run_experiments.py`](run_experiments.py) for further details on the command line arguments. (Or run it with the `-h` flag to bring up help.)
+
+The config files for experiments from the [paper](https://arxiv.org/abs/1909.07750) are in the experiments directory.<br>
+The name of the file corresponding to an experiment is formed as: `<algorithm_name>_<dimension_names>.py`<br>
+Some sample `algorithm_name`s are: `dqn`, `rainbow`, `a3c`, `a3c_lstm`, `ddpg`, `td3` and `sac`<br>
+Some sample `dimension_name`s are: `seq_del` (for **delay** and **sequence length** varied together), `p_r_noises` (for **P** and **R noises** varied together),
+`target_radius` (for varying **target radius**) and `time_unit` (for varying **time unit**)<br>
+For example, for algorithm **DQN** when varying dimensions **delay** and **sequence length**, the corresponding experiment file is [`dqn_seq_del.py`](experiments/dqn_seq_del.py)
+
+The CSV stats files will be saved to the current directory and can be analysed in [`plot_experiments.ipynb`](plot_experiments.ipynb).
+
 ## Plotting
 To plot results from experiments, run `jupyter-notebook` and open [`plot_experiments.ipynb`](plot_experiments.ipynb) in Jupyter. There are instructions within each of the cells on how to generate and save plots.
 
diff --git a/experiments/ddpg_halfcheetah_action_max.py b/experiments/ddpg_halfcheetah_action_max.py
@@ -4,6 +4,7 @@
 from ray import tune
 from collections import OrderedDict
 num_seeds = 5
+timesteps_total = 3000000
 
 
 var_env_configs = OrderedDict(
diff --git a/experiments/ddpg_halfcheetah_time_unit.py b/experiments/ddpg_halfcheetah_time_unit.py
@@ -4,6 +4,7 @@
 from ray import tune
 from collections import OrderedDict
 num_seeds = 5
+timesteps_total = 3000000
 
 
 var_env_configs = OrderedDict(
diff --git a/experiments/sac_halfcheetah_action_max.py b/experiments/sac_halfcheetah_action_max.py
@@ -4,6 +4,7 @@
 from ray import tune
 from collections import OrderedDict
 num_seeds = 5
+timesteps_total = 3000000
 
 
 var_env_configs = OrderedDict(
diff --git a/experiments/sac_halfcheetah_time_unit.py b/experiments/sac_halfcheetah_time_unit.py
@@ -4,6 +4,7 @@
 from ray import tune
 from collections import OrderedDict
 num_seeds = 5
+timesteps_total = 3000000
 
 
 var_env_configs = OrderedDict(
diff --git a/experiments/sac_halfcheetah_time_unit_config_processor.py b/experiments/sac_halfcheetah_time_unit_config_processor.py
@@ -4,6 +4,7 @@
 from collections import OrderedDict
 from mdp_playground.config_processor import *
 num_seeds = 5
+timesteps_total = 3000000
 
 
 var_env_configs = OrderedDict(
diff --git a/experiments/td3_halfcheetah_action_max.py b/experiments/td3_halfcheetah_action_max.py
@@ -4,6 +4,7 @@
 from ray import tune
 from collections import OrderedDict
 num_seeds = 5
+timesteps_total = 3000000
 
 
 var_env_configs = OrderedDict(
diff --git a/experiments/td3_halfcheetah_time_unit.py b/experiments/td3_halfcheetah_time_unit.py
@@ -4,6 +4,7 @@
 from ray import tune
 from collections import OrderedDict
 num_seeds = 5
+timesteps_total = 3000000
 
 
 var_env_configs = OrderedDict(
diff --git a/mdp_playground/config_processor/config_processor.py b/mdp_playground/config_processor/config_processor.py
@@ -86,8 +86,57 @@ def process_configs(
         *variable_configs, overwrite=False
     )
 
+    varying_configs = []
+    separate_var_configs = []
+    # ###IMP Currently num_configs has to be equal for all 3 cases below:
+    # grid (i.e. var), random and sobol #TODO Not sure how to solve this #config
+    # setup problem. Could take Cartesian product of all 3 but that may lead to
+    # too many configs and Cartesian product of dicts is a pain.
+    if "var_configs" in dir(config):
+        separate_var_configs.append(
+            get_list_of_varying_configs(config.var_configs, mode="grid")
+        )
+    if "sobol_configs" in dir(config):
+        separate_var_configs.append(
+            get_list_of_varying_configs(
+                config.sobol_configs, mode="sobol", num_configs=config.num_configs
+            )
+        )
+    if "random_configs" in dir(config):
+        separate_var_configs.append(
+            get_list_of_varying_configs(
+                config.random_configs, mode="random", num_configs=config.num_configs
+            )
+        )
+    # print("VARYING_CONFIGS:", varying_configs)
+
+    num_configs_ = max(
+        [len(separate_var_configs[i]) for i in range(len(separate_var_configs))]
+    )
+    for i in range(num_configs_):
+        to_combine = [
+            separate_var_configs[j][i] for j in range(len(separate_var_configs))
+        ]
+        # overwrite = False because the keys in different modes of
+        # config generation need to be disjoint
+        varying_configs.append(deepmerge_multiple_dicts(*to_combine, overwrite=False))
+
+    # #hack ####TODO Remove extra pre-processing done here and again below:
+    pre_final_configs = combined_processing(
+        config.env_config,
+        config.agent_config,
+        config.model_config,
+        config.eval_config,
+        varying_configs=copy.deepcopy(varying_configs),
+        framework=framework,
+        algorithm=config.algorithm,
+    )
+
+
     if "timesteps_total" in dir(config):
         hacky_timesteps_total = config.timesteps_total  # hack
+    else:
+        hacky_timesteps_total = pre_final_configs[-1]["timesteps_total"]
 
     config_algorithm = config.algorithm  # hack
     # sys.exit(0)
@@ -137,40 +186,6 @@ def process_configs(
             + ". Available options are: ray and stable_baselines."
         )
 
-    varying_configs = []
-    separate_var_configs = []
-    # ###IMP Currently num_configs has to be equal for all 3 cases below:
-    # grid (i.e. var), random and sobol #TODO Not sure how to solve this #config
-    # setup problem. Could take Cartesian product of all 3 but that may lead to
-    # too many configs and Cartesian product of dicts is a pain.
-    if "var_configs" in dir(config):
-        separate_var_configs.append(
-            get_list_of_varying_configs(config.var_configs, mode="grid")
-        )
-    if "sobol_configs" in dir(config):
-        separate_var_configs.append(
-            get_list_of_varying_configs(
-                config.sobol_configs, mode="sobol", num_configs=config.num_configs
-            )
-        )
-    if "random_configs" in dir(config):
-        separate_var_configs.append(
-            get_list_of_varying_configs(
-                config.random_configs, mode="random", num_configs=config.num_configs
-            )
-        )
-    # print("VARYING_CONFIGS:", varying_configs)
-
-    num_configs_ = max(
-        [len(separate_var_configs[i]) for i in range(len(separate_var_configs))]
-    )
-    for i in range(num_configs_):
-        to_combine = [
-            separate_var_configs[j][i] for j in range(len(separate_var_configs))
-        ]
-        # overwrite = False because the keys in different modes of
-        # config generation need to be disjoint
-        varying_configs.append(deepmerge_multiple_dicts(*to_combine, overwrite=False))
 
     # varying_configs is a list of dict of dicts with a specific structure.
     final_configs = combined_processing(
@@ -876,28 +891,28 @@ def combined_processing(*static_configs, varying_configs, framework="ray", algor
                             "fcnet_activation": "relu",
                         }
 
-                    # TODO Find a better way to enforce these?? Especially problematic for TD3
-                    # because then more values for target_noise_clip are witten to CSVs than
-                    # actually used during HPO but for normal (non-HPO) runs this needs to be
-                    # not done.
-                    if (algorithm == "DDPG"):
-                        if key == "critic_lr":
-                            final_configs[i]["actor_lr"] = value
-                        if key == "critic_hiddens":
-                            final_configs[i]["actor_hiddens"] = value
-                    if algorithm == "TD3":
-                        if key == "target_noise_clip_relative":
-                            final_configs[i]["target_noise_clip"] = (
-                                final_configs[i]["target_noise_clip_relative"]
-                                * final_configs[i]["target_noise"]
-                            )
-                            del final_configs[i][
-                                "target_noise_clip_relative"
-                            ]  # hack have to delete it otherwise Ray will crash for unknown config param.
+                # TODO Find a better way to enforce these?? Especially problematic for TD3
+                # because then more values for target_noise_clip are witten to CSVs than
+                # actually used during HPO but for normal (non-HPO) runs this needs to be
+                # not done.
+                if (algorithm == "DDPG"):
+                    if key == "critic_lr":
+                        final_configs[i]["actor_lr"] = value
+                    if key == "critic_hiddens":
+                        final_configs[i]["actor_hiddens"] = value
+                if algorithm == "TD3":
+                    if key == "target_noise_clip_relative":
+                        final_configs[i]["target_noise_clip"] = (
+                            final_configs[i]["target_noise_clip_relative"]
+                            * final_configs[i]["target_noise"]
+                        )
+                        del final_configs[i][
+                            "target_noise_clip_relative"
+                        ]  # hack have to delete it otherwise Ray will crash for unknown config param.
 
-                elif key == "model":
+                if key == "model":
                     for key_2 in final_configs[i][key]:
-                        if key_2 == "use_lstm":
+                        if key_2 == "use_lstm" and final_configs[i][key][key_2]:
                             final_configs[i][key]["max_seq_len"] = (
                                 final_configs[i]["env_config"]["delay"]
                                 + final_configs[i]["env_config"]["sequence_length"]
diff --git a/pyproject.toml b/pyproject.toml
diff --git a/requirements.txt b/requirements.txt
diff --git a/setup.py b/setup.py