josiahls · josiahls · Feb 2, 2020 · Feb 2, 2020 · Feb 2, 2020 · Feb 2, 2020
diff --git a/README.md b/README.md
@@ -108,45 +108,75 @@ OpenAI environments.
 
 ## RoadMap
 
-- [ ] **Working on** **1.0.0** Base version is completed with working model visualizations proving performance / expected failure. At 
-this point, all models should have guaranteed environments they should succeed in. 
-- [ ] 1.1.0 **Working on**  More Traditional RL models
-    - [ ]  **Working on** Add PPO
-    - [ ]  **Working on** Add TRPO
+- [ ] 1.1.0 More Traditional RL models
+    - [X] Add Cross Entropy Method CEM
+    - [X] NStep Experience replay
+    - [X] Gaussian and Factored Gaussian Noise exploration replacement
+    - [X] Add Distributional DQN
+    - [X] Add RAINBOW DQN (Note warnings, will require refactor / re-testing)
+    - [ ] **Working on** Add REINFORCE
+    - [ ] **Working on** Add PPO
+    - [ ] **Working on** Add TRPO
     - [ ] Add D4PG
     - [ ] Add A2C
     - [ ] Add A3C
-- [ ] 1.2.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
+    - [ ] Add SAC
+- [ ] 2.0.0 Mass refactor / performance update
+    - [ ] Environments needs to be faster. Beat openai baseline 350 frames per second
+        - Comparing against https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On performance
+    - [ ] fastrl needs to handle ram better
+    - [ ] Use Pong as "expensive computation" benchmark for all compatible models (discrete). 
+        - [ ] 2 Runs image space
+    - [ ] Use Cartpole as "cheap computation" benchmark for all compatible models (discrete).
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use Mountain car as "far distance goal" benchmark all compatible models (discrete)
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use Ant as "expensive computation" benchmark for all compatible models (continuous). 
+        - [ ] 2 Runs image space
+    - [ ] Use Pendulum as "cheap computation" benchmark for all compatible models (continuous).
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use Mountain car continuous as "cheap computation" "far distance goal" benchmark all compatible models (continuous).
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use yield instead of return for the MDPDataset object
+    - [ ] Unify common code pieces shared in all models
+    - [ ] Transition entire project to [nbdev](https://github.com/fastai/nbdev)
+        - Make documentation easier / more expansive. Current method is tedious.
+- [ ] 2.1.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
     - [ ] Add SMDP
     - [ ] Add Goal oriented MDPs. Will Require a new "Step"
     - [ ] Add FeUdal Network
     - [ ] Add storage based DataBunch memory management. This can prevent RAM from being used up by episode image frames
     that may or may not serve any use to the agent, but only for logging.
-- [ ] 1.3.0
+- [ ] 2.2.0
     - [ ] Add HAC
     - [ ] Add MAXQ
     - [ ] Add HIRO
-- [ ] 1.4.0
+- [ ] 2.3.0
     - [ ] Add h-DQN
     - [ ] Add Modulated Policy Hierarchies
     - [ ] Add Meta Learning Shared Hierarchies
-- [ ] 1.5.0
+- [ ] 2.4.0
     - [ ] Add STRategic Attentive Writer (STRAW)
     - [ ] Add H-DRLN
     - [ ] Add Abstract Markov Decision Process (AMDP)
     - [ ] Add conda integration so that installation can be truly one step.
-- [ ] 1.6.0 HRL Options models *Possibly will already be implemented in a previous model*
+- [ ] 2.5.0 HRL Options models *Possibly will already be implemented in a previous model*
     - [ ] Options augmentation to DQN based models
     - [ ] Options augmentation to actor critic models
     - [ ] Options augmentation to async actor critic models
-- [ ] 1.8.0 HRL Skills
+- [ ] 2.6.0 HRL Skills
     - [ ] Skills augmentation to DQN based models
     - [ ] Skills augmentation to actor critic models
     - [ ] Skills augmentation to async actor critic models
-- [ ] 1.9.0
-- [ ] 2.0.0 Add PyBullet Fetch Environments
-    - [ ] 2.0.0 Not part of this repo, however the envs need to subclass the OpenAI `gym.GoalEnv`
-    - [ ] 2.0.0 Add HER
+- [ ] 2.7.0 Add PyBullet Fetch Environments
+    - [ ] Envs need to subclass OpenAI `gym.GoalEnv`
+    - [ ] Add HER
+- [ ] 3.0.0 Breaking refactor of all methods
+    - [ ] Move to fastai 2.0
 
 
 ## Contribution

diff --git a/ROADMAP.md b/ROADMAP.md
@@ -3,39 +3,72 @@
 - [X] 0.9.0 Notebook demonstrations of basic model usage.
 - [X] **1.0.0** Base version is completed with working model visualizations proving performance / expected failure. At 
 this point, all models should have guaranteed environments they should succeed in. 
-- [ ] **Working on** 1.1.0 More Traditional RL models
+- [ ] 1.1.0 More Traditional RL models
+    - [X] Add Cross Entropy Method CEM
+    - [X] NStep Experience replay
+    - [X] Gaussian and Factored Gaussian Noise exploration replacement
+    - [X] Add Distributional DQN
+    - [X] Add RAINBOW DQN (Note warnings, will require refactor / re-testing)
+    - [X] Add REINFORCE
     - [ ] **Working on** Add PPO
     - [ ] **Working on** Add TRPO
     - [ ] Add D4PG
     - [ ] Add A2C
     - [ ] Add A3C
-- [ ] 1.2.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
+    - [ ] Add SAC
+- [ ] 2.0.0 Mass refactor / performance update
+    - [ ] Environments needs to be faster. Beat openai baseline 350 frames per second
+        - Comparing against https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On performance
+    - [ ] fastrl needs to handle ram better
+    - [ ] Use Pong as "expensive computation" benchmark for all compatible models (discrete). 
+        - [ ] 2 Runs image space
+    - [ ] Use Cartpole as "cheap computation" benchmark for all compatible models (discrete).
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use Mountain car as "far distance goal" benchmark all compatible models (discrete)
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use Ant as "expensive computation" benchmark for all compatible models (continuous). 
+        - [ ] 2 Runs image space
+    - [ ] Use Pendulum as "cheap computation" benchmark for all compatible models (continuous).
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use Mountain car continuous as "cheap computation" "far distance goal" benchmark all compatible models (continuous).
+        - [ ] 5 Runs state space
+        - [ ] 2 Runs image space
+    - [ ] Use yield instead of return for the MDPDataset object
+    - [ ] Unify common code pieces shared in all models
+    - [ ] Transition entire project to [nbdev](https://github.com/fastai/nbdev)
+        - Make documentation easier / more expansive. Current method is tedious.
+- [ ] 2.1.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
     - [ ] Add SMDP
     - [ ] Add Goal oriented MDPs. Will Require a new "Step"
     - [ ] Add FeUdal Network
     - [ ] Add storage based DataBunch memory management. This can prevent RAM from being used up by episode image frames
     that may or may not serve any use to the agent, but only for logging.
-- [ ] 1.3.0
+- [ ] 2.2.0
     - [ ] Add HAC
     - [ ] Add MAXQ
     - [ ] Add HIRO
-- [ ] 1.4.0
+- [ ] 2.3.0
     - [ ] Add h-DQN
     - [ ] Add Modulated Policy Hierarchies
     - [ ] Add Meta Learning Shared Hierarchies
-- [ ] 1.5.0
+- [ ] 2.4.0
     - [ ] Add STRategic Attentive Writer (STRAW)
     - [ ] Add H-DRLN
     - [ ] Add Abstract Markov Decision Process (AMDP)
-- [ ] 1.6.0 HRL Options models *Possibly will already be implemented in a previous model*
+    - [ ] Add conda integration so that installation can be truly one step.
+- [ ] 2.5.0 HRL Options models *Possibly will already be implemented in a previous model*
     - [ ] Options augmentation to DQN based models
     - [ ] Options augmentation to actor critic models
     - [ ] Options augmentation to async actor critic models
-- [ ] 1.8.0 HRL Skills
+- [ ] 2.6.0 HRL Skills
     - [ ] Skills augmentation to DQN based models
     - [ ] Skills augmentation to actor critic models
     - [ ] Skills augmentation to async actor critic models
-- [ ] 1.9.0
-- [ ] 2.0.0 Add PyBullet Fetch Environments
-    - [ ] 2.0.0 Not part of this repo, however the envs need to subclass the OpenAI `gym.GoalEnv`
-    - [ ] 2.0.0 Add HER
+- [ ] 2.7.0 Add PyBullet Fetch Environments
+    - [ ] Envs need to subclass OpenAI `gym.GoalEnv`
+    - [ ] Add HER
+- [ ] 3.0.0 Breaking refactor of all methods
+    - [ ] Move to fastai 2.0
diff --git a/docs_src/rl.agents.cem.ipynb b/docs_src/rl.agents.cem.ipynb
@@ -0,0 +1,61 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Can't import one of these: No module named 'pybullet'\n",
+      "pygame 2.0.0.dev6 (SDL 2.0.10, python 3.6.7)\n",
+      "Hello from the pygame community. https://www.pygame.org/contribute.html\n",
+      "Can't import one of these: No module named 'gym_minigrid'\n"
+     ]
+    }
+   ],
+   "source": [
+    "from fastai.tabular.data import emb_sz_rule\n",
+    "from fast_rl.agents.cem import CEMLearner, CEMTrainer\n",
+    "from fast_rl.agents.cem_models import CEMModel\n",
+    "from fast_rl.core.data_block import MDPDataBunch\n",
+    "import numpy as np\n",
+    "from fast_rl.core.metrics import RewardMetric, RollingRewardMetric"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs_src/rl.agents.trpo.ipynb b/docs_src/rl.agents.trpo.ipynb
@@ -0,0 +1,96 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true,
+    "pycharm": {
+     "is_executing": false
+    }
+   },
+   "source": [
+    "\n",
+    "## TRPO\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from fastai.gen_doc.nbdoc import show_doc\n",
+    "from fast_rl.agents.trpo_models import *"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/markdown": [
+       "<h2 id=\"TRPOModule\" class=\"doc_header\"><code>class</code> <code>TRPOModule</code><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TRPOModule-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
+       "\n",
+       "> <code>TRPOModule</code>(**`ni`**:`int`, **`na`**:`int`, **`discount`**:`float`, **`fc_layers`**:`List`\\[`int`\\]=***`None`***, **`conv_filters`**:`List`\\[`int`\\]=***`None`***, **`nc`**=***`3`***, **`bn`**=***`False`***, **`q_lr`**=***`0.001`***, **`v_lr`**=***`0.0001`***, **`ks`**:`List`\\[`int`\\]=***`None`***, **`stride`**:`List`\\[`int`\\]=***`None`***) :: [`PrePostInitMeta`](/core.html#PrePostInitMeta) :: [`Module`](/torch_core.html#Module)\n",
+       "\n",
+       "<div class=\"collapse\" id=\"TRPOModule-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TRPOModule-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>TRPOModule</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
+       "\n",
+       "Implementation of the TRPO (Trust Region Policy Optimization) algorithm. Policy Gradient based algorithm for reinforcement learning in discrete\n",
+       " and continuous state and action spaces. Details of the algorithm's mathematical background can be found in [1].\n",
+       "\n",
+       "References:\n",
+       "        [1] (Schulman et al., 2017) Trust Region Policy Optimization.\n",
+       "\n",
+       "Args:\n",
+       "        ni:             na:             discount:               fc_layers:              conv_filters:\n",
+       "        nc:\n",
+       "        bn:\n",
+       "        q_lr:\n",
+       "        v_lr:\n",
+       "        ks:\n",
+       "        stride: "
+      ],
+      "text/plain": [
+       "<IPython.core.display.Markdown object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "show_doc(TRPOModule)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}