diff --git a/_data/navigation.yml b/_data/navigation.yml index 6600d2a4..ae308fd2 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -181,6 +181,17 @@ wiki: url: /wiki/machine-learning/mediapipe-live-ml-anywhere.md/ - title: NLP for robotics url: /wiki/machine-learning/nlp_for_robotics.md/ + - title: Reinforcement Learning + url: /wiki/reinforcemnet-learning + children: + - title: Key Concepts in Reinforcemnet Learning (RL) + url: /wiki/reinforcemnet-learning/key-concepts-in-rl/ + - title: Reinforcement Learning Algorithms + url: /wiki/reinforcemnet-learning/reinforcement-learning-algorithms/ + - title: Policy Gradient Methods + url: /wiki/reinforcemnet-learning/intro-to-policy-gradient-methods/ + - title: Foundation of Value-Based Reinforcement Learning + url: /wiki/reinforcemnet-learning/value-based-reinforcement-learning/ - title: State Estimation url: /wiki/state-estimation/ children: diff --git a/assets/images/Humanoid robot.drawio.png b/assets/images/Humanoid robot.drawio.png new file mode 100644 index 00000000..a698662e Binary files /dev/null and b/assets/images/Humanoid robot.drawio.png differ diff --git a/assets/images/multi_contact_planning.png b/assets/images/multi_contact_planning.png new file mode 100644 index 00000000..1e501fd7 Binary files /dev/null and b/assets/images/multi_contact_planning.png differ diff --git a/wiki/reinforcement-learning/intro-to-policy-gradient-methods.md b/wiki/reinforcement-learning/intro-to-policy-gradient-methods.md new file mode 100644 index 00000000..b4857545 --- /dev/null +++ b/wiki/reinforcement-learning/intro-to-policy-gradient-methods.md @@ -0,0 +1,152 @@ +--- +date: 2025-05-04 +title: Proximal Policy Optimization (PPO): Concepts, Theory, and Insights +--- + +Proximal Policy Optimization (PPO) is one of the most widely used algorithms in modern reinforcement learning. It combines the benefits of policy gradient methods with a set of improvements that make training more stable, sample-efficient, and easy to implement. PPO is used extensively in robotics, gaming, and simulated environments like MuJoCo and OpenAI Gym. This article explains PPO from the ground up: its motivation, theory, algorithmic structure, and practical considerations. + +## Motivation + +Traditional policy gradient methods suffer from instability due to large, unconstrained policy updates. While they optimize the expected return directly, updates can be so large that they lead to catastrophic performance collapse. + +Trust Region Policy Optimization (TRPO) proposed a solution by introducing a constraint on the size of the policy update using a KL-divergence penalty. However, TRPO is relatively complex to implement because it requires solving a constrained optimization problem using second-order methods. + +PPO was designed to simplify this by introducing a clipped surrogate objective that effectively limits how much the policy can change during each update—while retaining the benefits of trust-region-like behavior. + +## PPO Objective + +Let the old policy be $\pi_{\theta_{\text{old}}}$ and the new policy be $\pi_\theta$. PPO maximizes the following clipped surrogate objective: + +$$ +L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ +\min\left( +r_t(\theta) \hat{A}_t, +\text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t +\right) +\right] +$$ + +where: + +- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio, +- $\hat{A}_t$ is the advantage estimate at time step $t$, +- $\epsilon$ is a small hyperparameter (e.g., 0.1 or 0.2). + +### Why Clipping? + +Without clipping, large changes in the policy could lead to very large or small values of $r_t(\theta)$, resulting in destructive updates. The **clip** operation ensures that updates do not push the new policy too far from the old one. + +This introduces a **soft trust region**: when $r_t(\theta)$ is within $[1 - \epsilon, 1 + \epsilon]$, the update proceeds normally. If $r_t(\theta)$ exceeds this range, the objective is "flattened", preventing further change. + +## Full PPO Objective + +In practice, PPO uses a combination of multiple objectives: + +- **Clipped policy loss** (as above) +- **Value function loss**: typically a mean squared error between predicted value and empirical return +- **Entropy bonus**: to encourage exploration + +The full loss function is: + +$$ +L^{\text{PPO}}(\theta) = +\mathbb{E}_t \left[ +L^{\text{CLIP}}(\theta) +- c_1 \cdot (V_\theta(s_t) - \hat{V}_t)^2 ++ c_2 \cdot \mathcal{H}[\pi_\theta](s_t) +\right] +$$ + +where: + +- $c_1$ and $c_2$ are weighting coefficients, +- $\hat{V}_t$ is an empirical return (or bootstrapped target), +- $\mathcal{H}[\pi_\theta]$ is the entropy of the policy at state $s_t$. + +## Advantage Estimation + +PPO relies on high-quality advantage estimates $\hat{A}_t$ to guide policy updates. The most popular technique is **Generalized Advantage Estimation (GAE)**: + +$$ +\hat{A}_t = \sum_{l=0}^{T - t - 1} (\gamma \lambda)^l \delta_{t+l} +$$ + +with: + +$$ +\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) +$$ + +GAE balances the bias-variance trade-off via the $\lambda$ parameter (typically 0.95). + +## PPO Training Loop Overview + +At a high level, PPO training proceeds in the following way: + +1. **Collect rollouts** using the current policy for a fixed number of steps. +2. **Compute advantages** using GAE. +3. **Compute returns** for value function targets. +4. **Optimize the PPO objective** with multiple minibatch updates (typically using Adam). +5. **Update the old policy** to match the new one. + +Unlike TRPO, PPO allows **multiple passes through the same data**, improving sample efficiency. + +## Practical Tips + +- **Clip epsilon**: Usually 0.1 or 0.2. Too large allows harmful updates; too small restricts learning. +- **Number of epochs**: PPO uses multiple SGD epochs (3–10) per batch. +- **Batch size**: Typical values range from 2048 to 8192. +- **Value/policy loss scales**: The constants $c_1$ and $c_2$ are often 0.5 and 0.01 respectively. +- **Normalize advantages**: Empirically improves stability. + +> **Entropy Bonus**: Without sufficient entropy, the policy may prematurely converge to a suboptimal deterministic strategy. + +## Why PPO Works Well + +- **Stable updates**: Clipping constrains updates to a trust region without expensive computations. +- **On-policy training**: Leads to high-quality updates at the cost of sample reuse. +- **Good performance across domains**: PPO performs well in continuous control, discrete games, and real-world robotics. +- **Simplicity**: Easy to implement and debug compared to TRPO, ACER, or DDPG. + +## PPO vs TRPO + +| Feature | PPO | TRPO | +|---------------------------|--------------------------------------|--------------------------------------| +| Optimizer | First-order (SGD/Adam) | Second-order (constrained step) | +| Trust region enforcement | Clipping | Explicit KL constraint | +| Sample efficiency | Moderate | Low | +| Stability | High | Very high | +| Implementation | Simple | Complex | + +## Limitations + +- **On-policy nature** means PPO discards data after each update. +- **Entropy decay** can hurt long-term exploration unless tuned carefully. +- **Not optimal for sparse-reward environments** without additional exploration strategies (e.g., curiosity, count-based bonuses). + +## PPO in Robotics + +PPO has become a standard in sim-to-real training workflows: + +- Robust to partial observability +- Easy to stabilize on real robots +- Compatible with parallel simulation (e.g., Isaac Gym, MuJoCo) + +## Summary + +PPO offers a clean and reliable solution for training RL agents using policy gradient methods. Its clipping objective balances the need for learning speed with policy stability. PPO is widely regarded as a default choice for continuous control tasks, and has been proven to work well across a broad range of applications. + + +## Further Reading +- [Proximal Policy Optimization Algorithms – Schulman et al. (2017)](https://arxiv.org/abs/1707.06347) +- [Spinning Up PPO Overview – OpenAI](https://spinningup.openai.com/en/latest/algorithms/ppo.html) +- [CleanRL PPO Implementation](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py) +- [RL Course Lecture on PPO – UC Berkeley CS285](https://rail.eecs.berkeley.edu/deeprlcourse/) +- [OpenAI Gym PPO Examples](https://github.com/openai/baselines/tree/master/baselines/ppo2) +- [Generalized Advantage Estimation (GAE) – Schulman et al.](https://arxiv.org/abs/1506.02438) +- [PPO Implementation from Scratch – Andriy Mulyar](https://github.com/awjuliani/DeepRL-Agents) +- [Deep Reinforcement Learning Hands-On (PPO chapter)](https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On) +- [Stable Baselines3 PPO Documentation](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) +- [OpenReview: PPO vs TRPO Discussion](https://openreview.net/forum?id=r1etN1rtPB) +- [Reinforcement Learning: State-of-the-Art Survey (2019)](https://arxiv.org/abs/1701.07274) +- [RL Algorithms by Difficulty – RL Book Companion](https://huggingface.co/learn/deep-rl-course/unit2/ppo) diff --git a/wiki/reinforcement-learning/key-concepts-in-rl.md b/wiki/reinforcement-learning/key-concepts-in-rl.md new file mode 100644 index 00000000..6335929e --- /dev/null +++ b/wiki/reinforcement-learning/key-concepts-in-rl.md @@ -0,0 +1,85 @@ +--- +date: 2025-03-11 # YYYY-MM-DD +title: Key Concepts of Reinforcement Learning +--- + +This tutorial provides an introduction to the fundamental concepts of Reinforcement Learning (RL). RL involves an agent interacting with an environment to learn optimal behaviors through trial and feedback. The main objective of RL is to maximize cumulative rewards over time. + +## Main Components of Reinforcement Learning + +### Agent and Environment +The agent is the learner or decision-maker, while the environment represents everything the agent interacts with. The agent receives observations from the environment and takes actions that influence the environment's state. + +### States and Observations +- A **state** (s) fully describes the world at a given moment. +- An **observation** (o) is a partial view of the state. +- Environments can be **fully observed** (complete information) or **partially observed** (limited information). + +### Action Spaces +- The **action space** defines all possible actions an agent can take. +- **Discrete action spaces** (e.g., Atari, Go) have a finite number of actions. +- **Continuous action spaces** (e.g., robotics control) allow real-valued actions. + +## Policies +A **policy** determines how an agent selects actions based on states: + +- **Deterministic policy**: Always selects the same action for a given state. + + $a_t = \mu(s_t)$ + +- **Stochastic policy**: Samples actions from a probability distribution. + + $a_t \sim \pi(\cdot | s_t)$ + + +### Example: Deterministic Policy in PyTorch +```python +import torch.nn as nn + +pi_net = nn.Sequential( + nn.Linear(obs_dim, 64), + nn.Tanh(), + nn.Linear(64, 64), + nn.Tanh(), + nn.Linear(64, act_dim) +) +``` + +## Trajectories +A **trajectory (\tau)** is a sequence of states and actions: +```math +\tau = (s_0, a_0, s_1, a_1, ...) +``` +State transitions follow deterministic or stochastic rules: +```math +s_{t+1} = f(s_t, a_t) +``` +or +```math +s_{t+1} \sim P(\cdot|s_t, a_t) +``` + +## Reward and Return +The **reward function (R)** determines the agent's objective: +```math +r_t = R(s_t, a_t, s_{t+1}) +``` +### Types of Return +1. **Finite-horizon undiscounted return**: + ```math + R(\tau) = \sum_{t=0}^T r_t + ``` +2. **Infinite-horizon discounted return**: + ```math + R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t + ``` + where \( \gamma \) (discount factor) balances immediate vs. future rewards. + +## Summary +This tutorial introduced fundamental RL concepts, including agents, environments, policies, action spaces, trajectories, and rewards. These components are essential for designing RL algorithms. + +## Further Reading +- Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction*. + +## References +- [Reinforcement Learning Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning) diff --git a/wiki/reinforcement-learning/reinforcement-learning-algorithms.md b/wiki/reinforcement-learning/reinforcement-learning-algorithms.md new file mode 100644 index 00000000..4187bbbb --- /dev/null +++ b/wiki/reinforcement-learning/reinforcement-learning-algorithms.md @@ -0,0 +1,177 @@ +--- +date: 2025-05-04 +title: A Taxonomy of Reinforcement Learning Algorithms +--- + +Reinforcement Learning (RL) is a foundational paradigm in artificial intelligence where agents learn to make decisions through trial and error, guided by rewards. Over the years, a rich variety of RL algorithms have been developed, each differing in the way they represent knowledge, interact with the environment, and generalize from data. This article presents a high-level taxonomy of RL algorithms with an emphasis on design trade-offs, learning objectives, and algorithmic categories. The goal is to provide a structured guide to the RL landscape for students and practitioners. + +## Model-Based vs Model-Free Reinforcement Learning + +One of the most fundamental distinctions among RL algorithms is whether or not the algorithm uses a model of the environment's dynamics. + +### Model-Free RL + +Model-free algorithms do not attempt to learn or use an internal model of the environment. Instead, they learn policies or value functions directly from experience. These methods are typically simpler to implement and tune, making them more widely adopted in practice. + +**Key Advantages:** +- Easier to apply when the environment is complex or high-dimensional. +- No need for a simulator or model-learning pipeline. + +**Drawbacks:** +- High sample complexity: requires many interactions with the real or simulated environment. +- Cannot perform planning or imagination-based reasoning. + +**Examples:** +- **DQN (Deep Q-Networks)**: First to combine Q-learning with deep networks for Atari games. +- **PPO (Proximal Policy Optimization)**: A robust policy gradient method widely used in robotics and games. + +### Model-Based RL + +In contrast, model-based algorithms explicitly learn or use a model of the environment that predicts future states and rewards. The agent can then plan ahead by simulating trajectories using this model. + +**Key Advantages:** +- Better sample efficiency through planning and simulation. +- Can separate learning from data collection, enabling "dream-based" training. + +**Challenges:** +- Learning accurate models is difficult. +- Errors in the model can lead to compounding errors during planning. + +**Use Cases:** +- High-stakes environments where sample efficiency is critical. +- Scenarios requiring imagination or foresight (e.g., robotics, strategic games). + +**Examples:** +- **MBVE (Model-Based Value Expansion)**: Uses a learned model to expand the value estimate of real transitions. +- **AlphaZero**: Combines MCTS with learned value/policy networks to dominate board games. + +## What to Learn: Policy, Value, Q, or Model? + +RL algorithms also differ based on what the agent is trying to learn: + +- **Policy** $\pi_\theta(a|s)$: A mapping from state to action, either deterministic or stochastic. +- **Value function** $V^\pi(s)$: The expected return starting from state $s$ under policy $\pi$. +- **Action-Value (Q) function** $Q^\pi(s, a)$: The expected return starting from state $s$ taking action $a$, then following $\pi$. +- **Model**: A transition function $f(s, a) \rightarrow s'$ and reward predictor $r(s, a)$. + +### Model-Free Learning Strategies + +#### 1. Policy Optimization + +These algorithms directly optimize the parameters of a policy using gradient ascent on a performance objective: + +$$ +J(\pi_\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t r_t \right] +$$ + +They often require estimating the advantage function or value function to reduce variance. + +**Characteristics:** +- **On-policy**: Data must come from the current policy. +- **Stable and robust**: Optimizes directly for performance. + +**Popular Methods:** +- **A2C / A3C (Asynchronous Advantage Actor-Critic)**: Learns both policy and value function in parallel. +- **PPO (Proximal Policy Optimization)**: Ensures stable updates with clipped surrogate objectives. +- **TRPO (Trust Region Policy Optimization)**: Uses trust regions to prevent catastrophic policy changes. + +#### 2. Q-Learning + +Instead of learning a policy directly, Q-learning methods aim to learn the optimal action-value function: + +$$ +Q^*(s, a) = \mathbb{E} \left[ r + \gamma \max_{a'} Q^*(s', a') \right] +$$ + +Once $Q^*(s, a)$ is known, the policy is derived via: + +$$ +\pi(s) = \arg\max_a Q^*(s, a) +$$ + +**Characteristics:** +- **Off-policy**: Can use data from any past policy. +- **Data-efficient**, but prone to instability. + +**Variants:** +- **DQN**: Introduced experience replay and target networks. +- **C51 / QR-DQN**: Learn a distribution over returns, not just the mean. + +> **Trade-Off**: Policy gradient methods are more stable and principled; Q-learning methods are more sample-efficient but harder to stabilize due to the "deadly triad": function approximation, bootstrapping, and off-policy updates. + +#### Hybrid Algorithms + +Some methods blend policy optimization and Q-learning: + +- **DDPG (Deep Deterministic Policy Gradient)**: Actor-Critic method with off-policy Q-learning and deterministic policies. +- **TD3 (Twin Delayed DDPG)**: Addresses overestimation bias in DDPG. +- **SAC (Soft Actor-Critic)**: Adds entropy regularization to encourage exploration and stabilize learning. + +### Model-Based Learning Strategies + +Model-based RL allows a variety of architectures and learning techniques. + +#### 1. Pure Planning (e.g., MPC) + +The agent uses a learned or known model to plan a trajectory and execute the first action, then replan. No policy is explicitly learned. + +#### 2. Expert Iteration (ExIt) + +Combines planning and learning. Planning (e.g., via MCTS) provides strong actions ("experts"), which are used to train a policy via supervised learning. + +- **AlphaZero**: Exemplifies this method by using MCTS and neural nets in self-play. + +#### 3. Data Augmentation + +The learned model is used to synthesize additional training data. + +- **MBVE**: Augments true experiences with simulated rollouts. +- **World Models**: Trains entirely on imagined data ("dreaming"). + +#### 4. Imagination-Augmented Agents (I2A) + +Here, planning is embedded as a subroutine inside the policy network. The policy learns when and how to use imagination. + +> This technique can mitigate model bias because the policy can learn to ignore poor planning results. + +## Summary + +The landscape of RL algorithms is broad and evolving, but organizing them into categories based on model usage and learning targets helps build intuition: + +| Dimension | Model-Free RL | Model-Based RL | +|------------------------|--------------------------|------------------------------------------| +| Sample Efficiency | Low | High | +| Stability | High (Policy Gradient) | Variable (depends on model quality) | +| Planning Capability | None | Yes (MPC, MCTS, ExIt) | +| Real-World Deployment | Slower | Faster (if model is accurate) | +| Representative Methods | DQN, PPO, A2C | AlphaZero, MBVE, World Models, I2A | + +Understanding these trade-offs is key to selecting or designing an RL algorithm for your application. + + +## Further Reading +- [Spinning Up in Deep RL – OpenAI](https://spinningup.openai.com/en/latest/) +- [RL Course by David Silver](https://www.davidsilver.uk/teaching/) +- [RL Book – Sutton and Barto (2nd ed.)](http://incompleteideas.net/book/the-book-2nd.html) +- [CS285: Deep Reinforcement Learning – UC Berkeley (Sergey Levine)](https://rail.eecs.berkeley.edu/deeprlcourse/) +- [Deep RL Bootcamp (2017) – Stanford](https://sites.google.com/view/deep-rl-bootcamp/lectures) +- [Lil’Log – Reinforcement Learning Series by Lilian Weng](https://lilianweng.github.io/lil-log/) +- [RL Algorithms – Denny Britz’s GitHub](https://github.com/dennybritz/reinforcement-learning) +- [Reinforcement Learning Zoo – A curated collection of RL papers and code](https://github.com/instillai/reinforcement-learning-zoo) +- [Distill: Visualizing Reinforcement Learning](https://distill.pub/2019/visual-exploration/) +- [Deep Reinforcement Learning Nanodegree – Udacity](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) +- [Reinforcement Learning: State-of-the-Art (2019) – Arulkumaran et al.](https://arxiv.org/abs/1701.07274) +- [The RL Baselines3 Zoo – PyTorch Implementations of Popular RL Algorithms](https://github.com/DLR-RM/rl-baselines3-zoo) + + +## References +- [2] V. Mnih et al., “Asynchronous Methods for Deep Reinforcement Learning,” ICML, 2016. +- [3] J. Schulman et al., “Proximal Policy Optimization Algorithms,” arXiv:1707.06347, 2017. +- [5] T. Lillicrap et al., “Continuous Control with Deep Reinforcement Learning,” ICLR, 2016. +- [7] T. Haarnoja et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL,” ICML, 2018. +- [8] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” NIPS Deep Learning Workshop, 2013. +- [9] M. Bellemare et al., “A Distributional Perspective on Reinforcement Learning,” ICML, 2017. +- [12] D. Ha and J. Schmidhuber, “World Models,” arXiv:1803.10122, 2018. +- [13] T. Weber et al., “Imagination-Augmented Agents,” NIPS, 2017. +- [14] A. Nagabandi et al., “Neural Network Dynamics for Model-Based Deep RL,” CoRL, 2017. +- [16] D. Silver et al., “Mastering the Game of Go without Human Knowledge,” Nature, 2017. diff --git a/wiki/reinforcement-learning/value-based-reinforcement-learning.md b/wiki/reinforcement-learning/value-based-reinforcement-learning.md new file mode 100644 index 00000000..05d6d1fb --- /dev/null +++ b/wiki/reinforcement-learning/value-based-reinforcement-learning.md @@ -0,0 +1,127 @@ +--- +date: 2025-05-04 +title: Deep Q-Networks (DQN): A Foundation of Value-Based Reinforcement Learning +--- + +Deep Q-Networks (DQN) introduced the integration of Q-learning with deep neural networks, enabling reinforcement learning to scale to high-dimensional environments. Originally developed by DeepMind to play Atari games from raw pixels, DQN laid the groundwork for many modern value-based algorithms. This article explores the motivation, mathematical structure, algorithmic details, and practical insights for implementing and improving DQN. + +## Motivation + +Before DQN, classic Q-learning worked well in small, discrete environments. However, it couldn't generalize to high-dimensional or continuous state spaces. + +DQN addressed this by using a deep neural network as a function approximator for the Q-function, $Q(s, a; \theta)$. This allowed it to learn directly from visual input and approximate optimal action-values across thousands of states. + +The core idea: learn a parameterized Q-function that satisfies the Bellman optimality equation. + +## Q-Learning Recap + +Q-learning is a model-free, off-policy algorithm. It aims to learn the **optimal action-value function**: + +$$ +Q^*(s, a) = \mathbb{E} \left[ r + \gamma \max_{a'} Q^*(s', a') \middle| s, a \right] +$$ + +The Q-learning update rule is: + +$$ +Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) +$$ + +DQN replaces the tabular $Q(s, a)$ with a neural network $Q(s, a; \theta)$, trained to minimize: + +$$ +L(\theta) = \mathbb{E}_{(s, a, r, s')} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] +$$ + +where $\theta^-$ is the parameter set of a **target network** that is held fixed for several steps. + +## Core Components of DQN + +### 1. Experience Replay + +Instead of learning from consecutive experiences (which are highly correlated), DQN stores them in a **replay buffer** and samples random minibatches. This reduces variance and stabilizes updates. + +### 2. Target Network + +DQN uses a separate target network $Q(s, a; \theta^-)$ whose parameters are updated less frequently (e.g., every 10,000 steps). This decouples the moving target in the loss function and improves convergence. + +### 3. $\epsilon$-Greedy Exploration + +To balance exploration and exploitation, DQN uses an $\epsilon$-greedy policy: + +- With probability $\epsilon$, choose a random action. +- With probability $1 - \epsilon$, choose $\arg\max_a Q(s, a; \theta)$. + +$\epsilon$ is typically decayed over time. + +## DQN Algorithm Overview + +1. Initialize Q-network with random weights $\theta$. +2. Initialize target network $\theta^- \leftarrow \theta$. +3. Initialize replay buffer $\mathcal{D}$. +4. For each step: + - Observe state $s_t$. + - Select action $a_t$ via $\epsilon$-greedy. + - Take action, observe reward $r_t$ and next state $s_{t+1}$. + - Store $(s_t, a_t, r_t, s_{t+1})$ in buffer. + - Sample random minibatch from $\mathcal{D}$. + - Compute targets: $y_t = r + \gamma \max_{a'} Q(s_{t+1}, a'; \theta^-)$. + - Perform gradient descent on $(y_t - Q(s_t, a_t; \theta))^2$. + - Every $C$ steps, update $\theta^- \leftarrow \theta$. + +## Key Strengths + +- **Off-policy**: Allows experience reuse, increasing sample efficiency. +- **Stable with CNNs**: Effective in high-dimensional visual environments. +- **Simple to implement**: Core components are modular. + +## DQN Enhancements + +Several follow-up works improved on DQN: + +- **Double DQN**: Reduces overestimation bias in Q-learning. + + $$ + y_t = r + \gamma Q(s', \arg\max_a Q(s', a; \theta); \theta^-) + $$ + +- **Dueling DQN**: Splits Q-function into state-value and advantage function: + + $$ + Q(s, a) = V(s) + A(s, a) + $$ + +- **Prioritized Experience Replay**: Samples transitions with high temporal-difference (TD) error more frequently. +- **Rainbow DQN**: Combines all the above + distributional Q-learning into a single framework. + +## Limitations + +- **Not suited for continuous actions**: Requires discretization or replacement with actor-critic methods. +- **Sample inefficiency**: Still requires many environment steps to learn effectively. +- **Hard to tune**: Sensitive to learning rate, replay buffer size, etc. + +## DQN in Robotics + +DQN is less commonly used in robotics due to continuous control challenges, but: + +- Can be used in discretized navigation tasks. +- Serves as a baseline in hybrid planning-learning pipelines. +- Inspires off-policy learning architectures in real-time control. + +## Summary + +DQN is a foundational deep RL algorithm that brought deep learning to Q-learning. By integrating function approximation, experience replay, and target networks, it opened the door to using RL in complex visual and sequential tasks. Understanding DQN provides a solid base for learning more advanced value-based and off-policy algorithms. + + +## Further Reading +- [Playing Atari with Deep Reinforcement Learning – Mnih et al. (2013)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) +- [Human-level Control through Deep Reinforcement Learning – Nature 2015](https://www.nature.com/articles/nature14236) +- [Double Q-Learning – van Hasselt et al.](https://arxiv.org/abs/1509.06461) +- [Dueling Network Architectures – Wang et al.](https://arxiv.org/abs/1511.06581) +- [Rainbow: Combining Improvements in Deep RL – Hessel et al.](https://arxiv.org/abs/1710.02298) +- [Prioritized Experience Replay – Schaul et al.](https://arxiv.org/abs/1511.05952) +- [RL Course Lecture: Value-Based Methods – Berkeley CS285](https://rail.eecs.berkeley.edu/deeprlcourse/) +- [Deep RL Bootcamp – Value Iteration & DQN](https://sites.google.com/view/deep-rl-bootcamp/lectures) +- [CleanRL DQN Implementation](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py) +- [Spinning Up: Why Use Value-Based Methods](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html) +- [Reinforcement Learning: An Introduction]()