Skip to content

Commit adf6ed6

Browse files
authored
Merge pull request #168 from codeharborhub/dev-1
ml docs content added
2 parents 0b7573c + 837c5a2 commit adf6ed6

File tree

10 files changed

+1223
-0
lines changed

10 files changed

+1223
-0
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title: Actor-Critic Methods
3+
sidebar_label: Actor-Critic
4+
description: "Combining value-based and policy-based methods for stable and efficient reinforcement learning."
5+
tags: [machine-learning, reinforcement-learning, actor-critic, a2c, a3c]
6+
---
7+
8+
**Actor-Critic** methods are a hybrid architecture in Reinforcement Learning that combine the best of both worlds: **Policy Gradients** and **Value-Based** learning.
9+
10+
In this setup, we use two neural networks:
11+
1. **The Actor:** Learns the strategy (Policy). It decides which action to take.
12+
2. **The Critic:** Learns to evaluate the action. It tells the Actor how "good" the action was by estimating the Value function.
13+
14+
## 1. Why use Actor-Critic?
15+
16+
* **Policy Gradients (Actor only):** Have high variance and can be slow to converge because they rely on full episode returns.
17+
* **Q-Learning (Critic only):** Can be biased and struggles with continuous action spaces.
18+
* **Actor-Critic:** Uses the Critic to reduce the variance of the Actor, leading to faster and more stable learning.
19+
20+
## 2. How it Works: The Advantage
21+
22+
The Critic doesn't just predict the reward; it predicts the **Advantage** ($A$). The Advantage tells us if an action was better than the average action expected from that state.
23+
24+
$$
25+
A(s, a) = Q(s, a) - V(s)
26+
$$
27+
28+
Where:
29+
30+
* **$Q(s, a)$:** The value of taking a specific action.
31+
* **$V(s)$:** The average value of the state (The Baseline).
32+
33+
If $A > 0$, the Actor is encouraged to take that action more often. If $A < 0$, the Actor is discouraged.
34+
35+
## 3. The Learning Loop
36+
37+
```mermaid
38+
graph TD
39+
S[State] --> Actor(Actor: Policy)
40+
S --> Critic(Critic: Value)
41+
Actor --> A[Action]
42+
A --> E[Environment]
43+
E --> R[Reward]
44+
E --> NS[Next State]
45+
R --> TD[TD Error / Advantage]
46+
NS --> TD
47+
TD -->|Feedback| Actor
48+
TD -->|Feedback| Critic
49+
50+
style Actor fill:#e1f5fe,stroke:#01579b,color:#333
51+
style Critic fill:#fff3e0,stroke:#ef6c00,color:#333
52+
style TD fill:#fce4ec,stroke:#d81b60,color:#333
53+
54+
```
55+
56+
## 4. Popular Variations
57+
58+
### A2C (Advantage Actor-Critic)
59+
60+
A synchronous version where multiple agents run in parallel environments. The "Master" agent waits for all workers to finish their steps before updating the global network.
61+
62+
### A3C (Asynchronous Advantage Actor-Critic)
63+
64+
Introduced by DeepMind, this version is asynchronous. Each worker updates the global network independently without waiting for others, making it extremely fast.
65+
66+
### PPO (Proximal Policy Optimization)
67+
68+
A modern, state-of-the-art Actor-Critic algorithm used by OpenAI. It ensures that updates to the policy aren't "too large," preventing the model from collapsing during training.
69+
70+
## 5. Implementation Logic (Pseudo-code)
71+
72+
```python
73+
# 1. Get action from Actor
74+
probs = actor(state)
75+
action = sample(probs)
76+
77+
# 2. Interact with Environment
78+
next_state, reward = env.step(action)
79+
80+
# 3. Get values from Critic
81+
value = critic(state)
82+
next_value = critic(next_state)
83+
84+
# 4. Calculate Advantage (TD Error)
85+
# Advantage = (r + gamma * next_v) - v
86+
advantage = reward + gamma * next_value - value
87+
88+
# 5. Backpropagate
89+
actor_loss = -log_prob(action) * advantage.detach()
90+
critic_loss = advantage.pow(2)
91+
92+
(actor_loss + critic_loss).backward()
93+
94+
```
95+
96+
## 6. Pros and Cons
97+
98+
| Advantages | Disadvantages |
99+
| --- | --- |
100+
| **Lower Variance:** Much more stable than pure Policy Gradients. | **Complexity:** Harder to tune because you are training two networks at once. |
101+
| **Online Learning:** Can update after every step (doesn't need to wait for the end of an episode). | **Sample Inefficient:** Can still require millions of interactions for complex games. |
102+
| **Continuous Actions:** Handles continuous movement smoothly. | **Sensitive to Hyperparameters:** Learning rates for Actor and Critic must be balanced. |
103+
104+
## References
105+
106+
* **DeepMind's A3C Paper:** "Asynchronous Methods for Deep Reinforcement Learning."
107+
* **OpenAI Spinning Up:** Documentation on PPO and Actor-Critic variants.
108+
* **Reinforcement Learning with David Silver:** Lecture 7 (Policy Gradient and Actor-Critic).
109+
* **Sutton & Barto's "Reinforcement Learning: An Introduction":** Chapter on Actor-Critic Methods.
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
---
2+
title: "Deep Q-Networks (DQN)"
3+
sidebar_label: Deep Q-Networks
4+
description: "Scaling Reinforcement Learning with Deep Learning using Experience Replay and Target Networks."
5+
tags: [machine-learning, reinforcement-learning, dqn, deep-learning, neural-networks]
6+
---
7+
8+
**Deep Q-Networks (DQN)** represent the fusion of Reinforcement Learning and Deep Neural Networks. While standard [Q-Learning](/tutorial/machine-learning/machine-learning-core/reinforcement-learning/q-learning) uses a table to store values, DQN uses a **Neural Network** to approximate the Q-value function.
9+
10+
This advancement allowed RL agents to handle environments with high-dimensional state spaces, such as raw pixels from a video game screen.
11+
12+
## 1. Why Deep Learning for Q-Learning?
13+
14+
In a complex environment, the number of possible states is astronomical.
15+
* **Atari 2600:** A $210 \times 160$ pixel screen with 128 colors has more possible states than there are atoms in the universe.
16+
* **The Solution:** Instead of a table, we use a Neural Network ($Q_\theta$) that takes a **State** as input and outputs the predicted **Q-values** for all possible actions.
17+
18+
## 2. The Two "Secret Ingredients" of DQN
19+
20+
Standard neural networks struggle with RL because the data is highly correlated (sequential frames in a game are nearly identical). To fix this, DQN introduced two revolutionary concepts:
21+
22+
### A. Experience Replay
23+
Instead of learning from the current experience immediately, the agent saves its experiences $(s, a, r, s')$ in a **Replay Buffer**. During training, we sample a **random batch** of these experiences.
24+
* **Benefit:** It breaks the correlation between consecutive samples and allows the model to "re-learn" from past successes and failures multiple times.
25+
26+
### B. Target Networks
27+
In standard Q-Learning, the "target" we are chasing changes every time we update the weights. This is like a dog chasing its own tail.
28+
* **The Fix:** We maintain two networks:
29+
1. **Policy Network:** The one we are constantly training.
30+
2. **Target Network:** A frozen copy of the Policy Network used to calculate the "target" value. We only update this copy every few thousand steps.
31+
32+
## 3. The DQN Mathematical Objective
33+
34+
The loss function for DQN is the squared difference between the **Target Q-value** and the **Predicted Q-value**:
35+
36+
$$
37+
L(\theta) = E \left[ \left( \underbrace{r + \gamma \max_{a'} Q_{\theta^{-}}(s', a')}_{\text{Target (Target Network)}} - \underbrace{Q_{\theta}(s, a)}_{\text{Prediction (Policy Network)}} \right)^2 \right]
38+
$$
39+
40+
Where:
41+
42+
* **$\theta$**: Weights of the Policy Network.
43+
* **$\theta^{-}$**: Weights of the Target Network (frozen).
44+
* **$r$**: Reward received after taking action $a$ in state $s$.
45+
* **$\gamma$**: Discount factor for future rewards.
46+
47+
## 4. The DQN Workflow
48+
49+
```mermaid
50+
graph LR
51+
ENV["$$\text{Environment}$$"]
52+
53+
ENV --> S["$$s_t$$<br/>$$\text{Current State}$$"]
54+
55+
S --> NET["$$Q(s,a;\theta)$$<br/>$$\text{Online Q-Network}$$"]
56+
57+
NET --> ACT["$$\varepsilon\text{-greedy Policy}a_t=\begin{cases} \text{random action} & \varepsilon \\ \arg\max_a Q(s_t,a;\theta) & 1-\varepsilon \end{cases}$$"]
58+
59+
ACT --> ENV
60+
61+
ENV --> R["$$r_t,\ s_{t+1}$$"]
62+
63+
R --> MEM["$$\text{Replay Buffer } \mathcal{D}$$"]
64+
65+
MEM --> SAMPLE["$$\text{Sample Mini-batch}$$"]
66+
67+
SAMPLE --> TARGET["$$y_t = r_t + \gamma \max_a Q(s_{t+1},a;\theta^-)$$"]
68+
69+
TARGET --> LOSS["$$\mathcal{L}(\theta) = \mathbb{E}\left[(y_t - Q(s_t,a_t;\theta))^2\right]$$"]
70+
71+
LOSS --> GRAD["$$\nabla_\theta \mathcal{L}$$"]
72+
73+
GRAD --> UPDATE["$$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$$"]
74+
75+
UPDATE --> NET
76+
77+
NET -.->|"$$\text{Periodically Copy}$$"| TNET["$$\theta^-$$<br/>$$\text{Target Network}$$"]
78+
79+
80+
```
81+
82+
## 5. Implementation logic (PyTorch-style)
83+
84+
```python
85+
# The DQN Model
86+
class DQN(nn.Module):
87+
def __init__(self, state_dim, action_dim):
88+
super(DQN, self).__init__()
89+
self.net = nn.Sequential(
90+
nn.Linear(state_dim, 128),
91+
nn.ReLU(),
92+
nn.Linear(128, action_dim)
93+
)
94+
95+
def forward(self, x):
96+
return self.net(x)
97+
98+
# Training Step
99+
def train_step():
100+
# 1. Sample random batch from replay buffer
101+
states, actions, rewards, next_states, dones = buffer.sample(batch_size)
102+
103+
# 2. Get current Q-values from Policy Network
104+
current_q = policy_net(states).gather(1, actions)
105+
106+
# 3. Get maximum Q-values for next states from Target Network
107+
with torch.no_grad():
108+
next_q = target_net(next_states).max(1)[0]
109+
target_q = rewards + (gamma * next_q * (1 - dones))
110+
111+
# 4. Minimize the Loss
112+
loss = F.mse_loss(current_q, target_q.unsqueeze(1))
113+
optimizer.zero_grad()
114+
loss.backward()
115+
optimizer.step()
116+
117+
```
118+
119+
## 6. Beyond DQN
120+
121+
While DQN was a massive breakthrough, it has been improved by:
122+
123+
* **Double DQN:** Reduces the tendency to overestimate Q-values.
124+
* **Dueling DQN:** Separates the calculation of state value and action advantage.
125+
* **Prioritized Experience Replay:** Samples "important" experiences (those with high error) more frequently.
126+
127+
```mermaid
128+
graph LR
129+
ENV["$$\text{Atari Environment}$$"]
130+
131+
ENV --> S["$$s_t$$<br/>$$\text{Game State}$$"]
132+
133+
%% Standard DQN
134+
S --> DQN["Standard DQN"]
135+
136+
DQN --> Q1["$$Q(s,a;\theta)$$"]
137+
Q1 --> T1["$$y = r + \gamma \max_a Q(s',a;\theta^-)$$"]
138+
T1 --> O1["$$\text{Overestimation Bias}$$"]
139+
O1 --> P1["$$\text{Unstable Learning}$$"]
140+
141+
%% Double DQN
142+
S --> DDQN["Double DQN"]
143+
144+
DDQN --> Q2["$$Q(s,a;\theta)$$"]
145+
Q2 --> T2["$$y = r + \gamma Q(s', \arg\max_a Q(s',a;\theta);\theta^-)$$"]
146+
T2 --> O2["$$\text{Reduced Overestimation}$$"]
147+
O2 --> P2["$$\text{More Stable Q-Values}$$"]
148+
149+
%% Dueling DQN
150+
S --> DUEL["Dueling DQN"]
151+
152+
DUEL --> V["$$V(s;\theta_v)$$<br/>$$\text{State Value}$$"]
153+
DUEL --> A["$$A(s,a;\theta_a)$$<br/>$$\text{Action Advantage}$$"]
154+
155+
V --> Q3["$$Q(s,a)=V(s)+A(s,a)-\frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$$"]
156+
A --> Q3
157+
158+
Q3 --> P3["$$\text{Better State Representation}$$"]
159+
P3 --> G3["$$\text{Faster Learning on Atari}$$"]
160+
161+
%% Experience Replay Enhancement
162+
ENV --> MEM["$$\text{Replay Buffer}$$"]
163+
164+
MEM --> PER["$$\text{Prioritized Experience Replay}$$"]
165+
PER --> ERR["$$p_i \propto |\delta_i|$$<br/>$$\text{TD Error-Based Sampling}$$"]
166+
ERR --> UPD["$$\text{Faster Convergence}$$"]
167+
168+
%% Comparison Links
169+
P1 -.->|"$$\text{Beyond DQN}$$"| O2
170+
O2 -.->|"$$\text{Combined}$$"| G3
171+
UPD -.->|"$$\text{Boosts All}$$"| G3
172+
173+
```
174+
175+
## References
176+
177+
* **Mnih et al. (2015):** "Human-level control through deep reinforcement learning" (The original Nature paper).
178+
* **DeepLizard RL Series:** Excellent visual tutorials on DQN mechanics.
179+
180+
---
181+
182+
**DQN is great for discrete actions (like buttons on a controller). But how do we handle continuous actions, like the pressure applied to a gas pedal?**

0 commit comments

Comments
 (0)