Skip to content

Commit 5c4fa7b

Browse files
authored
Merge pull request #180 from codeharborhub/dev-1
added ml-deep-learning docs
2 parents a7a411a + c396190 commit 5c4fa7b

File tree

5 files changed

+562
-0
lines changed

5 files changed

+562
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: "Multi-Head Attention: Parallelizing Insight"
3+
sidebar_label: Multi-Head Attention
4+
description: "Understanding how multiple attention 'heads' allow Transformers to capture diverse linguistic and spatial relationships simultaneously."
5+
tags: [deep-learning, attention, multi-head-attention, transformers, nlp]
6+
---
7+
8+
While [Self-Attention](./self-attention) is powerful, a single attention head often averages out the relationships between words. **Multi-Head Attention** solves this by running multiple self-attention operations in parallel, allowing the model to focus on different aspects of the input simultaneously.
9+
10+
## 1. The Concept: Why Multiple Heads?
11+
12+
If we use only one attention head, the model might focus entirely on the strongest relationship (e.g., the subject of a sentence). However, a word often has multiple relationships:
13+
* **Head 1:** Might focus on the **Grammar** (Subject-Verb agreement).
14+
* **Head 2:** Might focus on the **Context** (What does "it" refer to?).
15+
* **Head 3:** Might focus on the **Visual/Spatial** relations (Is the object "on" or "under" the table?).
16+
17+
By using multiple heads, we allow the model to "attend" to these different representation subspaces at once.
18+
19+
## 2. How it Works: Split, Attend, Concatenate
20+
21+
The process of Multi-Head Attention follows four distinct steps:
22+
23+
1. **Linear Projection (Split):** The input Query ($Q$), Key ($K$), and Value ($V$) are projected into $h$ different, lower-dimensional versions using learned weight matrices.
24+
2. **Parallel Attention:** We apply the [Scaled Dot-Product Attention](./self-attention#3-the-calculation-process) to each of the $h$ heads independently.
25+
3. **Concatenation:** The outputs from all heads are concatenated back into a single vector.
26+
4. **Final Linear Projection:** A final weight matrix ($W^O$) is applied to the concatenated vector to bring it back to the expected output dimension.
27+
28+
## 3. Mathematical Representation
29+
30+
For each head $i$, the attention is calculated as:
31+
32+
$$
33+
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
34+
$$
35+
36+
The final output is the concatenation of these heads multiplied by an output weight matrix:
37+
38+
$$
39+
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
40+
$$
41+
42+
## 4. Advanced Logic Flow (Mermaid)
43+
44+
The following diagram visualizes how the model splits a single high-dimensional embedding into multiple "heads" to process information in parallel.
45+
46+
```mermaid
47+
graph TD
48+
Input[Input Q, K, V] --> Split{Linear Split into 'h' Heads}
49+
50+
subgraph Parallel_Heads [Parallel Processing]
51+
Head1[Head 1: Scaled Dot-Product]
52+
Head2[Head 2: Scaled Dot-Product]
53+
HeadN[Head 'h': Scaled Dot-Product]
54+
end
55+
56+
Split --> Head1
57+
Split --> Head2
58+
Split --> HeadN
59+
60+
Head1 --> Concat[Concatenate Results]
61+
Head2 --> Concat
62+
HeadN --> Concat
63+
64+
Concat --> FinalLinear[Final Linear Projection WO]
65+
FinalLinear --> Output[Multi-Head Output]
66+
67+
```
68+
69+
## 5. Key Advantages
70+
71+
* **Ensemble Effect:** It acts like an ensemble of models, where each head learns something unique.
72+
* **Stable Training:** By dividing the by the number of heads, the internal dimensionality stays manageable, preventing the dot-products from growing too large.
73+
* **Resolution:** It improves the "resolution" of the attention map, making it less likely that one dominant word will "wash out" the influence of others.
74+
75+
## 6. Implementation with PyTorch
76+
77+
Using the `nn.MultiheadAttention` module is the standard way to implement this in production.
78+
79+
```python
80+
import torch
81+
import torch.nn as nn
82+
83+
# Parameters
84+
embed_dim = 128 # Dimension of the model
85+
num_heads = 8 # Number of parallel attention heads
86+
# Note: embed_dim must be divisible by num_heads (128/8 = 16 per head)
87+
88+
mha_layer = nn.MultiheadAttention(embed_dim, num_heads)
89+
90+
# Input shape: (sequence_length, batch_size, embed_dim)
91+
query = torch.randn(20, 1, 128)
92+
key = torch.randn(20, 1, 128)
93+
value = torch.randn(20, 1, 128)
94+
95+
# attn_output: the projected result; attn_weights: the attention map
96+
attn_output, attn_weights = mha_layer(query, key, value)
97+
98+
print(f"Output size: {attn_output.shape}") # [20, 1, 128]
99+
print(f"Attention weights: {attn_weights.shape}") # [1, 20, 20]
100+
101+
```
102+
103+
## References
104+
105+
* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762)
106+
* **Visualizing Attention:** [A Survey of Attention Mechanisms](https://arxiv.org/abs/2101.02257)
107+
108+
---
109+
110+
**Multi-Head Attention is the engine. But how do we organize these engines into a structure that can actually translate languages or generate text?**
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: The Core of Transformers
3+
sidebar_label: Self-Attention
4+
description: "Understanding how models weigh the importance of different parts of an input sequence using Queries, Keys, and Values."
5+
tags: [deep-learning, attention, transformers, nlp, self-attention]
6+
---
7+
8+
**Self-Attention** (also known as Intra-Attention) is the mechanism that allows a model to look at other words in an input sequence to get a better encoding for the word it is currently processing.
9+
10+
Unlike [RNNs](../rnn/rnn-basics), which process words one by one, Self-Attention allows every word to "talk" to every other word simultaneously, regardless of their distance.
11+
12+
## 1. Why do we need Self-Attention?
13+
14+
Consider the sentence: *"The animal didn't cross the street because **it** was too tired."*
15+
16+
When a model processes the word **"it"**, it needs to know what "it" refers to. Is it the animal or the street?
17+
* In a standard RNN, if the sentence is long, the model might "forget" about the animal by the time it reaches "it".
18+
* In **Self-Attention**, the model calculates a score that links "it" strongly to "animal" and weakly to "street".
19+
20+
## 2. The Three Vectors: Query, Key, and Value
21+
22+
To calculate self-attention, we create three vectors from every input word (embedding) by multiplying it by three weight matrices ($W^Q, W^K, W^V$) that are learned during training.
23+
24+
| Vector | Analogy (The Library) | Purpose |
25+
| :--- | :--- | :--- |
26+
| **Query ($Q$)** | The topic you are searching for. | Represents the current word looking at other words. |
27+
| **Key ($K$)** | The label on the spine of the book. | Represents the "relevance" tag of all other words. |
28+
| **Value ($V$)** | The information inside the book. | Represents the actual content of the word. |
29+
30+
## 3. The Calculation Process
31+
32+
The attention score is calculated through a series of matrix operations:
33+
34+
1. **Dot Product:** We multiply the Query of the current word by the Keys of all other words.
35+
2. **Scaling:** We divide by the square root of the dimension of the key ($\sqrt{d_k}$) to keep gradients stable.
36+
3. **Softmax:** We apply a Softmax function to turn scores into probabilities (weights) that sum to 1.
37+
4. **Weighted Sum:** We multiply the weights by the Value vectors to get the final output for that word.
38+
39+
$$
40+
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
41+
$$
42+
43+
## 4. Advanced Flow Logic (Mermaid)
44+
45+
The following diagram represents how an input embedding is transformed into an Attention output.
46+
47+
```mermaid
48+
graph TD
49+
Input[Input Embedding $$\ X$$] --> WQ[Weight Matrix $$\ W^Q$$]
50+
Input --> WK[Weight Matrix $$\ W^K$$]
51+
Input --> WV[Weight Matrix $$\ W^V$$]
52+
53+
WQ --> Q[Query $$\ Q$$]
54+
WK --> K[Key $$\ K$$]
55+
WV --> V[Value $$\ V$$]
56+
57+
Q --> Dot[Dot Product $$\ Q·K$$]
58+
K --> Dot
59+
60+
Dot --> Scale["Scale by $$\ 1/\sqrt {d_k}$$"]
61+
Scale --> Softmax[Softmax Layer]
62+
63+
Softmax --> WeightSum[Weighted Sum with $$\ V$$]
64+
V --> WeightSum
65+
66+
WeightSum --> Final[Attention Output]
67+
68+
```
69+
70+
## 5. Multi-Head Attention
71+
72+
In practice, we don't just use one self-attention mechanism. We use **Multi-Head Attention**. This involves running several self-attention calculations (heads) in parallel.
73+
74+
* One head might focus on the **subject-verb** relationship.
75+
* Another head might focus on **adjectives**.
76+
* Another head might focus on **contextual references**.
77+
78+
By combining these, the model gets a much richer understanding of the text.
79+
80+
## 6. Implementation with PyTorch
81+
82+
Modern deep learning frameworks provide highly optimized modules for this.
83+
84+
```python
85+
import torch
86+
import torch.nn as nn
87+
88+
# Embedding dim = 512, Number of heads = 8
89+
multihead_attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)
90+
91+
# Input shape: (sequence_length, batch_size, embed_dim)
92+
query = torch.randn(10, 1, 512)
93+
key = torch.randn(10, 1, 512)
94+
value = torch.randn(10, 1, 512)
95+
96+
attn_output, attn_weights = multihead_attn(query, key, value)
97+
98+
print(f"Output shape: {attn_output.shape}") # [10, 1, 512]
99+
100+
```
101+
102+
## References
103+
104+
* **Original Paper:** [Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762)
105+
* **The Illustrated Transformer:** [Jay Alammar's Blog](https://jalammar.github.io/illustrated-transformer/)
106+
* **Harvard NLP:** [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
107+
108+
---
109+
110+
**Self-Attention allows the model to understand the context of a sequence. But how do we stack these layers to build the most powerful models in AI today?**
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: "Transformer Architecture: The Foundation of Modern AI"
3+
sidebar_label: Transformers
4+
description: "A comprehensive deep dive into the Transformer architecture, including Encoder-Decoder stacks and Positional Encoding."
5+
tags: [deep-learning, transformers, nlp, attention, gpt, bert]
6+
---
7+
8+
Introduced in the 2017 paper *"Attention Is All You Need"*, the **Transformer** shifted the paradigm of sequence modeling. By removing recurrence (RNNs) and convolutions (CNNs) entirely and relying solely on [Self-Attention](./self-attention), Transformers allowed for massive parallelization and state-of-the-art performance in NLP and beyond.
9+
10+
## 1. High-Level Architecture
11+
12+
The Transformer follows an **Encoder-Decoder** structure:
13+
* **The Encoder (Left):** Maps an input sequence to a sequence of continuous representations.
14+
* **The Decoder (Right):** Uses the encoder's representation and previous outputs to generate an output sequence, one element at a time.
15+
16+
## 2. The Encoder Stack
17+
18+
An encoder consists of a stack of identical layers (typically 6). Each layer has two sub-layers:
19+
1. **Multi-Head Self-Attention:** Allows the encoder to look at other words in the input sentence as it encodes a specific word.
20+
2. **Position-wise Feed-Forward Network (FFN):** A simple fully connected network applied to each position independently and identically.
21+
22+
:::info Key Feature
23+
Each sub-layer uses **Residual Connections** (Add) followed by **Layer Normalization** (Norm). This is often abbreviated as `Add & Norm`.
24+
:::
25+
26+
## 3. The Decoder Stack
27+
28+
The decoder also has a stack of identical layers, but it includes a third sub-layer:
29+
1. **Masked Multi-Head Attention:** Ensures that the prediction for a specific position can only depend on the known outputs at positions before it (preventing the model from "cheating" by looking ahead).
30+
2. **Encoder-Decoder Attention:** Performs attention over the encoder's output. This helps the decoder focus on relevant parts of the input sequence.
31+
3. **Feed-Forward Network (FFN):** Similar to the encoder's FFN.
32+
33+
## 4. Positional Encoding
34+
35+
Since Transformers do not use RNNs, they have no inherent sense of the **order** of words. To fix this, we add **Positional Encodings** to the input embeddings. These are vectors that follow a specific mathematical pattern (often sine and cosine functions) to give the model information about the relative or absolute position of words.
36+
37+
$$
38+
PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})
39+
$$
40+
$$
41+
PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})
42+
$$
43+
44+
## 5. Transformer Data Flow (Mermaid)
45+
46+
This diagram visualizes how a single token moves through the Transformer stack.
47+
48+
```mermaid
49+
graph TD
50+
Input[Input Tokens] --> Embed[Input Embedding]
51+
Pos[Positional Encoding] --> Embed
52+
Embed --> EncStack[Encoder Stack]
53+
54+
subgraph EncoderLayer [Encoder Layer]
55+
SelfAttn[Multi-Head Self-Attention] --> AddNorm1[Add & Norm]
56+
AddNorm1 --> FFN[Feed Forward]
57+
FFN --> AddNorm2[Add & Norm]
58+
end
59+
60+
EncStack --> DecStack[Decoder Stack]
61+
62+
subgraph DecoderLayer [Decoder Layer]
63+
MaskAttn[Masked Self-Attention] --> AddNorm3[Add & Norm]
64+
AddNorm3 --> CrossAttn[Encoder-Decoder Attention]
65+
CrossAttn --> AddNorm4[Add & Norm]
66+
AddNorm4 --> DecFFN[Feed Forward]
67+
DecFFN --> AddNorm5[Add & Norm]
68+
end
69+
70+
DecStack --> Linear[Linear Layer]
71+
Linear --> Softmax[Softmax]
72+
Softmax --> Output[Predicted Token]
73+
74+
```
75+
76+
## 6. Why Transformers Won
77+
78+
| Feature | RNNs / LSTMs | Transformers |
79+
| --- | --- | --- |
80+
| **Processing** | Sequential (Slow) | Parallel (Fast on GPUs) |
81+
| **Long-range Ties** | Difficult (Vanishing Gradient) | Easy (Direct Attention) |
82+
| **Scaling** | Hard to scale to massive data | Designed for massive data & parameters |
83+
| **Example Models** | ELMo | BERT, GPT-4, Llama 3 |
84+
85+
## 7. Simple Implementation (PyTorch)
86+
87+
PyTorch provides a high-level `nn.Transformer` module, but you can also access the individual components:
88+
89+
```python
90+
import torch
91+
import torch.nn as nn
92+
93+
# Parameters
94+
d_model = 512
95+
nhead = 8
96+
num_encoder_layers = 6
97+
98+
# Define Encoder Layer
99+
encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead)
100+
# Define Transformer Encoder
101+
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
102+
103+
# Input shape: (S, N, E) where S is seq_length, N is batch, E is d_model
104+
src = torch.randn(10, 32, 512)
105+
out = transformer_encoder(src)
106+
107+
print(f"Output shape: {out.shape}") # [10, 32, 512]
108+
109+
```
110+
111+
## References
112+
113+
* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762)
114+
* **Visual Guide:** [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
115+
* **DeepLearning.AI:** [Transformer Network (C5W4L06)](https://www.youtube.com/watch?v=AFkGPmU16QA)
116+
117+
---
118+
119+
**The Transformer architecture is the engine. But how do we train it? Does it read the whole internet at once?**

0 commit comments

Comments
 (0)