|
| 1 | +--- |
| 2 | +title: "Word Embeddings: Mapping Meaning to Vectors" |
| 3 | +sidebar_label: Word Embeddings |
| 4 | +description: "How to represent words as dense vectors where geometric distance corresponds to semantic similarity." |
| 5 | +tags: [nlp, machine-learning, embeddings, word2vec, glove, fasttext] |
| 6 | +--- |
| 7 | + |
| 8 | +In previous steps like [Stemming](./stemming), we treated words as discrete symbols. However, a machine doesn't know that "Apple" is closer to "Orange" than it is to "Airplane." |
| 9 | + |
| 10 | +**Word Embeddings** solve this by representing words as **dense vectors** of real numbers in a high-dimensional space. The core philosophy is the **Distributional Hypothesis**: *"A word is characterized by the company it keeps."* |
| 11 | + |
| 12 | +## 1. Why Not Use One-Hot Encoding? |
| 13 | + |
| 14 | +Before embeddings, we used One-Hot Encoding (a vector of 0s with a single 1). |
| 15 | +* **The Problem:** It creates massive, sparse vectors (if you have 50,000 words, each vector is 50,000 long). |
| 16 | +* **The Fatal Flaw:** All vectors are equidistant. The mathematical dot product between "King" and "Queen" is the same as "King" and "Potato" (zero), meaning the model sees no relationship between them. |
| 17 | + |
| 18 | +## 2. The Vector Space: King - Man + Woman = Queen |
| 19 | + |
| 20 | +The most famous property of embeddings is their ability to capture **analogies** through vector arithmetic. Because words with similar meanings are placed close together, the distance and direction between vectors represent semantic relationships. |
| 21 | + |
| 22 | +* **Gender:** $\vec{King} - \vec{Man} + \vec{Woman} \approx \vec{Queen}$ |
| 23 | +* **Verb Tense:** $\vec{Walking} - \vec{Walk} + \vec{Swim} \approx \vec{Swimming}$ |
| 24 | +* **Capital Cities:** $\vec{Paris} - \vec{France} + \vec{Germany} \approx \vec{Berlin}$ |
| 25 | + |
| 26 | +## 3. Major Embedding Algorithms |
| 27 | + |
| 28 | +### A. Word2Vec (Google) |
| 29 | +Uses a shallow neural network to learn word associations. It has two architectures: |
| 30 | +1. **CBOW (Continuous Bag of Words):** Predicts a target word based on context words. |
| 31 | +2. **Skip-gram:** Predicts surrounding context words based on a single target word (better for rare words). |
| 32 | + |
| 33 | +### B. GloVe (Stanford) |
| 34 | +Short for "Global Vectors." Unlike Word2Vec, which iterates over local windows, GloVe looks at the **Global Co-occurrence Matrix** of the entire dataset. |
| 35 | + |
| 36 | +### C. FastText (Facebook) |
| 37 | +An extension of Word2Vec that treats each word as a bag of **character n-grams**. This allows it to generate embeddings for "Out of Vocabulary" (OOV) words by looking at their sub-parts. |
| 38 | + |
| 39 | +## 4. Advanced Logic: Skip-gram Architecture (Mermaid) |
| 40 | + |
| 41 | +The following diagram illustrates how the Skip-gram model uses a center word to predict its neighbors, thereby learning a dense representation in its hidden layer. |
| 42 | + |
| 43 | +```mermaid |
| 44 | +graph LR |
| 45 | + Input[Input Word: 'King'] --> Hidden[Hidden Layer / Embedding] |
| 46 | + Hidden --> Out1[Context Word 1: 'Queen'] |
| 47 | + Hidden --> Out2[Context Word 2: 'Throne'] |
| 48 | + Hidden --> Out3[Context Word 3: 'Rule'] |
| 49 | + |
| 50 | + style Input fill:#e1f5fe,stroke:#01579b,color:#333 |
| 51 | + style Hidden fill:#ffecb3,stroke:#ffa000,stroke-width:2px,color:#333 |
| 52 | + style Out1 fill:#c8e6c9,stroke:#2e7d32,color:#333 |
| 53 | + style Out2 fill:#c8e6c9,stroke:#2e7d32,color:#333 |
| 54 | + style Out3 fill:#c8e6c9,stroke:#2e7d32,color:#333 |
| 55 | +
|
| 56 | +``` |
| 57 | + |
| 58 | +## 5. Measuring Similarity: Cosine Similarity |
| 59 | + |
| 60 | +To find how similar two words are in an embedding space, we don't use Euclidean distance (which can be affected by the length of the vector). Instead, we use **Cosine Similarity**, which measures the angle between two vectors. |
| 61 | + |
| 62 | +$$ |
| 63 | +\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} |
| 64 | +$$ |
| 65 | + |
| 66 | +* **1.0:** Vectors point in the same direction (Synonyms). |
| 67 | +* **0.0:** Vectors are orthogonal (Unrelated). |
| 68 | +* **-1.0:** Vectors point in opposite directions (Antonyms). |
| 69 | + |
| 70 | +## 6. Implementation with Gensim |
| 71 | + |
| 72 | +Gensim is the go-to Python library for using pre-trained embeddings or training your own. |
| 73 | + |
| 74 | +```python |
| 75 | +import gensim.downloader as api |
| 76 | + |
| 77 | +# 1. Load pre-trained Word2Vec embeddings (Glove-wiki) |
| 78 | +model = api.load("glove-wiki-gigaword-100") |
| 79 | + |
| 80 | +# 2. Find most similar words |
| 81 | +result = model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1) |
| 82 | +print(f"King - Man + Woman = {result[0][0]}") |
| 83 | +# Output: queen |
| 84 | + |
| 85 | +# 3. Compute similarity score |
| 86 | +score = model.similarity('apple', 'banana') |
| 87 | +print(f"Similarity between apple and banana: {score:.4f}") |
| 88 | + |
| 89 | +``` |
| 90 | + |
| 91 | +## References |
| 92 | + |
| 93 | +* **Original Word2Vec Paper:** [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781) |
| 94 | +* **Stanford NLP:** [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) |
| 95 | +* **Gensim:** [Official Documentation and Tutorials](https://radimrehurek.com/gensim/auto_examples/index.html) |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +**Static embeddings like Word2Vec are great, but they have a flaw: the word "Bank" has the same vector whether it's a river bank or a financial bank. How do we make embeddings context-aware?** |
0 commit comments