Skip to content

Commit b995045

Browse files
authored
Merge pull request #169 from codeharborhub/dev-1
ml docs add
2 parents adf6ed6 + 2349def commit b995045

File tree

9 files changed

+913
-0
lines changed

9 files changed

+913
-0
lines changed
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "Accuracy: The Intuitive Metric"
3+
sidebar_label: Accuracy
4+
description: "Understanding the most common evaluation metric, its formula, and its fatal flaws in imbalanced datasets."
5+
tags: [machine-learning, model-evaluation, metrics, classification]
6+
---
7+
8+
**Accuracy** is the most basic and intuitive metric used to evaluate a classification model. In simple terms, it answers the question: *"Out of all the predictions made, how many were correct?"*
9+
10+
## 1. The Mathematical Formula
11+
12+
Accuracy is calculated by dividing the number of correct predictions by the total number of input samples.
13+
14+
Using the components of a [Confusion Matrix](./confusion-matrix), the formula is:
15+
16+
$$
17+
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
18+
$$
19+
20+
Where:
21+
22+
* **TP (True Positives):** Correctly predicted positive samples.
23+
* **TN (True Negatives):** Correctly predicted negative samples.
24+
* **FP (False Positives):** Incorrectly predicted as positive.
25+
* **FN (False Negatives):** Incorrectly predicted as negative.
26+
27+
**Example:**
28+
29+
Imagine you have a dataset of 100 emails, where 80 are spam and 20 are not spam. Your model makes the following predictions:
30+
31+
| Actual \ Predicted | Spam | Not Spam |
32+
| --- | --- | --- |
33+
| **Spam** | 70 (TP) | 10 (FN) |
34+
| **Not Spam** | 5 (FP) | 15 (TN) |
35+
36+
Using the formula:
37+
38+
$$
39+
\text{Accuracy} = \frac{70 + 15}{70 + 15 + 5 + 10} = \frac{85}{100} = 0.85 \text{ or } 85\%
40+
$$
41+
42+
This means your model correctly identified 85% of the emails.
43+
44+
## 2. When Accuracy Works Best
45+
46+
Accuracy is a reliable metric **only** when your dataset is **balanced**.
47+
48+
* **Example:** You are building a model to classify images as either "Cats" or "Dogs." Your dataset has 500 cats and 500 dogs.
49+
* If your model gets an accuracy of 90%, you can be confident that it is performing well across both categories.
50+
51+
## 3. The "Accuracy Paradox" (Imbalanced Data)
52+
53+
Accuracy becomes highly misleading when one class significantly outweighs the other. This is known as the **Accuracy Paradox**.
54+
55+
### The Scenario:
56+
57+
Imagine a Rare Disease test where only **1%** of the population is actually sick.
58+
59+
1. If a "lazy" model is programmed to simply say **"Healthy"** for every single patient...
60+
2. It will be **99% accurate**.
61+
62+
```mermaid
63+
graph LR
64+
POP["$$\text{Population (100\%)}$$"]
65+
66+
POP --> H["$$99\% \ \text{Healthy}$$"]
67+
POP --> S["$$1\% \ \text{Sick (Rare Disease)}$$"]
68+
69+
%% Lazy Model
70+
H --> PH["$$\text{Predicted: Healthy}$$"]
71+
S --> PS["$$\text{Predicted: Healthy}$$"]
72+
73+
PH --> ACC1["$$\text{True Negatives (99\%)}$$"]
74+
PS --> ERR1["$$\text{False Negatives (1\%)}$$"]
75+
76+
ACC1 --> MET["$$\text{Accuracy} = \frac{99}{100} = 99\%$$"]
77+
78+
ERR1 --> FAIL["$$\text{❌ All Sick Patients Missed}$$"]
79+
80+
MET -.->|"$$\text{Accuracy Paradox}$$"| FAIL
81+
82+
```
83+
84+
**The problem?** Even though the accuracy is 99%, the model failed to find the 1% of people who actually need help. In high-stakes fields like medicine or fraud detection, accuracy is often the least important metric.
85+
86+
## 4. Implementation with Scikit-Learn
87+
88+
```python
89+
from sklearn.metrics import accuracy_score
90+
91+
# Actual target values
92+
y_true = [0, 1, 1, 0, 1, 1]
93+
94+
# Model predictions
95+
y_pred = [0, 1, 0, 0, 1, 1]
96+
97+
# Calculate Accuracy
98+
score = accuracy_score(y_true, y_pred)
99+
100+
print(f"Accuracy: {score * 100:.2f}%")
101+
# Output: Accuracy: 83.33%
102+
103+
```
104+
105+
## 5. Pros and Cons
106+
107+
| Advantages | Disadvantages |
108+
| --- | --- |
109+
| **Simple to understand:** Easy to explain to non-technical stakeholders. | **Useless for Imbalance:** Can hide poor performance on minority classes. |
110+
| **Single Number:** Provides a quick, high-level overview of model health. | **Ignores Probability:** Doesn't tell you how confident the model was in its choice. |
111+
| **Standardized:** Used across almost every classification project. | **Cost Blind:** Treats "False Positives" and "False Negatives" as equally bad. |
112+
113+
## 6. How to move beyond Accuracy?
114+
115+
To get a true picture of your model's performance—especially if your data is "skewed"—you should look at Accuracy alongside:
116+
117+
* **Precision:** How many of the predicted positives were actually positive?
118+
* **Recall:** How many of the actual positives did we successfully find?
119+
* **F1-Score:** The harmonic mean of Precision and Recall.
120+
121+
## References
122+
123+
* **Google Developers:** [Classification: Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)
124+
* **StatQuest:** [Accuracy, Precision, and Recall](https://www.youtube.com/watch?v=Kdsp6soqA7o)
125+
126+
---
127+
128+
**If Accuracy isn't enough to catch rare diseases or credit card fraud, what is?** Stay tuned for our next chapter on **Precision & Recall** to find out!
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: The Confusion Matrix
3+
sidebar_label: Confusion Matrix
4+
description: "The foundation of classification evaluation: True Positives, False Positives, True Negatives, and False Negatives."
5+
tags: [machine-learning, model-evaluation, metrics, classification, confusion-matrix]
6+
---
7+
8+
A **Confusion Matrix** is a table used to describe the performance of a classification model. While "Accuracy" tells you how often the model is correct, the Confusion Matrix tells you exactly **how** it is failing and which classes are being swapped.
9+
10+
## 1. The 2x2 Layout
11+
12+
For a binary classification (Yes/No, Spam/Ham), the matrix consists of four quadrants:
13+
14+
| | Predicted: **Negative** | Predicted: **Positive** |
15+
| :--- | :--- | :--- |
16+
| **Actual: Negative** | **True Negative (TN)** | **False Positive (FP)** |
17+
| **Actual: Positive** | **False Negative (FN)** | **True Positive (TP)** |
18+
19+
### Breaking Down the Quadrants:
20+
* **True Positive (TP):** You predicted positive, and it was true. (e.g., You predicted a patient has cancer, and they do).
21+
* **True Negative (TN):** You predicted negative, and it was true. (e.g., You predicted a patient is healthy, and they are).
22+
* **False Positive (FP):** You predicted positive, but it was false. (Also known as a **Type I Error** or a "False Alarm").
23+
* **False Negative (FN):** You predicted negative, but it was positive. (Also known as a **Type II Error** or a "Miss").
24+
25+
## 2. Type I vs. Type II Errors
26+
27+
The "cost" of these errors depends entirely on your specific problem.
28+
29+
```mermaid
30+
graph TB
31+
TITLE["$$\text{Type I vs. Type II Errors}$$"]
32+
33+
%% Ground Truth
34+
TITLE --> TRUTH["$$\text{Actual Condition}$$"]
35+
TRUTH --> POS["$$\text{Positive (Condition Present)}$$"]
36+
TRUTH --> NEG["$$\text{Negative (Condition Absent)}$$"]
37+
38+
%% Model Decisions
39+
POS --> TP["$$\text{True Positive}$$"]
40+
POS --> FN["$$\text{Type II Error}$$<br/>$$\text{False Negative}$$"]
41+
42+
NEG --> TN["$$\text{True Negative}$$"]
43+
NEG --> FP["$$\text{Type I Error}$$<br/>$$\text{False Positive}$$"]
44+
45+
%% Costs
46+
FP --> COST1["$$\text{Cost Depends on Context}$$"]
47+
FN --> COST2["$$\text{Cost Depends on Context}$$"]
48+
49+
%% Examples
50+
COST1 --> EX1["$$\text{Example: Spam Filter}$$<br/>$$\text{Important Email Blocked}$$"]
51+
COST2 --> EX2["$$\text{Example: Medical Test}$$<br/>$$\text{Disease Missed}$$"]
52+
53+
%% Emphasis
54+
EX1 -.->|"$$\text{Type I Cost High}$$"| FP
55+
EX2 -.->|"$$\text{Type II Cost High}$$"| FN
56+
57+
```
58+
59+
* **In Cancer Detection:** A **Type II Error (FN)** is much worse because a sick patient goes untreated.
60+
* **In Spam Filtering:** A **Type I Error (FP)** is worse because an important work email is hidden in the trash.
61+
62+
## 3. Implementation with Scikit-Learn
63+
64+
```python
65+
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
66+
import matplotlib.pyplot as plt
67+
68+
# Actual values and Model predictions
69+
y_true = [0, 1, 0, 1, 0, 1, 1, 0]
70+
y_pred = [0, 1, 1, 1, 0, 0, 1, 0]
71+
72+
# 1. Generate the matrix
73+
cm = confusion_matrix(y_true, y_pred)
74+
75+
# 2. Visualize it
76+
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
77+
disp.plot(cmap=plt.cm.Blues)
78+
plt.show()
79+
80+
```
81+
82+
## 4. Multi-Class Confusion Matrices
83+
84+
The matrix isn't just for binary problems. If you are classifying "Cat," "Dog," and "Bird," your matrix will be 3x3. The diagonal line from top-left to bottom-right represents correct predictions. Any numbers off that diagonal show you which animals the model is confusing.
85+
86+
```mermaid
87+
graph TB
88+
TITLE["$$\text{Multi-Class Confusion Matrix (3×3)}$$"]
89+
90+
%% Axes
91+
TITLE --> ACT["$$\text{Actual Class}$$"]
92+
TITLE --> PRED["$$\text{Predicted Class}$$"]
93+
94+
ACT --> CAT_A["$$\text{Cat}$$"]
95+
ACT --> DOG_A["$$\text{Dog}$$"]
96+
ACT --> BIRD_A["$$\text{Bird}$$"]
97+
98+
PRED --> CAT_P["$$\text{Cat}$$"]
99+
PRED --> DOG_P["$$\text{Dog}$$"]
100+
PRED --> BIRD_P["$$\text{Bird}$$"]
101+
102+
%% Diagonal (Correct Predictions)
103+
CAT_A --> CAT_P["$$\text{Cat → Cat}$$<br/>$$\text{Correct}$$"]
104+
DOG_A --> DOG_P["$$\text{Dog → Dog}$$<br/>$$\text{Correct}$$"]
105+
BIRD_A --> BIRD_P["$$\text{Bird → Bird}$$<br/>$$\text{Correct}$$"]
106+
107+
%% Off-Diagonal (Confusions)
108+
CAT_A --> DOG_P["$$\text{Cat → Dog}$$<br/>$$\text{Confusion}$$"]
109+
CAT_A --> BIRD_P["$$\text{Cat → Bird}$$<br/>$$\text{Confusion}$$"]
110+
111+
DOG_A --> CAT_P["$$\text{Dog → Cat}$$<br/>$$\text{Confusion}$$"]
112+
DOG_A --> BIRD_P["$$\text{Dog → Bird}$$<br/>$$\text{Confusion}$$"]
113+
114+
BIRD_A --> CAT_P["$$\text{Bird → Cat}$$<br/>$$\text{Confusion}$$"]
115+
BIRD_A --> DOG_P["$$\text{Bird → Dog}$$<br/>$$\text{Confusion}$$"]
116+
117+
%% Emphasis
118+
CAT_P -.->|"$$\text{Diagonal}$$"| GOOD["$$\text{Correct Predictions}$$"]
119+
DOG_P -.->|"$$\text{Diagonal}$$"| GOOD
120+
BIRD_P -.->|"$$\text{Diagonal}$$"| GOOD
121+
122+
DOG_P -.->|"$$\text{Off-Diagonal}$$"| BAD["$$\text{Model Confusion}$$"]
123+
BIRD_P -.->|"$$\text{Off-Diagonal}$$"| BAD
124+
125+
```
126+
127+
## 5. Summary: What can we calculate from here?
128+
129+
The Confusion Matrix is the "mother" of all classification metrics. From these four numbers, we derive:
130+
131+
* **Accuracy:**
132+
* **Precision:**
133+
* **Recall:**
134+
* **F1-Score:** The balance between Precision and Recall.
135+
136+
## References
137+
138+
* **StatQuest:** [Confusion Matrices Explained](https://www.youtube.com/watch?v=Kdsp6soqA7o)
139+
* **Scikit-Learn:** [Confusion Matrix API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
140+
141+
---
142+
143+
**Now that you can see where the model is making mistakes, let's learn how to turn those mistakes into a single score.**
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
---
2+
title: "F1-Score: The Balanced Metric"
3+
sidebar_label: F1-Score
4+
description: "Mastering the harmonic mean of Precision and Recall to evaluate models on imbalanced datasets."
5+
tags: [machine-learning, model-evaluation, metrics, f1-score, classification]
6+
---
7+
8+
The **F1-Score** is a single metric that combines [Precision](./precision) and [Recall](./recall) into a single value. It is particularly useful when you have an imbalanced dataset and you need to find an optimal balance between "False Positives" and "False Negatives."
9+
10+
## 1. The Mathematical Formula
11+
12+
The F1-Score is the **harmonic mean** of Precision and Recall. Unlike a simple average, the harmonic mean punishes extreme values. If either Precision or Recall is very low, the F1-Score will also be low.
13+
14+
$$
15+
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
16+
$$
17+
18+
19+
### Why use the Harmonic Mean?
20+
21+
If we used a standard arithmetic average, a model with 1.0 Precision and 0.0 Recall would have a "decent" score of 0.5. However, such a model is useless. The harmonic mean ensures that if one metric is 0, the total score is 0.
22+
23+
## 2. When to Use the F1-Score
24+
25+
F1-Score is the best choice when:
26+
27+
1. **Imbalanced Classes:** You have a large number of "Negative" samples and few "Positive" ones (e.g., Fraud detection).
28+
2. **Equal Importance:** You care equally about minimizing False Positives (Precision) and False Negatives (Recall).
29+
30+
## 3. Visualizing the Balance
31+
32+
Think of the F1-Score as a "balance scale." If you tilt too far toward catching everyone (Recall), your precision drops. If you tilt too far toward being perfectly accurate (Precision), you miss people. The F1-Score is highest when these two are in equilibrium.
33+
34+
```mermaid
35+
graph TB
36+
SCALE["$$\text{F1-Score}$$<br/>$$\text{Balance Scale}$$"]
37+
38+
%% Precision Side
39+
SCALE --> P["$$\text{Precision}$$"]
40+
P --> P1["$$\text{Few False Positives}$$"]
41+
P1 --> P2["$$\text{Strict Threshold}$$"]
42+
P2 --> P3["$$\text{Misses True Positives}$$"]
43+
P3 --> P4["$$\text{Low Recall}$$"]
44+
45+
%% Recall Side
46+
SCALE --> R["$$\text{Recall}$$"]
47+
R --> R1["$$\text{Few False Negatives}$$"]
48+
R1 --> R2["$$\text{Loose Threshold}$$"]
49+
R2 --> R3["$$\text{Many False Positives}$$"]
50+
R3 --> R4["$$\text{Low Precision}$$"]
51+
52+
%% Balance Point
53+
P4 -.->|"$$\text{Too Strict}$$"| UNBAL["$$\text{Unbalanced Model}$$"]
54+
R4 -.->|"$$\text{Too Loose}$$"| UNBAL
55+
56+
P --> BAL["$$\text{Equilibrium}$$"]
57+
R --> BAL
58+
59+
BAL --> F1["$$\text{F1} = 2 \cdot \frac{P \cdot R}{P + R}$$"]
60+
F1 --> OPT["$$\text{Maximum F1-Score}$$"]
61+
62+
```
63+
64+
## 4. Implementation with Scikit-Learn
65+
66+
```python
67+
from sklearn.metrics import f1_score
68+
69+
# Actual target values
70+
y_true = [0, 1, 1, 0, 1, 1, 0]
71+
72+
# Model predictions
73+
y_pred = [0, 1, 0, 0, 1, 1, 1]
74+
75+
# Calculate F1-Score
76+
score = f1_score(y_true, y_pred)
77+
78+
print(f"F1-Score: {score:.2f}")
79+
# Output: F1-Score: 0.75
80+
81+
```
82+
83+
## 5. Summary Table: Which Metric to Trust?
84+
85+
| Scenario | Best Metric | Why? |
86+
| --- | --- | --- |
87+
| **Balanced Data** | **Accuracy** | Simple and representative. |
88+
| **Spam Filter** | **Precision** | False Positives (real mail in spam) are very bad. |
89+
| **Cancer Screen** | **Recall** | False Negatives (missing a sick patient) are fatal. |
90+
| **Fraud Detection** | **F1-Score** | Need to catch thieves (Recall) without blocking everyone (Precision). |
91+
92+
## 6. Beyond Binary: Macro vs. Weighted F1
93+
94+
If you have more than two classes (Multi-class classification), you'll see these options:
95+
96+
* **Macro F1:** Calculates F1 for each class and takes the unweighted average. Treats all classes as equal.
97+
* **Weighted F1:** Calculates F1 for each class but weights them by the number of samples in that class.
98+
99+
## References
100+
101+
* **Scikit-Learn:** [F1 Score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
102+
* **Towards Data Science:** [The F1 Score Paradox](https://towardsdatascience.com/the-f1-score-2236378a31).
103+
104+
**The F1-Score gives us a snapshot at a single threshold. But how do we evaluate a model's performance across ALL possible thresholds?**

0 commit comments

Comments
 (0)