Skip to content

Commit b76f85d

Browse files
authored
Merge pull request #177 from codeharborhub/dev-1
added docs for ml
2 parents 2662159 + e6565f9 commit b76f85d

File tree

7 files changed

+524
-0
lines changed

7 files changed

+524
-0
lines changed
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: Image Classification
3+
sidebar_label: Image Classification
4+
description: "How to train neural networks to categorize images into predefined classes using CNNs."
5+
tags: [deep-learning, cnn, image-classification, computer-vision, transfer-learning]
6+
---
7+
8+
**Image Classification** is the task of assigning a label or a category to an entire input image. It is the most fundamental task in Computer Vision and serves as the building block for more complex tasks like Object Detection and Image Segmentation.
9+
10+
## 1. The Workflow: From Pixels to Labels
11+
12+
An image classification model follows a linear pipeline where spatial information is gradually transformed into a semantic category.
13+
14+
1. **Input Layer:** Raw pixel data (e.g., $224 \times 224 \times 3$ for an RGB image).
15+
2. **Feature Extraction:** Multiple [Convolution](../cnn/convolution) and [Pooling](../cnn/pooling) layers identify edges, shapes, and complex patterns.
16+
3. **Flattening:** The 2D feature maps are converted into a 1D vector.
17+
4. **Classification:** [Fully Connected Layers](https://www.youtube.com/watch?v=rxSmwM7z0_4) act as a traditional MLP to interpret the features.
18+
5. **Output Layer:** Uses a **Softmax** function to provide probabilities for each class.
19+
20+
## 2. Binary vs. Multi-Class Classification
21+
22+
| Type | Output Neurons | Activation | Loss Function |
23+
| :--- | :--- | :--- | :--- |
24+
| **Binary** (Cat or Not) | 1 | Sigmoid | Binary Cross-Entropy |
25+
| **Multi-Class** (Cat, Dog, Bird) | $N$ (Number of classes) | Softmax | Categorical Cross-Entropy |
26+
27+
## 3. Transfer Learning: Standing on the Shoulders of Giants
28+
29+
Training a CNN from scratch requires thousands of images and massive computing power. Instead, most developers use **Transfer Learning**.
30+
31+
This involves taking a model pre-trained on a massive dataset (like **ImageNet**, which has 1.4 million images across 1,000 classes) and repurposing it for a specific task.
32+
33+
* **Freezing:** We keep the "Feature Extractor" weights fixed because they already know how to "see" shapes.
34+
* **Fine-Tuning:** We only replace and train the final classification head for our specific labels.
35+
36+
## 4. Implementation with Keras (Transfer Learning)
37+
38+
This example shows how to use the **MobileNetV2** architecture to classify custom images.
39+
40+
```python
41+
import tensorflow as tf
42+
from tensorflow.keras import layers, models
43+
44+
# 1. Load a pre-trained model without the top (classification) layer
45+
base_model = tf.keras.applications.MobileNetV2(
46+
input_shape=(160, 160, 3), include_top=False, weights='imagenet'
47+
)
48+
49+
# 2. Freeze the base model
50+
base_model.trainable = False
51+
52+
# 3. Add custom classification head
53+
model = models.Sequential([
54+
base_model,
55+
layers.GlobalAveragePooling2D(),
56+
layers.Dense(1, activation='sigmoid') # Binary: e.g., 'Mask' or 'No Mask'
57+
])
58+
59+
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
60+
61+
```
62+
63+
## 5. Challenges in Classification
64+
65+
1. **Intra-class Variation:** A "Chair" can look very different depending on its design.
66+
2. **Scale Variation:** An object may occupy the entire frame or just a tiny corner.
67+
3. **Viewpoint Variation:** A model must recognize a car from the front, side, and top.
68+
4. **Occlusion:** Only part of the object might be visible (e.g., a dog behind a fence).
69+
70+
## 6. Popular Architectures for Classification
71+
72+
* **ResNet (Residual Networks):** Introduced "Skip Connections" to allow training of very deep networks (100+ layers).
73+
* **VGG-16:** A very deep but simple architecture using only convolutions.
74+
* **Inception (GoogLeNet):** Uses different kernel sizes in the same layer to capture features at different scales.
75+
* **EfficientNet:** Optimized for the best balance between accuracy and computational cost.
76+
77+
## References
78+
79+
* **ImageNet:** [The Benchmark Dataset](https://www.image-net.org/)
80+
* **TensorFlow Tutorials:** [Image Classification for Beginners](https://www.tensorflow.org/tutorials/images/classification)
81+
* **PyTorch Tutorials:** [Transfer Learning for Computer Vision](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)
82+
83+
---
84+
85+
**Classifying an entire image is great, but what if you need to know *where* the object is or if there are multiple objects?**
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
title: Image Segmentation
3+
sidebar_label: Image Segmentation
4+
description: "Going beyond bounding boxes: How to classify every single pixel in an image."
5+
tags: [deep-learning, cnn, computer-vision, segmentation, u-net, mask-rcnn]
6+
---
7+
8+
While [Image Classification](./image-classification) tells us **what** is in an image, and **Object Detection** tells us **where** it is, **Image Segmentation** provides a pixel-perfect understanding of the scene.
9+
10+
It is the process of partitioning a digital image into multiple segments (sets of pixels) to simplify or change the representation of an image into something that is more meaningful and easier to analyze.
11+
12+
## 1. Types of Segmentation
13+
14+
Not all segmentation tasks are the same. We generally categorize them into three levels of complexity:
15+
16+
### A. Semantic Segmentation
17+
Every pixel is assigned a class label (e.g., "Road," "Sky," "Car"). However, it does **not** differentiate between multiple instances of the same class. Two cars parked next to each other will appear as a single connected "blob."
18+
19+
### B. Instance Segmentation
20+
This goes a step further by detecting and delineating each distinct object of interest. If there are five people in a photo, instance segmentation will give each person a unique color/ID.
21+
22+
### C. Panoptic Segmentation
23+
The "holy grail" of segmentation. It combines semantic and instance segmentation to provide a total understanding of the scene—identifying individual objects (cars, people) and background textures (sky, grass).
24+
25+
## 2. The Architecture: Encoder-Decoder (U-Net)
26+
27+
Traditional CNNs lose spatial resolution through pooling. To get back to an image output of the same size as the input, we use an **Encoder-Decoder** architecture.
28+
29+
1. **Encoder (The "What"):** A standard CNN that downsamples the image to extract high-level features.
30+
2. **Bottleneck:** The compressed representation of the image.
31+
3. **Decoder (The "Where"):** Uses **Transposed Convolutions** (Upsampling) to recover the spatial dimensions.
32+
4. **Skip Connections:** These are the "secret sauce" of the **U-Net** architecture. They pass high-resolution information from the encoder directly to the decoder to help refine the boundaries of the mask.
33+
34+
## 3. Loss Functions for Segmentation
35+
36+
Because we are classifying every pixel, standard accuracy can be misleading (especially if 90% of the image is just background). We use specialized metrics:
37+
38+
* **Intersection over Union (IoU) / Jaccard Index:** Measures the overlap between the predicted mask and the ground truth.
39+
* **Dice Coefficient:** Similar to IoU, it measures the similarity between two sets of data and is more robust to class imbalance.
40+
41+
$$
42+
IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}
43+
$$
44+
45+
## 4. Real-World Applications
46+
47+
* **Medical Imaging:** Identifying tumors or mapping organs in MRI and CT scans.
48+
* **Self-Driving Cars:** Identifying the exact boundaries of lanes, sidewalks, and drivable space.
49+
* **Satellite Imagery:** Mapping land use, deforestation, or urban development.
50+
* **Portrait Mode:** Separating the person (subject) from the background to apply a "bokeh" blur effect.
51+
52+
## 5. Popular Models
53+
54+
| Model | Type | Best For |
55+
| :--- | :--- | :--- |
56+
| **U-Net** | Semantic | Medical imaging and biomedical research. |
57+
| **Mask R-CNN** | Instance | Detecting objects and generating masks (e.g., counting individual cells). |
58+
| **DeepLabV3+** | Semantic | State-of-the-art results using Atrous (Dilated) Convolutions. |
59+
| **SegNet** | Semantic | Efficient scene understanding for autonomous driving. |
60+
61+
## 6. Implementation Sketch (PyTorch)
62+
63+
Using a pre-trained segmentation model from `torchvision`:
64+
65+
```python
66+
import torch
67+
from torchvision import models
68+
69+
# Load a pre-trained DeepLabV3 model
70+
model = models.segmentation.deeplabv3_resnet101(pretrained=True).eval()
71+
72+
# Input: (Batch, Channels, Height, Width)
73+
dummy_input = torch.randn(1, 3, 224, 224)
74+
75+
# Output: Returns a dictionary containing 'out' - the pixel-wise class predictions
76+
with torch.no_grad():
77+
output = model(dummy_input)['out']
78+
79+
print(f"Output shape: {output.shape}")
80+
# Shape will be [1, 21, 224, 224] (for 21 Pascal VOC classes)
81+
82+
```
83+
84+
## References
85+
86+
* **ArXiv:** [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)
87+
* **Facebook Research:** [Mask R-CNN Paper](https://arxiv.org/abs/1703.06870)
88+
89+
---
90+
91+
**Segmentation provides a high level of detail, but it's computationally expensive. How do we make these models faster for real-time applications?**
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: The Convolution Operation
3+
sidebar_label: Convolution
4+
description: "Understanding kernels, filters, and how feature maps are created in Convolutional Neural Networks."
5+
tags: [deep-learning, cnn, computer-vision, convolution, kernels]
6+
---
7+
8+
The **Convolution** is the heart of Computer Vision. Unlike standard neural networks that treat every pixel as an independent feature, Convolution allows the network to preserve the **spatial relationship** between pixels, enabling it to recognize shapes, edges, and textures.
9+
10+
## 1. What is a Convolution?
11+
12+
At its simplest, a convolution is a mathematical operation where a small matrix (called a **Kernel** or **Filter**) slides across an input image and performs element-wise multiplication with the part of the input it is currently hovering over.
13+
14+
The results are summed up to create a single value in a new matrix called a **Feature Map** (or Activation Map).
15+
16+
## 2. The Anatomy of a Kernel
17+
18+
A kernel is a grid of weights. Different weights allow the kernel to detect different types of features:
19+
20+
* **Vertical Edge Detector:** A kernel with high values on the left and low values on the right.
21+
* **Horizontal Edge Detector:** A kernel with high values on the top and low values on the bottom.
22+
* **Sharpening Kernel:** A kernel that emphasizes the central pixel relative to its neighbors.
23+
24+
## 3. Key Hyperparameters
25+
26+
When performing a convolution, there are three main settings that determine the size and behavior of the output:
27+
28+
### A. Stride
29+
Stride is the number of pixels the kernel moves at a time.
30+
* **Stride 1:** Moves one pixel at a time (larger output).
31+
* **Stride 2:** Jumps two pixels at a time (smaller, downsampled output).
32+
33+
### B. Padding
34+
Since the kernel cannot "hang off" the edge of an image, the pixels on the borders are processed less than the pixels in the center. To fix this, we add a border of zeros around the image.
35+
* **Valid Padding:** No padding (output is smaller than input).
36+
* **Same Padding:** Zeros are added so the output is the same size as the input.
37+
38+
### C. Depth (Channels)
39+
If you are processing a color image, your input has 3 channels (Red, Green, Blue). Your kernel will also have a depth of 3 to match.
40+
41+
## 4. The Math of Output Size
42+
43+
To calculate the dimensions of the resulting Feature Map, we use the following formula:
44+
45+
$$
46+
O = \frac{W - K + 2P}{S} + 1
47+
$$
48+
49+
* **$W$**: Input width/height
50+
* **$K$**: Kernel size
51+
* **$P$**: Padding
52+
* **$S$**: Stride
53+
54+
## 5. Why Convolution?
55+
56+
1. **Sparse Connectivity:** Instead of every input pixel connecting to every output neuron, neurons only look at a small "receptive field." This massively reduces the number of parameters.
57+
2. **Parameter Sharing:** The same kernel (weights) is used across the entire image. If a filter learns to detect a "circle," it can find that circle in the top-left corner or the bottom-right corner using the same weights.
58+
59+
## 6. Implementation with PyTorch
60+
61+
```python
62+
import torch
63+
import torch.nn as nn
64+
65+
# Create a sample input: (Batch, Channels, Height, Width)
66+
input_image = torch.randn(1, 3, 32, 32)
67+
68+
# Define a Convolutional Layer
69+
# 3 input channels (RGB), 16 output filters, 3x3 kernel size
70+
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
71+
72+
# Apply convolution
73+
output = conv_layer(input_image)
74+
75+
print(f"Input shape: {input_image.shape}")
76+
print(f"Output shape: {output.shape}")
77+
# Output: [1, 16, 32, 32] because of 'Same' padding
78+
79+
```
80+
81+
## References
82+
83+
* **Stanford CS231n:** [Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/convolutional-networks/)
84+
* **Setosa.io:** [Image Kernels Visualizer](http://setosa.io/ev/image-kernels/)
85+
86+
---
87+
88+
**Convolution extracts the features, but the resulting maps are often too large and computationally heavy. How do we shrink them down without losing the important information?**
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
title: Padding in CNNs
3+
sidebar_label: Padding
4+
description: "How padding prevents data loss at the edges and controls the output size of convolutional layers."
5+
tags: [deep-learning, cnn, computer-vision, padding, zero-padding]
6+
---
7+
8+
When we slide a kernel over an image in a [Convolutional Layer](./convolution), two problems occur:
9+
1. **Shrinking Output:** The image gets smaller with every layer.
10+
2. **Loss of Border Info:** Pixels at the corners are only "touched" by the kernel once, whereas central pixels are processed many times.
11+
12+
**Padding** solves both by adding a border of extra pixels (usually zeros) around the input image.
13+
14+
## 1. The Border Problem
15+
16+
Imagine a $3 \times 3$ kernel sliding over a $5 \times 5$ image. The center pixel is involved in 9 different multiplications, but the corner pixel is only involved in 1. This means the network effectively "ignores" information at the edges of your images.
17+
18+
## 2. Types of Padding
19+
20+
There are two primary ways to handle padding in deep learning frameworks:
21+
22+
### A. Valid Padding (No Padding)
23+
In "Valid" padding, we add zero extra pixels. The kernel stays strictly within the boundaries of the original image.
24+
* **Result:** The output is always smaller than the input.
25+
* **Formula:** $O = (W - K + 1)$
26+
27+
### B. Same Padding (Zero Padding)
28+
In "Same" padding, we add enough pixels (usually zeros) around the edges so that the output size is **exactly the same** as the input size (assuming a stride of 1).
29+
* **Result:** Spatial dimensions are preserved.
30+
* **Common use:** Deep architectures where we want to stack dozens of layers without the image disappearing.
31+
32+
## 3. Mathematical Formula with Padding
33+
34+
When we include padding ($P$), the formula for the output dimension becomes:
35+
36+
$$
37+
O = \frac{W - K + 2P}{S} + 1
38+
$$
39+
40+
* **$W$**: Input dimension
41+
* **$K$**: Kernel size
42+
* **$P$**: Padding amount (number of pixels added to one side)
43+
* **$S$**: Stride
44+
45+
:::note
46+
For "Same" padding with a stride of 1, the required padding is usually $P = \frac{K-1}{2}$. This is why kernel sizes are almost always odd numbers ($3 \times 3, 5 \times 5$).
47+
:::
48+
49+
## 4. Other Padding Techniques
50+
51+
While **Zero Padding** is the standard, other methods exist for specific cases:
52+
* **Reflection Padding:** Mirrors the pixels from inside the image. This is often used in style transfer or image generation to prevent "border artifacts."
53+
* **Constant Padding:** Fills the border with a specific constant value (e.g., gray or white).
54+
55+
## 5. Implementation
56+
57+
### TensorFlow / Keras
58+
Keras simplifies this by using strings:
59+
60+
```python
61+
from tensorflow.keras.layers import Conv2D
62+
63+
# Output size will be smaller than input
64+
valid_conv = Conv2D(32, (3, 3), padding='valid')
65+
66+
# Output size will be identical to input
67+
same_conv = Conv2D(32, (3, 3), padding='same')
68+
69+
```
70+
71+
### PyTorch
72+
73+
In PyTorch, you specify the exact number of pixels:
74+
75+
```python
76+
import torch.nn as nn
77+
78+
# For a 3x3 kernel, padding=1 gives 'same' output
79+
# (3-1)/2 = 1
80+
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
81+
82+
```
83+
84+
## References
85+
86+
* **CS231n:** [Spatial Arrangement of Layers](https://cs231n.github.io/convolutional-networks/#spatial)
87+
* **PyTorch Docs:** [Conv2d Layer Specifications](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)
88+
89+
---
90+
91+
**Padding keeps the image size consistent, but what if we want to move across the image faster or purposely reduce the size?**

0 commit comments

Comments
 (0)