|
| 1 | +--- |
| 2 | +title: Image Segmentation |
| 3 | +sidebar_label: Image Segmentation |
| 4 | +description: "Going beyond bounding boxes: How to classify every single pixel in an image." |
| 5 | +tags: [deep-learning, cnn, computer-vision, segmentation, u-net, mask-rcnn] |
| 6 | +--- |
| 7 | + |
| 8 | +While [Image Classification](./image-classification) tells us **what** is in an image, and **Object Detection** tells us **where** it is, **Image Segmentation** provides a pixel-perfect understanding of the scene. |
| 9 | + |
| 10 | +It is the process of partitioning a digital image into multiple segments (sets of pixels) to simplify or change the representation of an image into something that is more meaningful and easier to analyze. |
| 11 | + |
| 12 | +## 1. Types of Segmentation |
| 13 | + |
| 14 | +Not all segmentation tasks are the same. We generally categorize them into three levels of complexity: |
| 15 | + |
| 16 | +### A. Semantic Segmentation |
| 17 | +Every pixel is assigned a class label (e.g., "Road," "Sky," "Car"). However, it does **not** differentiate between multiple instances of the same class. Two cars parked next to each other will appear as a single connected "blob." |
| 18 | + |
| 19 | +### B. Instance Segmentation |
| 20 | +This goes a step further by detecting and delineating each distinct object of interest. If there are five people in a photo, instance segmentation will give each person a unique color/ID. |
| 21 | + |
| 22 | +### C. Panoptic Segmentation |
| 23 | +The "holy grail" of segmentation. It combines semantic and instance segmentation to provide a total understanding of the scene—identifying individual objects (cars, people) and background textures (sky, grass). |
| 24 | + |
| 25 | +## 2. The Architecture: Encoder-Decoder (U-Net) |
| 26 | + |
| 27 | +Traditional CNNs lose spatial resolution through pooling. To get back to an image output of the same size as the input, we use an **Encoder-Decoder** architecture. |
| 28 | + |
| 29 | +1. **Encoder (The "What"):** A standard CNN that downsamples the image to extract high-level features. |
| 30 | +2. **Bottleneck:** The compressed representation of the image. |
| 31 | +3. **Decoder (The "Where"):** Uses **Transposed Convolutions** (Upsampling) to recover the spatial dimensions. |
| 32 | +4. **Skip Connections:** These are the "secret sauce" of the **U-Net** architecture. They pass high-resolution information from the encoder directly to the decoder to help refine the boundaries of the mask. |
| 33 | + |
| 34 | +## 3. Loss Functions for Segmentation |
| 35 | + |
| 36 | +Because we are classifying every pixel, standard accuracy can be misleading (especially if 90% of the image is just background). We use specialized metrics: |
| 37 | + |
| 38 | +* **Intersection over Union (IoU) / Jaccard Index:** Measures the overlap between the predicted mask and the ground truth. |
| 39 | +* **Dice Coefficient:** Similar to IoU, it measures the similarity between two sets of data and is more robust to class imbalance. |
| 40 | + |
| 41 | +$$ |
| 42 | +IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}} |
| 43 | +$$ |
| 44 | + |
| 45 | +## 4. Real-World Applications |
| 46 | + |
| 47 | +* **Medical Imaging:** Identifying tumors or mapping organs in MRI and CT scans. |
| 48 | +* **Self-Driving Cars:** Identifying the exact boundaries of lanes, sidewalks, and drivable space. |
| 49 | +* **Satellite Imagery:** Mapping land use, deforestation, or urban development. |
| 50 | +* **Portrait Mode:** Separating the person (subject) from the background to apply a "bokeh" blur effect. |
| 51 | + |
| 52 | +## 5. Popular Models |
| 53 | + |
| 54 | +| Model | Type | Best For | |
| 55 | +| :--- | :--- | :--- | |
| 56 | +| **U-Net** | Semantic | Medical imaging and biomedical research. | |
| 57 | +| **Mask R-CNN** | Instance | Detecting objects and generating masks (e.g., counting individual cells). | |
| 58 | +| **DeepLabV3+** | Semantic | State-of-the-art results using Atrous (Dilated) Convolutions. | |
| 59 | +| **SegNet** | Semantic | Efficient scene understanding for autonomous driving. | |
| 60 | + |
| 61 | +## 6. Implementation Sketch (PyTorch) |
| 62 | + |
| 63 | +Using a pre-trained segmentation model from `torchvision`: |
| 64 | + |
| 65 | +```python |
| 66 | +import torch |
| 67 | +from torchvision import models |
| 68 | + |
| 69 | +# Load a pre-trained DeepLabV3 model |
| 70 | +model = models.segmentation.deeplabv3_resnet101(pretrained=True).eval() |
| 71 | + |
| 72 | +# Input: (Batch, Channels, Height, Width) |
| 73 | +dummy_input = torch.randn(1, 3, 224, 224) |
| 74 | + |
| 75 | +# Output: Returns a dictionary containing 'out' - the pixel-wise class predictions |
| 76 | +with torch.no_grad(): |
| 77 | + output = model(dummy_input)['out'] |
| 78 | + |
| 79 | +print(f"Output shape: {output.shape}") |
| 80 | +# Shape will be [1, 21, 224, 224] (for 21 Pascal VOC classes) |
| 81 | + |
| 82 | +``` |
| 83 | + |
| 84 | +## References |
| 85 | + |
| 86 | +* **ArXiv:** [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597) |
| 87 | +* **Facebook Research:** [Mask R-CNN Paper](https://arxiv.org/abs/1703.06870) |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +**Segmentation provides a high level of detail, but it's computationally expensive. How do we make these models faster for real-time applications?** |
0 commit comments