Final Project: Advanced Computer Vision (2025)
A professional, PhD-level implementation of a multimodal Retrieval-Augmented Generation (RAG) pipeline, designed to "see" and "reason" by grounding Large Language Models in visual data.
The system operates in two distinct phases: 1. One-Time Indexing (offline) and 2. Real-Time Inference (online).
This process is run once to build the vector database. All images from the Flickr30k dataset are converted into high-dimensional vectors using the CLIP encoder and stored in a FAISS index for rapid lookup.
graph TD
subgraph "Indexing Pipeline (Offline)"
A[Input: Flickr30k Image Folder] --> B[Encoder: CLIP ViT-L/14];
B --> C{Generate 768-dim Vectors};
C --> D[Vector DB: FAISS Index];
D --> E[Save: flickr30k_large.index];
F(Image Filenames) --> G[Save: metadata_large.json];
end
This is the live workflow executed when a user submits a query. The system retrieves relevant images, translates them into text context, and generates a final answer.
graph LR
subgraph "Inference Pipeline (Real-Time RAG)"
direction LR
U["User Text Query"] --> R1["Encoder: CLIP ViT-L/14"]
R1 --> R2["Query Vector"]
R2 --> R3["Search FAISS Index"]
R3 --> G1["Retrieve Top-K Image Paths"]
G1 --> G2["Visual Bridge: BLIP-2"]
G2 --> G3["Generated Visual Context (Text)"]
U --> L1["Prompt Template"]
G3 --> L1
L1 --> L2["LLM: Llama 3"]
L2 --> O["Final Generated Answer"]
O --> UI["Display in Streamlit UI"]
end
This system uses a Retrieve-Then-Generate architecture with a multimodal approach (text & images). The workflow is divided into two main phases: Image Retrieval & Generative Component
graph TD
subgraph "I. Image Retrieval (Pencarian)"
A[User Input: Query Teks] -->|Encode| B(CLIP Text Encoder)
DB[(Flickr30k Dataset)] -->|Pre-compute| C(CLIP Image Encoder)
C -->|Vectors| D{FAISS Vector DB}
B -->|Search Vector| D
D -->|Top-K Results| E[Retrieval: 5 Gambar Relevan]
end
subgraph "II. Generative & Reasoning"
E -->|Input Image| F[BLIP-2 Model]
F -->|Image Captioning| G[Context: Deskripsi Teks Visual]
A -->|Prompt Gabungan| H(Llama-3 Generator)
G -->|Context Injection| H
H -->|Reasoning| I[Final Output: Jawaban AI]
end
This project uses a specific set of SOTA (State-of-the-Art) models and libraries. The parameters below detail the choices made for each component of the pipeline as per the final project requirements.
| Component | Technology Chosen | Alternative(s) Considered | Status |
|---|---|---|---|
| Dataset | Flickr30k (captions.txt) | COCO, Fashion-MNIST | [β] |
| Embedding | openai/clip-vit-large-patch14 | clip-vit-base-patch32, ResNet50 | [β] |
| Vector DB | FAISS (IndexFlatIP) | Milvus, ChromaDB | [β] |
| Visual Bridge | Salesforce/blip2-opt-2.7b | blip-image-captioning-large | [β] |
| Reasoning LLM | Llama 3 (via Ollama) | GPT-4, Flan-T5 | [β] |
| Web UI | Streamlit | Gradio, Hugging Face Spaces | [β] |
This project has a two-step execution flow: first, you must build the database index, then you can run the interactive application.
- Python 3.10+
- PyTorch 2.0+
- NVIDIA GPU with CUDA 11.8+ (for GPU-accelerated inference)
- Ollama installed and running locally.
This step populates the FAISS vector database. You only need to do this once.
- Open the Jupyter Notebook (e.g.,
FinalProject_Multimodal_RAG.ipynb). - Ensure all paths in the Config cells are correct for your system. (e.g.,
IMAGES_DIR = "../Dataset/Images") - Execute all cells from top to bottom.
This will create two files in your root directory:
flickr30k_large.index(The FAISS database)metadata_large.json(The mapping of index IDs to filenames)
This step runs the interactive demo application.
-
Ensure Ollama is running:
ollama serve(Leave this terminal running in the background)
-
Open a new terminal and navigate to the User Interface directory:
cd User_Interface -
Run the Streamlit app:
streamlit run app.py -
Open the provided
http://localhost:8501link in your web browser.
.
βββ Dataset/
β βββ Images/
β β βββ 1000092795.jpg
β β βββ ... (all other .jpg files)
β βββ captions.txt
βββ User_Interface/
β βββ app.py
βββ FinalProject_Multimodal_RAG.ipynb <-- (Run this first)
βββ flickr30k_large.index <-- (Generated by Notebook)
βββ metadata_large.json <-- (Generated by Notebook)
βββ README.md
βββ requirements.txt
βββ structure.txt
This project is licensed under the MIT License. It can be freely used for academic, research, and commercial purposes with proper attribution.
This repository is an academic submission for the Advanced Computer Vision course. For suggestions, critiques, or future collaboration, please contact the project team.
Disclaimer: This system is built for academic and research purposes. All visual data is sourced from the public Flickr30k dataset. Generated responses are for illustrative purposes and are not guaranteed for real-world application accuracy.