Releases: RazerArdi/Knowledge-Infused-Multimodal-Retrieval-A-RAG-Based-Approach-for-Context-Aware-Image-Understanding
Releases · RazerArdi/Knowledge-Infused-Multimodal-Retrieval-A-RAG-Based-Approach-for-Context-Aware-Image-Understanding
Fine-Tuned Vision Support & Advanced RAG Metrics
This release introduces a major upgrade to the Multimodal RAG System, focusing on Domain Adaptation for the Vision Encoder and a comprehensive Evaluation Framework.
🔥 Key Features
- Fine-Tuned BLIP-2 Support: Implemented QLoRA (Parameter-Efficient Fine-Tuning) pipeline for BLIP-2 using the Flickr30k dataset. The system now automatically detects and loads
peftadapters for domain-specific captioning. - RAG Evaluation Triad:
- Retrieval: GT Match Rate (Recall@K).
- Perception: BLEU-4 and ROUGE-L metrics to evaluate caption quality against Ground Truth.
- Reasoning: Implemented Unsupervised RAG Metrics (Faithfulness & Answer Relevance) using CLIP Latent Space embeddings to audit Llama-3's hallucinations.
🛠 Technical Improvements
- Optimized Training Loop: Added Gradient Accumulation logic to enable fine-tuning on consumer GPUs (8GB VRAM).
- Smart UI Updates: Streamlit interface now displays real-time academic metrics with visual status indicators (🔴🟡🟢).
- Auto-Dependency: Added automatic installation for metric libraries (
nltk,rouge-score).
📦 How to Use
- Run the training notebook (
FinalProject_Multimodal_RAG.ipynb) to generate the adapter. - Ensure
fine_tuned_blip2_adapterfolder is in the root directory. - Run
streamlit run app.py.
This milestone marks the completion of the system's evaluation methodology and domain adaptation phase.
🧪 Methodology Updates
-
Domain Adaptation:
- Implemented LoRA (Low-Rank Adaptation) fine-tuning on the BLIP-2 OPT-2.7b model.
- Objective: To align visual feature extraction with the specific linguistic style of the Flickr30k dataset.
-
Quantitative Evaluation Framework:
- Lexical Metrics: Integrated BLEU-4 and ROUGE-L to measure the syntactic alignment of generated captions.
- Semantic Metrics: Utilized CLIP-based Cosine Similarity to measure:
- Answer Relevance: Semantic distance between User Query and Generated Answer.
- Faithfulness: Semantic distance between Visual Evidence and Generated Answer.
📊 Results
Initial benchmarks show significant improvement in caption relevance after fine-tuning, with the RAG module demonstrating high faithfulness scores (>0.5) in auditing tasks.