Skip to content

Releases: RazerArdi/Knowledge-Infused-Multimodal-Retrieval-A-RAG-Based-Approach-for-Context-Aware-Image-Understanding

Fine-Tuned Vision Support & Advanced RAG Metrics

22 Dec 13:56

Choose a tag to compare

This release introduces a major upgrade to the Multimodal RAG System, focusing on Domain Adaptation for the Vision Encoder and a comprehensive Evaluation Framework.

🔥 Key Features

  • Fine-Tuned BLIP-2 Support: Implemented QLoRA (Parameter-Efficient Fine-Tuning) pipeline for BLIP-2 using the Flickr30k dataset. The system now automatically detects and loads peft adapters for domain-specific captioning.
  • RAG Evaluation Triad:
    • Retrieval: GT Match Rate (Recall@K).
    • Perception: BLEU-4 and ROUGE-L metrics to evaluate caption quality against Ground Truth.
    • Reasoning: Implemented Unsupervised RAG Metrics (Faithfulness & Answer Relevance) using CLIP Latent Space embeddings to audit Llama-3's hallucinations.

🛠 Technical Improvements

  • Optimized Training Loop: Added Gradient Accumulation logic to enable fine-tuning on consumer GPUs (8GB VRAM).
  • Smart UI Updates: Streamlit interface now displays real-time academic metrics with visual status indicators (🔴🟡🟢).
  • Auto-Dependency: Added automatic installation for metric libraries (nltk, rouge-score).

📦 How to Use

  1. Run the training notebook (FinalProject_Multimodal_RAG.ipynb) to generate the adapter.
  2. Ensure fine_tuned_blip2_adapter folder is in the root directory.
  3. Run streamlit run app.py.

This milestone marks the completion of the system's evaluation methodology and domain adaptation phase.

🧪 Methodology Updates

  1. Domain Adaptation:

    • Implemented LoRA (Low-Rank Adaptation) fine-tuning on the BLIP-2 OPT-2.7b model.
    • Objective: To align visual feature extraction with the specific linguistic style of the Flickr30k dataset.
  2. Quantitative Evaluation Framework:

    • Lexical Metrics: Integrated BLEU-4 and ROUGE-L to measure the syntactic alignment of generated captions.
    • Semantic Metrics: Utilized CLIP-based Cosine Similarity to measure:
      • Answer Relevance: Semantic distance between User Query and Generated Answer.
      • Faithfulness: Semantic distance between Visual Evidence and Generated Answer.

📊 Results

Initial benchmarks show significant improvement in caption relevance after fine-tuning, with the RAG module demonstrating high faithfulness scores (>0.5) in auditing tasks.