Releases · RazerArdi/Knowledge-Infused-Multimodal-Retrieval-A-RAG-Based-Approach-for-Context-Aware-Image-Understanding

This release introduces a major upgrade to the Multimodal RAG System, focusing on Domain Adaptation for the Vision Encoder and a comprehensive Evaluation Framework.

🔥 Key Features

Fine-Tuned BLIP-2 Support: Implemented QLoRA (Parameter-Efficient Fine-Tuning) pipeline for BLIP-2 using the Flickr30k dataset. The system now automatically detects and loads peft adapters for domain-specific captioning.
RAG Evaluation Triad:
- Retrieval: GT Match Rate (Recall@K).
- Perception: BLEU-4 and ROUGE-L metrics to evaluate caption quality against Ground Truth.
- Reasoning: Implemented Unsupervised RAG Metrics (Faithfulness & Answer Relevance) using CLIP Latent Space embeddings to audit Llama-3's hallucinations.

🛠 Technical Improvements

Optimized Training Loop: Added Gradient Accumulation logic to enable fine-tuning on consumer GPUs (8GB VRAM).
Smart UI Updates: Streamlit interface now displays real-time academic metrics with visual status indicators (🔴🟡🟢).
Auto-Dependency: Added automatic installation for metric libraries (nltk, rouge-score).

📦 How to Use

Run the training notebook (FinalProject_Multimodal_RAG.ipynb) to generate the adapter.
Ensure fine_tuned_blip2_adapter folder is in the root directory.
Run streamlit run app.py.

This milestone marks the completion of the system's evaluation methodology and domain adaptation phase.

🧪 Methodology Updates

Domain Adaptation:
- Implemented LoRA (Low-Rank Adaptation) fine-tuning on the BLIP-2 OPT-2.7b model.
- Objective: To align visual feature extraction with the specific linguistic style of the Flickr30k dataset.
Quantitative Evaluation Framework:
- Lexical Metrics: Integrated BLEU-4 and ROUGE-L to measure the syntactic alignment of generated captions.
- Semantic Metrics: Utilized CLIP-based Cosine Similarity to measure:
  - Answer Relevance: Semantic distance between User Query and Generated Answer.
  - Faithfulness: Semantic distance between Visual Evidence and Generated Answer.

📊 Results

Initial benchmarks show significant improvement in caption relevance after fine-tuning, with the RAG module demonstrating high faithfulness scores (>0.5) in auditing tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🔥 Key Features

🛠 Technical Improvements

📦 How to Use

🧪 Methodology Updates

📊 Results

Uh oh!

Releases: RazerArdi/Knowledge-Infused-Multimodal-Retrieval-A-RAG-Based-Approach-for-Context-Aware-Image-Understanding

Fine-Tuned Vision Support & Advanced RAG Metrics

🔥 Key Features

🛠 Technical Improvements

📦 How to Use

🧪 Methodology Updates

📊 Results

Uh oh!