A multi-agent retrieval-augmented generation (RAG) system with specialized agents for HR, Finance, and Tech support queries. Includes full observability with Langfuse for debugging and monitoring routing decisions.
- 🤖 Specialized Agents: Separate RAG agents for HR, Finance, and Tech domains
- 🎯 Orchestrator: Intelligent routing to the appropriate specialist agent(s)
- 🔀 Hybrid Ambiguity Handling: Multi-agent queries for cross-domain ambiguous questions; clarification requests for extremely vague queries
- 🛡️ Hallucination Prevention: Enforced tool usage - orchestrator must query specialists, cannot answer from its own knowledge
- 📦 Vector Stores: FAISS-based semantic search for each domain
- 📊 Observability: Full tracing with Langfuse to debug misrouted questions and track agent performance
- ⭐ Auto-Evaluation: Automatic quality scoring (1-10) for every response using LLM-as-a-judge, tracked in Langfuse. Evaluations run asynchronously in background threads so responses return immediately
- Python 3.12 or higher
- uv package manager
- OpenAI API key
- Langfuse account
-
Clone the repository
git clone <your-repo-url> cd multi-agent-rag
-
Install dependencies with uv
uv sync
-
Set up environment variables
cp .env.example .env
Edit
.envand add your API keys:
Run the CLI in interactive mode to ask questions:
uv run python src/multi_agent_system.pyExample session:
You: What is the vacation policy?
Assistant: According to our HR policy, employees receive...
You: How do I submit an expense report?
Assistant: To submit an expense report, you need to...
You: exit
👋 Goodbye!
Enable detailed logging to see agent routing decisions:
uv run python src/multi_agent_system.py --verbose- Navigate to https://cloud.langfuse.com
- Select your project
- Click on "Traces" to see all queries
- Click on any trace to see:
- Orchestrator routing decision
- Which specialist agent(s) were called
- Document retrieval results
- LLM calls with prompts and responses
- Tool invocations
- Quality scores with reasoning
Every response receives an automatic quality score (1-10) based on:
- Accuracy: Correctness of information
- Relevance: How well it addresses the question
- Completeness: Whether all aspects are answered
- Clarity: Readability and structure
Special handling: When the system correctly refuses out-of-scope questions, it receives a high score (8-10) because this is the intended behavior.
To view scores:
- Open any trace in Langfuse
- Click the "Scores" tab
- See
response_qualityscore and reasoning
For single-domain queries:
- Find the trace in Langfuse
- Examine the orchestrator's routing decision
- Check which tool was called
- Review the agent selection reasoning
For ambiguous queries:
- Look for traces where multiple specialist agents were called
- Verify the orchestrator correctly identified the ambiguity
- Check if all relevant specialists were consulted
- Review how responses were synthesized
For clarification requests:
- Identify traces where
request_clarificationwas used - Verify the query was genuinely too vague to route
- Check if the clarification question was helpful
- Consider if multiple specialists would have been better
-
No Conversation History: The system processes each query independently without maintaining conversation context. Users cannot ask follow-up questions like "What about for managers?" or "Tell me more", reference previous answers, or build on earlier context within a session.
-
No Real-Time Document Updates: The knowledge base is frozen at startup. If HR updates the vacation policy document or any other source document, the system won't reflect those changes until you manually rebuild the vector store by deleting the existing FAISS index and restarting.