DistylAI · yenchia-distyl · Dec 11, 2025
diff --git a/docs/evals/how-to-evaluate-rerankers-for-retrieval-optimization-in-rag-.md b/docs/evals/how-to-evaluate-rerankers-for-retrieval-optimization-in-rag-.md
@@ -0,0 +1,192 @@
+# How to Evaluate Rerankers for Retrieval Optimization in RAG Pipelines
+
+## Overview
+
+This guide explains the role of reranking models in retrieval-augmented generation (RAG) pipelines, how they compare to the current “retrieve many chunks and send them all to the large language model (LLM)” approach, and the key trade-offs to consider before adopting rerankers.
+
+## Prerequisites
+
+Before using or evaluating rerankers, you should:
+
+- Have an existing RAG or retrieval pipeline in place (e.g., vector search returning multiple document chunks).
+- Understand:
+  - What a “chunk” is in your system (e.g., document fragment, paragraph, section).
+  - How many chunks are currently passed into the final LLM call.
+  - Your latency and cost constraints for:
+    - Retrieval (vector search or similar)
+    - LLM calls
+    - Any additional model calls (such as a reranker)
+- Be able to modify your pipeline to:
+  - Add an intermediate step (reranking API/model call).
+  - Adjust how many chunks are passed to the final LLM call.
+
+## Explanation: How Rerankers Fit into a Two-Stage Retrieval Pipeline
+
+### Current Pattern (Single-Stage Retrieval)
+
+1. **Initial retrieval**  
+   - A vector database or search system returns a relatively large number of chunks (documents or passages) based on similarity to the query.
+
+2. **Final LLM call**  
+   - Many or all of these chunks are passed directly into the LLM.
+   - The LLM is expected to:
+     - Identify the most relevant information.
+     - Answer the user’s question based on that context.
+
+3. **Observed issue**  
+   - To ensure relevant information is included, the system may:
+     - Retrieve and pass *too many* chunks.
+   - Consequences:
+     - Increased latency for the LLM call (more tokens to process).
+     - Higher cost (more tokens).
+     - Potential context dilution (LLM has to sift through a lot of content).
+
+### Proposed Pattern (Two-Stage Retrieval with Reranking)
+
+1. **Stage 1: Broad retrieval**
+   - Use your existing retrieval mechanism (e.g., vector search) to fetch a larger set of candidate chunks (e.g., top 50–100).
+
+2. **Stage 2: Reranking**
+   - Pass the query and the retrieved chunks to a **reranking model**.
+   - The reranker scores each chunk for relevance to the query.
+   - Select only the top *N* chunks (e.g., top 5–10) based on reranker scores.
+
+3. **Final LLM call with fewer, higher-quality chunks**
+   - Pass only these top *N* chunks into the LLM.
+   - Expected benefits:
+     - Reduced context size → lower LLM latency and cost.
+     - Maintained or improved answer quality, because the LLM sees the most relevant chunks.
+
+### Core Trade-Off
+
+- **With reranking**:
+  - **Pros**:
+    - Fewer chunks in the final LLM call.
+    - Potentially faster and cheaper LLM calls.
+    - Better focus on the most relevant documents.
+  - **Cons**:
+    - Additional latency from the reranking model/API call.
+    - Reranker is not perfect; may occasionally drop a useful chunk.
+
+- **Without reranking**:
+  - **Pros**:
+    - Simpler pipeline (no extra model call).
+    - No additional reranker latency.
+    - “Maximum recall” if you send everything to the LLM.
+  - **Cons**:
+    - Larger LLM context → higher latency and cost.
+    - Risk of overwhelming the LLM with too many chunks.
+
+In other words, you are **trading the latency of an additional reranking call** against **the latency and cost of a much larger LLM call**.
+
+## Suggested Evaluation Steps
+
+Use these steps to decide whether reranking is appropriate for your use case.
+
+1. **Measure your current baseline**
+   - Record:
+     - Average number of chunks sent to the LLM per query.
+     - Average LLM latency per query.
+     - Average total pipeline latency per query.
+     - Quality metrics (e.g., answer accuracy, user satisfaction, or internal evaluation scores).
+
+2. **Define a target chunk budget**
+   - Decide how many chunks you *ideally* want to pass to the LLM (e.g., 5–10).
+   - This should be based on:
+     - LLM context limits.
+     - Desired latency and cost.
+     - Empirical tests of how many chunks the LLM can handle effectively.
+
+3. **Prototype a reranking step**
+   - Integrate a reranking model between retrieval and the LLM call.
+   - Pipeline:
+     1. Retrieve a larger set of chunks (e.g., top 50–100).
+     2. Call the reranker with the query and these chunks.
+     3. Select the top *N* chunks (your target chunk budget).
+     4. Pass only these *N* chunks to the LLM.
+
+4. **Compare latency and cost**
+   - Measure:
+     - Additional latency from the reranker call.
+     - Reduction in LLM latency due to fewer chunks.
+     - Net effect on total pipeline latency.
+     - Any change in token usage and cost.
+
+5. **Compare quality**
+   - Evaluate:
+     - Does answer quality stay the same, improve, or degrade?
+     - Are there cases where the reranker drops critical chunks that the LLM previously used?
+
+6. **Decide on adoption**
+   - If:
+     - Total latency is acceptable or improved, and
+     - Quality is acceptable (even if not 100% of the “send everything” baseline),
+   - Then reranking may be a good fit.
+   - If:
+     - Latency increase is too high, or
+     - Quality loss is unacceptable,
+   - Then reranking may not be right for your current requirements.
+
+## Important Notes and Caveats
+
+- **Latency concerns are primary**  
+  There is explicit concern that adding a reranking model may introduce too much latency. Any evaluation must quantify:
+  - Reranker latency vs. LLM latency savings.
+
+- **Rerankers are not perfect**  
+  While rerankers often work very well, they are:
+  - Not 100% accurate compared to “feed everything to the model.”
+  - Capable of occasionally excluding relevant chunks.
+
+- **Fit for your use case is uncertain**  
+  It is not yet clear whether reranking is the right solution for your specific system:
+  - It “seems to be solving the problem that we have” (too many chunks in the final LLM call).
+  - However, there is still uncertainty about whether the trade-offs are acceptable in practice.
+
+- **Additional information needed for a final decision**
+  To move from discussion to decision, you would need:
+  - Concrete latency benchmarks for:
+    - Current pipeline (no reranker).
+    - Prototype pipeline (with reranker).
+  - Quality evaluation results:
+    - Human or automated assessments comparing answers with and without reranking.
+  - Cost analysis:
+    - Token usage and API costs for both approaches.
+
+## Troubleshooting and Evaluation Tips
+
+- **If total latency increases significantly**
+  - Check:
+    - Whether you are retrieving too many initial chunks before reranking.
+    - Whether the reranking model is overpowered (and slower than needed) for your use case.
+  - Possible mitigations:
+    - Reduce the number of initial retrieved chunks.
+    - Use a lighter or faster reranking model.
+    - Cache reranker results for repeated or similar queries.
+
+- **If answer quality drops noticeably**
+  - Investigate:
+    - Whether the top-*N* cutoff is too aggressive (e.g., using top 3 instead of top 10).
+    - Whether the reranker is misaligned with your domain (e.g., specialized jargon or formats).
+  - Possible mitigations:
+    - Increase *N* (allow more chunks into the LLM).
+    - Fine-tune or choose a domain-appropriate reranker.
+    - Add a fallback path:
+      - If confidence is low, send more chunks or bypass reranking.
+
+- **If costs do not improve**
+  - Verify:
+    - That the reduction in LLM tokens is substantial enough to offset reranker costs.
+  - Possible mitigations:
+    - Further reduce the number of chunks passed to the LLM.
+    - Use a cheaper reranking model or provider.
+    - Apply reranking only to high-value or complex queries.
+
+- **If it is unclear whether reranking is “right for us”**
+  - Run a time-boxed experiment:
+    - Implement reranking for a subset of traffic or a test environment.
+    - Collect metrics over a defined period.
+  - Use the data to make a decision rather than relying on intuition alone.
+
+---
+*Source: [Original Slack thread](https://distylai.slack.com/archives/impl-tower-infobot/p1739987575183209)*