diff --git a/docs/evals/how-to-evaluate-rerankers-for-retrieval-optimization-in-rag-.md b/docs/evals/how-to-evaluate-rerankers-for-retrieval-optimization-in-rag-.md new file mode 100644 index 0000000..82ec417 --- /dev/null +++ b/docs/evals/how-to-evaluate-rerankers-for-retrieval-optimization-in-rag-.md @@ -0,0 +1,192 @@ +# How to Evaluate Rerankers for Retrieval Optimization in RAG Pipelines + +## Overview + +This guide explains the role of reranking models in retrieval-augmented generation (RAG) pipelines, how they compare to the current “retrieve many chunks and send them all to the large language model (LLM)” approach, and the key trade-offs to consider before adopting rerankers. + +## Prerequisites + +Before using or evaluating rerankers, you should: + +- Have an existing RAG or retrieval pipeline in place (e.g., vector search returning multiple document chunks). +- Understand: + - What a “chunk” is in your system (e.g., document fragment, paragraph, section). + - How many chunks are currently passed into the final LLM call. + - Your latency and cost constraints for: + - Retrieval (vector search or similar) + - LLM calls + - Any additional model calls (such as a reranker) +- Be able to modify your pipeline to: + - Add an intermediate step (reranking API/model call). + - Adjust how many chunks are passed to the final LLM call. + +## Explanation: How Rerankers Fit into a Two-Stage Retrieval Pipeline + +### Current Pattern (Single-Stage Retrieval) + +1. **Initial retrieval** + - A vector database or search system returns a relatively large number of chunks (documents or passages) based on similarity to the query. + +2. **Final LLM call** + - Many or all of these chunks are passed directly into the LLM. + - The LLM is expected to: + - Identify the most relevant information. + - Answer the user’s question based on that context. + +3. **Observed issue** + - To ensure relevant information is included, the system may: + - Retrieve and pass *too many* chunks. + - Consequences: + - Increased latency for the LLM call (more tokens to process). + - Higher cost (more tokens). + - Potential context dilution (LLM has to sift through a lot of content). + +### Proposed Pattern (Two-Stage Retrieval with Reranking) + +1. **Stage 1: Broad retrieval** + - Use your existing retrieval mechanism (e.g., vector search) to fetch a larger set of candidate chunks (e.g., top 50–100). + +2. **Stage 2: Reranking** + - Pass the query and the retrieved chunks to a **reranking model**. + - The reranker scores each chunk for relevance to the query. + - Select only the top *N* chunks (e.g., top 5–10) based on reranker scores. + +3. **Final LLM call with fewer, higher-quality chunks** + - Pass only these top *N* chunks into the LLM. + - Expected benefits: + - Reduced context size → lower LLM latency and cost. + - Maintained or improved answer quality, because the LLM sees the most relevant chunks. + +### Core Trade-Off + +- **With reranking**: + - **Pros**: + - Fewer chunks in the final LLM call. + - Potentially faster and cheaper LLM calls. + - Better focus on the most relevant documents. + - **Cons**: + - Additional latency from the reranking model/API call. + - Reranker is not perfect; may occasionally drop a useful chunk. + +- **Without reranking**: + - **Pros**: + - Simpler pipeline (no extra model call). + - No additional reranker latency. + - “Maximum recall” if you send everything to the LLM. + - **Cons**: + - Larger LLM context → higher latency and cost. + - Risk of overwhelming the LLM with too many chunks. + +In other words, you are **trading the latency of an additional reranking call** against **the latency and cost of a much larger LLM call**. + +## Suggested Evaluation Steps + +Use these steps to decide whether reranking is appropriate for your use case. + +1. **Measure your current baseline** + - Record: + - Average number of chunks sent to the LLM per query. + - Average LLM latency per query. + - Average total pipeline latency per query. + - Quality metrics (e.g., answer accuracy, user satisfaction, or internal evaluation scores). + +2. **Define a target chunk budget** + - Decide how many chunks you *ideally* want to pass to the LLM (e.g., 5–10). + - This should be based on: + - LLM context limits. + - Desired latency and cost. + - Empirical tests of how many chunks the LLM can handle effectively. + +3. **Prototype a reranking step** + - Integrate a reranking model between retrieval and the LLM call. + - Pipeline: + 1. Retrieve a larger set of chunks (e.g., top 50–100). + 2. Call the reranker with the query and these chunks. + 3. Select the top *N* chunks (your target chunk budget). + 4. Pass only these *N* chunks to the LLM. + +4. **Compare latency and cost** + - Measure: + - Additional latency from the reranker call. + - Reduction in LLM latency due to fewer chunks. + - Net effect on total pipeline latency. + - Any change in token usage and cost. + +5. **Compare quality** + - Evaluate: + - Does answer quality stay the same, improve, or degrade? + - Are there cases where the reranker drops critical chunks that the LLM previously used? + +6. **Decide on adoption** + - If: + - Total latency is acceptable or improved, and + - Quality is acceptable (even if not 100% of the “send everything” baseline), + - Then reranking may be a good fit. + - If: + - Latency increase is too high, or + - Quality loss is unacceptable, + - Then reranking may not be right for your current requirements. + +## Important Notes and Caveats + +- **Latency concerns are primary** + There is explicit concern that adding a reranking model may introduce too much latency. Any evaluation must quantify: + - Reranker latency vs. LLM latency savings. + +- **Rerankers are not perfect** + While rerankers often work very well, they are: + - Not 100% accurate compared to “feed everything to the model.” + - Capable of occasionally excluding relevant chunks. + +- **Fit for your use case is uncertain** + It is not yet clear whether reranking is the right solution for your specific system: + - It “seems to be solving the problem that we have” (too many chunks in the final LLM call). + - However, there is still uncertainty about whether the trade-offs are acceptable in practice. + +- **Additional information needed for a final decision** + To move from discussion to decision, you would need: + - Concrete latency benchmarks for: + - Current pipeline (no reranker). + - Prototype pipeline (with reranker). + - Quality evaluation results: + - Human or automated assessments comparing answers with and without reranking. + - Cost analysis: + - Token usage and API costs for both approaches. + +## Troubleshooting and Evaluation Tips + +- **If total latency increases significantly** + - Check: + - Whether you are retrieving too many initial chunks before reranking. + - Whether the reranking model is overpowered (and slower than needed) for your use case. + - Possible mitigations: + - Reduce the number of initial retrieved chunks. + - Use a lighter or faster reranking model. + - Cache reranker results for repeated or similar queries. + +- **If answer quality drops noticeably** + - Investigate: + - Whether the top-*N* cutoff is too aggressive (e.g., using top 3 instead of top 10). + - Whether the reranker is misaligned with your domain (e.g., specialized jargon or formats). + - Possible mitigations: + - Increase *N* (allow more chunks into the LLM). + - Fine-tune or choose a domain-appropriate reranker. + - Add a fallback path: + - If confidence is low, send more chunks or bypass reranking. + +- **If costs do not improve** + - Verify: + - That the reduction in LLM tokens is substantial enough to offset reranker costs. + - Possible mitigations: + - Further reduce the number of chunks passed to the LLM. + - Use a cheaper reranking model or provider. + - Apply reranking only to high-value or complex queries. + +- **If it is unclear whether reranking is “right for us”** + - Run a time-boxed experiment: + - Implement reranking for a subset of traffic or a test environment. + - Collect metrics over a defined period. + - Use the data to make a decision rather than relying on intuition alone. + +--- +*Source: [Original Slack thread](https://distylai.slack.com/archives/impl-tower-infobot/p1739987575183209)*