Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# How to Evaluate Rerankers for Retrieval Optimization in RAG Pipelines

## Overview

This guide explains the role of reranking models in retrieval-augmented generation (RAG) pipelines, how they compare to the current “retrieve many chunks and send them all to the large language model (LLM)” approach, and the key trade-offs to consider before adopting rerankers.

## Prerequisites

Before using or evaluating rerankers, you should:

- Have an existing RAG or retrieval pipeline in place (e.g., vector search returning multiple document chunks).
- Understand:
- What a “chunk” is in your system (e.g., document fragment, paragraph, section).
- How many chunks are currently passed into the final LLM call.
- Your latency and cost constraints for:
- Retrieval (vector search or similar)
- LLM calls
- Any additional model calls (such as a reranker)
- Be able to modify your pipeline to:
- Add an intermediate step (reranking API/model call).
- Adjust how many chunks are passed to the final LLM call.

## Explanation: How Rerankers Fit into a Two-Stage Retrieval Pipeline

### Current Pattern (Single-Stage Retrieval)

1. **Initial retrieval**
- A vector database or search system returns a relatively large number of chunks (documents or passages) based on similarity to the query.

2. **Final LLM call**
- Many or all of these chunks are passed directly into the LLM.
- The LLM is expected to:
- Identify the most relevant information.
- Answer the user’s question based on that context.

3. **Observed issue**
- To ensure relevant information is included, the system may:
- Retrieve and pass *too many* chunks.
- Consequences:
- Increased latency for the LLM call (more tokens to process).
- Higher cost (more tokens).
- Potential context dilution (LLM has to sift through a lot of content).

### Proposed Pattern (Two-Stage Retrieval with Reranking)

1. **Stage 1: Broad retrieval**
- Use your existing retrieval mechanism (e.g., vector search) to fetch a larger set of candidate chunks (e.g., top 50–100).

2. **Stage 2: Reranking**
- Pass the query and the retrieved chunks to a **reranking model**.
- The reranker scores each chunk for relevance to the query.
- Select only the top *N* chunks (e.g., top 5–10) based on reranker scores.

3. **Final LLM call with fewer, higher-quality chunks**
- Pass only these top *N* chunks into the LLM.
- Expected benefits:
- Reduced context size → lower LLM latency and cost.
- Maintained or improved answer quality, because the LLM sees the most relevant chunks.

### Core Trade-Off

- **With reranking**:
- **Pros**:
- Fewer chunks in the final LLM call.
- Potentially faster and cheaper LLM calls.
- Better focus on the most relevant documents.
- **Cons**:
- Additional latency from the reranking model/API call.
- Reranker is not perfect; may occasionally drop a useful chunk.

- **Without reranking**:
- **Pros**:
- Simpler pipeline (no extra model call).
- No additional reranker latency.
- “Maximum recall” if you send everything to the LLM.
- **Cons**:
- Larger LLM context → higher latency and cost.
- Risk of overwhelming the LLM with too many chunks.

In other words, you are **trading the latency of an additional reranking call** against **the latency and cost of a much larger LLM call**.

## Suggested Evaluation Steps

Use these steps to decide whether reranking is appropriate for your use case.

1. **Measure your current baseline**
- Record:
- Average number of chunks sent to the LLM per query.
- Average LLM latency per query.
- Average total pipeline latency per query.
- Quality metrics (e.g., answer accuracy, user satisfaction, or internal evaluation scores).

2. **Define a target chunk budget**
- Decide how many chunks you *ideally* want to pass to the LLM (e.g., 5–10).
- This should be based on:
- LLM context limits.
- Desired latency and cost.
- Empirical tests of how many chunks the LLM can handle effectively.

3. **Prototype a reranking step**
- Integrate a reranking model between retrieval and the LLM call.
- Pipeline:
1. Retrieve a larger set of chunks (e.g., top 50–100).
2. Call the reranker with the query and these chunks.
3. Select the top *N* chunks (your target chunk budget).
4. Pass only these *N* chunks to the LLM.

4. **Compare latency and cost**
- Measure:
- Additional latency from the reranker call.
- Reduction in LLM latency due to fewer chunks.
- Net effect on total pipeline latency.
- Any change in token usage and cost.

5. **Compare quality**
- Evaluate:
- Does answer quality stay the same, improve, or degrade?
- Are there cases where the reranker drops critical chunks that the LLM previously used?

6. **Decide on adoption**
- If:
- Total latency is acceptable or improved, and
- Quality is acceptable (even if not 100% of the “send everything” baseline),
- Then reranking may be a good fit.
- If:
- Latency increase is too high, or
- Quality loss is unacceptable,
- Then reranking may not be right for your current requirements.

## Important Notes and Caveats

- **Latency concerns are primary**
There is explicit concern that adding a reranking model may introduce too much latency. Any evaluation must quantify:
- Reranker latency vs. LLM latency savings.

- **Rerankers are not perfect**
While rerankers often work very well, they are:
- Not 100% accurate compared to “feed everything to the model.”
- Capable of occasionally excluding relevant chunks.

- **Fit for your use case is uncertain**
It is not yet clear whether reranking is the right solution for your specific system:
- It “seems to be solving the problem that we have” (too many chunks in the final LLM call).
- However, there is still uncertainty about whether the trade-offs are acceptable in practice.

- **Additional information needed for a final decision**
To move from discussion to decision, you would need:
- Concrete latency benchmarks for:
- Current pipeline (no reranker).
- Prototype pipeline (with reranker).
- Quality evaluation results:
- Human or automated assessments comparing answers with and without reranking.
- Cost analysis:
- Token usage and API costs for both approaches.

## Troubleshooting and Evaluation Tips

- **If total latency increases significantly**
- Check:
- Whether you are retrieving too many initial chunks before reranking.
- Whether the reranking model is overpowered (and slower than needed) for your use case.
- Possible mitigations:
- Reduce the number of initial retrieved chunks.
- Use a lighter or faster reranking model.
- Cache reranker results for repeated or similar queries.

- **If answer quality drops noticeably**
- Investigate:
- Whether the top-*N* cutoff is too aggressive (e.g., using top 3 instead of top 10).
- Whether the reranker is misaligned with your domain (e.g., specialized jargon or formats).
- Possible mitigations:
- Increase *N* (allow more chunks into the LLM).
- Fine-tune or choose a domain-appropriate reranker.
- Add a fallback path:
- If confidence is low, send more chunks or bypass reranking.

- **If costs do not improve**
- Verify:
- That the reduction in LLM tokens is substantial enough to offset reranker costs.
- Possible mitigations:
- Further reduce the number of chunks passed to the LLM.
- Use a cheaper reranking model or provider.
- Apply reranking only to high-value or complex queries.

- **If it is unclear whether reranking is “right for us”**
- Run a time-boxed experiment:
- Implement reranking for a subset of traffic or a test environment.
- Collect metrics over a defined period.
- Use the data to make a decision rather than relying on intuition alone.

---
*Source: [Original Slack thread](https://distylai.slack.com/archives/impl-tower-infobot/p1739987575183209)*