Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# How to Prototype LLM-Based Rewrites of Scraped Documentation for Token Reduction

## Overview

This guide describes a proposed approach for reducing token usage by rewriting scraped documentation with a Large Language Model (LLM) before inserting it into MongoDB. It also outlines key concerns about information distortion, collaboration with OpenAI on preprocessing, and how to share relevant examples.

## Prerequisites

- Access to:
- The MongoDB instance where scraped documentation is stored.
- The Distyl application environment (e.g., `https://distillery.distyl.dev/app/stage/tower/...`).
- Linear issues (e.g., `TOW-216`) and Coffey traces, if needed for examples.
- Ability to:
- Run or modify preprocessing pipelines for scraped documentation.
- Call the `o3-high` LLM model (or equivalent) from your environment.
- Understanding of:
- How current “general lookup” and chunking of documents work.
- Basic concerns around hallucinations and information distortion in LLM outputs.

> Note: The exact implementation details of the scraping pipeline, MongoDB schema, and LLM invocation are not specified in the original discussion and will need to be clarified before implementation.

## Proposed Workflow

### 1. Identify the Problem

1. A user query such as “Where is my order?” currently:
- Routes to a general lookup flow.
- Retrieves document chunks that contain “all kinds of junk” (irrelevant or noisy content).
2. This leads to:
- Higher token usage.
- Potentially lower answer quality due to noisy context.

### 2. Define the LLM Rewrite Objective

Before implementing, clarify the goals of the rewrite step:

- **Primary goals**
- Remove irrelevant or low-value content (“junk”) from scraped documentation.
- Normalize and compress text to reduce token count.
- **Constraints**
- Preserve factual accuracy and critical details.
- Avoid introducing hallucinations or changing the meaning of the source documentation.

### 3. Insert an LLM Rewrite Step Before MongoDB Ingestion

1. **Locate the ingestion pipeline**
Identify where scraped documents are currently processed and inserted into MongoDB.

2. **Add a preprocessing stage using an LLM (e.g., `o3-high`)**
For each scraped document:
- Send the raw text to the LLM with instructions such as:
- Summarize and normalize the content.
- Remove navigation, boilerplate, and irrelevant sections.
- Preserve all user-impacting details (e.g., order status logic, error conditions, policy details).
- Receive the rewritten text and store that version in MongoDB instead of (or alongside) the raw text.

3. **Use a looping question–answer (QA) verification step (if available)**
- Apply an internal QA loop (not customer-facing) to:
- Compare the original and rewritten versions.
- Ask the LLM to verify that all key facts and constraints are preserved.
- Reject or flag rewrites that appear to drop important information or introduce contradictions.

### 4. Prototype and Evaluate

1. **Prototype quickly**
- A team member has offered to “prototype this” to make the concept easier to evaluate.
- Start with a small subset of documents (e.g., those that are frequently hit by “Where is my order?” queries).

2. **Measure performance**
- Compare:
- Token usage before vs. after the rewrite step.
- Retrieval quality (e.g., reduction in irrelevant chunks).
- Monitor for:
- Any evidence of information distortion.
- Changes in answer quality for common queries.

3. **Iterate based on findings**
- Adjust prompts, filtering rules, or the scope of rewriting based on observed issues.

### 5. Coordinate with OpenAI on Related Preprocessing

1. **Spanish translation and preprocessing**
- James from OpenAI is prototyping Spanish translation preprocessing rewrites.
- He is interested in:
- Examples of “extra junk” in documentation that should be removed.
- Other preprocessing transformations that could be applied similarly.

2. **Share examples and access**
- Determine whether James has:
- Access to Linear tickets (e.g., `TOW-216`).
- Access to Coffey traces (e.g., traces like `.../traces/a495db2b-1045-40dc-a408-d8d647a2564d`).
- If not, decide:
- What anonymized or redacted examples can be shared.
- The best channel and format to share:
- Problematic chunks.
- Before/after rewrite examples.
- Representative traces of noisy retrievals.

## Important Notes and Caveats

- **Risk of information distortion**
- There is explicit concern that LLM rewrites could distort or omit critical information.
- This could “get us into real trouble” if customer-facing answers rely on altered documentation.
- Mitigation:
- Use high-quality models (e.g., `o3-high`).
- Keep the rewrite step internal and non-customer-facing.
- Implement QA verification loops and spot checks.

- **Scope of deployment**
- Initial work should be treated as a prototype.
- Do not rely on rewritten documents for production-critical flows until:
- Accuracy has been validated.
- Failure modes are understood and mitigated.

- **Access and privacy**
- Confirm what data can be shared with external partners (e.g., OpenAI) and under what agreements.
- Ensure that any traces or tickets shared are compliant with privacy and security policies.

## Troubleshooting and Open Questions

### Common Issues

1. **Rewritten documents lose important details**
- Symptoms:
- Answers become more generic or incomplete.
- Edge cases or special conditions disappear from the documentation.
- Actions:
- Tighten the rewrite prompt to explicitly require preservation of all conditions, parameters, and exceptions.
- Use the QA loop to compare specific facts between original and rewritten text.
- Consider storing both raw and rewritten versions and falling back to raw when needed.

2. **Rewritten documents still contain “junk”**
- Symptoms:
- Retrieval still returns navigation text, boilerplate, or irrelevant sections.
- Actions:
- Add explicit instructions to remove:
- Headers/footers.
- Navigation menus.
- Legal boilerplate (if not needed for answers).
- Provide the LLM with concrete examples of what should be removed.

3. **Difficulty evaluating the prototype**
- Symptoms:
- Stakeholders are unsure whether the rewrite step is beneficial.
- Actions:
- Define clear metrics:
- Token reduction per document.
- Reduction in irrelevant chunks for key queries.
- Qualitative rating of answer quality on a test set.
- Run A/B tests on internal queries where possible.

### Additional Information Needed

To fully implement this guide, the following details must be clarified:

- Exact structure and entry points of the current scraping and ingestion pipeline.
- How “general lookup” is implemented and how chunks are currently generated.
- The precise interface and configuration for calling `o3-high` (or the chosen LLM) in this environment.
- Data access rules for sharing Linear tickets and Coffey traces with external collaborators.
- The existing implementation (if any) of the “looping QA step” referenced in the discussion.

---
*Source: [Original Slack thread](https://distylai.slack.com/archives/impl-tower-infobot/p1741808756956959)*