DistylAI · yenchia-distyl · Dec 11, 2025
diff --git a/docs/debugging/how-to-prototype-llm-based-rewrites-of-scraped-documentation.md b/docs/debugging/how-to-prototype-llm-based-rewrites-of-scraped-documentation.md
@@ -0,0 +1,163 @@
+# How to Prototype LLM-Based Rewrites of Scraped Documentation for Token Reduction
+
+## Overview
+
+This guide describes a proposed approach for reducing token usage by rewriting scraped documentation with a Large Language Model (LLM) before inserting it into MongoDB. It also outlines key concerns about information distortion, collaboration with OpenAI on preprocessing, and how to share relevant examples.
+
+## Prerequisites
+
+- Access to:
+  - The MongoDB instance where scraped documentation is stored.
+  - The Distyl application environment (e.g., `https://distillery.distyl.dev/app/stage/tower/...`).
+  - Linear issues (e.g., `TOW-216`) and Coffey traces, if needed for examples.
+- Ability to:
+  - Run or modify preprocessing pipelines for scraped documentation.
+  - Call the `o3-high` LLM model (or equivalent) from your environment.
+- Understanding of:
+  - How current “general lookup” and chunking of documents work.
+  - Basic concerns around hallucinations and information distortion in LLM outputs.
+
+> Note: The exact implementation details of the scraping pipeline, MongoDB schema, and LLM invocation are not specified in the original discussion and will need to be clarified before implementation.
+
+## Proposed Workflow
+
+### 1. Identify the Problem
+
+1. A user query such as “Where is my order?” currently:
+   - Routes to a general lookup flow.
+   - Retrieves document chunks that contain “all kinds of junk” (irrelevant or noisy content).
+2. This leads to:
+   - Higher token usage.
+   - Potentially lower answer quality due to noisy context.
+
+### 2. Define the LLM Rewrite Objective
+
+Before implementing, clarify the goals of the rewrite step:
+
+- **Primary goals**
+  - Remove irrelevant or low-value content (“junk”) from scraped documentation.
+  - Normalize and compress text to reduce token count.
+- **Constraints**
+  - Preserve factual accuracy and critical details.
+  - Avoid introducing hallucinations or changing the meaning of the source documentation.
+
+### 3. Insert an LLM Rewrite Step Before MongoDB Ingestion
+
+1. **Locate the ingestion pipeline**  
+   Identify where scraped documents are currently processed and inserted into MongoDB.
+
+2. **Add a preprocessing stage using an LLM (e.g., `o3-high`)**  
+   For each scraped document:
+   - Send the raw text to the LLM with instructions such as:
+     - Summarize and normalize the content.
+     - Remove navigation, boilerplate, and irrelevant sections.
+     - Preserve all user-impacting details (e.g., order status logic, error conditions, policy details).
+   - Receive the rewritten text and store that version in MongoDB instead of (or alongside) the raw text.
+
+3. **Use a looping question–answer (QA) verification step (if available)**  
+   - Apply an internal QA loop (not customer-facing) to:
+     - Compare the original and rewritten versions.
+     - Ask the LLM to verify that all key facts and constraints are preserved.
+   - Reject or flag rewrites that appear to drop important information or introduce contradictions.
+
+### 4. Prototype and Evaluate
+
+1. **Prototype quickly**
+   - A team member has offered to “prototype this” to make the concept easier to evaluate.
+   - Start with a small subset of documents (e.g., those that are frequently hit by “Where is my order?” queries).
+
+2. **Measure performance**
+   - Compare:
+     - Token usage before vs. after the rewrite step.
+     - Retrieval quality (e.g., reduction in irrelevant chunks).
+   - Monitor for:
+     - Any evidence of information distortion.
+     - Changes in answer quality for common queries.
+
+3. **Iterate based on findings**
+   - Adjust prompts, filtering rules, or the scope of rewriting based on observed issues.
+
+### 5. Coordinate with OpenAI on Related Preprocessing
+
+1. **Spanish translation and preprocessing**
+   - James from OpenAI is prototyping Spanish translation preprocessing rewrites.
+   - He is interested in:
+     - Examples of “extra junk” in documentation that should be removed.
+     - Other preprocessing transformations that could be applied similarly.
+
+2. **Share examples and access**
+   - Determine whether James has:
+     - Access to Linear tickets (e.g., `TOW-216`).
+     - Access to Coffey traces (e.g., traces like `.../traces/a495db2b-1045-40dc-a408-d8d647a2564d`).
+   - If not, decide:
+     - What anonymized or redacted examples can be shared.
+     - The best channel and format to share:
+       - Problematic chunks.
+       - Before/after rewrite examples.
+       - Representative traces of noisy retrievals.
+
+## Important Notes and Caveats
+
+- **Risk of information distortion**
+  - There is explicit concern that LLM rewrites could distort or omit critical information.
+  - This could “get us into real trouble” if customer-facing answers rely on altered documentation.
+  - Mitigation:
+    - Use high-quality models (e.g., `o3-high`).
+    - Keep the rewrite step internal and non-customer-facing.
+    - Implement QA verification loops and spot checks.
+
+- **Scope of deployment**
+  - Initial work should be treated as a prototype.
+  - Do not rely on rewritten documents for production-critical flows until:
+    - Accuracy has been validated.
+    - Failure modes are understood and mitigated.
+
+- **Access and privacy**
+  - Confirm what data can be shared with external partners (e.g., OpenAI) and under what agreements.
+  - Ensure that any traces or tickets shared are compliant with privacy and security policies.
+
+## Troubleshooting and Open Questions
+
+### Common Issues
+
+1. **Rewritten documents lose important details**
+   - Symptoms:
+     - Answers become more generic or incomplete.
+     - Edge cases or special conditions disappear from the documentation.
+   - Actions:
+     - Tighten the rewrite prompt to explicitly require preservation of all conditions, parameters, and exceptions.
+     - Use the QA loop to compare specific facts between original and rewritten text.
+     - Consider storing both raw and rewritten versions and falling back to raw when needed.
+
+2. **Rewritten documents still contain “junk”**
+   - Symptoms:
+     - Retrieval still returns navigation text, boilerplate, or irrelevant sections.
+   - Actions:
+     - Add explicit instructions to remove:
+       - Headers/footers.
+       - Navigation menus.
+       - Legal boilerplate (if not needed for answers).
+     - Provide the LLM with concrete examples of what should be removed.
+
+3. **Difficulty evaluating the prototype**
+   - Symptoms:
+     - Stakeholders are unsure whether the rewrite step is beneficial.
+   - Actions:
+     - Define clear metrics:
+       - Token reduction per document.
+       - Reduction in irrelevant chunks for key queries.
+       - Qualitative rating of answer quality on a test set.
+     - Run A/B tests on internal queries where possible.
+
+### Additional Information Needed
+
+To fully implement this guide, the following details must be clarified:
+
+- Exact structure and entry points of the current scraping and ingestion pipeline.
+- How “general lookup” is implemented and how chunks are currently generated.
+- The precise interface and configuration for calling `o3-high` (or the chosen LLM) in this environment.
+- Data access rules for sharing Linear tickets and Coffey traces with external collaborators.
+- The existing implementation (if any) of the “looping QA step” referenced in the discussion.
+
+---
+*Source: [Original Slack thread](https://distylai.slack.com/archives/impl-tower-infobot/p1741808756956959)*