diff --git a/docs/debugging/how-to-prototype-llm-based-rewrites-of-scraped-documentation.md b/docs/debugging/how-to-prototype-llm-based-rewrites-of-scraped-documentation.md new file mode 100644 index 0000000..8be308b --- /dev/null +++ b/docs/debugging/how-to-prototype-llm-based-rewrites-of-scraped-documentation.md @@ -0,0 +1,163 @@ +# How to Prototype LLM-Based Rewrites of Scraped Documentation for Token Reduction + +## Overview + +This guide describes a proposed approach for reducing token usage by rewriting scraped documentation with a Large Language Model (LLM) before inserting it into MongoDB. It also outlines key concerns about information distortion, collaboration with OpenAI on preprocessing, and how to share relevant examples. + +## Prerequisites + +- Access to: + - The MongoDB instance where scraped documentation is stored. + - The Distyl application environment (e.g., `https://distillery.distyl.dev/app/stage/tower/...`). + - Linear issues (e.g., `TOW-216`) and Coffey traces, if needed for examples. +- Ability to: + - Run or modify preprocessing pipelines for scraped documentation. + - Call the `o3-high` LLM model (or equivalent) from your environment. +- Understanding of: + - How current “general lookup” and chunking of documents work. + - Basic concerns around hallucinations and information distortion in LLM outputs. + +> Note: The exact implementation details of the scraping pipeline, MongoDB schema, and LLM invocation are not specified in the original discussion and will need to be clarified before implementation. + +## Proposed Workflow + +### 1. Identify the Problem + +1. A user query such as “Where is my order?” currently: + - Routes to a general lookup flow. + - Retrieves document chunks that contain “all kinds of junk” (irrelevant or noisy content). +2. This leads to: + - Higher token usage. + - Potentially lower answer quality due to noisy context. + +### 2. Define the LLM Rewrite Objective + +Before implementing, clarify the goals of the rewrite step: + +- **Primary goals** + - Remove irrelevant or low-value content (“junk”) from scraped documentation. + - Normalize and compress text to reduce token count. +- **Constraints** + - Preserve factual accuracy and critical details. + - Avoid introducing hallucinations or changing the meaning of the source documentation. + +### 3. Insert an LLM Rewrite Step Before MongoDB Ingestion + +1. **Locate the ingestion pipeline** + Identify where scraped documents are currently processed and inserted into MongoDB. + +2. **Add a preprocessing stage using an LLM (e.g., `o3-high`)** + For each scraped document: + - Send the raw text to the LLM with instructions such as: + - Summarize and normalize the content. + - Remove navigation, boilerplate, and irrelevant sections. + - Preserve all user-impacting details (e.g., order status logic, error conditions, policy details). + - Receive the rewritten text and store that version in MongoDB instead of (or alongside) the raw text. + +3. **Use a looping question–answer (QA) verification step (if available)** + - Apply an internal QA loop (not customer-facing) to: + - Compare the original and rewritten versions. + - Ask the LLM to verify that all key facts and constraints are preserved. + - Reject or flag rewrites that appear to drop important information or introduce contradictions. + +### 4. Prototype and Evaluate + +1. **Prototype quickly** + - A team member has offered to “prototype this” to make the concept easier to evaluate. + - Start with a small subset of documents (e.g., those that are frequently hit by “Where is my order?” queries). + +2. **Measure performance** + - Compare: + - Token usage before vs. after the rewrite step. + - Retrieval quality (e.g., reduction in irrelevant chunks). + - Monitor for: + - Any evidence of information distortion. + - Changes in answer quality for common queries. + +3. **Iterate based on findings** + - Adjust prompts, filtering rules, or the scope of rewriting based on observed issues. + +### 5. Coordinate with OpenAI on Related Preprocessing + +1. **Spanish translation and preprocessing** + - James from OpenAI is prototyping Spanish translation preprocessing rewrites. + - He is interested in: + - Examples of “extra junk” in documentation that should be removed. + - Other preprocessing transformations that could be applied similarly. + +2. **Share examples and access** + - Determine whether James has: + - Access to Linear tickets (e.g., `TOW-216`). + - Access to Coffey traces (e.g., traces like `.../traces/a495db2b-1045-40dc-a408-d8d647a2564d`). + - If not, decide: + - What anonymized or redacted examples can be shared. + - The best channel and format to share: + - Problematic chunks. + - Before/after rewrite examples. + - Representative traces of noisy retrievals. + +## Important Notes and Caveats + +- **Risk of information distortion** + - There is explicit concern that LLM rewrites could distort or omit critical information. + - This could “get us into real trouble” if customer-facing answers rely on altered documentation. + - Mitigation: + - Use high-quality models (e.g., `o3-high`). + - Keep the rewrite step internal and non-customer-facing. + - Implement QA verification loops and spot checks. + +- **Scope of deployment** + - Initial work should be treated as a prototype. + - Do not rely on rewritten documents for production-critical flows until: + - Accuracy has been validated. + - Failure modes are understood and mitigated. + +- **Access and privacy** + - Confirm what data can be shared with external partners (e.g., OpenAI) and under what agreements. + - Ensure that any traces or tickets shared are compliant with privacy and security policies. + +## Troubleshooting and Open Questions + +### Common Issues + +1. **Rewritten documents lose important details** + - Symptoms: + - Answers become more generic or incomplete. + - Edge cases or special conditions disappear from the documentation. + - Actions: + - Tighten the rewrite prompt to explicitly require preservation of all conditions, parameters, and exceptions. + - Use the QA loop to compare specific facts between original and rewritten text. + - Consider storing both raw and rewritten versions and falling back to raw when needed. + +2. **Rewritten documents still contain “junk”** + - Symptoms: + - Retrieval still returns navigation text, boilerplate, or irrelevant sections. + - Actions: + - Add explicit instructions to remove: + - Headers/footers. + - Navigation menus. + - Legal boilerplate (if not needed for answers). + - Provide the LLM with concrete examples of what should be removed. + +3. **Difficulty evaluating the prototype** + - Symptoms: + - Stakeholders are unsure whether the rewrite step is beneficial. + - Actions: + - Define clear metrics: + - Token reduction per document. + - Reduction in irrelevant chunks for key queries. + - Qualitative rating of answer quality on a test set. + - Run A/B tests on internal queries where possible. + +### Additional Information Needed + +To fully implement this guide, the following details must be clarified: + +- Exact structure and entry points of the current scraping and ingestion pipeline. +- How “general lookup” is implemented and how chunks are currently generated. +- The precise interface and configuration for calling `o3-high` (or the chosen LLM) in this environment. +- Data access rules for sharing Linear tickets and Coffey traces with external collaborators. +- The existing implementation (if any) of the “looping QA step” referenced in the discussion. + +--- +*Source: [Original Slack thread](https://distylai.slack.com/archives/impl-tower-infobot/p1741808756956959)*