|
| 1 | +# GenAI IDP Accelerator - Active Context |
| 2 | + |
| 3 | +## Current Task Focus |
| 4 | + |
| 5 | +**User Question**: Understanding OCR processing architecture for large PDFs (500+ pages) in the IDP accelerator, specifically: |
| 6 | +1. Is OCR processing sequential or distributed by page? |
| 7 | +2. How does Bedrock-only OCR deployment differ? |
| 8 | +3. What parts of the system run sequentially vs distributed? |
| 9 | +4. Handling massive PDFs with hundreds of forms without clear page boundaries |
| 10 | + |
| 11 | +## Key Findings |
| 12 | + |
| 13 | +### OCR Processing Models |
| 14 | + |
| 15 | +The IDP accelerator uses **different processing models depending on the pattern**: |
| 16 | + |
| 17 | +#### Pattern 1 (BDA): Sequential Internal Processing |
| 18 | +- **OCR Approach**: Bedrock Data Automation handles everything internally |
| 19 | +- **Processing**: Entire document processed as single unit by BDA service |
| 20 | +- **Concurrency**: Not user-controllable, managed by BDA |
| 21 | +- **Large Documents**: Subject to BDA service limits and timeouts |
| 22 | + |
| 23 | +#### Pattern 2/3 (Textract + Bedrock): Distributed Page Processing |
| 24 | +- **OCR Approach**: AWS Textract with concurrent page processing |
| 25 | +- **Processing**: **Pages processed in parallel** using ThreadPoolExecutor |
| 26 | +- **Concurrency**: Configurable (default: 20 concurrent workers) |
| 27 | +- **Large Documents**: Optimal for 500+ page documents |
| 28 | + |
| 29 | +### Sequential vs Distributed Components |
| 30 | + |
| 31 | +#### Sequential Processing: |
| 32 | +1. **Step Functions Workflow**: OCR → Classification → Extraction → Assessment → Summarization |
| 33 | +2. **Classification**: Analyzes all pages to create document boundaries |
| 34 | +3. **BDA Internal Processing**: Everything handled as single unit |
| 35 | + |
| 36 | +#### Distributed Processing: |
| 37 | +1. **OCR Pages (Pattern 2/3)**: Up to 20 pages processed simultaneously |
| 38 | +2. **Extraction Sections**: Up to 10 document sections processed in parallel |
| 39 | +3. **Independent API Calls**: Each page makes separate Textract calls |
| 40 | + |
| 41 | +## Customer Scenario Analysis |
| 42 | + |
| 43 | +### 500+ Page PDF with Multiple Forms |
| 44 | + |
| 45 | +**Challenge**: Single PDF containing hundreds of forms without clear page boundaries |
| 46 | + |
| 47 | +**Recommended Approach**: Pattern 2 or 3 for optimal performance |
| 48 | + |
| 49 | +**Why Pattern 2/3 is Better**: |
| 50 | +- **Page-Level Parallelism**: 500 pages processed 20 at a time |
| 51 | +- **Memory Efficiency**: Individual pages loaded, not entire document |
| 52 | +- **Fault Tolerance**: Page failures don't stop entire processing |
| 53 | +- **Granular Control**: Can optimize per-page processing |
| 54 | + |
| 55 | +**Classification Strategy**: |
| 56 | +- Use "holistic" classification method to analyze entire document |
| 57 | +- Creates logical sections grouping related pages |
| 58 | +- Handles form boundaries that don't align with page boundaries |
| 59 | + |
| 60 | +## Technical Implementation Details |
| 61 | + |
| 62 | +### OCR Service Configuration for Large Documents |
| 63 | + |
| 64 | +```yaml |
| 65 | +ocr: |
| 66 | + backend: "textract" |
| 67 | + max_workers: 20 # Increase for more parallelism |
| 68 | + image: |
| 69 | + dpi: 150 # Balance quality vs processing time |
| 70 | + target_width: 1024 |
| 71 | + target_height: 1024 |
| 72 | + features: |
| 73 | + - name: "LAYOUT" |
| 74 | + - name: "TABLES" |
| 75 | + - name: "FORMS" |
| 76 | +``` |
| 77 | +
|
| 78 | +### Processing Flow for Large PDFs |
| 79 | +
|
| 80 | +1. **Document Load**: PyMuPDF loads PDF structure |
| 81 | +2. **Page Distribution**: ThreadPoolExecutor creates 20 concurrent workers |
| 82 | +3. **Parallel OCR**: Each page processed independently via Textract |
| 83 | +4. **Result Assembly**: Pages sorted and combined into document structure |
| 84 | +5. **Classification**: Holistic analysis creates logical document sections |
| 85 | +6. **Parallel Extraction**: Sections processed concurrently (MaxConcurrency: 10) |
| 86 | +
|
| 87 | +## Performance Implications |
| 88 | +
|
| 89 | +### For 500-Page Document: |
| 90 | +- **Pattern 1 (BDA)**: Single job, BDA-managed processing |
| 91 | +- **Pattern 2/3**: ~25 batches of 20 pages each, highly parallelized |
| 92 | +
|
| 93 | +### Bottlenecks to Consider: |
| 94 | +1. **Textract Rate Limits**: May need to adjust max_workers |
| 95 | +2. **Memory Usage**: 20 concurrent pages require significant memory |
| 96 | +3. **S3 Operations**: Parallel uploads/downloads for page results |
| 97 | +4. **Lambda Timeouts**: Ensure sufficient timeout for large documents |
| 98 | +
|
| 99 | +## Next Steps and Considerations |
| 100 | +
|
| 101 | +### For Customer Implementation: |
| 102 | +1. **Choose Pattern 2 or 3** for large document processing |
| 103 | +2. **Configure max_workers** based on Textract limits and memory |
| 104 | +3. **Use holistic classification** to handle form boundaries |
| 105 | +4. **Monitor memory usage** during processing |
| 106 | +5. **Consider document splitting** if single PDF approach is problematic |
| 107 | +
|
| 108 | +### Optimization Opportunities: |
| 109 | +- **Adaptive Concurrency**: Adjust workers based on document size |
| 110 | +- **Progressive Processing**: Start classification while OCR continues |
| 111 | +- **Caching Strategy**: Cache page images for reprocessing |
| 112 | +- **Error Recovery**: Implement page-level retry with exponential backoff |
0 commit comments