Skip to content

Commit df09fde

Browse files
author
Bob Strahan
committed
Update lending_package.pdf sample with realistic driver's license image
1 parent 9fb331b commit df09fde

File tree

2 files changed

+116
-0
lines changed

2 files changed

+116
-0
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Changed
9+
10+
- Updated lending_package.pdf sample with more realistic driver's license image
11+
812
### Added
913

1014
## [0.3.8]

memory-bank/activeContext.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# GenAI IDP Accelerator - Active Context
2+
3+
## Current Task Focus
4+
5+
**User Question**: Understanding OCR processing architecture for large PDFs (500+ pages) in the IDP accelerator, specifically:
6+
1. Is OCR processing sequential or distributed by page?
7+
2. How does Bedrock-only OCR deployment differ?
8+
3. What parts of the system run sequentially vs distributed?
9+
4. Handling massive PDFs with hundreds of forms without clear page boundaries
10+
11+
## Key Findings
12+
13+
### OCR Processing Models
14+
15+
The IDP accelerator uses **different processing models depending on the pattern**:
16+
17+
#### Pattern 1 (BDA): Sequential Internal Processing
18+
- **OCR Approach**: Bedrock Data Automation handles everything internally
19+
- **Processing**: Entire document processed as single unit by BDA service
20+
- **Concurrency**: Not user-controllable, managed by BDA
21+
- **Large Documents**: Subject to BDA service limits and timeouts
22+
23+
#### Pattern 2/3 (Textract + Bedrock): Distributed Page Processing
24+
- **OCR Approach**: AWS Textract with concurrent page processing
25+
- **Processing**: **Pages processed in parallel** using ThreadPoolExecutor
26+
- **Concurrency**: Configurable (default: 20 concurrent workers)
27+
- **Large Documents**: Optimal for 500+ page documents
28+
29+
### Sequential vs Distributed Components
30+
31+
#### Sequential Processing:
32+
1. **Step Functions Workflow**: OCR → Classification → Extraction → Assessment → Summarization
33+
2. **Classification**: Analyzes all pages to create document boundaries
34+
3. **BDA Internal Processing**: Everything handled as single unit
35+
36+
#### Distributed Processing:
37+
1. **OCR Pages (Pattern 2/3)**: Up to 20 pages processed simultaneously
38+
2. **Extraction Sections**: Up to 10 document sections processed in parallel
39+
3. **Independent API Calls**: Each page makes separate Textract calls
40+
41+
## Customer Scenario Analysis
42+
43+
### 500+ Page PDF with Multiple Forms
44+
45+
**Challenge**: Single PDF containing hundreds of forms without clear page boundaries
46+
47+
**Recommended Approach**: Pattern 2 or 3 for optimal performance
48+
49+
**Why Pattern 2/3 is Better**:
50+
- **Page-Level Parallelism**: 500 pages processed 20 at a time
51+
- **Memory Efficiency**: Individual pages loaded, not entire document
52+
- **Fault Tolerance**: Page failures don't stop entire processing
53+
- **Granular Control**: Can optimize per-page processing
54+
55+
**Classification Strategy**:
56+
- Use "holistic" classification method to analyze entire document
57+
- Creates logical sections grouping related pages
58+
- Handles form boundaries that don't align with page boundaries
59+
60+
## Technical Implementation Details
61+
62+
### OCR Service Configuration for Large Documents
63+
64+
```yaml
65+
ocr:
66+
backend: "textract"
67+
max_workers: 20 # Increase for more parallelism
68+
image:
69+
dpi: 150 # Balance quality vs processing time
70+
target_width: 1024
71+
target_height: 1024
72+
features:
73+
- name: "LAYOUT"
74+
- name: "TABLES"
75+
- name: "FORMS"
76+
```
77+
78+
### Processing Flow for Large PDFs
79+
80+
1. **Document Load**: PyMuPDF loads PDF structure
81+
2. **Page Distribution**: ThreadPoolExecutor creates 20 concurrent workers
82+
3. **Parallel OCR**: Each page processed independently via Textract
83+
4. **Result Assembly**: Pages sorted and combined into document structure
84+
5. **Classification**: Holistic analysis creates logical document sections
85+
6. **Parallel Extraction**: Sections processed concurrently (MaxConcurrency: 10)
86+
87+
## Performance Implications
88+
89+
### For 500-Page Document:
90+
- **Pattern 1 (BDA)**: Single job, BDA-managed processing
91+
- **Pattern 2/3**: ~25 batches of 20 pages each, highly parallelized
92+
93+
### Bottlenecks to Consider:
94+
1. **Textract Rate Limits**: May need to adjust max_workers
95+
2. **Memory Usage**: 20 concurrent pages require significant memory
96+
3. **S3 Operations**: Parallel uploads/downloads for page results
97+
4. **Lambda Timeouts**: Ensure sufficient timeout for large documents
98+
99+
## Next Steps and Considerations
100+
101+
### For Customer Implementation:
102+
1. **Choose Pattern 2 or 3** for large document processing
103+
2. **Configure max_workers** based on Textract limits and memory
104+
3. **Use holistic classification** to handle form boundaries
105+
4. **Monitor memory usage** during processing
106+
5. **Consider document splitting** if single PDF approach is problematic
107+
108+
### Optimization Opportunities:
109+
- **Adaptive Concurrency**: Adjust workers based on document size
110+
- **Progressive Processing**: Start classification while OCR continues
111+
- **Caching Strategy**: Cache page images for reprocessing
112+
- **Error Recovery**: Implement page-level retry with exponential backoff

0 commit comments

Comments
 (0)