Skip to content

Commit fc66539

Browse files
committed
Merge branch 'fix/cache-page-classes-on-throttle-failure' into 'develop'
add caching of successful page classes on thread (page) failure to classifier,... See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!153
2 parents 26fa535 + f644386 commit fc66539

File tree

18 files changed

+586
-94
lines changed

18 files changed

+586
-94
lines changed

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,20 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
10+
- **DynamoDB Caching for Resilient Classification**
11+
- Added optional DynamoDB caching to the multimodal page-level classification service to improve efficiency and resilience
12+
- Cache successful page classification results to avoid redundant processing during retries when some pages fail due to throttling
13+
- Exception-safe caching preserves successful work even when individual threads or the overall process fails
14+
- Configurable via `cache_table` parameter or `CLASSIFICATION_CACHE_TABLE` environment variable
15+
- Cache entries scoped to document ID and workflow execution ARN with automatic TTL cleanup (24 hours)
16+
- Significant cost reduction and improved retry performance for large multi-page documents
17+
18+
### Fixed
19+
- "Use as Evaluation Baseline" incorrectly sets document status back to QUEUED. It should remain as COMPLETED.
20+
21+
822
## [0.3.1]
923

1024
### Added

config_library/pattern-2/few_shot_example_with_multimodal_page_classification/config.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ classes:
7474
"signature": "Will E. Clark",
7575
"cc": null,
7676
"reference_number": "TNJB 0008497"
77-
imagePath: config_library/pattern-2/few_shot_example/example-images/letter1.jpg
77+
imagePath: config_library/pattern-2/few_shot_example_with_multimodal_page_classification/example-images/letter1.jpg
7878
- classPrompt: This is an example of the class 'letter'
7979
name: Letter2
8080
attributesPrompt: |-
@@ -89,7 +89,7 @@ classes:
8989
"signature": "Bill",
9090
"cc": null,
9191
"reference_number": null
92-
imagePath: config_library/pattern-2/few_shot_example/example-images/letter2.png
92+
imagePath: config_library/pattern-2/few_shot_example_with_multimodal_page_classification/example-images/letter2.png
9393
- name: form
9494
description: >-
9595
A structured document with labeled fields, checkboxes, or blanks requiring
@@ -464,7 +464,7 @@ classes:
464464
"priority": null,
465465
"thread_id": null,
466466
"message_id": null
467-
imagePath: config_library/pattern-2/few_shot_example/example-images/email1.jpg
467+
imagePath: config_library/pattern-2/few_shot_example_with_multimodal_page_classification/example-images/email1.jpg
468468
- name: questionnaire
469469
description: >-
470470
A survey instrument containing numbered questions with multiple choice,
@@ -636,7 +636,7 @@ classes:
636636
"account_name": ["Checking", "Savings"],
637637
"account_number": ["003525801543","352580154336"],
638638
"transactions": [{"Date": "2/6/2020", "Description": "Food Purchase - AnyCompany Restaurant - 1194989245", "Amount": "-171"}]
639-
imagePath: config_library/pattern-2/few_shot_example/example-images/bank-statement-pages/
639+
imagePath: config_library/pattern-2/few_shot_example_with_multimodal_page_classification/example-images/bank-statement-pages/
640640

641641
classification:
642642
classificationMethod: multimodalPageLevelClassification
@@ -672,11 +672,11 @@ classification:
672672
{CLASS_NAMES_AND_DESCRIPTIONS}
673673
674674
675-
<few_shot_examples>
675+
<few_shot_example_with_multimodal_page_classifications>
676676
677-
{FEW_SHOT_EXAMPLES}
677+
{few_shot_example_with_multimodal_page_classificationS}
678678
679-
</few_shot_examples>
679+
</few_shot_example_with_multimodal_page_classifications>
680680
681681
682682
<<CACHEPOINT>>

lib/idp_common_pkg/idp_common/classification/README.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ This module provides document classification capabilities for the IDP Accelerato
1616
- Structured data models for results
1717
- Grouping of pages into sections by classification
1818
- Comprehensive error handling and retry mechanisms
19+
- **DynamoDB caching for resilient page-level classification**
1920

2021
## Usage Example
2122

@@ -226,6 +227,136 @@ def handler(event, context):
226227
- `ClassificationResult`: Overall result of a classification operation
227228
- `Document`: Core document data model used throughout the IDP pipeline
228229

230+
## DynamoDB Caching for Resilient Classification
231+
232+
The classification service now supports optional DynamoDB caching to improve efficiency and resilience when processing documents with multiple pages. This feature addresses throttling scenarios where some pages succeed while others fail, avoiding the need to reclassify already successful pages on retry.
233+
234+
### How It Works
235+
236+
1. **Cache Check**: Before processing, the service checks for cached classification results for the document
237+
2. **Selective Processing**: Only pages without cached results are classified
238+
3. **Exception-Safe Caching**: Successful page results are cached even when other pages fail
239+
4. **Retry Efficiency**: Subsequent retries only process previously failed pages
240+
241+
### Configuration
242+
243+
#### Via Constructor Parameter
244+
```python
245+
from idp_common import classification, get_config
246+
247+
config = get_config()
248+
service = classification.ClassificationService(
249+
region="us-east-1",
250+
config=config,
251+
backend="bedrock",
252+
cache_table="classification-cache-table" # Enable caching
253+
)
254+
```
255+
256+
#### Via Environment Variable
257+
```bash
258+
export CLASSIFICATION_CACHE_TABLE=classification-cache-table
259+
```
260+
261+
```python
262+
# Cache table will be automatically detected from environment
263+
service = classification.ClassificationService(
264+
region="us-east-1",
265+
config=config,
266+
backend="bedrock"
267+
)
268+
```
269+
270+
### DynamoDB Table Schema
271+
272+
The cache uses the following DynamoDB table structure:
273+
274+
- **Primary Key (PK)**: `classcache#{document_id}#{workflow_execution_arn}`
275+
- **Sort Key (SK)**: `none`
276+
- **Attributes**:
277+
- `page_classifications` (String): JSON-encoded successful page results
278+
- `cached_at` (String): Unix timestamp of cache creation
279+
- `document_id` (String): Document identifier
280+
- `workflow_execution_arn` (String): Workflow execution ARN
281+
- `ExpiresAfter` (Number): TTL attribute for automatic cleanup (24 hours)
282+
283+
#### Example DynamoDB Item
284+
```json
285+
{
286+
"PK": "classcache#doc-123#arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123",
287+
"SK": "none",
288+
"page_classifications": "{\"1\":{\"doc_type\":\"invoice\",\"confidence\":1.0,\"metadata\":{\"metering\":{...}},\"image_uri\":\"s3://...\",\"text_uri\":\"s3://...\",\"raw_text_uri\":\"s3://...\"},\"2\":{...}}",
289+
"cached_at": "1672531200",
290+
"document_id": "doc-123",
291+
"workflow_execution_arn": "arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123",
292+
"ExpiresAfter": 1672617600
293+
}
294+
```
295+
296+
### Benefits
297+
298+
- **Cost Reduction**: Avoids redundant API calls to Bedrock/SageMaker for already-classified pages
299+
- **Improved Resilience**: Handles partial failures gracefully during concurrent processing
300+
- **Faster Retries**: Subsequent attempts only process failed pages, not the entire document
301+
- **Automatic Cleanup**: TTL ensures cache entries don't accumulate indefinitely
302+
- **Thread Safety**: Safe for concurrent page processing within the same document
303+
304+
### Example: Resilient Processing Flow
305+
306+
```python
307+
from idp_common import classification, get_config
308+
from idp_common.models import Document
309+
310+
config = get_config()
311+
service = classification.ClassificationService(
312+
region="us-east-1",
313+
config=config,
314+
backend="bedrock",
315+
cache_table="classification-cache-table"
316+
)
317+
318+
# Create document with 5 pages
319+
document = Document(
320+
id="doc-123",
321+
workflow_execution_arn="arn:aws:states:us-east-1:123456789012:execution:MyWorkflow:abc-123",
322+
pages={
323+
"1": {...},
324+
"2": {...},
325+
"3": {...},
326+
"4": {...},
327+
"5": {...}
328+
}
329+
)
330+
331+
try:
332+
# First attempt: pages 1,2,4 succeed, pages 3,5 fail due to throttling
333+
document = service.classify_document(document)
334+
except Exception as e:
335+
# Pages 1,2,4 are cached automatically before exception is raised
336+
print(f"Classification failed: {e}")
337+
338+
try:
339+
# Retry: only pages 3,5 are processed (1,2,4 loaded from cache)
340+
document = service.classify_document(document)
341+
print("Document classified successfully on retry")
342+
except Exception as e:
343+
print(f"Retry failed: {e}")
344+
```
345+
346+
### Cache Lifecycle
347+
348+
1. **Creation**: Cache entries are created when `classify_document()` completes successfully or encounters exceptions
349+
2. **Retrieval**: Cache is checked at the start of each `classify_document()` call
350+
3. **Update**: Cache entries are updated with new successful results from each processing attempt
351+
4. **Expiration**: Entries automatically expire after 24 hours via DynamoDB TTL
352+
353+
### Important Notes
354+
355+
- Caching only applies to the `classify_document()` method, not individual `classify_page()` calls
356+
- Cache entries are scoped to specific document and workflow execution combinations
357+
- Only successful page classifications (without errors in metadata) are cached
358+
- The cache is transparent - existing code continues to work without modifications
359+
229360
## Backend Options
230361

231362
### Bedrock Backend

0 commit comments

Comments
 (0)