Skip to content

Commit 82fd994

Browse files
author
Bob Strahan
committed
Merge branch 'develop' v0.3.5
2 parents 0e86392 + 134674f commit 82fd994

File tree

106 files changed

+25058
-1545
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

106 files changed

+25058
-1545
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,5 @@ build/
1515
__pycache__
1616
*.code-workspace
1717
.ruff_cache
18+
.kiro
1819
rvl_cdip_*

.gitlab-ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ developer_tests:
2828
- apt-get update -y
2929
- apt-get install make -y
3030
- pip install ruff
31+
# Install test dependencies
32+
- cd lib/idp_common_pkg && pip install -e ".[test]" && cd ../..
3133

3234
script:
3335
- make lint-cicd

1751146101381_classification_state.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

CHANGELOG.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,76 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
## [0.3.5]
9+
10+
### Added
11+
- **Human-in-the-Loop (HITL) Support - Pattern 1**
12+
- Added comprehensive Human-in-the-Loop review capabilities using Amazon SageMaker Augmented AI (A2I)
13+
- **Key Features**:
14+
- Automatic triggering when extraction confidence falls below configurable threshold
15+
- Integration with SageMaker A2I Review Portal for human validation and correction
16+
- Configurable confidence threshold through Web UI Portal Configuration tab (0.0-1.0 range)
17+
- Seamless result integration with human-verified data automatically updating source results
18+
- **Workflow Integration**:
19+
- HITL tasks created automatically when confidence thresholds are not met
20+
- Reviewers can validate correct extractions or make necessary corrections through the Review Portal
21+
- Document processing continues with human-verified data after review completion
22+
- **Configuration Management**:
23+
- `EnableHITL` parameter for feature toggle
24+
- Confidence threshold configurable via Web UI without stack redeployment
25+
- Support for existing private workforce work teams via input parameter
26+
- **CloudFormation Output**: Added `SageMakerA2IReviewPortalURL` for easy access to review portal
27+
- **Known Limitations**: Current A2I version cannot provide direct hyperlinks to specific document tasks; template updates require resource recreation
28+
- **Document Compression for Large Documents - all patterns**
29+
- Added automatic compression support to handle large documents and avoid exceeding Step Functions payload limits (256KB)
30+
- **Key Features**:
31+
- Automatic compression (default trigger threshold of 0KB enables compression by default)
32+
- Transparent handling of both compressed and uncompressed documents in Lambda functions
33+
- Temporary S3 storage for compressed document state with automatic cleanup via lifecycle policies
34+
- **New Utility Methods**:
35+
- `Document.load_document()`: Automatically detects and decompresses document input from Lambda events
36+
- `Document.serialize_document()`: Automatically compresses large documents for Lambda responses
37+
- `Document.compress()` and `Document.decompress()`: Compression/decompression methods
38+
- **Lambda Function Integration**: All relevant Lambda functions updated to use compression utilities
39+
- **Resolves Step Functions Errors**: Eliminates "result with a size exceeding the maximum number of bytes service limit" errors for large multi-page documents
40+
- **Multi-Backend OCR Support - Pattern 2 and 3**
41+
- Textract Backend (default): Existing AWS Textract functionality
42+
- Bedrock Backend: New LLM-based OCR using Claude/Nova models
43+
- None Backend: Image-only processing without OCR
44+
- **Bedrock OCR Integration - Pattern 2 and 3**
45+
- Customizable system and task prompts for OCR optimization
46+
- Better handling of complex documents, tables, and forms
47+
- Layout preservation capabilities
48+
- **Image Preprocessing - Pattern 2 and 3**
49+
- Adaptive Binarization: Improves OCR accuracy on documents with:
50+
- Uneven lighting or shadows
51+
- Low contrast text
52+
- Background noise or gradients
53+
- Optional feature with configurable enable/disable
54+
- **YAML Parsing Support for LLM Responses - Pattern 2 and 3**
55+
- Added comprehensive YAML parsing capabilities to complement existing JSON parsing functionality
56+
- New `extract_yaml_from_text()` function with robust multi-strategy YAML extraction:
57+
- YAML in ```yaml and ```yml code blocks
58+
- YAML with document markers (---)
59+
- Pattern-based YAML detection using indentation and key indicators
60+
- New `detect_format()` function for automatic format detection returning 'json', 'yaml', or 'unknown'
61+
- New unified `extract_structured_data_from_text()` wrapper function that automatically detects and parses both JSON and YAML formats
62+
- **Token Efficiency**: YAML typically uses 10-30% fewer tokens than equivalent JSON due to more compact syntax
63+
- **Service Integration**: Updated classification service to use the new unified parsing function with automatic fallback between formats
64+
- **Comprehensive Testing**: Added 39 new unit tests covering all YAML extraction strategies, format detection, and edge cases
65+
- **Backward Compatibility**: All existing JSON functionality preserved unchanged, new functionality is purely additive
66+
- **Intelligent Fallback**: Robust fallback mechanism handles cases where preferred format fails (e.g., JSON requested as YAML falls back to JSON)
67+
- **Production Ready**: Handles malformed content gracefully, comprehensive error handling and logging
68+
- **Example Notebook**: Added `notebooks/examples/step3_extraction_using_yaml.ipynb` demonstrating YAML-based extraction with automatic format detection and token efficiency benefits
69+
70+
### Fixed
71+
- **Enhanced JSON Extraction from LLM Responses (Issue #16)**
72+
- Modularized duplicate `_extract_json()` functions across classification, extraction, summarization, and assessment services into a common `extract_json_from_text()` utility function
73+
- Improved multi-line JSON handling with literal newlines in string values that previously caused parsing failures
74+
- Added robust JSON validation and multiple fallback strategies for better extraction reliability
75+
- Enhanced string parsing with proper escape sequence handling for quotes and newlines
76+
- Added comprehensive unit tests covering various JSON formats including multi-line scenarios
77+
878
## [0.3.4]
979

1080
### Added

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,4 @@ commit: lint test
4646
export COMMIT_MESSAGE="$(shell q chat --no-interactive --trust-all-tools "Understand pending local git change and changes to be committed, then infer a commit message. Return this commit message only" | tail -n 1 | sed 's/\x1b\[[0-9;]*m//g')" && \
4747
git add . && \
4848
git commit -am "$${COMMIT_MESSAGE}" && \
49-
git push
49+
git push

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ After deployment, you can quickly process a document and view results:
7474

7575
2. **Use Sample Documents**:
7676
- For Pattern 1 (BDA): Use [samples/lending_package.pdf](./samples/lending_package.pdf)
77-
- For Patterns 2 and 3: Use [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf)
77+
- For Patterns 2 and 3: Use [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf)
7878

7979
3. **Monitor Processing**:
8080
- **Via Web UI**: Track document status on the dashboard

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.4
1+
0.3.5-delta

config_library/pattern-1/default/config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
# SPDX-License-Identifier: MIT-0
33

44
notes: Processing configuration in BDA project.
5+
assessment:
6+
default_confidence_threshold: '0.8'
57
summarization:
68
top_p: '0.1'
79
max_tokens: '4096'

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,15 @@
33

44
notes: Default settings
55
ocr:
6+
backend: "textract" # Default to Textract for backward compatibility
7+
model_id: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
8+
system_prompt: "You are an expert OCR system. Extract all text from the provided image accurately, preserving layout where possible."
9+
task_prompt: "Extract all text from this document image. Preserve the layout, including paragraphs, tables, and formatting."
610
features:
711
- name: LAYOUT
12+
image:
13+
target_width: '951'
14+
target_height: '1268'
815
classes:
916
- name: Bank Statement
1017
description: Monthly bank account statement

config_library/pattern-2/default/config.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33

44
notes: Default settings
55
ocr:
6+
backend: "textract" # Default to Textract for backward compatibility
7+
model_id: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
8+
system_prompt: "You are an expert OCR system. Extract all text from the provided image accurately, preserving layout where possible."
9+
task_prompt: "Extract all text from this document image. Preserve the layout, including paragraphs, tables, and formatting."
610
features:
711
- name: LAYOUT
812
- name: TABLES

0 commit comments

Comments
 (0)