Skip to content

Commit 5ae96bc

Browse files
author
Taniya Mathur
committed
resolving merge conflict and improve logging
2 parents 9240ac6 + 9a17ed5 commit 5ae96bc

File tree

93 files changed

+8699
-2628
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+8699
-2628
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@ rvl_cdip_*
2222
notebooks/examples/data
2323
.idea/
2424
.dsr/
25+
*tmp-dev-assets*

CHANGELOG.md

Lines changed: 26 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -7,74 +7,41 @@ SPDX-License-Identifier: MIT-0
77

88
### Added
99

10-
- **Modern Python-Based Publishing System**
11-
- **Complete Rewrite**: Replaced legacy `publish.sh` bash script (583 lines) with modern `publish.py` Python script (1,294 lines)
12-
- **Enhanced User Experience**: Rich console interface with progress bars, spinners, and colored output using Rich library
13-
- **Concurrent Processing**: Multi-threaded artifact building and uploading for significantly improved performance
14-
- **Cross-Platform Support**: Native support for Linux, macOS, and Windows environments
15-
- **Intelligent Caching**: Advanced checksum-based build optimization to skip unnecessary rebuilds
16-
- **Robust Error Handling**: Comprehensive error handling with detailed error messages and recovery suggestions
17-
- **Resource Management**: Automatic cleanup of temporary files and proper resource management
18-
- **CLI Interface**: Modern command-line interface using Typer with clear help documentation and parameter validation
19-
- **Key Features**:
20-
- Parallel S3 uploads with thread-safe progress tracking
21-
- CloudFormation template validation and processing
22-
- Automatic bucket creation with proper permissions and lifecycle policies
23-
- Build artifact optimization with dependency tracking
24-
- Comprehensive logging and debugging capabilities
25-
- Memory-efficient processing for large artifacts
26-
27-
- **Comprehensive Unit Test Suite for Publishing System**
28-
- **Extensive Test Coverage**: 1,621 lines of unit tests covering 95%+ of publish.py functionality
29-
- **Testing Framework**: Uses pytest with proper unit test markers (`@pytest.mark.unit`) following project testing standards
30-
- **Mock Integration**: Comprehensive mocking of AWS services (S3, CloudFormation) for isolated unit testing without external dependencies
31-
- **Cross-Platform Testing**: Tests for Linux, macOS, and Windows-specific functionality and edge cases
32-
- **Error Scenario Coverage**: Tests for network failures, permission errors, malformed templates, and concurrent access scenarios
33-
- **Performance Testing**: Tests for concurrent operations, memory usage, and resource cleanup
34-
- **Key Test Categories**:
35-
- IDPPublisher initialization and configuration
36-
- S3 operations (bucket creation, uploads, permissions)
37-
- CloudFormation template processing and validation
38-
- Build system integration and checksum validation
39-
- Error handling and recovery mechanisms
40-
- Concurrent publishing workflows and thread safety
41-
42-
- **Windows Development Environment Setup**
43-
- **Automated Setup Script**: New `scripts/dev_setup.bat` (570 lines) for complete Windows development environment configuration
44-
- **Comprehensive Installation**: Automated installation of all required development tools and dependencies
45-
- **Tool Installation**:
46-
- AWS CLI with proper configuration
47-
- Python 3.13 (required for compatibility with latest dependencies)
48-
- Node.js for React UI development
49-
- Git for version control
50-
- Docker for containerized development
51-
- AWS SAM CLI for serverless application development
52-
- Python dependencies (boto3, numpy 2.3.2, typer, rich)
53-
- **Environment Configuration**: Automatic AWS credentials setup and project cloning with dependency installation
54-
- **Administrator Privileges**: Proper handling of Windows administrator requirements for system-level installations
55-
- **Error Handling**: Comprehensive error checking and user guidance throughout the setup process
10+
- **AWS GovCloud Support with Automated Template Generation**
11+
- Added GovCloud compatibility through `scripts/generate_govcloud_template.py` script
12+
- **ARN Partition Compatibility**: All templates updated to use `arn:${AWS::Partition}:` for both commercial and GovCloud regions
13+
- **Headless Operation**: Automatically removes UI-related resources (CloudFront, AppSync, Cognito, WAF) for GovCloud deployment
14+
- **Core Functionality Preserved**: All 3 processing patterns and complete 6-step pipeline (OCR, Classification, Extraction, Assessment, Summarization, Evaluation) remain fully functional
15+
- **Automated Workflow**: Single script orchestrates build + GovCloud template generation + S3 upload with deployment URLs
16+
- **Enterprise Ready**: Enables headless document processing for government and enterprise environments requiring GovCloud compliance
17+
- **Documentation**: New `docs/govcloud-deployment.md` with deployment guide, architecture differences, and access methods
18+
19+
- **Pattern-2 and Pattern-3 Assessment now generate geometry (bounding boxes) for visualization in UI 'Visual Editor' (parity with Pattern-1)**
20+
- Added comprehensive spatial localization capabilities to both regular and granular assessment services
21+
- **Automatic Processing**: When LLM provides bbox coordinates, automatically converts to UI-compatible (Visual Edit) geometry format without any configuration
22+
- **Universal Support**: Works with all attribute types - simple attributes, nested group attributes (e.g., CompanyAddress.State), and list attributes
23+
- **Enhanced Prompts**: Updated assessment task prompts with spatial-localization-guidelines requesting bbox coordinates in normalized 0-1000 scale
24+
- **Demo Notebooks**: Assessment notebooks now showcase automatic bounding box processing
25+
26+
- **New Python-Based Publishing System**
27+
- Replaced `publish.sh` bash script with new `publish.py` Python script
28+
- Rich console interface with progress bars, spinners, and colored output using Rich library
29+
- Multi-threaded artifact building and uploading for significantly improved performance
30+
- Native support for Linux, macOS, and Windows environments
31+
32+
- **Windows Development Environment Setup Guide and Helper Script**
33+
- New `scripts/dev_setup.bat` (570 lines) for complete Windows development environment configuration
5634

5735
- **OCR Service Default Image Sizing for Resource Optimization**
5836
- Implemented automatic default image size limits (951×1268) when no image sizing configuration is provided
5937
- **Key Benefits**: Reduction in vision model token consumption, prevents OutOfMemory errors during concurrent processing, improves processing speed and reduces bandwidth usage
6038

6139
### Changed
6240

63-
- **Publishing Workflow Modernization**
64-
- Migrated from bash-based to Python-based publishing system for better maintainability and cross-platform support
65-
- Improved build performance through intelligent caching and concurrent processing
66-
- Enhanced developer experience with rich console output and clear progress indicators
67-
- Better error diagnostics and troubleshooting capabilities
68-
6941
- **Reverted to python3.12 runtime to resolve build package dependency problems**
7042

71-
### Technical Improvements
72-
73-
- **Build System Optimization**: Checksum-based incremental builds reduce unnecessary processing time
74-
- **Memory Management**: Efficient handling of large artifacts and proper resource cleanup
75-
- **Thread Safety**: Concurrent operations with proper synchronization and error handling
76-
- **Code Quality**: Comprehensive unit testing ensures reliability and maintainability of the publishing system
77-
43+
### Fixed
44+
- **Improved Visual Edit bounding box position when using image zoom or pan**
7845

7946

8047

Makefile

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ test:
1414
$(MAKE) -C lib/idp_common_pkg test
1515

1616
# Run both linting and formatting in one command
17-
lint: ruff-lint format
17+
lint: ruff-lint format check-arn-partitions
1818

1919
# Run linting checks and fix issues automatically
2020
ruff-lint:
@@ -29,16 +29,39 @@ format:
2929
lint-cicd:
3030
@echo "Running code quality checks..."
3131
@if ! ruff check; then \
32-
echo "$(RED)ERROR: Ruff linting failed!$(NC)"; \
33-
echo "$(YELLOW)Please run 'make ruff-lint' locally to fix these issues.$(NC)"; \
32+
echo -e "$(RED)ERROR: Ruff linting failed!$(NC)"; \
33+
echo -e "$(YELLOW)Please run 'make ruff-lint' locally to fix these issues.$(NC)"; \
3434
exit 1; \
3535
fi
3636
@if ! ruff format --check; then \
37-
echo "$(RED)ERROR: Code formatting check failed!$(NC)"; \
38-
echo "$(YELLOW)Please run 'make format' locally to fix these issues.$(NC)"; \
37+
echo -e "$(RED)ERROR: Code formatting check failed!$(NC)"; \
38+
echo -e "$(YELLOW)Please run 'make format' locally to fix these issues.$(NC)"; \
39+
exit 1; \
40+
fi
41+
@echo -e "$(GREEN)All code quality checks passed!$(NC)"
42+
43+
# Check CloudFormation templates for hardcoded AWS partition ARNs
44+
check-arn-partitions:
45+
@echo "Checking CloudFormation templates for hardcoded ARN partitions..."
46+
@FOUND_ISSUES=0; \
47+
for template in template.yaml patterns/*/template.yaml patterns/*/sagemaker_classifier_endpoint.yaml options/*/template.yaml; do \
48+
if [ -f "$$template" ]; then \
49+
echo "Checking $$template..."; \
50+
MATCHES=$$(grep -n "arn:aws:" "$$template" | grep -v "arn:\$${AWS::Partition}:" || true); \
51+
if [ -n "$$MATCHES" ]; then \
52+
echo -e "$(RED)ERROR: Found hardcoded 'arn:aws:' references in $$template:$(NC)"; \
53+
echo "$$MATCHES" | sed 's/^/ /'; \
54+
echo -e "$(YELLOW) These should use 'arn:\$${AWS::Partition}:' instead for GovCloud compatibility$(NC)"; \
55+
FOUND_ISSUES=1; \
56+
fi; \
57+
fi; \
58+
done; \
59+
if [ $$FOUND_ISSUES -eq 0 ]; then \
60+
echo -e "$(GREEN)✅ No hardcoded ARN partition references found!$(NC)"; \
61+
else \
62+
echo -e "$(RED)❌ Found hardcoded ARN partition references that need to be fixed$(NC)"; \
3963
exit 1; \
4064
fi
41-
@echo "$(GREEN)All code quality checks passed!$(NC)"
4265

4366
# A convenience Makefile target that runs
4467
commit: lint test

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
124124
- [Deployment](./docs/deployment.md) - Build, publish, deploy, and test instructions
125125
- [Web UI](./docs/web-ui.md) - Web interface features and usage
126126
- [Agent Analysis](./docs/agent-analysis.md) - Natural language analytics and data visualization feature
127+
- [Custom MCP Agent](./docs/custom-MCP-agent.md) - Integrating external MCP servers for custom tools and capabilities
127128
- [Configuration](./docs/configuration.md) - Configuration and customization options
128129
- [Classification](./docs/classification.md) - Customizing document classification
129130
- [Extraction](./docs/extraction.md) - Customizing information extraction

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 67 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
22
# SPDX-License-Identifier: MIT-0
33

4-
notes: Default settings
4+
notes: Default settings for bank statement sample configuration
55
ocr:
66
backend: "textract" # Default to Textract for backward compatibility
77
model_id: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
@@ -368,6 +368,7 @@ summarization:
368368
model: us.anthropic.claude-3-7-sonnet-20250219-v1:0
369369
system_prompt: >-
370370
You are a document summarization expert who can analyze and summarize documents from various domains including medical, financial, legal, and general business documents. Your task is to create a summary that captures the key information, main points, and important details from the document. Your output must be in valid JSON format. \nSummarization Style: Balanced\\nCreate a balanced summary that provides a moderate level of detail. Include the main points and key supporting information, while maintaining the document's overall structure. Aim for a comprehensive yet concise summary.\n Your output MUST be in valid JSON format with markdown content. You MUST strictly adhere to the output format specified in the instructions.
371+
371372
assessment:
372373
enabled: true
373374
image:
@@ -383,130 +384,146 @@ assessment:
383384
max_tokens: '10000'
384385
top_k: '5'
385386
temperature: '0.0'
386-
model: us.amazon.nova-pro-v1:0
387+
model: us.amazon.nova-lite-v1:0
387388
system_prompt: >-
388-
You are a document analysis assessment expert. Your task is to evaluate the confidence of extraction results by analyzing the source document evidence. Respond only with JSON containing confidence scores for each extracted attribute.
389+
You are a document analysis assessment expert. Your role is to evaluate the confidence and accuracy of data extraction results by analyzing them against source documents.
390+
391+
Provide accurate confidence scores for each assessment.
392+
When bounding boxes are requested, provide precise coordinate locations where information appears in the document.
389393
task_prompt: >-
390394
<background>
391-
392-
You are an expert document analysis assessment system. Your task is to evaluate the confidence of extraction results for a document of class {DOCUMENT_CLASS}.
393-
395+
You are an expert document analysis assessment system. Your task is to evaluate the confidence of extraction results for a document of class {DOCUMENT_CLASS} and provide precise spatial localization for each field.
394396
</background>
395397
396-
397398
<task>
398-
399-
Analyze the extraction results against the source document and provide confidence assessments for each extracted attribute. Consider factors such as:
400-
401-
1. Text clarity and OCR quality in the source regions
402-
2. Alignment between extracted values and document content
403-
3. Presence of clear evidence supporting the extraction
404-
4. Potential ambiguity or uncertainty in the source material
399+
Analyze the extraction results against the source document and provide confidence assessments AND bounding box coordinates for each extracted attribute. Consider factors such as:
400+
1. Text clarity and OCR quality in the source regions
401+
2. Alignment between extracted values and document content
402+
3. Presence of clear evidence supporting the extraction
403+
4. Potential ambiguity or uncertainty in the source material
405404
5. Completeness and accuracy of the extracted information
406-
405+
6. Precise spatial location of each field in the document
407406
</task>
408407
409-
410408
<assessment-guidelines>
411-
412-
For each attribute, provide:
413-
A confidence score between 0.0 and 1.0 where:
409+
For each attribute, provide:
410+
- A confidence score between 0.0 and 1.0 where:
414411
- 1.0 = Very high confidence, clear and unambiguous evidence
415412
- 0.8-0.9 = High confidence, strong evidence with minor uncertainty
416413
- 0.6-0.7 = Medium confidence, reasonable evidence but some ambiguity
417414
- 0.4-0.5 = Low confidence, weak or unclear evidence
418415
- 0.0-0.3 = Very low confidence, little to no supporting evidence
419-
420-
Guidelines:
421-
- Base assessments on actual document content and OCR quality
422-
- Consider both text-based evidence and visual/layout clues
423-
- Account for OCR confidence scores when provided
424-
- Be objective and specific in reasoning
416+
- A clear explanation of the confidence reasoning
417+
- Precise spatial coordinates where the field appears in the document
418+
419+
Guidelines:
420+
- Base assessments on actual document content and OCR quality
421+
- Consider both text-based evidence and visual/layout clues
422+
- Account for OCR confidence scores when provided
423+
- Be objective and specific in reasoning
425424
- If an extraction appears incorrect, score accordingly with explanation
426-
425+
- Provide tight, accurate bounding boxes around the actual text
427426
</assessment-guidelines>
428427
429-
<final-instructions>
428+
<spatial-localization-guidelines>
429+
For each field, provide bounding box coordinates:
430+
- bbox: [x1, y1, x2, y2] coordinates in normalized 0-1000 scale
431+
- page: Page number where the field appears (starting from 1)
432+
433+
Coordinate system:
434+
- Use normalized scale 0-1000 for both x and y axes
435+
- x1, y1 = top-left corner of bounding box
436+
- x2, y2 = bottom-right corner of bounding box
437+
- Ensure x2 > x1 and y2 > y1
438+
- Make bounding boxes tight around the actual text content
439+
- If a field spans multiple lines, create a bounding box that encompasses all relevant text
440+
</spatial-localization-guidelines>
430441
431-
Analyze the extraction results against the source document and provide confidence assessments. Return a JSON object with the following structure based on the attribute type:
442+
<final-instructions>
443+
Analyze the extraction results against the source document and provide confidence assessments with spatial localization. Return a JSON object with the following structure based on the attribute type:
432444
433-
For SIMPLE attributes:
445+
For SIMPLE attributes:
434446
{
435447
"simple_attribute_name": {
436448
"confidence": 0.85,
449+
"bbox": [100, 200, 300, 250],
450+
"page": 1
437451
}
438452
}
439453
440-
For GROUP attributes (nested object structure):
454+
For GROUP attributes (nested object structure):
441455
{
442456
"group_attribute_name": {
443457
"sub_attribute_1": {
444458
"confidence": 0.90,
459+
"bbox": [150, 300, 250, 320],
460+
"page": 1
445461
},
446462
"sub_attribute_2": {
447463
"confidence": 0.75,
464+
"bbox": [150, 325, 280, 345],
465+
"page": 1
448466
}
449467
}
450468
}
451469
452-
For LIST attributes (array of assessed items):
470+
For LIST attributes (array of assessed items):
453471
{
454472
"list_attribute_name": [
455473
{
456474
"item_attribute_1": {
457475
"confidence": 0.95,
476+
"bbox": [100, 400, 200, 420],
477+
"page": 1
458478
},
459479
"item_attribute_2": {
460480
"confidence": 0.88,
481+
"bbox": [250, 400, 350, 420],
482+
"page": 1
461483
}
462484
},
463485
{
464486
"item_attribute_1": {
465487
"confidence": 0.92,
488+
"bbox": [100, 425, 200, 445],
489+
"page": 1
466490
},
467491
"item_attribute_2": {
468492
"confidence": 0.70,
493+
"bbox": [250, 425, 350, 445],
494+
"page": 1
469495
}
470496
}
471497
]
472498
}
473499
474-
IMPORTANT:
475-
- For LIST attributes like "Transactions", assess EACH individual item in the list separately
476-
- Each transaction should be assessed as a separate object in the array
477-
- Do NOT provide aggregate assessments for list items - assess each one individually
478-
- Include assessments for ALL attributes present in the extraction results
500+
IMPORTANT:
501+
- For LIST attributes like "Transactions", assess EACH individual item in the list separately with individual bounding boxes
502+
- Each transaction should be assessed as a separate object in the array with its own spatial coordinates
503+
- Do NOT provide aggregate assessments for list items - assess each one individually with precise locations
504+
- Include assessments AND bounding boxes for ALL attributes present in the extraction results
479505
- Match the exact structure of the extracted data
480-
506+
- Provide page numbers for all bounding boxes (starting from 1)
481507
</final-instructions>
482508
483-
<attributes-definitions>
484-
485-
{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}
486-
487-
</attributes-definitions>
488-
489509
<<CACHEPOINT>>
490510
491511
<document-image>
492-
493512
{DOCUMENT_IMAGE}
494-
495513
</document-image>
496514
497-
498515
<ocr-text-confidence-results>
499-
500516
{OCR_TEXT_CONFIDENCE}
501-
502517
</ocr-text-confidence-results>
503518
504519
<<CACHEPOINT>>
505520
506-
<extraction-results>
521+
<attributes-definitions>
522+
{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}
523+
</attributes-definitions>
507524
525+
<extraction-results>
508526
{EXTRACTION_RESULTS}
509-
510527
</extraction-results>
511528
512529
evaluation:

0 commit comments

Comments
 (0)