aws-solutions-library-samples
diff --git a/‎.gitattributes‎
Lines changed: 5 additions & 0 deletions b/‎.gitattributes‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎.gitlab-ci.yml‎
Lines changed: 7 additions & 4 deletions b/‎.gitlab-ci.yml‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 124 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 124 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 37 additions & 6 deletions b/‎Makefile‎
Lines changed: 37 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 5 additions & 0 deletions b/‎README.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎VERSION‎
Lines changed: 1 addition & 1 deletion b/‎VERSION‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎config_library/pattern-1/lending-package-sample/config.yaml‎
Lines changed: 10 additions & 0 deletions b/‎config_library/pattern-1/lending-package-sample/config.yaml‎
Lines changed: 10 additions & 0 deletions
@@ -0,0 +1,5 @@
+* text=auto eol=lf
+*.py text eol=lf
+*.sh text eol=lf
+*.yaml text eol=lf
+*.yml text eol=lf
@@ -4,6 +4,8 @@ build.toml
 model.tar.gz
 .checksum
 .checksums/
+.build_checksum
+.lib_checksum
 .vscode/
 .DS_Store
 dist/
@@ -20,3 +22,4 @@ rvl_cdip_*
 notebooks/examples/data
 .idea/
 .dsr/
+*tmp-dev-assets*
@@ -28,6 +28,8 @@ developer_tests:
     - apt-get update -y
     - apt-get install make -y
     - pip install ruff
+    # Install dependencies needed by publish.py for test imports
+    - pip install typer rich boto3
     # Install test dependencies
     - cd lib/idp_common_pkg && pip install -e ".[test]" && cd ../..
 
@@ -54,14 +56,15 @@ integration_tests:
   #   AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION}
   #   IDP_ACCOUNT_ID: ${IDP_ACCOUNT_ID}
 
-  # Add rules to only run on develop branch
+ # Add rules to only run on develop branch
+ # Add rules to only run on develop branch
   rules:
     - if: $CI_COMMIT_BRANCH == "develop"
-      when: manual # always # When idp-accelerator CICD is reconfigured
+      when: always # always # When idp-accelerator CICD is reconfigured
     - if: $CI_COMMIT_BRANCH =~ /^feature\/.*/
-      when: manual
+      when: always
     - if: $CI_COMMIT_BRANCH =~ /^fix\/.*/
-      when: manual
+      when: always
     - if: $CI_COMMIT_BRANCH =~ /^hotfix\/.*/
       when: manual
     - if: $CI_COMMIT_BRANCH =~ /^release\/.*/
 
@@ -5,8 +5,132 @@ SPDX-License-Identifier: MIT-0
 
 ## [Unreleased]
 
+## [0.3.15]
+
+### Added
+
+- **Intelligent Document Discovery Module for Automated Configuration Generation**
+  - Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
+  - **Pattern-Neutral Design**: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
+  - **Dual Discovery Methods**: Discovery without ground truth (exploratory analysis) and with ground truth (optimization using labeled data)
+  - **Automated Blueprint Creation**: Pattern 1 includes zero-touch BDA blueprint generation with intelligent change detection and version management
+  - **Web UI Integration**: Real-time discovery job monitoring, interactive results review, and seamless configuration integration
+  - **Advanced Features**: Multi-model support (Nova, Claude), customizable prompts, configurable parameters, ground truth processing, schema conversion, and lifecycle management
+  - **Key Benefits**: Rapid new document type onboarding, reduced time-to-production, configuration optimization, and automated workflow bootstrapping
+  - **Use Cases**: New document exploration, configuration improvement, rapid prototyping, and document understanding
+  - **Documentation**: Guide in `docs/discovery.md` with architecture details, best practices, and troubleshooting
+
+- **Optional Pattern-2 Regex-Based Classification for Enhanced Performance**
+  - Added support for optional regex patterns in document class definitions for performance optimization
+  - **Document Name Regex**: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
+  - **Document Page Content Regex**: Match against page text content during multi-modal page-level classification for fast page classification
+  - **Key Benefits**: Significant performance improvements and cost savings by bypassing LLM calls for pattern-matched documents, deterministic classification results for known document patterns, seamless fallback to existing LLM classification when regex patterns don't match
+  - **Configuration**: Optional `document_name_regex` and `document_page_content_regex` fields in class definitions with automatic regex compilation and validation
+  - **Logging**: Comprehensive info-level logging when regex patterns match for observability and debugging
+  - **CloudFormation Integration**: Updated Pattern-2 schema to support regex configuration through the Web UI
+  - **Demonstration**: New `step2_classification_with_regex.ipynb` notebook showcasing regex configuration and performance comparisons
+  - **Documentation**: Enhanced classification module README and main documentation with regex usage examples and best practices
+  
+- **Windows WSL Development Environment Setup Guide**
+  - Added WSL-based development environment setup guide for Windows developers in `docs/setup-development-env-WSL.md`
+  - **Key Features**: Automated setup script (`wsl_setup.sh`) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI
+  - **Integrated Workflow**: Development setup combining Windows tools (VS Code, browsers) with native Linux environment
+  - **Target Use Cases**: Windows developers needing Linux compatibility without Docker Desktop or VM overhead
+
+### Fixed
+- **Throttling Error Detection and Retry Logic for Assessment Functions** - [GitHub Issue #45](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/45)
+  - **Assessment Function**: Enhanced throttling detection to check for throttling errors returned in `document.errors` field in addition to thrown exceptions, raising `ThrottlingException` to trigger Step Functions retry when throttling is detected
+  - **Granular Assessment Task Caching**: Fixed caching logic to properly cache successful assessment tasks when there are ANY failed tasks (both exception-based and result-based failures), enabling efficient retry optimization by only reprocessing failed tasks while preserving successful results
+  - **Impact**: Improved resilience for throttling scenarios, reduced redundant processing during retries, and better Step Functions retry behavior
+
+- **Security Vulnerability Mitigation - Package Updates**
+
+- **GovCloud Compatibility - Hardcoded Service Domain References**
+  - Fixed hardcoded `amazonaws.com` references in CloudFormation templates that prevented GovCloud deployment
+  - Updated all service principals and endpoints to use dynamic `${AWS::URLSuffix}` expressions for automatic region-based resolution
+  - **Templates Updated**: `template.yaml` (main template), `patterns/pattern-3/sagemaker_classifier_endpoint.yaml`
+  - **Services Fixed**: EventBridge, Cognito, SageMaker, ECR, CloudFront, CodeBuild, AppSync, Lambda, DynamoDB, CloudWatch Logs, Glue
+  - Resolves GitHub Issue #50 - templates now deploy correctly in both standard AWS and GovCloud regions
+
+- **Bug Fixes and Code Improvements**
+  - Fixed HITL processing errors in both Pattern-1 (DynamoDB validation with empty strings) and Pattern-2 (string indices error in A2I output processing)
+  - Fixed Step Function UI issues including auto-refresh button auto-disable and fetch failures for failed executions with datetime serialization errors
+  - Cleaned up unused Step Function subscription infrastructure and removed duplicate code in Pattern-2 HITL function
+  - Expanded UI Visual Editor bounding box size with padding for better visibility and user interaction
+  - Fixed bug in list of models supporting cache points - previously claude 4 sonnet and opus had been excluded.
+  - Validations added at the assessment step for checking valid json response. The validation fails after extraction/assessment is complete if json parsing issues are encountered.
+
+
+## [0.3.14]
+
+### Added
+- Support for 1m token context for Claude Sonnet 4
+- Video demo of "Chat with Document" in [./docs/web-ui.md](./docs/web-ui.md)
+- **Human-in-the-Loop (HITL) Support Extended to Pattern-2**
+  - Added HITL review capabilities for Pattern-2 (Textract + Bedrock processing) using Amazon SageMaker Augmented AI (A2I)
+  - Enables human validation and correction when extraction confidence falls below configurable threshold
+  - Includes same features as Pattern-1 HITL: automatic triggering, review portal integration, and seamless result updates
+  - Documentation and video demo in [./docs/human-review.md](./docs/human-review.md)
+
+### Removed
+- Windows development environment guide and setup script removed as it proved insufficiently robust
+
+### Fixed
+- Fix 1-click Launch URL output from the GovCloud template generation script
+- Add Agent Analytics to architecture diagram
+- Fix various UX and error reporting issues with the new Python publish script
+- Simplify UDOP model path construction and avoid invalid default for regions other than us-east-1 and us-west-2
+- Permission regression from previous release affecting "Chat with Document"
+
+
+## [0.3.13]
+
 ### Added
 
+- **External MCP Agent Integration for Custom Tool Extension**
+  - Added External MCP (Model Context Protocol) Agent support that enables integration with custom MCP servers to extend IDP capabilities
+  - **Cross-Account Integration**: Host MCP servers in separate AWS accounts or external infrastructure with secure OAuth authentication using AWS Cognito
+  - **Dynamic Tool Discovery**: Automatically discovers and integrates available tools from MCP servers through the IDP web interface
+  - **Secure Authentication Flow**: Uses AWS Cognito User Pools for OAuth bearer token authentication with proper token validation
+  - **Configuration Management**: JSON array configuration in AWS Secrets Manager supporting multiple MCP server connections with optional custom agent names and descriptions
+  - **Real-time Integration**: Tools become immediately available through the IDP web interface after configuration
+
+- **AWS GovCloud Support with Automated Template Generation**
+  - Added GovCloud compatibility through `scripts/generate_govcloud_template.py` script
+  - **ARN Partition Compatibility**: All templates updated to use `arn:${AWS::Partition}:` for both commercial and GovCloud regions
+  - **Headless Operation**: Automatically removes UI-related resources (CloudFront, AppSync, Cognito, WAF) for GovCloud deployment
+  - **Core Functionality Preserved**: All 3 processing patterns and complete 6-step pipeline (OCR, Classification, Extraction, Assessment, Summarization, Evaluation) remain fully functional
+  - **Automated Workflow**: Single script orchestrates build + GovCloud template generation + S3 upload with deployment URLs
+  - **Enterprise Ready**: Enables headless document processing for government and enterprise environments requiring GovCloud compliance
+  - **Documentation**: New `docs/govcloud-deployment.md` with deployment guide, architecture differences, and access methods
+
+- **Pattern-2 and Pattern-3 Assessment now generate geometry (bounding boxes) for visualization in UI 'Visual Editor' (parity with Pattern-1)**
+  - Added comprehensive spatial localization capabilities to both regular and granular assessment services
+  - **Automatic Processing**: When LLM provides bbox coordinates, automatically converts to UI-compatible (Visual Edit) geometry format without any configuration
+  - **Universal Support**: Works with all attribute types - simple attributes, nested group attributes (e.g., CompanyAddress.State), and list attributes
+  - **Enhanced Prompts**: Updated assessment task prompts with spatial-localization-guidelines requesting bbox coordinates in normalized 0-1000 scale
+  - **Demo Notebooks**: Assessment notebooks now showcase automatic bounding box processing
+
+- **New Python-Based Publishing System**
+  - Replaced `publish.sh` bash script with new `publish.py` Python script
+  - Rich console interface with progress bars, spinners, and colored output using Rich library
+  - Multi-threaded artifact building and uploading for significantly improved performance
+  - Native support for Linux, macOS, and Windows environments
+
+- **Windows Development Environment Setup Guide and Helper Script**
+  - New `scripts/dev_setup.bat` (570 lines) for complete Windows development environment configuration
+
+- **OCR Service Default Image Sizing for Resource Optimization**
+  - Implemented automatic default image size limits (951×1268) when no image sizing configuration is provided
+  - **Key Benefits**: Reduction in vision model token consumption, prevents OutOfMemory errors during concurrent processing, improves processing speed and reduces bandwidth usage
+
+### Changed
+
+- **Reverted to python3.12 runtime to resolve build package dependency problems**
+
+### Fixed
+- **Improved Visual Edit bounding box position when using image zoom or pan**
+
 
 
 ## [0.3.12]
 
@@ -14,7 +14,7 @@ test:
 	$(MAKE) -C lib/idp_common_pkg test
 
 # Run both linting and formatting in one command
-lint: ruff-lint format
+lint: ruff-lint format check-arn-partitions
 
 # Run linting checks and fix issues automatically
 ruff-lint:
@@ -29,16 +29,47 @@ format:
 lint-cicd:
 	@echo "Running code quality checks..."
 	@if ! ruff check; then \
-		echo "$(RED)ERROR: Ruff linting failed!$(NC)"; \
-		echo "$(YELLOW)Please run 'make ruff-lint' locally to fix these issues.$(NC)"; \
+		echo -e "$(RED)ERROR: Ruff linting failed!$(NC)"; \
+		echo -e "$(YELLOW)Please run 'make ruff-lint' locally to fix these issues.$(NC)"; \
 		exit 1; \
 	fi
 	@if ! ruff format --check; then \
-		echo "$(RED)ERROR: Code formatting check failed!$(NC)"; \
-		echo "$(YELLOW)Please run 'make format' locally to fix these issues.$(NC)"; \
+		echo -e "$(RED)ERROR: Code formatting check failed!$(NC)"; \
+		echo -e "$(YELLOW)Please run 'make format' locally to fix these issues.$(NC)"; \
+		exit 1; \
+	fi
+	@echo -e "$(GREEN)All code quality checks passed!$(NC)"
+
+# Check CloudFormation templates for hardcoded AWS partition ARNs and service principals
+check-arn-partitions:
+	@echo "Checking CloudFormation templates for hardcoded ARN partitions and service principals..."
+	@FOUND_ISSUES=0; \
+	for template in template.yaml patterns/*/template.yaml patterns/*/sagemaker_classifier_endpoint.yaml options/*/template.yaml; do \
+		if [ -f "$$template" ]; then \
+			echo "Checking $$template..."; \
+			ARN_MATCHES=$$(grep -n "arn:aws:" "$$template" | grep -v "arn:\$${AWS::Partition}:" || true); \
+			if [ -n "$$ARN_MATCHES" ]; then \
+				echo -e "$(RED)ERROR: Found hardcoded 'arn:aws:' references in $$template:$(NC)"; \
+				echo "$$ARN_MATCHES" | sed 's/^/  /'; \
+				echo -e "$(YELLOW)  These should use 'arn:\$${AWS::Partition}:' instead for GovCloud compatibility$(NC)"; \
+				FOUND_ISSUES=1; \
+			fi; \
+			SERVICE_MATCHES=$$(grep -n "\.amazonaws\.com" "$$template" | grep -v "\$${AWS::URLSuffix}" | grep -v "^[[:space:]]*#" | grep -v "Description:" | grep -v "Comment:" | grep -v "cognito" | grep -v "ContentSecurityPolicy" || true); \
+			if [ -n "$$SERVICE_MATCHES" ]; then \
+				echo -e "$(RED)ERROR: Found hardcoded service principal references in $$template:$(NC)"; \
+				echo "$$SERVICE_MATCHES" | sed 's/^/  /'; \
+				echo -e "$(YELLOW)  These should use '\$${AWS::URLSuffix}' instead of 'amazonaws.com' for GovCloud compatibility$(NC)"; \
+				echo -e "$(YELLOW)  Example: 'lambda.amazonaws.com' should be 'lambda.\$${AWS::URLSuffix}'$(NC)"; \
+				FOUND_ISSUES=1; \
+			fi; \
+		fi; \
+	done; \
+	if [ $$FOUND_ISSUES -eq 0 ]; then \
+		echo -e "$(GREEN)✅ No hardcoded ARN partition or service principal references found!$(NC)"; \
+	else \
+		echo -e "$(RED)❌ Found hardcoded references that need to be fixed for GovCloud compatibility$(NC)"; \
 		exit 1; \
 	fi
-	@echo "$(GREEN)All code quality checks passed!$(NC)"
 
 # A convenience Makefile target that runs 
 commit: lint test
 
@@ -39,6 +39,8 @@ White-glove customization, deployment, and integration support for production us
 - **Cost Optimization**: Pay-per-use pricing model with built-in controls
 - **Comprehensive Monitoring**: Rich CloudWatch dashboard with detailed metrics and logs
 - **Web User Interface**: Modern UI for inspecting document workflow status and results
+- **Human-in-the-Loop (HITL)**: Amazon A2I integration for human review workflows (Pattern 1 & Pattern 2)
+  - **Note**: When deploying multiple patterns with HITL, reuse existing private workteam ARN due to AWS account limits
 - **AI-Powered Evaluation**: Framework to assess accuracy against baseline data
 - **Extraction Confidence Assessment**: LLM-powered assessment of extraction confidence with multimodal document analysis
 - **Document Knowledge Base Query**: Ask questions about your processed documents
@@ -124,9 +126,12 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
 - [Deployment](./docs/deployment.md) - Build, publish, deploy, and test instructions
 - [Web UI](./docs/web-ui.md) - Web interface features and usage
 - [Agent Analysis](./docs/agent-analysis.md) - Natural language analytics and data visualization feature
+- [Custom MCP Agent](./docs/custom-MCP-agent.md) - Integrating external MCP servers for custom tools and capabilities
 - [Configuration](./docs/configuration.md) - Configuration and customization options
+- [Discovery](./docs/discovery.md) - Pattern-neutral discovery process and BDA blueprint automation
 - [Classification](./docs/classification.md) - Customizing document classification
 - [Extraction](./docs/extraction.md) - Customizing information extraction
+- [Human-in-the-Loop Review](./docs/human-review.md) - Human review workflows with Amazon A2I
 - [Assessment](./docs/assessment.md) - Extraction confidence evaluation using LLMs
 - [Evaluation Framework](./docs/evaluation.md) - Accuracy assessment system with analytics database and reporting
 - [Knowledge Base](./docs/knowledge-base.md) - Document knowledge base query feature
 
@@ -1 +1 @@
-0.3.12
+0.3.15
@@ -185,6 +185,16 @@ pricing:
         price: '3.0E-7'
       - name: cacheWriteInputTokens
         price: '3.75E-6'
+  - name: bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0:1m
+    units:
+      - name: inputTokens
+        price: '6.0E-6'
+      - name: outputTokens
+        price: '2.25E-5'
+      - name: cacheReadInputTokens
+        price: '6.0E-7'
+      - name: cacheWriteInputTokens
+        price: '7.5E-6'
   - name: bedrock/us.anthropic.claude-opus-4-20250514-v1:0
     units:
       - name: inputTokens