aws-solutions-library-samples
diff --git a/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎notebooks/examples/README.md‎
Lines changed: 352 additions & 0 deletions b/‎notebooks/examples/README.md‎
Lines changed: 352 additions & 0 deletions
diff --git a/‎notebooks/examples/config/assessment.yaml‎
Lines changed: 10 additions & 0 deletions b/‎notebooks/examples/config/assessment.yaml‎
Lines changed: 10 additions & 0 deletions
@@ -8,6 +8,7 @@ SPDX-License-Identifier: MIT-0
 ## [Unreleased]
 
 ### Added
+- New example notebooks with improved clarity, modularity, and documentation.
 - Added confidence threshold to evaluation outputs to enable prioritizing accuracy results for attributes with higher confidence thersholds.
 - Comprehensive Metering Data: The system now captures and stores detailed metering data for analytics, including:
    • Which services were used (Textract, Bedrock, etc.)
 
@@ -0,0 +1,352 @@
+# Modular IDP Pipeline Notebooks
+
+This directory contains a complete set of modular Jupyter notebooks that demonstrate the Intelligent Document Processing (IDP) pipeline using the `idp_common` library. Each notebook represents a distinct step in the IDP workflow and can be run independently or sequentially.
+
+## 🏗️ Architecture Overview
+
+The modular approach breaks down the IDP pipeline into discrete, manageable steps:
+
+```
+Step 0: Setup → Step 1: OCR → Step 2: Classification → Step 3: Extraction → Step 4: Assessment → Step 5: Summarization → Step 6: Evaluation
+```
+
+### Key Benefits
+
+- **Independent Execution**: Each step can be run and tested independently
+- **Modular Configuration**: Separate YAML configuration files for different components
+- **Data Persistence**: Each step saves results for the next step to consume
+- **Easy Experimentation**: Modify configurations without changing code
+- **Comprehensive Evaluation**: Professional-grade evaluation with the EvaluationService
+- **Debugging Friendly**: Isolate issues to specific processing steps
+
+## 📁 Directory Structure
+
+```
+notebooks/examples/
+├── README.md                          # This file
+├── step0_setup.ipynb                  # Environment setup and document initialization
+├── step1_ocr.ipynb                    # OCR processing using Amazon Textract
+├── step2_classification.ipynb         # Document classification 
+├── step3_extraction.ipynb             # Structured data extraction
+├── step4_assessment.ipynb             # Confidence assessment and explainability
+├── step5_summarization.ipynb          # Content summarization
+├── step6_evaluation.ipynb             # Final evaluation and reporting
+├── config/                            # Modular configuration files
+│   ├── main.yaml                      # Main pipeline configuration
+│   ├── classes.yaml                   # Document classification definitions
+│   ├── ocr.yaml                       # OCR service configuration
+│   ├── classification.yaml            # Classification method configuration
+│   ├── extraction.yaml                # Extraction method configuration
+│   ├── assessment.yaml                # Assessment method configuration
+│   ├── summarization.yaml             # Summarization method configuration
+│   └── evaluation.yaml                # Evaluation method configuration
+└── data/                              # Step-by-step processing results
+    ├── step0_setup/                   # Setup outputs
+    ├── step1_ocr/                     # OCR results
+    ├── step2_classification/          # Classification results
+    ├── step3_extraction/              # Extraction results
+    ├── step4_assessment/              # Assessment results
+    ├── step5_summarization/           # Summarization results
+    └── step6_evaluation/              # Final evaluation results
+```
+
+## 🚀 Quick Start
+
+### Prerequisites
+
+1. **AWS Credentials**: Ensure your AWS credentials are configured
+2. **Required Libraries**: Install the `idp_common` package
+3. **Sample Document**: Place a PDF file in the project samples directory
+
+### Running the Complete Pipeline
+
+Execute the notebooks in sequence:
+
+```bash
+# 1. Setup environment and document
+jupyter notebook step0_setup.ipynb
+
+# 2. Process OCR
+jupyter notebook step1_ocr.ipynb
+
+# 3. Classify document sections
+jupyter notebook step2_classification.ipynb
+
+# 4. Extract structured data
+jupyter notebook step3_extraction.ipynb
+
+# 5. Assess confidence and explainability
+jupyter notebook step4_assessment.ipynb
+
+# 6. Generate summaries
+jupyter notebook step5_summarization.ipynb
+
+# 7. Evaluate results and generate reports
+jupyter notebook step6_evaluation.ipynb
+```
+
+### Running Individual Steps
+
+Each notebook can be run independently by ensuring the required input data exists:
+
+```python
+# Each notebook loads its inputs from the previous step's data directory
+previous_step_dir = Path("data/step{n-1}_{previous_step_name}")
+```
+
+## ⚙️ Configuration Management
+
+### Modular Configuration Files
+
+Configuration is split across multiple YAML files for better organization:
+
+- **`config/main.yaml`**: Overall pipeline settings and AWS configuration
+- **`config/classes.yaml`**: Document type definitions and attributes to extract
+- **`config/ocr.yaml`**: Textract features and OCR-specific settings  
+- **`config/classification.yaml`**: Classification model and method configuration
+- **`config/extraction.yaml`**: Extraction model and prompting configuration
+- **`config/assessment.yaml`**: Assessment model and confidence thresholds
+- **`config/summarization.yaml`**: Summarization models and output formats
+- **`config/evaluation.yaml`**: Evaluation metrics and reporting settings
+
+### Configuration Loading
+
+Each notebook automatically merges all configuration files:
+
+```python
+# Automatic configuration loading in each notebook
+CONFIG = load_and_merge_configs("config/")
+```
+
+### Experimentation with Configurations
+
+To experiment with different settings:
+
+1. **Backup Current Config**: Copy the config directory
+2. **Modify Settings**: Edit the relevant YAML files
+3. **Run Specific Steps**: Execute only the affected notebooks
+4. **Compare Results**: Review outputs in the data directories
+
+## 📊 Data Flow
+
+### Input/Output Structure
+
+Each step follows a consistent pattern:
+
+```python
+# Input (from previous step)
+input_data_dir = Path("data/step{n-1}_{previous_name}")
+document = Document.from_json((input_data_dir / "document.json").read_text())
+config = json.load(open(input_data_dir / "config.json"))
+
+# Processing
+# ... step-specific processing ...
+
+# Output (for next step)
+output_data_dir = Path("data/step{n}_{current_name}")
+output_data_dir.mkdir(parents=True, exist_ok=True)
+(output_data_dir / "document.json").write_text(document.to_json())
+json.dump(config, open(output_data_dir / "config.json", "w"))
+```
+
+### Serialized Artifacts
+
+Each step produces:
+- **`document.json`**: Updated Document object with step results
+- **`config.json`**: Complete merged configuration  
+- **`environment.json`**: Environment settings and metadata
+- **Step-specific result files**: Detailed processing outputs
+
+## 🔬 Detailed Step Descriptions
+
+### Step 0: Setup (`step0_setup.ipynb`)
+- **Purpose**: Initialize the Document object and prepare the processing environment
+- **Inputs**: PDF file path, configuration files
+- **Outputs**: Document object with pages and metadata
+- **Key Features**: Multi-page PDF support, metadata extraction
+
+### Step 1: OCR (`step1_ocr.ipynb`)
+- **Purpose**: Extract text and analyze document structure using Amazon Textract
+- **Inputs**: Document object with PDF pages
+- **Outputs**: OCR results with text blocks, tables, and forms
+- **Key Features**: Textract API integration, feature selection, result caching
+
+### Step 2: Classification (`step2_classification.ipynb`)
+- **Purpose**: Identify document types and create logical sections
+- **Inputs**: Document with OCR results
+- **Outputs**: Classified sections with confidence scores
+- **Key Features**: Multi-modal classification, few-shot prompting, custom classes
+
+### Step 3: Extraction (`step3_extraction.ipynb`)
+- **Purpose**: Extract structured data from each classified section
+- **Inputs**: Document with classified sections
+- **Outputs**: Structured data for each section based on class definitions
+- **Key Features**: Class-specific extraction, JSON schema validation
+
+### Step 4: Assessment (`step4_assessment.ipynb`)
+- **Purpose**: Evaluate extraction confidence and provide explainability
+- **Inputs**: Document with extraction results
+- **Outputs**: Confidence scores and reasoning for each extracted attribute
+- **Key Features**: Confidence assessment, hallucination detection, explainability
+
+### Step 5: Summarization (`step5_summarization.ipynb`)
+- **Purpose**: Generate human-readable summaries of processing results
+- **Inputs**: Document with assessed extractions
+- **Outputs**: Section and document-level summaries in multiple formats
+- **Key Features**: Multi-format output (JSON, Markdown, HTML), customizable templates
+
+### Step 6: Evaluation (`step6_evaluation.ipynb`)
+- **Purpose**: Comprehensive evaluation of pipeline performance and accuracy
+- **Inputs**: Document with complete processing results
+- **Outputs**: Evaluation reports, accuracy metrics, performance analysis
+- **Key Features**: EvaluationService integration, ground truth comparison, detailed reporting
+
+## 🧪 Experimentation Guide
+
+### Modifying Document Classes
+
+To add new document types or modify existing ones:
+
+1. **Edit `config/classes.yaml`**:
+```yaml
+classes:
+  new_document_type:
+    description: "Description of the new document type"
+    attributes:
+      - name: "attribute_name"
+        description: "What this attribute represents"
+        type: "string"  # or "number", "date", etc.
+```
+
+2. **Run from Step 2**: Classification onwards to process with new classes
+
+### Changing Models
+
+To experiment with different AI models:
+
+1. **Edit relevant config files**:
+```yaml
+# In config/extraction.yaml
+llm_method:
+  model: "anthropic.claude-3-5-sonnet-20241022-v2:0"  # Change model
+  temperature: 0.1  # Adjust parameters
+```
+
+2. **Run affected steps**: Only the steps that use the changed configuration
+
+### Adjusting Confidence Thresholds
+
+To experiment with confidence thresholds:
+
+1. **Edit `config/assessment.yaml`**:
+```yaml
+assessment:
+  confidence_threshold: 0.7  # Lower threshold = more permissive
+```
+
+2. **Run Steps 4-6**: Assessment, Summarization, and Evaluation
+
+### Performance Optimization
+
+- **Parallel Processing**: Modify extraction/assessment to process sections in parallel
+- **Caching**: Results are automatically cached between steps
+- **Batch Processing**: Process multiple documents by running the pipeline multiple times
+
+## 🐛 Troubleshooting
+
+### Common Issues
+
+1. **AWS Credentials**: Ensure proper AWS configuration
+```bash
+aws configure list
+```
+
+2. **Missing Dependencies**: Install required packages
+```bash
+pip install boto3 jupyter ipython
+```
+
+3. **Memory Issues**: For large documents, consider processing sections individually
+
+4. **Configuration Errors**: Validate YAML syntax
+```bash
+python -c "import yaml; yaml.safe_load(open('config/main.yaml'))"
+```
+
+### Debug Mode
+
+Enable detailed logging in any notebook:
+```python
+import logging
+logging.basicConfig(level=logging.DEBUG)
+```
+
+### Data Inspection
+
+Each step saves detailed results that can be inspected:
+```python
+# Inspect intermediate results
+import json
+with open("data/step3_extraction/extraction_summary.json") as f:
+    results = json.load(f)
+    print(json.dumps(results, indent=2))
+```
+
+## 📈 Performance Monitoring
+
+### Metrics Tracked
+
+Each step automatically tracks:
+- **Processing Time**: Total time for the step
+- **Throughput**: Pages per second
+- **Memory Usage**: Peak memory consumption
+- **API Calls**: Number of service calls made
+- **Error Rates**: Failed operations
+
+### Performance Analysis
+
+The evaluation step provides comprehensive performance analysis:
+- Step-by-step timing breakdown
+- Bottleneck identification  
+- Resource utilization metrics
+- Cost analysis (for AWS services)
+
+## 🔒 Security and Best Practices
+
+### AWS Security
+- Use IAM roles with minimal required permissions
+- Enable CloudTrail for API logging
+- Store sensitive data in S3 with appropriate encryption
+
+### Data Privacy
+- Documents are processed in your AWS account
+- No data is sent to external services (except configured AI models)
+- Temporary files are cleaned up automatically
+
+### Configuration Management
+- Version control your configuration files
+- Use environment-specific configurations for different deployments
+- Document any custom modifications
+
+## 🤝 Contributing
+
+To extend or modify the notebooks:
+
+1. **Follow the Pattern**: Maintain the input/output structure for compatibility
+2. **Update Configurations**: Add new configuration options to appropriate YAML files
+3. **Document Changes**: Update this README and add inline documentation
+4. **Test Thoroughly**: Verify that changes work across the entire pipeline
+
+## 📚 Additional Resources
+
+- [IDP Common Library Documentation](../docs/using-notebooks-with-idp-common.md)
+- [Configuration Guide](../docs/configuration.md)
+- [Evaluation Methods](../docs/evaluation.md)
+- [AWS Textract Documentation](https://docs.aws.amazon.com/textract/)
+- [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/)
+
+---
+
+**Happy Document Processing! 🚀**
+
+For questions or support, refer to the main project documentation or create an issue in the project repository.
@@ -0,0 +1,10 @@
+# Assessment Service Configuration
+assessment:
+  default_confidence_threshold: "0.9"
+  top_p: "0.1"
+  max_tokens: "4096"
+  top_k: "5"
+  temperature: "0.0"
+  model: "us.amazon.nova-pro-v1:0"
+  system_prompt: "You are a document analysis assessment expert. Your task is to evaluate the confidence and accuracy of extraction results by analyzing the source document evidence. Respond only with JSON containing confidence scores and reasoning for each extracted attribute."
+  task_prompt: "<background>\nYou are an expert document analysis assessment system. Your task is to evaluate the confidence and accuracy of extraction results for a document of class {DOCUMENT_CLASS}.\n</background>\n\n<task>\nAnalyze the extraction results against the source document and provide confidence assessments for each extracted attribute. Consider factors such as:\n1. Text clarity and OCR quality in the source regions 2. Alignment between extracted values and document content 3. Presence of clear evidence supporting the extraction 4. Potential ambiguity or uncertainty in the source material 5. Completeness and accuracy of the extracted information\n</task>\n\n<assessment-guidelines>\nFor each attribute, provide: 1. A confidence score between 0.0 and 1.0 where:\n   - 1.0 = Very high confidence, clear and unambiguous evidence\n   - 0.8-0.9 = High confidence, strong evidence with minor uncertainty\n   - 0.6-0.7 = Medium confidence, reasonable evidence but some ambiguity\n   - 0.4-0.5 = Low confidence, weak or unclear evidence\n   - 0.0-0.3 = Very low confidence, little to no supporting evidence\n\n2. A clear reason explaining the confidence score, including:\n   - What evidence supports or contradicts the extraction\n   - Any OCR quality issues that affect confidence\n   - Clarity of the source document in relevant areas\n   - Any ambiguity or uncertainty factors\n\nGuidelines: - Base assessments on actual document content and OCR quality - Consider both text-based evidence and visual/layout clues - Account for OCR confidence scores when provided - Be objective and specific in reasoning - If an extraction appears incorrect, score accordingly with explanation\n</assessment-guidelines>\n<attributes-definitions>\n{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}\n</attributes-definitions>\n\n<<CACHEPOINT>>\n\n<extraction-results>\n{EXTRACTION_RESULTS}\n</extraction-results>\n\n<document-image>\n{DOCUMENT_IMAGE}\n</document-image>\n\n<ocr-text-confidence-results>\n{OCR_TEXT_CONFIDENCE}\n</ocr-text-confidence-results>\n\n<final-instructions>\nAnalyze the extraction results against the source document and provide confidence assessments. Return a JSON object with the following structure:\n\n  {\n    \"attribute_name_1\": {\n      \"confidence_score\": 0.85,\n      \"confidence_reason\": \"Clear text evidence found in document header with high OCR confidence (0.98). Value matches exactly.\"\n    },\n    \"attribute_name_2\": {\n      \"confidence_score\": 0.65,\n      \"confidence_reason\": \"Text is partially unclear due to poor scan quality. OCR confidence low (0.72) in this region.\"\n    }\n  }\n\nInclude assessments for ALL attributes present in the extraction results.\n</final-instructions>"