Skip to content

Commit 1eb420a

Browse files
committed
Merge branch 'feature/modular-notebooks' into 'develop'
Feature/modular notebooks See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!180
2 parents aab8673 + 871dd3f commit 1eb420a

File tree

66 files changed

+6428
-310
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+6428
-310
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ SPDX-License-Identifier: MIT-0
88
## [Unreleased]
99

1010
### Added
11+
- New example notebooks with improved clarity, modularity, and documentation.
1112
- Added confidence threshold to evaluation outputs to enable prioritizing accuracy results for attributes with higher confidence thersholds.
1213
- Comprehensive Metering Data: The system now captures and stores detailed metering data for analytics, including:
1314
• Which services were used (Textract, Bedrock, etc.)

notebooks/examples/README.md

Lines changed: 352 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,352 @@
1+
# Modular IDP Pipeline Notebooks
2+
3+
This directory contains a complete set of modular Jupyter notebooks that demonstrate the Intelligent Document Processing (IDP) pipeline using the `idp_common` library. Each notebook represents a distinct step in the IDP workflow and can be run independently or sequentially.
4+
5+
## 🏗️ Architecture Overview
6+
7+
The modular approach breaks down the IDP pipeline into discrete, manageable steps:
8+
9+
```
10+
Step 0: Setup → Step 1: OCR → Step 2: Classification → Step 3: Extraction → Step 4: Assessment → Step 5: Summarization → Step 6: Evaluation
11+
```
12+
13+
### Key Benefits
14+
15+
- **Independent Execution**: Each step can be run and tested independently
16+
- **Modular Configuration**: Separate YAML configuration files for different components
17+
- **Data Persistence**: Each step saves results for the next step to consume
18+
- **Easy Experimentation**: Modify configurations without changing code
19+
- **Comprehensive Evaluation**: Professional-grade evaluation with the EvaluationService
20+
- **Debugging Friendly**: Isolate issues to specific processing steps
21+
22+
## 📁 Directory Structure
23+
24+
```
25+
notebooks/examples/
26+
├── README.md # This file
27+
├── step0_setup.ipynb # Environment setup and document initialization
28+
├── step1_ocr.ipynb # OCR processing using Amazon Textract
29+
├── step2_classification.ipynb # Document classification
30+
├── step3_extraction.ipynb # Structured data extraction
31+
├── step4_assessment.ipynb # Confidence assessment and explainability
32+
├── step5_summarization.ipynb # Content summarization
33+
├── step6_evaluation.ipynb # Final evaluation and reporting
34+
├── config/ # Modular configuration files
35+
│ ├── main.yaml # Main pipeline configuration
36+
│ ├── classes.yaml # Document classification definitions
37+
│ ├── ocr.yaml # OCR service configuration
38+
│ ├── classification.yaml # Classification method configuration
39+
│ ├── extraction.yaml # Extraction method configuration
40+
│ ├── assessment.yaml # Assessment method configuration
41+
│ ├── summarization.yaml # Summarization method configuration
42+
│ └── evaluation.yaml # Evaluation method configuration
43+
└── data/ # Step-by-step processing results
44+
├── step0_setup/ # Setup outputs
45+
├── step1_ocr/ # OCR results
46+
├── step2_classification/ # Classification results
47+
├── step3_extraction/ # Extraction results
48+
├── step4_assessment/ # Assessment results
49+
├── step5_summarization/ # Summarization results
50+
└── step6_evaluation/ # Final evaluation results
51+
```
52+
53+
## 🚀 Quick Start
54+
55+
### Prerequisites
56+
57+
1. **AWS Credentials**: Ensure your AWS credentials are configured
58+
2. **Required Libraries**: Install the `idp_common` package
59+
3. **Sample Document**: Place a PDF file in the project samples directory
60+
61+
### Running the Complete Pipeline
62+
63+
Execute the notebooks in sequence:
64+
65+
```bash
66+
# 1. Setup environment and document
67+
jupyter notebook step0_setup.ipynb
68+
69+
# 2. Process OCR
70+
jupyter notebook step1_ocr.ipynb
71+
72+
# 3. Classify document sections
73+
jupyter notebook step2_classification.ipynb
74+
75+
# 4. Extract structured data
76+
jupyter notebook step3_extraction.ipynb
77+
78+
# 5. Assess confidence and explainability
79+
jupyter notebook step4_assessment.ipynb
80+
81+
# 6. Generate summaries
82+
jupyter notebook step5_summarization.ipynb
83+
84+
# 7. Evaluate results and generate reports
85+
jupyter notebook step6_evaluation.ipynb
86+
```
87+
88+
### Running Individual Steps
89+
90+
Each notebook can be run independently by ensuring the required input data exists:
91+
92+
```python
93+
# Each notebook loads its inputs from the previous step's data directory
94+
previous_step_dir = Path("data/step{n-1}_{previous_step_name}")
95+
```
96+
97+
## ⚙️ Configuration Management
98+
99+
### Modular Configuration Files
100+
101+
Configuration is split across multiple YAML files for better organization:
102+
103+
- **`config/main.yaml`**: Overall pipeline settings and AWS configuration
104+
- **`config/classes.yaml`**: Document type definitions and attributes to extract
105+
- **`config/ocr.yaml`**: Textract features and OCR-specific settings
106+
- **`config/classification.yaml`**: Classification model and method configuration
107+
- **`config/extraction.yaml`**: Extraction model and prompting configuration
108+
- **`config/assessment.yaml`**: Assessment model and confidence thresholds
109+
- **`config/summarization.yaml`**: Summarization models and output formats
110+
- **`config/evaluation.yaml`**: Evaluation metrics and reporting settings
111+
112+
### Configuration Loading
113+
114+
Each notebook automatically merges all configuration files:
115+
116+
```python
117+
# Automatic configuration loading in each notebook
118+
CONFIG = load_and_merge_configs("config/")
119+
```
120+
121+
### Experimentation with Configurations
122+
123+
To experiment with different settings:
124+
125+
1. **Backup Current Config**: Copy the config directory
126+
2. **Modify Settings**: Edit the relevant YAML files
127+
3. **Run Specific Steps**: Execute only the affected notebooks
128+
4. **Compare Results**: Review outputs in the data directories
129+
130+
## 📊 Data Flow
131+
132+
### Input/Output Structure
133+
134+
Each step follows a consistent pattern:
135+
136+
```python
137+
# Input (from previous step)
138+
input_data_dir = Path("data/step{n-1}_{previous_name}")
139+
document = Document.from_json((input_data_dir / "document.json").read_text())
140+
config = json.load(open(input_data_dir / "config.json"))
141+
142+
# Processing
143+
# ... step-specific processing ...
144+
145+
# Output (for next step)
146+
output_data_dir = Path("data/step{n}_{current_name}")
147+
output_data_dir.mkdir(parents=True, exist_ok=True)
148+
(output_data_dir / "document.json").write_text(document.to_json())
149+
json.dump(config, open(output_data_dir / "config.json", "w"))
150+
```
151+
152+
### Serialized Artifacts
153+
154+
Each step produces:
155+
- **`document.json`**: Updated Document object with step results
156+
- **`config.json`**: Complete merged configuration
157+
- **`environment.json`**: Environment settings and metadata
158+
- **Step-specific result files**: Detailed processing outputs
159+
160+
## 🔬 Detailed Step Descriptions
161+
162+
### Step 0: Setup (`step0_setup.ipynb`)
163+
- **Purpose**: Initialize the Document object and prepare the processing environment
164+
- **Inputs**: PDF file path, configuration files
165+
- **Outputs**: Document object with pages and metadata
166+
- **Key Features**: Multi-page PDF support, metadata extraction
167+
168+
### Step 1: OCR (`step1_ocr.ipynb`)
169+
- **Purpose**: Extract text and analyze document structure using Amazon Textract
170+
- **Inputs**: Document object with PDF pages
171+
- **Outputs**: OCR results with text blocks, tables, and forms
172+
- **Key Features**: Textract API integration, feature selection, result caching
173+
174+
### Step 2: Classification (`step2_classification.ipynb`)
175+
- **Purpose**: Identify document types and create logical sections
176+
- **Inputs**: Document with OCR results
177+
- **Outputs**: Classified sections with confidence scores
178+
- **Key Features**: Multi-modal classification, few-shot prompting, custom classes
179+
180+
### Step 3: Extraction (`step3_extraction.ipynb`)
181+
- **Purpose**: Extract structured data from each classified section
182+
- **Inputs**: Document with classified sections
183+
- **Outputs**: Structured data for each section based on class definitions
184+
- **Key Features**: Class-specific extraction, JSON schema validation
185+
186+
### Step 4: Assessment (`step4_assessment.ipynb`)
187+
- **Purpose**: Evaluate extraction confidence and provide explainability
188+
- **Inputs**: Document with extraction results
189+
- **Outputs**: Confidence scores and reasoning for each extracted attribute
190+
- **Key Features**: Confidence assessment, hallucination detection, explainability
191+
192+
### Step 5: Summarization (`step5_summarization.ipynb`)
193+
- **Purpose**: Generate human-readable summaries of processing results
194+
- **Inputs**: Document with assessed extractions
195+
- **Outputs**: Section and document-level summaries in multiple formats
196+
- **Key Features**: Multi-format output (JSON, Markdown, HTML), customizable templates
197+
198+
### Step 6: Evaluation (`step6_evaluation.ipynb`)
199+
- **Purpose**: Comprehensive evaluation of pipeline performance and accuracy
200+
- **Inputs**: Document with complete processing results
201+
- **Outputs**: Evaluation reports, accuracy metrics, performance analysis
202+
- **Key Features**: EvaluationService integration, ground truth comparison, detailed reporting
203+
204+
## 🧪 Experimentation Guide
205+
206+
### Modifying Document Classes
207+
208+
To add new document types or modify existing ones:
209+
210+
1. **Edit `config/classes.yaml`**:
211+
```yaml
212+
classes:
213+
new_document_type:
214+
description: "Description of the new document type"
215+
attributes:
216+
- name: "attribute_name"
217+
description: "What this attribute represents"
218+
type: "string" # or "number", "date", etc.
219+
```
220+
221+
2. **Run from Step 2**: Classification onwards to process with new classes
222+
223+
### Changing Models
224+
225+
To experiment with different AI models:
226+
227+
1. **Edit relevant config files**:
228+
```yaml
229+
# In config/extraction.yaml
230+
llm_method:
231+
model: "anthropic.claude-3-5-sonnet-20241022-v2:0" # Change model
232+
temperature: 0.1 # Adjust parameters
233+
```
234+
235+
2. **Run affected steps**: Only the steps that use the changed configuration
236+
237+
### Adjusting Confidence Thresholds
238+
239+
To experiment with confidence thresholds:
240+
241+
1. **Edit `config/assessment.yaml`**:
242+
```yaml
243+
assessment:
244+
confidence_threshold: 0.7 # Lower threshold = more permissive
245+
```
246+
247+
2. **Run Steps 4-6**: Assessment, Summarization, and Evaluation
248+
249+
### Performance Optimization
250+
251+
- **Parallel Processing**: Modify extraction/assessment to process sections in parallel
252+
- **Caching**: Results are automatically cached between steps
253+
- **Batch Processing**: Process multiple documents by running the pipeline multiple times
254+
255+
## 🐛 Troubleshooting
256+
257+
### Common Issues
258+
259+
1. **AWS Credentials**: Ensure proper AWS configuration
260+
```bash
261+
aws configure list
262+
```
263+
264+
2. **Missing Dependencies**: Install required packages
265+
```bash
266+
pip install boto3 jupyter ipython
267+
```
268+
269+
3. **Memory Issues**: For large documents, consider processing sections individually
270+
271+
4. **Configuration Errors**: Validate YAML syntax
272+
```bash
273+
python -c "import yaml; yaml.safe_load(open('config/main.yaml'))"
274+
```
275+
276+
### Debug Mode
277+
278+
Enable detailed logging in any notebook:
279+
```python
280+
import logging
281+
logging.basicConfig(level=logging.DEBUG)
282+
```
283+
284+
### Data Inspection
285+
286+
Each step saves detailed results that can be inspected:
287+
```python
288+
# Inspect intermediate results
289+
import json
290+
with open("data/step3_extraction/extraction_summary.json") as f:
291+
results = json.load(f)
292+
print(json.dumps(results, indent=2))
293+
```
294+
295+
## 📈 Performance Monitoring
296+
297+
### Metrics Tracked
298+
299+
Each step automatically tracks:
300+
- **Processing Time**: Total time for the step
301+
- **Throughput**: Pages per second
302+
- **Memory Usage**: Peak memory consumption
303+
- **API Calls**: Number of service calls made
304+
- **Error Rates**: Failed operations
305+
306+
### Performance Analysis
307+
308+
The evaluation step provides comprehensive performance analysis:
309+
- Step-by-step timing breakdown
310+
- Bottleneck identification
311+
- Resource utilization metrics
312+
- Cost analysis (for AWS services)
313+
314+
## 🔒 Security and Best Practices
315+
316+
### AWS Security
317+
- Use IAM roles with minimal required permissions
318+
- Enable CloudTrail for API logging
319+
- Store sensitive data in S3 with appropriate encryption
320+
321+
### Data Privacy
322+
- Documents are processed in your AWS account
323+
- No data is sent to external services (except configured AI models)
324+
- Temporary files are cleaned up automatically
325+
326+
### Configuration Management
327+
- Version control your configuration files
328+
- Use environment-specific configurations for different deployments
329+
- Document any custom modifications
330+
331+
## 🤝 Contributing
332+
333+
To extend or modify the notebooks:
334+
335+
1. **Follow the Pattern**: Maintain the input/output structure for compatibility
336+
2. **Update Configurations**: Add new configuration options to appropriate YAML files
337+
3. **Document Changes**: Update this README and add inline documentation
338+
4. **Test Thoroughly**: Verify that changes work across the entire pipeline
339+
340+
## 📚 Additional Resources
341+
342+
- [IDP Common Library Documentation](../docs/using-notebooks-with-idp-common.md)
343+
- [Configuration Guide](../docs/configuration.md)
344+
- [Evaluation Methods](../docs/evaluation.md)
345+
- [AWS Textract Documentation](https://docs.aws.amazon.com/textract/)
346+
- [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/)
347+
348+
---
349+
350+
**Happy Document Processing! 🚀**
351+
352+
For questions or support, refer to the main project documentation or create an issue in the project repository.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Assessment Service Configuration
2+
assessment:
3+
default_confidence_threshold: "0.9"
4+
top_p: "0.1"
5+
max_tokens: "4096"
6+
top_k: "5"
7+
temperature: "0.0"
8+
model: "us.amazon.nova-pro-v1:0"
9+
system_prompt: "You are a document analysis assessment expert. Your task is to evaluate the confidence and accuracy of extraction results by analyzing the source document evidence. Respond only with JSON containing confidence scores and reasoning for each extracted attribute."
10+
task_prompt: "<background>\nYou are an expert document analysis assessment system. Your task is to evaluate the confidence and accuracy of extraction results for a document of class {DOCUMENT_CLASS}.\n</background>\n\n<task>\nAnalyze the extraction results against the source document and provide confidence assessments for each extracted attribute. Consider factors such as:\n1. Text clarity and OCR quality in the source regions 2. Alignment between extracted values and document content 3. Presence of clear evidence supporting the extraction 4. Potential ambiguity or uncertainty in the source material 5. Completeness and accuracy of the extracted information\n</task>\n\n<assessment-guidelines>\nFor each attribute, provide: 1. A confidence score between 0.0 and 1.0 where:\n - 1.0 = Very high confidence, clear and unambiguous evidence\n - 0.8-0.9 = High confidence, strong evidence with minor uncertainty\n - 0.6-0.7 = Medium confidence, reasonable evidence but some ambiguity\n - 0.4-0.5 = Low confidence, weak or unclear evidence\n - 0.0-0.3 = Very low confidence, little to no supporting evidence\n\n2. A clear reason explaining the confidence score, including:\n - What evidence supports or contradicts the extraction\n - Any OCR quality issues that affect confidence\n - Clarity of the source document in relevant areas\n - Any ambiguity or uncertainty factors\n\nGuidelines: - Base assessments on actual document content and OCR quality - Consider both text-based evidence and visual/layout clues - Account for OCR confidence scores when provided - Be objective and specific in reasoning - If an extraction appears incorrect, score accordingly with explanation\n</assessment-guidelines>\n<attributes-definitions>\n{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}\n</attributes-definitions>\n\n<<CACHEPOINT>>\n\n<extraction-results>\n{EXTRACTION_RESULTS}\n</extraction-results>\n\n<document-image>\n{DOCUMENT_IMAGE}\n</document-image>\n\n<ocr-text-confidence-results>\n{OCR_TEXT_CONFIDENCE}\n</ocr-text-confidence-results>\n\n<final-instructions>\nAnalyze the extraction results against the source document and provide confidence assessments. Return a JSON object with the following structure:\n\n {\n \"attribute_name_1\": {\n \"confidence_score\": 0.85,\n \"confidence_reason\": \"Clear text evidence found in document header with high OCR confidence (0.98). Value matches exactly.\"\n },\n \"attribute_name_2\": {\n \"confidence_score\": 0.65,\n \"confidence_reason\": \"Text is partially unclear due to poor scan quality. OCR confidence low (0.72) in this region.\"\n }\n }\n\nInclude assessments for ALL attributes present in the extraction results.\n</final-instructions>"

0 commit comments

Comments
 (0)