Skip to content

Commit 20114bd

Browse files
author
Bob Strahan
committed
feat(notebooks): Refactor example notebooks for better modularity and organization
1 parent 9d5081a commit 20114bd

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+3612
-830
lines changed

notebooks/examples/README.md

Lines changed: 352 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,352 @@
1+
# Modular IDP Pipeline Notebooks
2+
3+
This directory contains a complete set of modular Jupyter notebooks that demonstrate the Intelligent Document Processing (IDP) pipeline using the `idp_common` library. Each notebook represents a distinct step in the IDP workflow and can be run independently or sequentially.
4+
5+
## 🏗️ Architecture Overview
6+
7+
The modular approach breaks down the IDP pipeline into discrete, manageable steps:
8+
9+
```
10+
Step 0: Setup → Step 1: OCR → Step 2: Classification → Step 3: Extraction → Step 4: Assessment → Step 5: Summarization → Step 6: Evaluation
11+
```
12+
13+
### Key Benefits
14+
15+
- **Independent Execution**: Each step can be run and tested independently
16+
- **Modular Configuration**: Separate YAML configuration files for different components
17+
- **Data Persistence**: Each step saves results for the next step to consume
18+
- **Easy Experimentation**: Modify configurations without changing code
19+
- **Comprehensive Evaluation**: Professional-grade evaluation with the EvaluationService
20+
- **Debugging Friendly**: Isolate issues to specific processing steps
21+
22+
## 📁 Directory Structure
23+
24+
```
25+
notebooks/examples/
26+
├── README.md # This file
27+
├── step0_setup.ipynb # Environment setup and document initialization
28+
├── step1_ocr.ipynb # OCR processing using Amazon Textract
29+
├── step2_classification.ipynb # Document classification
30+
├── step3_extraction.ipynb # Structured data extraction
31+
├── step4_assessment.ipynb # Confidence assessment and explainability
32+
├── step5_summarization.ipynb # Content summarization
33+
├── step6_evaluation.ipynb # Final evaluation and reporting
34+
├── config/ # Modular configuration files
35+
│ ├── main.yaml # Main pipeline configuration
36+
│ ├── classes.yaml # Document classification definitions
37+
│ ├── ocr.yaml # OCR service configuration
38+
│ ├── classification.yaml # Classification method configuration
39+
│ ├── extraction.yaml # Extraction method configuration
40+
│ ├── assessment.yaml # Assessment method configuration
41+
│ ├── summarization.yaml # Summarization method configuration
42+
│ └── evaluation.yaml # Evaluation method configuration
43+
└── data/ # Step-by-step processing results
44+
├── step0_setup/ # Setup outputs
45+
├── step1_ocr/ # OCR results
46+
├── step2_classification/ # Classification results
47+
├── step3_extraction/ # Extraction results
48+
├── step4_assessment/ # Assessment results
49+
├── step5_summarization/ # Summarization results
50+
└── step6_evaluation/ # Final evaluation results
51+
```
52+
53+
## 🚀 Quick Start
54+
55+
### Prerequisites
56+
57+
1. **AWS Credentials**: Ensure your AWS credentials are configured
58+
2. **Required Libraries**: Install the `idp_common` package
59+
3. **Sample Document**: Place a PDF file in the project samples directory
60+
61+
### Running the Complete Pipeline
62+
63+
Execute the notebooks in sequence:
64+
65+
```bash
66+
# 1. Setup environment and document
67+
jupyter notebook step0_setup.ipynb
68+
69+
# 2. Process OCR
70+
jupyter notebook step1_ocr.ipynb
71+
72+
# 3. Classify document sections
73+
jupyter notebook step2_classification.ipynb
74+
75+
# 4. Extract structured data
76+
jupyter notebook step3_extraction.ipynb
77+
78+
# 5. Assess confidence and explainability
79+
jupyter notebook step4_assessment.ipynb
80+
81+
# 6. Generate summaries
82+
jupyter notebook step5_summarization.ipynb
83+
84+
# 7. Evaluate results and generate reports
85+
jupyter notebook step6_evaluation.ipynb
86+
```
87+
88+
### Running Individual Steps
89+
90+
Each notebook can be run independently by ensuring the required input data exists:
91+
92+
```python
93+
# Each notebook loads its inputs from the previous step's data directory
94+
previous_step_dir = Path("data/step{n-1}_{previous_step_name}")
95+
```
96+
97+
## ⚙️ Configuration Management
98+
99+
### Modular Configuration Files
100+
101+
Configuration is split across multiple YAML files for better organization:
102+
103+
- **`config/main.yaml`**: Overall pipeline settings and AWS configuration
104+
- **`config/classes.yaml`**: Document type definitions and attributes to extract
105+
- **`config/ocr.yaml`**: Textract features and OCR-specific settings
106+
- **`config/classification.yaml`**: Classification model and method configuration
107+
- **`config/extraction.yaml`**: Extraction model and prompting configuration
108+
- **`config/assessment.yaml`**: Assessment model and confidence thresholds
109+
- **`config/summarization.yaml`**: Summarization models and output formats
110+
- **`config/evaluation.yaml`**: Evaluation metrics and reporting settings
111+
112+
### Configuration Loading
113+
114+
Each notebook automatically merges all configuration files:
115+
116+
```python
117+
# Automatic configuration loading in each notebook
118+
CONFIG = load_and_merge_configs("config/")
119+
```
120+
121+
### Experimentation with Configurations
122+
123+
To experiment with different settings:
124+
125+
1. **Backup Current Config**: Copy the config directory
126+
2. **Modify Settings**: Edit the relevant YAML files
127+
3. **Run Specific Steps**: Execute only the affected notebooks
128+
4. **Compare Results**: Review outputs in the data directories
129+
130+
## 📊 Data Flow
131+
132+
### Input/Output Structure
133+
134+
Each step follows a consistent pattern:
135+
136+
```python
137+
# Input (from previous step)
138+
input_data_dir = Path("data/step{n-1}_{previous_name}")
139+
document = Document.from_json((input_data_dir / "document.json").read_text())
140+
config = json.load(open(input_data_dir / "config.json"))
141+
142+
# Processing
143+
# ... step-specific processing ...
144+
145+
# Output (for next step)
146+
output_data_dir = Path("data/step{n}_{current_name}")
147+
output_data_dir.mkdir(parents=True, exist_ok=True)
148+
(output_data_dir / "document.json").write_text(document.to_json())
149+
json.dump(config, open(output_data_dir / "config.json", "w"))
150+
```
151+
152+
### Serialized Artifacts
153+
154+
Each step produces:
155+
- **`document.json`**: Updated Document object with step results
156+
- **`config.json`**: Complete merged configuration
157+
- **`environment.json`**: Environment settings and metadata
158+
- **Step-specific result files**: Detailed processing outputs
159+
160+
## 🔬 Detailed Step Descriptions
161+
162+
### Step 0: Setup (`step0_setup.ipynb`)
163+
- **Purpose**: Initialize the Document object and prepare the processing environment
164+
- **Inputs**: PDF file path, configuration files
165+
- **Outputs**: Document object with pages and metadata
166+
- **Key Features**: Multi-page PDF support, metadata extraction
167+
168+
### Step 1: OCR (`step1_ocr.ipynb`)
169+
- **Purpose**: Extract text and analyze document structure using Amazon Textract
170+
- **Inputs**: Document object with PDF pages
171+
- **Outputs**: OCR results with text blocks, tables, and forms
172+
- **Key Features**: Textract API integration, feature selection, result caching
173+
174+
### Step 2: Classification (`step2_classification.ipynb`)
175+
- **Purpose**: Identify document types and create logical sections
176+
- **Inputs**: Document with OCR results
177+
- **Outputs**: Classified sections with confidence scores
178+
- **Key Features**: Multi-modal classification, few-shot prompting, custom classes
179+
180+
### Step 3: Extraction (`step3_extraction.ipynb`)
181+
- **Purpose**: Extract structured data from each classified section
182+
- **Inputs**: Document with classified sections
183+
- **Outputs**: Structured data for each section based on class definitions
184+
- **Key Features**: Class-specific extraction, JSON schema validation
185+
186+
### Step 4: Assessment (`step4_assessment.ipynb`)
187+
- **Purpose**: Evaluate extraction confidence and provide explainability
188+
- **Inputs**: Document with extraction results
189+
- **Outputs**: Confidence scores and reasoning for each extracted attribute
190+
- **Key Features**: Confidence assessment, hallucination detection, explainability
191+
192+
### Step 5: Summarization (`step5_summarization.ipynb`)
193+
- **Purpose**: Generate human-readable summaries of processing results
194+
- **Inputs**: Document with assessed extractions
195+
- **Outputs**: Section and document-level summaries in multiple formats
196+
- **Key Features**: Multi-format output (JSON, Markdown, HTML), customizable templates
197+
198+
### Step 6: Evaluation (`step6_evaluation.ipynb`)
199+
- **Purpose**: Comprehensive evaluation of pipeline performance and accuracy
200+
- **Inputs**: Document with complete processing results
201+
- **Outputs**: Evaluation reports, accuracy metrics, performance analysis
202+
- **Key Features**: EvaluationService integration, ground truth comparison, detailed reporting
203+
204+
## 🧪 Experimentation Guide
205+
206+
### Modifying Document Classes
207+
208+
To add new document types or modify existing ones:
209+
210+
1. **Edit `config/classes.yaml`**:
211+
```yaml
212+
classes:
213+
new_document_type:
214+
description: "Description of the new document type"
215+
attributes:
216+
- name: "attribute_name"
217+
description: "What this attribute represents"
218+
type: "string" # or "number", "date", etc.
219+
```
220+
221+
2. **Run from Step 2**: Classification onwards to process with new classes
222+
223+
### Changing Models
224+
225+
To experiment with different AI models:
226+
227+
1. **Edit relevant config files**:
228+
```yaml
229+
# In config/extraction.yaml
230+
llm_method:
231+
model: "anthropic.claude-3-5-sonnet-20241022-v2:0" # Change model
232+
temperature: 0.1 # Adjust parameters
233+
```
234+
235+
2. **Run affected steps**: Only the steps that use the changed configuration
236+
237+
### Adjusting Confidence Thresholds
238+
239+
To experiment with confidence thresholds:
240+
241+
1. **Edit `config/assessment.yaml`**:
242+
```yaml
243+
assessment:
244+
confidence_threshold: 0.7 # Lower threshold = more permissive
245+
```
246+
247+
2. **Run Steps 4-6**: Assessment, Summarization, and Evaluation
248+
249+
### Performance Optimization
250+
251+
- **Parallel Processing**: Modify extraction/assessment to process sections in parallel
252+
- **Caching**: Results are automatically cached between steps
253+
- **Batch Processing**: Process multiple documents by running the pipeline multiple times
254+
255+
## 🐛 Troubleshooting
256+
257+
### Common Issues
258+
259+
1. **AWS Credentials**: Ensure proper AWS configuration
260+
```bash
261+
aws configure list
262+
```
263+
264+
2. **Missing Dependencies**: Install required packages
265+
```bash
266+
pip install boto3 jupyter ipython
267+
```
268+
269+
3. **Memory Issues**: For large documents, consider processing sections individually
270+
271+
4. **Configuration Errors**: Validate YAML syntax
272+
```bash
273+
python -c "import yaml; yaml.safe_load(open('config/main.yaml'))"
274+
```
275+
276+
### Debug Mode
277+
278+
Enable detailed logging in any notebook:
279+
```python
280+
import logging
281+
logging.basicConfig(level=logging.DEBUG)
282+
```
283+
284+
### Data Inspection
285+
286+
Each step saves detailed results that can be inspected:
287+
```python
288+
# Inspect intermediate results
289+
import json
290+
with open("data/step3_extraction/extraction_summary.json") as f:
291+
results = json.load(f)
292+
print(json.dumps(results, indent=2))
293+
```
294+
295+
## 📈 Performance Monitoring
296+
297+
### Metrics Tracked
298+
299+
Each step automatically tracks:
300+
- **Processing Time**: Total time for the step
301+
- **Throughput**: Pages per second
302+
- **Memory Usage**: Peak memory consumption
303+
- **API Calls**: Number of service calls made
304+
- **Error Rates**: Failed operations
305+
306+
### Performance Analysis
307+
308+
The evaluation step provides comprehensive performance analysis:
309+
- Step-by-step timing breakdown
310+
- Bottleneck identification
311+
- Resource utilization metrics
312+
- Cost analysis (for AWS services)
313+
314+
## 🔒 Security and Best Practices
315+
316+
### AWS Security
317+
- Use IAM roles with minimal required permissions
318+
- Enable CloudTrail for API logging
319+
- Store sensitive data in S3 with appropriate encryption
320+
321+
### Data Privacy
322+
- Documents are processed in your AWS account
323+
- No data is sent to external services (except configured AI models)
324+
- Temporary files are cleaned up automatically
325+
326+
### Configuration Management
327+
- Version control your configuration files
328+
- Use environment-specific configurations for different deployments
329+
- Document any custom modifications
330+
331+
## 🤝 Contributing
332+
333+
To extend or modify the notebooks:
334+
335+
1. **Follow the Pattern**: Maintain the input/output structure for compatibility
336+
2. **Update Configurations**: Add new configuration options to appropriate YAML files
337+
3. **Document Changes**: Update this README and add inline documentation
338+
4. **Test Thoroughly**: Verify that changes work across the entire pipeline
339+
340+
## 📚 Additional Resources
341+
342+
- [IDP Common Library Documentation](../docs/using-notebooks-with-idp-common.md)
343+
- [Configuration Guide](../docs/configuration.md)
344+
- [Evaluation Methods](../docs/evaluation.md)
345+
- [AWS Textract Documentation](https://docs.aws.amazon.com/textract/)
346+
- [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/)
347+
348+
---
349+
350+
**Happy Document Processing! 🚀**
351+
352+
For questions or support, refer to the main project documentation or create an issue in the project repository.

0 commit comments

Comments
 (0)