|
| 1 | +# Modular IDP Pipeline Notebooks |
| 2 | + |
| 3 | +This directory contains a complete set of modular Jupyter notebooks that demonstrate the Intelligent Document Processing (IDP) pipeline using the `idp_common` library. Each notebook represents a distinct step in the IDP workflow and can be run independently or sequentially. |
| 4 | + |
| 5 | +## 🏗️ Architecture Overview |
| 6 | + |
| 7 | +The modular approach breaks down the IDP pipeline into discrete, manageable steps: |
| 8 | + |
| 9 | +``` |
| 10 | +Step 0: Setup → Step 1: OCR → Step 2: Classification → Step 3: Extraction → Step 4: Assessment → Step 5: Summarization → Step 6: Evaluation |
| 11 | +``` |
| 12 | + |
| 13 | +### Key Benefits |
| 14 | + |
| 15 | +- **Independent Execution**: Each step can be run and tested independently |
| 16 | +- **Modular Configuration**: Separate YAML configuration files for different components |
| 17 | +- **Data Persistence**: Each step saves results for the next step to consume |
| 18 | +- **Easy Experimentation**: Modify configurations without changing code |
| 19 | +- **Comprehensive Evaluation**: Professional-grade evaluation with the EvaluationService |
| 20 | +- **Debugging Friendly**: Isolate issues to specific processing steps |
| 21 | + |
| 22 | +## 📁 Directory Structure |
| 23 | + |
| 24 | +``` |
| 25 | +notebooks/examples/ |
| 26 | +├── README.md # This file |
| 27 | +├── step0_setup.ipynb # Environment setup and document initialization |
| 28 | +├── step1_ocr.ipynb # OCR processing using Amazon Textract |
| 29 | +├── step2_classification.ipynb # Document classification |
| 30 | +├── step3_extraction.ipynb # Structured data extraction |
| 31 | +├── step4_assessment.ipynb # Confidence assessment and explainability |
| 32 | +├── step5_summarization.ipynb # Content summarization |
| 33 | +├── step6_evaluation.ipynb # Final evaluation and reporting |
| 34 | +├── config/ # Modular configuration files |
| 35 | +│ ├── main.yaml # Main pipeline configuration |
| 36 | +│ ├── classes.yaml # Document classification definitions |
| 37 | +│ ├── ocr.yaml # OCR service configuration |
| 38 | +│ ├── classification.yaml # Classification method configuration |
| 39 | +│ ├── extraction.yaml # Extraction method configuration |
| 40 | +│ ├── assessment.yaml # Assessment method configuration |
| 41 | +│ ├── summarization.yaml # Summarization method configuration |
| 42 | +│ └── evaluation.yaml # Evaluation method configuration |
| 43 | +└── data/ # Step-by-step processing results |
| 44 | + ├── step0_setup/ # Setup outputs |
| 45 | + ├── step1_ocr/ # OCR results |
| 46 | + ├── step2_classification/ # Classification results |
| 47 | + ├── step3_extraction/ # Extraction results |
| 48 | + ├── step4_assessment/ # Assessment results |
| 49 | + ├── step5_summarization/ # Summarization results |
| 50 | + └── step6_evaluation/ # Final evaluation results |
| 51 | +``` |
| 52 | + |
| 53 | +## 🚀 Quick Start |
| 54 | + |
| 55 | +### Prerequisites |
| 56 | + |
| 57 | +1. **AWS Credentials**: Ensure your AWS credentials are configured |
| 58 | +2. **Required Libraries**: Install the `idp_common` package |
| 59 | +3. **Sample Document**: Place a PDF file in the project samples directory |
| 60 | + |
| 61 | +### Running the Complete Pipeline |
| 62 | + |
| 63 | +Execute the notebooks in sequence: |
| 64 | + |
| 65 | +```bash |
| 66 | +# 1. Setup environment and document |
| 67 | +jupyter notebook step0_setup.ipynb |
| 68 | + |
| 69 | +# 2. Process OCR |
| 70 | +jupyter notebook step1_ocr.ipynb |
| 71 | + |
| 72 | +# 3. Classify document sections |
| 73 | +jupyter notebook step2_classification.ipynb |
| 74 | + |
| 75 | +# 4. Extract structured data |
| 76 | +jupyter notebook step3_extraction.ipynb |
| 77 | + |
| 78 | +# 5. Assess confidence and explainability |
| 79 | +jupyter notebook step4_assessment.ipynb |
| 80 | + |
| 81 | +# 6. Generate summaries |
| 82 | +jupyter notebook step5_summarization.ipynb |
| 83 | + |
| 84 | +# 7. Evaluate results and generate reports |
| 85 | +jupyter notebook step6_evaluation.ipynb |
| 86 | +``` |
| 87 | + |
| 88 | +### Running Individual Steps |
| 89 | + |
| 90 | +Each notebook can be run independently by ensuring the required input data exists: |
| 91 | + |
| 92 | +```python |
| 93 | +# Each notebook loads its inputs from the previous step's data directory |
| 94 | +previous_step_dir = Path("data/step{n-1}_{previous_step_name}") |
| 95 | +``` |
| 96 | + |
| 97 | +## ⚙️ Configuration Management |
| 98 | + |
| 99 | +### Modular Configuration Files |
| 100 | + |
| 101 | +Configuration is split across multiple YAML files for better organization: |
| 102 | + |
| 103 | +- **`config/main.yaml`**: Overall pipeline settings and AWS configuration |
| 104 | +- **`config/classes.yaml`**: Document type definitions and attributes to extract |
| 105 | +- **`config/ocr.yaml`**: Textract features and OCR-specific settings |
| 106 | +- **`config/classification.yaml`**: Classification model and method configuration |
| 107 | +- **`config/extraction.yaml`**: Extraction model and prompting configuration |
| 108 | +- **`config/assessment.yaml`**: Assessment model and confidence thresholds |
| 109 | +- **`config/summarization.yaml`**: Summarization models and output formats |
| 110 | +- **`config/evaluation.yaml`**: Evaluation metrics and reporting settings |
| 111 | + |
| 112 | +### Configuration Loading |
| 113 | + |
| 114 | +Each notebook automatically merges all configuration files: |
| 115 | + |
| 116 | +```python |
| 117 | +# Automatic configuration loading in each notebook |
| 118 | +CONFIG = load_and_merge_configs("config/") |
| 119 | +``` |
| 120 | + |
| 121 | +### Experimentation with Configurations |
| 122 | + |
| 123 | +To experiment with different settings: |
| 124 | + |
| 125 | +1. **Backup Current Config**: Copy the config directory |
| 126 | +2. **Modify Settings**: Edit the relevant YAML files |
| 127 | +3. **Run Specific Steps**: Execute only the affected notebooks |
| 128 | +4. **Compare Results**: Review outputs in the data directories |
| 129 | + |
| 130 | +## 📊 Data Flow |
| 131 | + |
| 132 | +### Input/Output Structure |
| 133 | + |
| 134 | +Each step follows a consistent pattern: |
| 135 | + |
| 136 | +```python |
| 137 | +# Input (from previous step) |
| 138 | +input_data_dir = Path("data/step{n-1}_{previous_name}") |
| 139 | +document = Document.from_json((input_data_dir / "document.json").read_text()) |
| 140 | +config = json.load(open(input_data_dir / "config.json")) |
| 141 | + |
| 142 | +# Processing |
| 143 | +# ... step-specific processing ... |
| 144 | + |
| 145 | +# Output (for next step) |
| 146 | +output_data_dir = Path("data/step{n}_{current_name}") |
| 147 | +output_data_dir.mkdir(parents=True, exist_ok=True) |
| 148 | +(output_data_dir / "document.json").write_text(document.to_json()) |
| 149 | +json.dump(config, open(output_data_dir / "config.json", "w")) |
| 150 | +``` |
| 151 | + |
| 152 | +### Serialized Artifacts |
| 153 | + |
| 154 | +Each step produces: |
| 155 | +- **`document.json`**: Updated Document object with step results |
| 156 | +- **`config.json`**: Complete merged configuration |
| 157 | +- **`environment.json`**: Environment settings and metadata |
| 158 | +- **Step-specific result files**: Detailed processing outputs |
| 159 | + |
| 160 | +## 🔬 Detailed Step Descriptions |
| 161 | + |
| 162 | +### Step 0: Setup (`step0_setup.ipynb`) |
| 163 | +- **Purpose**: Initialize the Document object and prepare the processing environment |
| 164 | +- **Inputs**: PDF file path, configuration files |
| 165 | +- **Outputs**: Document object with pages and metadata |
| 166 | +- **Key Features**: Multi-page PDF support, metadata extraction |
| 167 | + |
| 168 | +### Step 1: OCR (`step1_ocr.ipynb`) |
| 169 | +- **Purpose**: Extract text and analyze document structure using Amazon Textract |
| 170 | +- **Inputs**: Document object with PDF pages |
| 171 | +- **Outputs**: OCR results with text blocks, tables, and forms |
| 172 | +- **Key Features**: Textract API integration, feature selection, result caching |
| 173 | + |
| 174 | +### Step 2: Classification (`step2_classification.ipynb`) |
| 175 | +- **Purpose**: Identify document types and create logical sections |
| 176 | +- **Inputs**: Document with OCR results |
| 177 | +- **Outputs**: Classified sections with confidence scores |
| 178 | +- **Key Features**: Multi-modal classification, few-shot prompting, custom classes |
| 179 | + |
| 180 | +### Step 3: Extraction (`step3_extraction.ipynb`) |
| 181 | +- **Purpose**: Extract structured data from each classified section |
| 182 | +- **Inputs**: Document with classified sections |
| 183 | +- **Outputs**: Structured data for each section based on class definitions |
| 184 | +- **Key Features**: Class-specific extraction, JSON schema validation |
| 185 | + |
| 186 | +### Step 4: Assessment (`step4_assessment.ipynb`) |
| 187 | +- **Purpose**: Evaluate extraction confidence and provide explainability |
| 188 | +- **Inputs**: Document with extraction results |
| 189 | +- **Outputs**: Confidence scores and reasoning for each extracted attribute |
| 190 | +- **Key Features**: Confidence assessment, hallucination detection, explainability |
| 191 | + |
| 192 | +### Step 5: Summarization (`step5_summarization.ipynb`) |
| 193 | +- **Purpose**: Generate human-readable summaries of processing results |
| 194 | +- **Inputs**: Document with assessed extractions |
| 195 | +- **Outputs**: Section and document-level summaries in multiple formats |
| 196 | +- **Key Features**: Multi-format output (JSON, Markdown, HTML), customizable templates |
| 197 | + |
| 198 | +### Step 6: Evaluation (`step6_evaluation.ipynb`) |
| 199 | +- **Purpose**: Comprehensive evaluation of pipeline performance and accuracy |
| 200 | +- **Inputs**: Document with complete processing results |
| 201 | +- **Outputs**: Evaluation reports, accuracy metrics, performance analysis |
| 202 | +- **Key Features**: EvaluationService integration, ground truth comparison, detailed reporting |
| 203 | + |
| 204 | +## 🧪 Experimentation Guide |
| 205 | + |
| 206 | +### Modifying Document Classes |
| 207 | + |
| 208 | +To add new document types or modify existing ones: |
| 209 | + |
| 210 | +1. **Edit `config/classes.yaml`**: |
| 211 | +```yaml |
| 212 | +classes: |
| 213 | + new_document_type: |
| 214 | + description: "Description of the new document type" |
| 215 | + attributes: |
| 216 | + - name: "attribute_name" |
| 217 | + description: "What this attribute represents" |
| 218 | + type: "string" # or "number", "date", etc. |
| 219 | +``` |
| 220 | +
|
| 221 | +2. **Run from Step 2**: Classification onwards to process with new classes |
| 222 | +
|
| 223 | +### Changing Models |
| 224 | +
|
| 225 | +To experiment with different AI models: |
| 226 | +
|
| 227 | +1. **Edit relevant config files**: |
| 228 | +```yaml |
| 229 | +# In config/extraction.yaml |
| 230 | +llm_method: |
| 231 | + model: "anthropic.claude-3-5-sonnet-20241022-v2:0" # Change model |
| 232 | + temperature: 0.1 # Adjust parameters |
| 233 | +``` |
| 234 | +
|
| 235 | +2. **Run affected steps**: Only the steps that use the changed configuration |
| 236 | +
|
| 237 | +### Adjusting Confidence Thresholds |
| 238 | +
|
| 239 | +To experiment with confidence thresholds: |
| 240 | +
|
| 241 | +1. **Edit `config/assessment.yaml`**: |
| 242 | +```yaml |
| 243 | +assessment: |
| 244 | + confidence_threshold: 0.7 # Lower threshold = more permissive |
| 245 | +``` |
| 246 | + |
| 247 | +2. **Run Steps 4-6**: Assessment, Summarization, and Evaluation |
| 248 | + |
| 249 | +### Performance Optimization |
| 250 | + |
| 251 | +- **Parallel Processing**: Modify extraction/assessment to process sections in parallel |
| 252 | +- **Caching**: Results are automatically cached between steps |
| 253 | +- **Batch Processing**: Process multiple documents by running the pipeline multiple times |
| 254 | + |
| 255 | +## 🐛 Troubleshooting |
| 256 | + |
| 257 | +### Common Issues |
| 258 | + |
| 259 | +1. **AWS Credentials**: Ensure proper AWS configuration |
| 260 | +```bash |
| 261 | +aws configure list |
| 262 | +``` |
| 263 | + |
| 264 | +2. **Missing Dependencies**: Install required packages |
| 265 | +```bash |
| 266 | +pip install boto3 jupyter ipython |
| 267 | +``` |
| 268 | + |
| 269 | +3. **Memory Issues**: For large documents, consider processing sections individually |
| 270 | + |
| 271 | +4. **Configuration Errors**: Validate YAML syntax |
| 272 | +```bash |
| 273 | +python -c "import yaml; yaml.safe_load(open('config/main.yaml'))" |
| 274 | +``` |
| 275 | + |
| 276 | +### Debug Mode |
| 277 | + |
| 278 | +Enable detailed logging in any notebook: |
| 279 | +```python |
| 280 | +import logging |
| 281 | +logging.basicConfig(level=logging.DEBUG) |
| 282 | +``` |
| 283 | + |
| 284 | +### Data Inspection |
| 285 | + |
| 286 | +Each step saves detailed results that can be inspected: |
| 287 | +```python |
| 288 | +# Inspect intermediate results |
| 289 | +import json |
| 290 | +with open("data/step3_extraction/extraction_summary.json") as f: |
| 291 | + results = json.load(f) |
| 292 | + print(json.dumps(results, indent=2)) |
| 293 | +``` |
| 294 | + |
| 295 | +## 📈 Performance Monitoring |
| 296 | + |
| 297 | +### Metrics Tracked |
| 298 | + |
| 299 | +Each step automatically tracks: |
| 300 | +- **Processing Time**: Total time for the step |
| 301 | +- **Throughput**: Pages per second |
| 302 | +- **Memory Usage**: Peak memory consumption |
| 303 | +- **API Calls**: Number of service calls made |
| 304 | +- **Error Rates**: Failed operations |
| 305 | + |
| 306 | +### Performance Analysis |
| 307 | + |
| 308 | +The evaluation step provides comprehensive performance analysis: |
| 309 | +- Step-by-step timing breakdown |
| 310 | +- Bottleneck identification |
| 311 | +- Resource utilization metrics |
| 312 | +- Cost analysis (for AWS services) |
| 313 | + |
| 314 | +## 🔒 Security and Best Practices |
| 315 | + |
| 316 | +### AWS Security |
| 317 | +- Use IAM roles with minimal required permissions |
| 318 | +- Enable CloudTrail for API logging |
| 319 | +- Store sensitive data in S3 with appropriate encryption |
| 320 | + |
| 321 | +### Data Privacy |
| 322 | +- Documents are processed in your AWS account |
| 323 | +- No data is sent to external services (except configured AI models) |
| 324 | +- Temporary files are cleaned up automatically |
| 325 | + |
| 326 | +### Configuration Management |
| 327 | +- Version control your configuration files |
| 328 | +- Use environment-specific configurations for different deployments |
| 329 | +- Document any custom modifications |
| 330 | + |
| 331 | +## 🤝 Contributing |
| 332 | + |
| 333 | +To extend or modify the notebooks: |
| 334 | + |
| 335 | +1. **Follow the Pattern**: Maintain the input/output structure for compatibility |
| 336 | +2. **Update Configurations**: Add new configuration options to appropriate YAML files |
| 337 | +3. **Document Changes**: Update this README and add inline documentation |
| 338 | +4. **Test Thoroughly**: Verify that changes work across the entire pipeline |
| 339 | + |
| 340 | +## 📚 Additional Resources |
| 341 | + |
| 342 | +- [IDP Common Library Documentation](../docs/using-notebooks-with-idp-common.md) |
| 343 | +- [Configuration Guide](../docs/configuration.md) |
| 344 | +- [Evaluation Methods](../docs/evaluation.md) |
| 345 | +- [AWS Textract Documentation](https://docs.aws.amazon.com/textract/) |
| 346 | +- [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/) |
| 347 | + |
| 348 | +--- |
| 349 | + |
| 350 | +**Happy Document Processing! 🚀** |
| 351 | + |
| 352 | +For questions or support, refer to the main project documentation or create an issue in the project repository. |
0 commit comments