|
| 1 | +# Benchmark Utilities |
| 2 | + |
| 3 | +This directory contains utility scripts for working with benchmark and evaluation datasets. |
| 4 | + |
| 5 | +## prep_baseline_data.py |
| 6 | + |
| 7 | +Convert ground truth data from JSONL format to IDP Accelerator evaluation baseline format. |
| 8 | + |
| 9 | +### Purpose |
| 10 | + |
| 11 | +This script processes JSONL files containing document ground truth labels and converts them into the directory structure required by the IDP Accelerator's evaluation framework. |
| 12 | + |
| 13 | +### Input Format |
| 14 | + |
| 15 | +JSONL file where each line contains: |
| 16 | +```json |
| 17 | +{ |
| 18 | + "document_path": "path/to/document.pdf", |
| 19 | + "labels": "{\"field1\": \"value1\", \"field2\": \"value2\", ...}" |
| 20 | +} |
| 21 | +``` |
| 22 | + |
| 23 | +### Output Format |
| 24 | + |
| 25 | +Creates the following directory structure: |
| 26 | +``` |
| 27 | +<output_base_path>/ |
| 28 | +├── document1.pdf/ |
| 29 | +│ └── sections/ |
| 30 | +│ └── 1/ |
| 31 | +│ └── result.json |
| 32 | +├── document2.pdf/ |
| 33 | +│ └── sections/ |
| 34 | +│ └── 1/ |
| 35 | +│ └── result.json |
| 36 | +... |
| 37 | +``` |
| 38 | + |
| 39 | +Where each `result.json` contains: |
| 40 | +```json |
| 41 | +{ |
| 42 | + "inference_result": { |
| 43 | + "field1": "value1", |
| 44 | + "field2": "value2", |
| 45 | + ... |
| 46 | + } |
| 47 | +} |
| 48 | +``` |
| 49 | + |
| 50 | +### Usage |
| 51 | + |
| 52 | +#### Basic Usage (Default Paths) |
| 53 | +```bash |
| 54 | +python prep_baseline_data.py |
| 55 | +``` |
| 56 | + |
| 57 | +Default paths: |
| 58 | +- **Input**: `scratch/fcc_invoices_reann_standardized_val_fixed_v0.jsonl` |
| 59 | +- **Output**: `scratch/accelerator/fcc_invoices/evaluation_baseline/` |
| 60 | + |
| 61 | +#### Dry Run (Preview Only) |
| 62 | +```bash |
| 63 | +python prep_baseline_data.py --dry-run |
| 64 | +``` |
| 65 | + |
| 66 | +#### Custom Paths |
| 67 | +```bash |
| 68 | +python prep_baseline_data.py \ |
| 69 | + --input path/to/your/ground_truth.jsonl \ |
| 70 | + --output path/to/output/baseline/ |
| 71 | +``` |
| 72 | + |
| 73 | +#### Overwrite Existing Files |
| 74 | +```bash |
| 75 | +python prep_baseline_data.py --overwrite |
| 76 | +``` |
| 77 | + |
| 78 | +#### Skip Validation |
| 79 | +```bash |
| 80 | +python prep_baseline_data.py --no-validate |
| 81 | +``` |
| 82 | + |
| 83 | +### Command-Line Options |
| 84 | + |
| 85 | +| Option | Description | Default | |
| 86 | +|--------|-------------|---------| |
| 87 | +| `--input PATH` | Path to input JSONL file | `scratch/fcc_invoices_reann_standardized_val_fixed_v0.jsonl` | |
| 88 | +| `--output PATH` | Base path for output baseline files | `scratch/accelerator/fcc_invoices/evaluation_baseline` | |
| 89 | +| `--dry-run` | Simulate processing without creating files | False | |
| 90 | +| `--overwrite` | Overwrite existing baseline files | False | |
| 91 | +| `--validate` | Validate created files after processing | True | |
| 92 | +| `--no-validate` | Skip validation of created files | - | |
| 93 | + |
| 94 | +### Features |
| 95 | + |
| 96 | +- **Error Handling**: Gracefully handles malformed JSON, missing fields, and file system errors |
| 97 | +- **Duplicate Detection**: Warns about duplicate document IDs in the input file |
| 98 | +- **Progress Tracking**: Shows progress every 100 documents processed |
| 99 | +- **Validation**: Automatically validates a sample of created files |
| 100 | +- **Statistics**: Provides detailed summary of processing results |
| 101 | +- **Dry Run Mode**: Preview what would be created without writing files |
| 102 | + |
| 103 | +### Output Summary |
| 104 | + |
| 105 | +After processing, the script displays a summary including: |
| 106 | +- Total documents processed |
| 107 | +- Successfully created files |
| 108 | +- Skipped files (if not overwriting) |
| 109 | +- Failed operations |
| 110 | +- Duplicate document IDs |
| 111 | +- Error details |
| 112 | +- Success rate |
| 113 | + |
| 114 | +Example output: |
| 115 | +``` |
| 116 | +================================================================================ |
| 117 | +PROCESSING SUMMARY |
| 118 | +================================================================================ |
| 119 | +Total documents in file: 150 |
| 120 | +Successfully processed: 148 |
| 121 | +Skipped (already exist): 0 |
| 122 | +Failed: 2 |
| 123 | +Unique doc_ids: 148 |
| 124 | +
|
| 125 | +Success rate: 98.7% |
| 126 | +================================================================================ |
| 127 | +``` |
| 128 | + |
| 129 | +### Error Handling |
| 130 | + |
| 131 | +The script handles various error scenarios: |
| 132 | +- **Missing input file**: Exits with clear error message |
| 133 | +- **Malformed JSON**: Logs line number and continues processing |
| 134 | +- **Missing required fields**: Logs error and skips document |
| 135 | +- **File system errors**: Logs error and continues with remaining documents |
| 136 | +- **Duplicate document IDs**: Warns but continues processing |
| 137 | + |
| 138 | +### Exit Codes |
| 139 | + |
| 140 | +- `0`: Success (all documents processed without errors) |
| 141 | +- `1`: Failure (fatal error or some documents failed) |
| 142 | + |
| 143 | +### Examples |
| 144 | + |
| 145 | +#### Process with default paths and see detailed output |
| 146 | +```bash |
| 147 | +python prep_baseline_data.py |
| 148 | +``` |
| 149 | + |
| 150 | +#### Test the script without creating files |
| 151 | +```bash |
| 152 | +python prep_baseline_data.py --dry-run |
| 153 | +``` |
| 154 | + |
| 155 | +#### Process a different dataset |
| 156 | +```bash |
| 157 | +python prep_baseline_data.py \ |
| 158 | + --input data/invoice_labels.jsonl \ |
| 159 | + --output baseline/invoices/ |
| 160 | +``` |
| 161 | + |
| 162 | +#### Force overwrite of existing baseline files |
| 163 | +```bash |
| 164 | +python prep_baseline_data.py --overwrite |
| 165 | +``` |
| 166 | + |
| 167 | +### Integration with IDP Accelerator |
| 168 | + |
| 169 | +Once baseline files are created, use them with the IDP Accelerator evaluation framework: |
| 170 | + |
| 171 | +1. Upload the baseline directory to your evaluation S3 bucket |
| 172 | +2. Configure the evaluation framework to use this baseline |
| 173 | +3. Process documents through the IDP pipeline |
| 174 | +4. View evaluation reports comparing results to baseline |
| 175 | + |
| 176 | +See `docs/evaluation.md` for more details on the evaluation framework. |
| 177 | + |
| 178 | +### Troubleshooting |
| 179 | + |
| 180 | +**Problem**: Script fails with "Input file not found" |
| 181 | +- **Solution**: Verify the input file path is correct |
| 182 | + |
| 183 | +**Problem**: Permission denied when creating files |
| 184 | +- **Solution**: Ensure you have write permissions to the output directory |
| 185 | + |
| 186 | +**Problem**: Out of memory errors |
| 187 | +- **Solution**: The script processes line-by-line and should handle large files. If issues persist, split the input file into smaller chunks. |
| 188 | + |
| 189 | +**Problem**: Validation fails |
| 190 | +- **Solution**: Check the error messages for specific files, then inspect the result.json files manually |
| 191 | + |
| 192 | +### License |
| 193 | + |
| 194 | +Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. |
| 195 | +SPDX-License-Identifier: MIT-0 |
0 commit comments