Skip to content

Commit 66d43b6

Browse files
author
Bob Strahan
committed
Merge branch 'develop' of ssh.gitlab.aws.dev:genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator into develop
2 parents 32e6b26 + 4bfea83 commit 66d43b6

File tree

8 files changed

+38
-74
lines changed

8 files changed

+38
-74
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ SPDX-License-Identifier: MIT-0
4040
- Added e2e-example-with-assessment.ipynb notebook for testing assessment workflow
4141

4242
- **Enhanced Evaluation Framework with Confidence Integration**
43-
- Added expected_confidence and actual_confidence fields to evaluation reports for quality analysis
43+
- Added confidence fields to evaluation reports for quality analysis
4444
- Automatic extraction and display of confidence scores from assessment explainability_info
4545
- Enhanced JSON and Markdown evaluation reports with confidence columns
4646
- Backward compatible integration - shows "N/A" when confidence data unavailable

docs/evaluation.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -49,26 +49,24 @@ The evaluation framework automatically integrates with the assessment feature to
4949

5050
The evaluation framework automatically extracts confidence scores from the `explainability_info` section of assessment results and displays them in both JSON and Markdown evaluation reports:
5151

52-
- **Expected Confidence**: Confidence score for baseline/ground truth data (if assessed)
53-
- **Actual Confidence**: Confidence score for extraction results being evaluated
52+
- **Confidence**: Confidence score for extraction results being evaluated
5453

5554
### Enhanced Evaluation Reports
5655

5756
When confidence data is available, evaluation reports include additional columns:
5857

5958
```
60-
| Status | Attribute | Expected | Actual | Expected Confidence | Actual Confidence | Score | Method | Reason |
61-
| :----: | --------- | -------- | ------ | :-----------------: | :---------------: | ----- | ------ | ------ |
62-
| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.95 | 0.92 | 1.00 | EXACT | Exact match |
63-
| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.88 | 0.75 | 0.00 | EXACT | Values do not match |
59+
| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |
60+
| :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ |
61+
| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.92 | 1.00 | EXACT | Exact match |
62+
| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.75 | 0.00 | EXACT | Values do not match |
6463
```
6564

6665
### Quality Analysis Benefits
6766

6867
The combination of evaluation accuracy and confidence scores provides deeper insights:
6968

70-
1. **Baseline Quality Assessment**: Low expected confidence may indicate questionable ground truth data that needs review
71-
2. **Extraction Quality Assessment**: Low actual confidence highlights extraction results requiring human verification
69+
2. **Extraction Quality Assessment**: Low confidence highlights extraction results requiring human verification
7270
3. **Quality Prioritization**: Focus improvement efforts on attributes with both low confidence and low accuracy
7371
4. **Pattern Identification**: Analyze relationships between confidence levels and evaluation outcomes
7472

lib/idp_common_pkg/idp_common/evaluation/README.md

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ The Evaluation Service component provides functionality to evaluate document ext
2121
- Applies default comparison method (LLM) for unconfigured attributes with clear indication
2222
- **Assessment Confidence Integration**:
2323
- Automatically extracts and displays confidence scores from assessment results
24-
- Shows expected confidence (baseline/ground truth confidence) and actual confidence (extraction confidence)
24+
- Shows confidence (extraction confidence)
2525
- Integrates with explainability_info from the assessment feature
2626
- Provides insights into data quality for both baseline and extraction results
2727
- Calculates key metrics including:
@@ -260,8 +260,7 @@ The evaluation service automatically integrates with the assessment feature to d
260260

261261
### Confidence Score Types
262262

263-
- **Expected Confidence**: Confidence score for the baseline/ground truth data (if assessed)
264-
- **Actual Confidence**: Confidence score for the extraction results being evaluated
263+
- **Confidence**: Confidence score for the extraction results being evaluated
265264

266265
### Enhanced Report Format
267266

@@ -275,8 +274,7 @@ The evaluation service automatically integrates with the assessment feature to d
275274
"actual": "INV-2024-001",
276275
"matched": true,
277276
"score": 1.0,
278-
"expected_confidence": 0.95,
279-
"actual_confidence": 0.92,
277+
"confidence": 0.92,
280278
"evaluation_method": "EXACT"
281279
}
282280
]
@@ -285,20 +283,19 @@ The evaluation service automatically integrates with the assessment feature to d
285283

286284
#### Markdown Table with Confidence
287285
```
288-
| Status | Attribute | Expected | Actual | Expected Confidence | Actual Confidence | Score | Method | Reason |
289-
| :----: | --------- | -------- | ------ | :-----------------: | :---------------: | ----- | ------ | ------ |
290-
| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.95 | 0.92 | 1.00 | EXACT | Exact match |
291-
| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.88 | 0.75 | 0.00 | EXACT | Values do not match |
286+
| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |
287+
| :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ |
288+
| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.92 | 1.00 | EXACT | Exact match |
289+
| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.75 | 0.00 | EXACT | Values do not match |
292290
```
293291

294292
### Quality Analysis Benefits
295293

296294
Confidence scores provide additional insights for evaluation analysis:
297295

298-
1. **Baseline Quality Assessment**: Low expected confidence may indicate questionable ground truth data
299-
2. **Extraction Quality Assessment**: Low actual confidence highlights extraction results needing review
300-
3. **Confidence-Accuracy Correlation**: Compare confidence levels with evaluation accuracy to identify patterns
301-
4. **Quality Prioritization**: Focus improvement efforts on low-confidence, low-accuracy results
296+
1. **Extraction Quality Assessment**: Low confidence highlights extraction results needing review
297+
2. **Confidence-Accuracy Correlation**: Compare confidence levels with evaluation accuracy to identify patterns
298+
3. **Quality Prioritization**: Focus improvement efforts on low-confidence, low-accuracy results
302299

303300
### Backward Compatibility
304301

lib/idp_common_pkg/idp_common/evaluation/models.py

Lines changed: 7 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,7 @@ class AttributeEvaluationResult:
4848
evaluation_method: str = "EXACT"
4949
evaluation_threshold: Optional[float] = None
5050
comparator_type: Optional[str] = None # Used for HUNGARIAN methods
51-
expected_confidence: Optional[float] = (
52-
None # Confidence score from assessment for expected values
53-
)
54-
actual_confidence: Optional[float] = (
51+
confidence: Optional[float] = (
5552
None # Confidence score from assessment for actual values
5653
)
5754

@@ -104,8 +101,7 @@ def to_dict(self) -> Dict[str, Any]:
104101
"evaluation_method": ar.evaluation_method,
105102
"evaluation_threshold": ar.evaluation_threshold,
106103
"comparator_type": ar.comparator_type,
107-
"expected_confidence": ar.expected_confidence,
108-
"actual_confidence": ar.actual_confidence,
104+
"confidence": ar.confidence,
109105
}
110106
for ar in sr.attributes
111107
],
@@ -243,8 +239,8 @@ def to_markdown(self) -> str:
243239

244240
# Attribute results
245241
sections.append("### Attributes")
246-
attr_table = "| Status | Attribute | Expected | Actual | Expected Confidence | Actual Confidence | Score | Method | Reason |\n"
247-
attr_table += "| :----: | --------- | -------- | ------ | :-----------------: | :---------------: | ----- | ------ | ------ |\n"
242+
attr_table = "| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |\n"
243+
attr_table += "| :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ |\n"
248244
for ar in sr.attributes:
249245
expected = str(ar.expected).replace("\n", " ")
250246
actual = str(ar.actual).replace("\n", " ")
@@ -287,18 +283,11 @@ def to_markdown(self) -> str:
287283
status_symbol = "❌"
288284

289285
# Format confidence values
290-
expected_confidence_str = (
291-
f"{ar.expected_confidence:.2f}"
292-
if ar.expected_confidence is not None
293-
else "N/A"
294-
)
295-
actual_confidence_str = (
296-
f"{ar.actual_confidence:.2f}"
297-
if ar.actual_confidence is not None
298-
else "N/A"
286+
confidence_str = (
287+
f"{ar.confidence:.2f}" if ar.confidence is not None else "N/A"
299288
)
300289

301-
attr_table += f"| {status_symbol} | {ar.name} | {expected} | {actual} | {expected_confidence_str} | {actual_confidence_str} | {ar.score:.2f} | {method_display} | {reason} |\n"
290+
attr_table += f"| {status_symbol} | {ar.name} | {expected} | {actual} | {confidence_str} | {ar.score:.2f} | {method_display} | {reason} |\n"
302291
sections.append(attr_table)
303292
sections.append("")
304293

lib/idp_common_pkg/idp_common/evaluation/service.py

Lines changed: 9 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -395,8 +395,7 @@ def evaluate_section(
395395
section: Section,
396396
expected_results: Dict[str, Any],
397397
actual_results: Dict[str, Any],
398-
expected_confidence_scores: Dict[str, float] = None,
399-
actual_confidence_scores: Dict[str, float] = None,
398+
confidence_scores: Dict[str, float] = None,
400399
) -> SectionEvaluationResult:
401400
"""
402401
Evaluate extraction results for a document section.
@@ -405,8 +404,7 @@ def evaluate_section(
405404
section: Document section
406405
expected_results: Expected extraction results
407406
actual_results: Actual extraction results
408-
expected_confidence_scores: Confidence scores for expected values from assessment
409-
actual_confidence_scores: Confidence scores for actual values from assessment
407+
confidence_scores: Confidence scores for actual values from assessment
410408
411409
Returns:
412410
Evaluation results for the section
@@ -508,12 +506,8 @@ def evaluate_section(
508506
)
509507

510508
# Set confidence scores if available
511-
if expected_confidence_scores:
512-
attribute_result.expected_confidence = (
513-
expected_confidence_scores.get(task["attr_name"])
514-
)
515-
if actual_confidence_scores:
516-
attribute_result.actual_confidence = actual_confidence_scores.get(
509+
if confidence_scores:
510+
attribute_result.confidence = confidence_scores.get(
517511
task["attr_name"]
518512
)
519513

@@ -569,13 +563,9 @@ def evaluate_section(
569563
None,
570564
)
571565
if task:
572-
if expected_confidence_scores:
573-
attribute_result.expected_confidence = (
574-
expected_confidence_scores.get(task["attr_name"])
575-
)
576-
if actual_confidence_scores:
577-
attribute_result.actual_confidence = (
578-
actual_confidence_scores.get(task["attr_name"])
566+
if confidence_scores:
567+
attribute_result.confidence = confidence_scores.get(
568+
task["attr_name"]
579569
)
580570

581571
# Add to attribute results
@@ -636,20 +626,14 @@ def _process_section(
636626
# Return empty result
637627
return None, {}
638628

639-
actual_results, actual_confidence_scores = self._load_extraction_results(
640-
actual_uri
641-
)
642-
expected_results, expected_confidence_scores = self._load_extraction_results(
643-
expected_uri
644-
)
629+
actual_results, confidence_scores = self._load_extraction_results(actual_uri)
645630

646631
# Evaluate section
647632
section_result = self.evaluate_section(
648633
section=actual_section,
649634
expected_results=expected_results,
650635
actual_results=actual_results,
651-
expected_confidence_scores=expected_confidence_scores,
652-
actual_confidence_scores=actual_confidence_scores,
636+
confidence_scores=confidence_scores,
653637
)
654638

655639
# Count matches and mismatches in the attributes

notebooks/evaluation_reporting_analytics.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -310,10 +310,10 @@
310310
"==================================================\n",
311311
"✓ Successfully queried attribute_evaluations\n",
312312
"Table has 17 columns\n",
313-
"Columns: ['document_id', 'section_id', 'section_type', 'attribute_name', 'expected', 'actual', 'matched', 'score', 'reason', 'evaluation_method', 'expected_confidence', 'actual_confidence', 'evaluation_date', 'year', 'month', 'day', 'document']\n",
313+
"Columns: ['document_id', 'section_id', 'section_type', 'attribute_name', 'expected', 'actual', 'matched', 'score', 'reason', 'evaluation_method', 'expected_confidence', 'confidence', 'evaluation_date', 'year', 'month', 'day', 'document']\n",
314314
"\n",
315315
"Sample data:\n",
316-
" document_id section_id section_type attribute_name expected actual matched score reason evaluation_method expected_confidence actual_confidence evaluation_date year month day document\n",
316+
" document_id section_id section_type attribute_name expected actual matched score reason evaluation_method expected_confidence confidence evaluation_date year month day document\n",
317317
"0 rvl_cdip_package.pdf 1 letter cc true 1.0 Both actual and expected values are missing, so they are matched. LLM 0.0 0.0 2025-06-10 22:08:58.185 2025 06 10 rvl_cdip_package.pdf\n",
318318
"1 rvl_cdip_package.pdf 1 letter date 10/31/1995 10/31/1995 true 1.0 The expected and actual values for the 'date' attribute are identical, representing the same date of 10/31/1995. The formatting and representation are exactly the same, so there is a perfect match. LLM 0.85 0.85 2025-06-10 22:08:58.185 2025 06 10 rvl_cdip_package.pdf\n",
319319
"2 rvl_cdip_package.pdf 1 letter letter_type Opposition Opposition true 1.0 The expected value 'Opposition' and the actual value 'Opposition' are an exact match in meaning, taking into account formatting, word order, and semantic equivalence. LLM 0.9 0.9 2025-06-10 22:08:58.185 2025 06 10 rvl_cdip_package.pdf\n",

src/lambda/evaluation_function/save_to_reporting.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -122,8 +122,7 @@ def save_evaluation_to_reporting_bucket(document, reporting_bucket: str) -> None
122122
('score', pa.float64()),
123123
('reason', pa.string()),
124124
('evaluation_method', pa.string()),
125-
('expected_confidence', pa.string()),
126-
('actual_confidence', pa.string()),
125+
('confidence', pa.string()),
127126
('evaluation_date', pa.timestamp('ms'))
128127
])
129128
logger.info(f"Writing evaluation results to ReportingBucket s3://{reporting_bucket}/evaluation_metrics/document_metrics")
@@ -210,8 +209,7 @@ def save_evaluation_to_reporting_bucket(document, reporting_bucket: str) -> None
210209
'score': getattr(attr, 'score', 0.0),
211210
'reason': _serialize_value(getattr(attr, 'reason', '')),
212211
'evaluation_method': _serialize_value(getattr(attr, 'evaluation_method', '')),
213-
'expected_confidence': _serialize_value(getattr(attr, 'expected_confidence', None)),
214-
'actual_confidence': _serialize_value(getattr(attr, 'actual_confidence', None)),
212+
'confidence': _serialize_value(getattr(attr, 'confidence', None)),
215213
'evaluation_date': now, # Use datetime object directly
216214
}
217215
attribute_records.append(attribute_record)

template.yaml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1412,9 +1412,7 @@ Resources:
14121412
Type: string
14131413
- Name: evaluation_method
14141414
Type: string
1415-
- Name: expected_confidence
1416-
Type: string
1417-
- Name: actual_confidence
1415+
- Name: confidence
14181416
Type: string
14191417
- Name: evaluation_date
14201418
Type: timestamp

0 commit comments

Comments
 (0)