Merge branch 'develop' of ssh.gitlab.aws.dev:genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator into develop

Bob Strahan · Bob Strahan · commit 66d43b601b67 · 2025-06-11T21:59:35.000Z
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -40,7 +40,7 @@ SPDX-License-Identifier: MIT-0
   - Added e2e-example-with-assessment.ipynb notebook for testing assessment workflow
 
 - **Enhanced Evaluation Framework with Confidence Integration**
-  - Added expected_confidence and actual_confidence fields to evaluation reports for quality analysis
+  - Added confidence fields to evaluation reports for quality analysis
   - Automatic extraction and display of confidence scores from assessment explainability_info
   - Enhanced JSON and Markdown evaluation reports with confidence columns
   - Backward compatible integration - shows "N/A" when confidence data unavailable
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -49,26 +49,24 @@ The evaluation framework automatically integrates with the assessment feature to
 
 The evaluation framework automatically extracts confidence scores from the `explainability_info` section of assessment results and displays them in both JSON and Markdown evaluation reports:
 
-- **Expected Confidence**: Confidence score for baseline/ground truth data (if assessed)
-- **Actual Confidence**: Confidence score for extraction results being evaluated
+- **Confidence**: Confidence score for extraction results being evaluated
 
 ### Enhanced Evaluation Reports
 
 When confidence data is available, evaluation reports include additional columns:
 
 ```
-| Status | Attribute | Expected | Actual | Expected Confidence | Actual Confidence | Score | Method | Reason |
-| :----: | --------- | -------- | ------ | :-----------------: | :---------------: | ----- | ------ | ------ |
-| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.95 | 0.92 | 1.00 | EXACT | Exact match |
-| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.88 | 0.75 | 0.00 | EXACT | Values do not match |
+| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |
+| :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ |
+| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.92 | 1.00 | EXACT | Exact match |
+| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.75 | 0.00 | EXACT | Values do not match |
 ```
 
 ### Quality Analysis Benefits
 
 The combination of evaluation accuracy and confidence scores provides deeper insights:
 
-1. **Baseline Quality Assessment**: Low expected confidence may indicate questionable ground truth data that needs review
-2. **Extraction Quality Assessment**: Low actual confidence highlights extraction results requiring human verification
+2. **Extraction Quality Assessment**: Low confidence highlights extraction results requiring human verification
 3. **Quality Prioritization**: Focus improvement efforts on attributes with both low confidence and low accuracy
 4. **Pattern Identification**: Analyze relationships between confidence levels and evaluation outcomes
 
diff --git a/lib/idp_common_pkg/idp_common/evaluation/README.md b/lib/idp_common_pkg/idp_common/evaluation/README.md
@@ -21,7 +21,7 @@ The Evaluation Service component provides functionality to evaluate document ext
   - Applies default comparison method (LLM) for unconfigured attributes with clear indication
 - **Assessment Confidence Integration**:
   - Automatically extracts and displays confidence scores from assessment results
-  - Shows expected confidence (baseline/ground truth confidence) and actual confidence (extraction confidence)
+  - Shows confidence (extraction confidence)
   - Integrates with explainability_info from the assessment feature
   - Provides insights into data quality for both baseline and extraction results
 - Calculates key metrics including:
@@ -260,8 +260,7 @@ The evaluation service automatically integrates with the assessment feature to d
 
 ### Confidence Score Types
 
-- **Expected Confidence**: Confidence score for the baseline/ground truth data (if assessed)
-- **Actual Confidence**: Confidence score for the extraction results being evaluated
+- **Confidence**: Confidence score for the extraction results being evaluated
 
 ### Enhanced Report Format
 
@@ -275,8 +274,7 @@ The evaluation service automatically integrates with the assessment feature to d
       "actual": "INV-2024-001",
       "matched": true,
       "score": 1.0,
-      "expected_confidence": 0.95,
-      "actual_confidence": 0.92,
+      "confidence": 0.92,
       "evaluation_method": "EXACT"
     }
   ]
@@ -285,20 +283,19 @@ The evaluation service automatically integrates with the assessment feature to d
 
 #### Markdown Table with Confidence
 ```
-| Status | Attribute | Expected | Actual | Expected Confidence | Actual Confidence | Score | Method | Reason |
-| :----: | --------- | -------- | ------ | :-----------------: | :---------------: | ----- | ------ | ------ |
-| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.95 | 0.92 | 1.00 | EXACT | Exact match |
-| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.88 | 0.75 | 0.00 | EXACT | Values do not match |
+| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |
+| :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ |
+| ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.92 | 1.00 | EXACT | Exact match |
+| ❌ | vendor_name | ABC Corp | XYZ Inc | 0.75 | 0.00 | EXACT | Values do not match |
 ```
 
 ### Quality Analysis Benefits
 
 Confidence scores provide additional insights for evaluation analysis:
 
-1. **Baseline Quality Assessment**: Low expected confidence may indicate questionable ground truth data
-2. **Extraction Quality Assessment**: Low actual confidence highlights extraction results needing review
-3. **Confidence-Accuracy Correlation**: Compare confidence levels with evaluation accuracy to identify patterns
-4. **Quality Prioritization**: Focus improvement efforts on low-confidence, low-accuracy results
+1. **Extraction Quality Assessment**: Low confidence highlights extraction results needing review
+2. **Confidence-Accuracy Correlation**: Compare confidence levels with evaluation accuracy to identify patterns
+3. **Quality Prioritization**: Focus improvement efforts on low-confidence, low-accuracy results
 
 ### Backward Compatibility
 
diff --git a/lib/idp_common_pkg/idp_common/evaluation/models.py b/lib/idp_common_pkg/idp_common/evaluation/models.py
@@ -48,10 +48,7 @@ class AttributeEvaluationResult:
     evaluation_method: str = "EXACT"
     evaluation_threshold: Optional[float] = None
     comparator_type: Optional[str] = None  # Used for HUNGARIAN methods
-    expected_confidence: Optional[float] = (
-        None  # Confidence score from assessment for expected values
-    )
-    actual_confidence: Optional[float] = (
+    confidence: Optional[float] = (
         None  # Confidence score from assessment for actual values
     )
 
@@ -104,8 +101,7 @@ def to_dict(self) -> Dict[str, Any]:
                             "evaluation_method": ar.evaluation_method,
                             "evaluation_threshold": ar.evaluation_threshold,
                             "comparator_type": ar.comparator_type,
-                            "expected_confidence": ar.expected_confidence,
-                            "actual_confidence": ar.actual_confidence,
+                            "confidence": ar.confidence,
                         }
                         for ar in sr.attributes
                     ],
@@ -243,8 +239,8 @@ def to_markdown(self) -> str:
 
             # Attribute results
             sections.append("### Attributes")
-            attr_table = "| Status | Attribute | Expected | Actual | Expected Confidence | Actual Confidence | Score | Method | Reason |\n"
-            attr_table += "| :----: | --------- | -------- | ------ | :-----------------: | :---------------: | ----- | ------ | ------ |\n"
+            attr_table = "| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason |\n"
+            attr_table += "| :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ |\n"
             for ar in sr.attributes:
                 expected = str(ar.expected).replace("\n", " ")
                 actual = str(ar.actual).replace("\n", " ")
@@ -287,18 +283,11 @@ def to_markdown(self) -> str:
                     status_symbol = "❌"
 
                 # Format confidence values
-                expected_confidence_str = (
-                    f"{ar.expected_confidence:.2f}"
-                    if ar.expected_confidence is not None
-                    else "N/A"
-                )
-                actual_confidence_str = (
-                    f"{ar.actual_confidence:.2f}"
-                    if ar.actual_confidence is not None
-                    else "N/A"
+                confidence_str = (
+                    f"{ar.confidence:.2f}" if ar.confidence is not None else "N/A"
                 )
 
-                attr_table += f"| {status_symbol} | {ar.name} | {expected} | {actual} | {expected_confidence_str} | {actual_confidence_str} | {ar.score:.2f} | {method_display} | {reason} |\n"
+                attr_table += f"| {status_symbol} | {ar.name} | {expected} | {actual} | {confidence_str} | {ar.score:.2f} | {method_display} | {reason} |\n"
             sections.append(attr_table)
             sections.append("")
 
diff --git a/lib/idp_common_pkg/idp_common/evaluation/service.py b/lib/idp_common_pkg/idp_common/evaluation/service.py
@@ -395,8 +395,7 @@ def evaluate_section(
         section: Section,
         expected_results: Dict[str, Any],
         actual_results: Dict[str, Any],
-        expected_confidence_scores: Dict[str, float] = None,
-        actual_confidence_scores: Dict[str, float] = None,
+        confidence_scores: Dict[str, float] = None,
     ) -> SectionEvaluationResult:
         """
         Evaluate extraction results for a document section.
@@ -405,8 +404,7 @@ def evaluate_section(
             section: Document section
             expected_results: Expected extraction results
             actual_results: Actual extraction results
-            expected_confidence_scores: Confidence scores for expected values from assessment
-            actual_confidence_scores: Confidence scores for actual values from assessment
+            confidence_scores: Confidence scores for actual values from assessment
 
         Returns:
             Evaluation results for the section
@@ -508,12 +506,8 @@ def evaluate_section(
                 )
 
                 # Set confidence scores if available
-                if expected_confidence_scores:
-                    attribute_result.expected_confidence = (
-                        expected_confidence_scores.get(task["attr_name"])
-                    )
-                if actual_confidence_scores:
-                    attribute_result.actual_confidence = actual_confidence_scores.get(
+                if confidence_scores:
+                    attribute_result.confidence = confidence_scores.get(
                         task["attr_name"]
                     )
 
@@ -569,13 +563,9 @@ def evaluate_section(
                             None,
                         )
                         if task:
-                            if expected_confidence_scores:
-                                attribute_result.expected_confidence = (
-                                    expected_confidence_scores.get(task["attr_name"])
-                                )
-                            if actual_confidence_scores:
-                                attribute_result.actual_confidence = (
-                                    actual_confidence_scores.get(task["attr_name"])
+                            if confidence_scores:
+                                attribute_result.confidence = confidence_scores.get(
+                                    task["attr_name"]
                                 )
 
                         # Add to attribute results
@@ -636,20 +626,14 @@ def _process_section(
             # Return empty result
             return None, {}
 
-        actual_results, actual_confidence_scores = self._load_extraction_results(
-            actual_uri
-        )
-        expected_results, expected_confidence_scores = self._load_extraction_results(
-            expected_uri
-        )
+        actual_results, confidence_scores = self._load_extraction_results(actual_uri)
 
         # Evaluate section
         section_result = self.evaluate_section(
             section=actual_section,
             expected_results=expected_results,
             actual_results=actual_results,
-            expected_confidence_scores=expected_confidence_scores,
-            actual_confidence_scores=actual_confidence_scores,
+            confidence_scores=confidence_scores,
         )
 
         # Count matches and mismatches in the attributes
diff --git a/notebooks/evaluation_reporting_analytics.ipynb b/notebooks/evaluation_reporting_analytics.ipynb
@@ -310,10 +310,10 @@
       "==================================================\n",
       "✓ Successfully queried attribute_evaluations\n",
       "Table has 17 columns\n",
-      "Columns: ['document_id', 'section_id', 'section_type', 'attribute_name', 'expected', 'actual', 'matched', 'score', 'reason', 'evaluation_method', 'expected_confidence', 'actual_confidence', 'evaluation_date', 'year', 'month', 'day', 'document']\n",
+      "Columns: ['document_id', 'section_id', 'section_type', 'attribute_name', 'expected', 'actual', 'matched', 'score', 'reason', 'evaluation_method', 'expected_confidence', 'confidence', 'evaluation_date', 'year', 'month', 'day', 'document']\n",
       "\n",
       "Sample data:\n",
-      "            document_id section_id section_type     attribute_name                                      expected                                        actual matched score                                                                                                                                                                                                                                    reason evaluation_method expected_confidence actual_confidence          evaluation_date  year month day              document\n",
+      "            document_id section_id section_type     attribute_name                                      expected                                        actual matched score                                                                                                                                                                                                                                    reason evaluation_method expected_confidence confidence          evaluation_date  year month day              document\n",
       "0  rvl_cdip_package.pdf          1       letter                 cc                                                                                                true   1.0                                                                                                                                                                         Both actual and expected values are missing, so they are matched.               LLM                 0.0               0.0  2025-06-10 22:08:58.185  2025    06  10  rvl_cdip_package.pdf\n",
       "1  rvl_cdip_package.pdf          1       letter               date                                    10/31/1995                                    10/31/1995    true   1.0                                     The expected and actual values for the 'date' attribute are identical, representing the same date of 10/31/1995. The formatting and representation are exactly the same, so there is a perfect match.               LLM                0.85              0.85  2025-06-10 22:08:58.185  2025    06  10  rvl_cdip_package.pdf\n",
       "2  rvl_cdip_package.pdf          1       letter        letter_type                                    Opposition                                    Opposition    true   1.0                                                                    The expected value 'Opposition' and the actual value 'Opposition' are an exact match in meaning, taking into account formatting, word order, and semantic equivalence.               LLM                 0.9               0.9  2025-06-10 22:08:58.185  2025    06  10  rvl_cdip_package.pdf\n",
diff --git a/src/lambda/evaluation_function/save_to_reporting.py b/src/lambda/evaluation_function/save_to_reporting.py
@@ -122,8 +122,7 @@ def save_evaluation_to_reporting_bucket(document, reporting_bucket: str) -> None
         ('score', pa.float64()),
         ('reason', pa.string()),
         ('evaluation_method', pa.string()),
-        ('expected_confidence', pa.string()),
-        ('actual_confidence', pa.string()),
+        ('confidence', pa.string()),
         ('evaluation_date', pa.timestamp('ms'))
     ])
     logger.info(f"Writing evaluation results to ReportingBucket s3://{reporting_bucket}/evaluation_metrics/document_metrics")
@@ -210,8 +209,7 @@ def save_evaluation_to_reporting_bucket(document, reporting_bucket: str) -> None
                         'score': getattr(attr, 'score', 0.0),
                         'reason': _serialize_value(getattr(attr, 'reason', '')),
                         'evaluation_method': _serialize_value(getattr(attr, 'evaluation_method', '')),
-                        'expected_confidence': _serialize_value(getattr(attr, 'expected_confidence', None)),
-                        'actual_confidence': _serialize_value(getattr(attr, 'actual_confidence', None)),
+                        'confidence': _serialize_value(getattr(attr, 'confidence', None)),
                         'evaluation_date': now,  # Use datetime object directly
                     }
                     attribute_records.append(attribute_record)
diff --git a/template.yaml b/template.yaml
@@ -1412,9 +1412,7 @@ Resources:
               Type: string
             - Name: evaluation_method
               Type: string
-            - Name: expected_confidence
-              Type: string
-            - Name: actual_confidence
+            - Name: confidence
               Type: string
             - Name: evaluation_date
               Type: timestamp