Skip to content

Commit 0813f8a

Browse files
author
Bob Strahan
committed
refactor: convert OCR text confidence format from JSON to markdown table
1 parent 00a558b commit 0813f8a

File tree

4 files changed

+79
-72
lines changed

4 files changed

+79
-72
lines changed

CHANGELOG.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,17 @@ SPDX-License-Identifier: MIT-0
88
### Added
99

1010
- **Text Confidence View for Document Pages**
11-
- Added support for displaying OCR text confidence data through new `TextConfidenceUri` field
12-
- New "Text Confidence View" option in the UI pages panel alongside existing Markdown and Text views
13-
- Fixed issues with view persistence - Text Confidence View button now always visible with appropriate messaging when content unavailable
14-
- Fixed view toggle behavior - switching between views no longer closes the viewer window
15-
- Reordered view buttons to: Markdown View, Text Confidence View, Text View for better user experience
11+
- Added support for displaying OCR text confidence data in the UI
12+
13+
### Changed
14+
- **Converted text confidence data format from JSON to markdown table for improved readability and reduced token usage**
15+
- Removed unnecessary "page_count" field
16+
- Changed "text_blocks" array to "text" field containing a markdown table with Text and Confidence columns
17+
- Reduces prompt size for assessment service while improving UI readability
1618
- OCR confidence values now rounded to 1 decimal point (e.g., 99.1, 87.3) for cleaner display
1719

1820

21+
1922
### Fixed
2023

2124

lib/idp_common_pkg/idp_common/ocr/README.md

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -15,21 +15,21 @@ The service supports three OCR backends, each with different capabilities and us
1515

1616
### 1. Textract Backend (Default - Recommended for Assessment)
1717
- **Technology**: AWS Textract OCR service
18-
- **Confidence Data**: ✅ Full granular confidence scores per text block
18+
- **Confidence Data**: ✅ Full granular confidence scores per text line (displayed as markdown table)
1919
- **Features**: Basic text detection + enhanced document analysis (tables, forms, signatures, layout)
2020
- **Assessment Quality**: ⭐⭐⭐ Optimal - Real OCR confidence enables accurate assessment
2121
- **Use Cases**: Standard document processing, when assessment is enabled, production workflows
2222

2323
### 2. Bedrock Backend (LLM-based OCR)
2424
- **Technology**: Amazon Bedrock LLMs (Claude, Nova) for text extraction
25-
- **Confidence Data**: ❌ No confidence data (empty text_blocks array)
25+
- **Confidence Data**: ❌ No confidence data (displays "No confidence data available from LLM OCR")
2626
- **Features**: Advanced text understanding, better handling of challenging/degraded documents
2727
- **Assessment Quality**: ❌ No confidence data for assessment
2828
- **Use Cases**: Challenging documents where traditional OCR fails, specialized text extraction needs
2929

3030
### 3. None Backend (Image-only)
3131
- **Technology**: No OCR processing
32-
- **Confidence Data**: ❌ Empty confidence data
32+
- **Confidence Data**: ❌ No confidence data (displays "No OCR performed")
3333
- **Features**: Image extraction and storage only
3434
- **Assessment Quality**: ❌ No text confidence for assessment
3535
- **Use Cases**: Image-only workflows, custom OCR integration
@@ -104,36 +104,37 @@ The format varies by OCR backend:
104104
**Textract Backend (with confidence data):**
105105
```json
106106
{
107-
"page_count": 1,
108-
"text_blocks": [
109-
{
110-
"text": "WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION",
111-
"confidence": 99.35,
112-
"type": "PRINTED"
113-
},
114-
{
115-
"text": "206 Maple Street",
116-
"confidence": 91.41,
117-
"type": "PRINTED"
118-
}
119-
]
107+
"text": "| Text | Confidence |\n|------|------------|\n| WESTERN DARK FIRED TOBACCO GROWERS' ASSOCIATION | 99.4 |\n| 206 Maple Street | 91.4 |\n| Murray, KY 42071 | 98.7 |"
108+
}
109+
```
110+
111+
The `text` field contains a markdown table with two columns:
112+
- **Text**: The extracted text content (with pipe characters escaped as `\|`)
113+
- **Confidence**: OCR confidence score rounded to 1 decimal point
114+
- Handwriting is indicated with "(HANDWRITING)" suffix in the text column
115+
116+
**Bedrock Backend (no confidence data):**
117+
```json
118+
{
119+
"text": "| Text | Confidence |\n|------|------------|\n| *No confidence data available from LLM OCR* | N/A |"
120120
}
121121
```
122122

123-
**Bedrock/None Backend (no confidence data):**
123+
**None Backend (no OCR):**
124124
```json
125125
{
126-
"page_count": 1,
127-
"text_blocks": []
126+
"text": "| Text | Confidence |\n|------|------------|\n| *No OCR performed* | N/A |"
128127
}
129128
```
130129

131130
### Benefits
132131

133-
- **80-90% token reduction** compared to raw Textract output
134-
- **Preserved assessment data**: Text content, OCR confidence scores, text type (PRINTED/HANDWRITING)
135-
- **Removed overhead**: Geometric data, relationships, block IDs, and verbose metadata
132+
- **85-95% token reduction** compared to raw Textract output (markdown table format is more compact than JSON)
133+
- **Preserved assessment data**: Text content, OCR confidence scores (rounded to 1 decimal), text type (PRINTED/HANDWRITING)
134+
- **Removed overhead**: Geometric data, relationships, block IDs, verbose metadata, and unnecessary JSON syntax
135+
- **Improved readability**: Markdown table format is human-readable in both UI and assessment prompts
136136
- **Cost efficiency**: Significantly reduced LLM inference costs for assessment workflows
137+
- **UI compatibility**: Displays beautifully in the Text Confidence View using existing markdown rendering
137138
- **Automated generation**: Created during initial OCR processing, not repeatedly during assessment
138139

139140
### Usage in Assessment Prompts

lib/idp_common_pkg/idp_common/ocr/service.py

Lines changed: 32 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -622,10 +622,9 @@ def _process_single_page_bedrock(
622622
)
623623

624624
# Generate and store text confidence data
625-
# For Bedrock, we use empty confidence data since LLM OCR doesn't provide real confidence scores
625+
# For Bedrock, we use empty markdown table since LLM OCR doesn't provide real confidence scores
626626
text_confidence_data = {
627-
"page_count": 1,
628-
"text_blocks": [], # Empty - no confidence data available from LLM OCR
627+
"text": "| Text | Confidence |\n|------|------------|\n| *No confidence data available from LLM OCR* | N/A |"
629628
}
630629

631630
text_confidence_key = f"{prefix}/pages/{page_id}/textConfidence.json"
@@ -703,8 +702,10 @@ def _process_single_page_none(
703702
content_type="application/json",
704703
)
705704

706-
# Generate minimal text confidence data (empty)
707-
text_confidence_data = {"page_count": 1, "text_blocks": []}
705+
# Generate minimal text confidence data (empty markdown table)
706+
text_confidence_data = {
707+
"text": "| Text | Confidence |\n|------|------------|\n| *No OCR performed* | N/A |"
708+
}
708709

709710
text_confidence_key = f"{prefix}/pages/{page_id}/textConfidence.json"
710711
s3.write_content(
@@ -807,11 +808,9 @@ def _generate_text_confidence_data(
807808
"""
808809
Generate text confidence data from raw OCR to reduce token usage while preserving essential information.
809810
810-
This method transforms verbose Textract output into a minimal format containing:
811+
This method transforms verbose Textract output into a markdown table format containing:
811812
- Essential text content (LINE blocks only)
812813
- OCR confidence scores (rounded to 1 decimal point)
813-
- Text type (PRINTED/HANDWRITING)
814-
- Page count
815814
816815
Removes geometric data, relationships, block IDs, and other verbose metadata
817816
that aren't needed for assessment purposes.
@@ -820,29 +819,30 @@ def _generate_text_confidence_data(
820819
raw_ocr_data: Raw Textract API response
821820
822821
Returns:
823-
Text confidence data with ~80-90% token reduction
822+
Text confidence data as markdown table with ~80-90% token reduction
824823
"""
825-
text_confidence_data = {
826-
"page_count": raw_ocr_data.get("DocumentMetadata", {}).get("Pages", 1),
827-
"text_blocks": [],
828-
}
824+
# Start building the markdown table
825+
markdown_lines = ["| Text | Confidence |", "|------|------------|"]
829826

830827
blocks = raw_ocr_data.get("Blocks", [])
831828

832829
for block in blocks:
833830
if block.get("BlockType") == "LINE" and block.get("Text"):
834-
text_block = {
835-
"text": block.get("Text", ""),
836-
"confidence": round(block.get("Confidence", 0.0), 1),
837-
}
831+
text = block.get("Text", "").replace(
832+
"|", "\\|"
833+
) # Escape pipe characters
834+
confidence = round(block.get("Confidence", 0.0), 1)
838835

839-
# Include text type if available (PRINTED vs HANDWRITING)
840-
if "TextType" in block:
841-
text_block["type"] = block["TextType"]
836+
# Add text type indicator if it's handwriting
837+
if block.get("TextType") == "HANDWRITING":
838+
markdown_lines.append(f"| {text} (HANDWRITING) | {confidence} |")
839+
else:
840+
markdown_lines.append(f"| {text} | {confidence} |")
842841

843-
text_confidence_data["text_blocks"].append(text_block)
842+
# Join all lines into a single markdown string
843+
markdown_table = "\n".join(markdown_lines)
844844

845-
return text_confidence_data
845+
return {"text": markdown_table}
846846

847847
def _parse_textract_response(
848848
self, response: Dict[str, Any], page_id: int = None
@@ -1070,15 +1070,16 @@ def _process_converted_page(
10701070
content_type="application/json",
10711071
)
10721072

1073-
# Generate text confidence data
1074-
text_confidence_data = {
1075-
"page_count": 1,
1076-
"text_blocks": [
1077-
{"text": line, "confidence": 99.0, "type": "PRINTED"}
1078-
for line in page_text.split("\n")
1079-
if line.strip()
1080-
],
1081-
}
1073+
# Generate text confidence data as markdown table
1074+
markdown_lines = ["| Text | Confidence |", "|------|------------|"]
1075+
for line in page_text.split("\n"):
1076+
if line.strip():
1077+
# Escape pipe characters in text
1078+
escaped_line = line.replace("|", "\\|")
1079+
markdown_lines.append(f"| {escaped_line} | 99.0 |")
1080+
1081+
markdown_table = "\n".join(markdown_lines)
1082+
text_confidence_data = {"text": markdown_table}
10821083

10831084
text_confidence_key = f"{prefix}/pages/{page_id}/textConfidence.json"
10841085
s3.write_content(

lib/idp_common_pkg/tests/unit/ocr/test_ocr_service.py

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -528,20 +528,22 @@ def test_generate_text_confidence_data(self, mock_textract_response):
528528
service = OcrService()
529529
result = service._generate_text_confidence_data(mock_textract_response)
530530

531-
# Verify structure
532-
assert "page_count" in result
533-
assert "text_blocks" in result
534-
assert result["page_count"] == 1
535-
assert len(result["text_blocks"]) == 2 # Two LINE blocks
536-
537-
# Verify text blocks
538-
assert result["text_blocks"][0]["text"] == "Sample text line 1"
539-
assert result["text_blocks"][0]["confidence"] == 98.5
540-
assert result["text_blocks"][0]["type"] == "PRINTED"
541-
542-
assert result["text_blocks"][1]["text"] == "Sample text line 2"
543-
assert result["text_blocks"][1]["confidence"] == 97.2
544-
assert result["text_blocks"][1]["type"] == "PRINTED"
531+
# Verify structure - now returns markdown table in 'text' field
532+
assert "text" in result
533+
assert "page_count" not in result # Removed in new format
534+
assert "text_blocks" not in result # Replaced with markdown table
535+
536+
# Verify markdown table content
537+
markdown_table = result["text"]
538+
lines = markdown_table.split("\n")
539+
540+
# Check header
541+
assert lines[0] == "| Text | Confidence |"
542+
assert lines[1] == "|------|------------|"
543+
544+
# Check data rows
545+
assert lines[2] == "| Sample text line 1 | 98.5 |"
546+
assert lines[3] == "| Sample text line 2 | 97.2 |"
545547

546548
def test_parse_textract_response_markdown_success(self):
547549
"""Test parsing Textract response to markdown successfully."""

0 commit comments

Comments
 (0)