Skip to content

Commit 9e352f7

Browse files
committed
Add configurable section splitting strategies for document segmentation control. Resolves #146.
1 parent c620ea2 commit 9e352f7

File tree

12 files changed

+754
-78
lines changed

12 files changed

+754
-78
lines changed

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,17 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
10+
- **Configurable Section Splitting Strategies for Enhanced Document Segmentation Control**
11+
- Added new `sectionSplitting` configuration option to control how classified pages are grouped into document sections
12+
- **Three Strategies Available**:
13+
- `disabled`: Entire document treated as single section with first detected class (simplest case)
14+
- `page`: One section per page preventing automatic joining of same-type documents (deterministic, solves Issue #146)
15+
- `llm_determined`: Uses LLM boundary detection with "Start"/"Continue" indicators (default, maintains existing behavior)
16+
- **Key Benefits**: Deterministic splitting for long documents with multiple same-type forms (e.g., multiple W-2s, multiple invoices), eliminates LLM boundary detection failures for critical government form processing, provides flexibility across simple to complex document scenarios
17+
- Resolves #146
18+
819
### Changed
920
- Removed page image limit entirely across all IDP services (classification, extraction, assessment) following Amazon Bedrock API removal of image count restrictions. The system now processes all document pages without artificial truncation, with info logging to track image counts for monitoring purposes.
1021
- Resolves #147

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ classification:
238238
system_prompt: >-
239239
You are a document classification expert who can analyze and classify multiple documents and their page boundaries within a document package from various domains. Your task is to determine the document type based on its content and structure, using the provided document type definitions. Your output must be valid JSON according to the requested format.
240240
classificationMethod: textbasedHolisticClassification
241+
sectionSplitting: llm_determined
241242
extraction:
242243
image:
243244
target_height: ""

config_library/pattern-2/lending-package-sample/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1188,6 +1188,7 @@ classes:
11881188
classification:
11891189
classificationMethod: multimodalPageLevelClassification
11901190
maxPagesForClassification: "ALL"
1191+
sectionSplitting: llm_determined
11911192
image:
11921193
target_height: ""
11931194
target_width: ""

config_library/pattern-2/rvl-cdip-package-sample-with-few-shot-examples/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -813,6 +813,7 @@ classification:
813813
target_height: ""
814814
target_width: ""
815815
classificationMethod: multimodalPageLevelClassification
816+
sectionSplitting: llm_determined
816817
model: us.amazon.nova-pro-v1:0
817818
temperature: "0.0"
818819
top_p: "0.1"

config_library/pattern-2/rvl-cdip-package-sample/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -905,6 +905,7 @@ classification:
905905
system_prompt: >-
906906
You are a document classification expert who can analyze and classify multiple documents and their page boundaries within a document package from various domains. Your task is to determine the document type based on its content and structure, using the provided document type definitions. Your output must be valid JSON according to the requested format.
907907
classificationMethod: textbasedHolisticClassification
908+
sectionSplitting: llm_determined
908909
extraction:
909910
image:
910911
target_width: ""

docs/classification.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,154 @@ Despite its strengths in handling full-document context, this method has several
148148
- Performs multi-modal page-level classification (classifies each page based on OCR data and page image)
149149
- Not configurable inside the GenAIIDP solution
150150
151+
## Section Splitting Strategies
152+
153+
The `sectionSplitting` configuration controls how classified pages are grouped into document sections. This setting works with both classification methods and provides three strategies:
154+
155+
### Available Strategies
156+
157+
#### 1. `disabled` - No Splitting (Entire Document = One Section)
158+
159+
**Behavior:**
160+
- All pages are assigned to a single section
161+
- Uses the first detected document class for the entire document
162+
- Ignores any page-level classification boundaries
163+
164+
**Use Cases:**
165+
- Documents known to be single-type with no internal divisions
166+
- Simplified processing where granular section splitting isn't needed
167+
- When you want to force all pages to be treated as one cohesive document
168+
169+
**Configuration Example:**
170+
```yaml
171+
classification:
172+
sectionSplitting: disabled
173+
classificationMethod: multimodalPageLevelClassification
174+
```
175+
176+
**Result:**
177+
- Document with 10 pages → 1 section containing all 10 pages
178+
- All pages assigned the first detected class
179+
180+
#### 2. `page` - Per-Page Splitting (Each Page = Own Section)
181+
182+
**Behavior:**
183+
- Every page becomes an independent section
184+
- Each page keeps its individually classified document type
185+
- **Prevents automatic joining of same-type documents**
186+
187+
**Use Cases:**
188+
- **Critical for long documents with multiple same-type forms** (e.g., multiple W-2 forms, multiple invoices)
189+
- When LLM boundary detection is unreliable or fails frequently
190+
- Government form processing where each form must be processed independently
191+
- Scenarios where deterministic splitting is required
192+
193+
**Configuration Example:**
194+
```yaml
195+
classification:
196+
sectionSplitting: page
197+
classificationMethod: multimodalPageLevelClassification
198+
```
199+
200+
**Result:**
201+
- Document with 10 pages → 10 sections (one per page)
202+
- Each page maintains its individual classification
203+
204+
**GitHub Issue Reference:**
205+
This strategy directly addresses [Issue #146](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/146) where long documents with multiple same-type forms were being incorrectly joined together.
206+
207+
#### 3. `llm_determined` - LLM Boundary Detection (Default)
208+
209+
**Behavior:**
210+
- Uses "Start"/"Continue" boundary indicators from LLM responses
211+
- Automatically groups related pages into logical sections
212+
- Implements BIO-like tagging for sophisticated document segmentation
213+
214+
**Use Cases:**
215+
- Complex multi-document packets requiring intelligent boundary detection
216+
- When LLM boundary detection works reliably
217+
- Default behavior that works well for most use cases
218+
219+
**Configuration Example:**
220+
```yaml
221+
classification:
222+
sectionSplitting: llm_determined # This is the default
223+
classificationMethod: multimodalPageLevelClassification
224+
```
225+
226+
**Result:**
227+
- Document with 10 pages → Variable number of sections based on LLM boundary detection
228+
- Pages grouped according to document boundaries and type changes
229+
230+
### Strategy Comparison Table
231+
232+
| Strategy | Sections Created | Boundary Detection | Same-Type Handling | Deterministic | Performance |
233+
|----------|-----------------|-------------------|-------------------|---------------|-------------|
234+
| `disabled` | 1 section always | None | All joined | Yes | Fastest |
235+
| `page` | N sections (N pages) | Per-page | Never joined | Yes | Fast |
236+
| `llm_determined` | Variable | LLM boundaries | May join | No | Standard |
237+
238+
### Configuration Placement
239+
240+
The `sectionSplitting` setting is placed in the classification configuration section:
241+
242+
```yaml
243+
classification:
244+
model: us.amazon.nova-pro-v1:0
245+
classificationMethod: multimodalPageLevelClassification
246+
sectionSplitting: page # Options: disabled, page, llm_determined
247+
maxPagesForClassification: "ALL"
248+
temperature: "0.0"
249+
# ... other classification settings
250+
```
251+
252+
### Interaction with Classification Methods
253+
254+
The `sectionSplitting` setting works with both classification methods:
255+
256+
**With `multimodalPageLevelClassification`:**
257+
- `disabled`: First page's class applies to all pages in one section
258+
- `page`: Each page's individual classification preserved in separate sections
259+
- `llm_determined`: Pages grouped by class + boundary metadata
260+
261+
**With `textbasedHolisticClassification`:**
262+
- `disabled`: First segment's class applies to all pages in one section
263+
- `page`: Each page gets its own section with the class assigned by holistic method
264+
- `llm_determined`: LLM-determined segments used as sections (default behavior)
265+
266+
### Real-World Example: Multiple W-2 Forms
267+
268+
Consider a 6-page document containing three W-2 forms (2 pages each):
269+
270+
**With `sectionSplitting: llm_determined` (may work or may fail):**
271+
```
272+
Result depends on LLM boundary detection accuracy
273+
Best case: 3 sections (one per W-2)
274+
Worst case: 1 section (all W-2s incorrectly joined)
275+
```
276+
277+
**With `sectionSplitting: page` (deterministic solution):**
278+
```
279+
Page 1 → Section 1 (W-2)
280+
Page 2 → Section 2 (W-2)
281+
Page 3 → Section 3 (W-2)
282+
Page 4 → Section 4 (W-2)
283+
Page 5 → Section 5 (W-2)
284+
Page 6 → Section 6 (W-2)
285+
286+
Result: 6 independent sections
287+
Each W-2 page processed separately
288+
No risk of incorrect joining
289+
```
290+
291+
**With `sectionSplitting: disabled` (simplest case):**
292+
```
293+
All 6 pages → Section 1 (W-2)
294+
295+
Result: Single section
296+
Entire document treated as one unit
297+
```
298+
151299
## Choosing Between Classification Methods
152300
153301
When deciding between Text-Based Holistic Classification and MultiModal Page-Level Classification with Sequence Segmentation, consider these factors:

0 commit comments

Comments
 (0)