Skip to content

Commit a5d7390

Browse files
author
Bob Strahan
committed
Merge branch 'main' of ssh.gitlab.aws.dev:genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator
2 parents d7575c2 + b419a3c commit a5d7390

File tree

58 files changed

+5004
-1887
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+5004
-1887
lines changed

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,38 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
10+
## [0.3.16]
11+
12+
### Added
13+
14+
- **S3 Vectors Support for Cost-Optimized Knowledge Base Storage**
15+
- Added S3 Vectors as alternative vector store option to OpenSearch Serverless for Bedrock Knowledge Base with lower storage costs
16+
- Custom resource Lambda implementation for S3 vector bucket and index management (using boto3 s3vectors client) with proper IAM permissions and resource cleanup
17+
- Unified Knowledge Base interface supporting both vector store types with automatic resource provisioning based on user selection
18+
19+
- **Page Limit Configuration for Classification Control**
20+
- Added `maxPagesForClassification` configuration option to control how many pages are used during document classification
21+
- **Default Behavior**: `"ALL"` - uses all pages for classification (existing behavior)
22+
- **Limited Page Classification**: Set to numeric value (e.g., `"1"`, `"2"`, `"3"`) to classify only the first N pages
23+
- **Important**: When using numeric limit, the classification result from the first N pages is applied to ALL pages in the document, effectively forcing the entire document to be assigned a single class with one section
24+
- **Use Cases**: Performance optimization for large documents, cost reduction for documents with consistent classification patterns, simplified processing for homogeneous document types
25+
26+
- **CloudFormation Service Role for Delegated Deployment Access**
27+
- Added example CloudFormation service role template that enables non-administrator users to deploy and maintain IDP stacks without requiring ongoing administrator permissions
28+
- Administrators can provision the service role once with elevated privileges, then delegate deployment capabilities to developer/DevOps teams
29+
- Includes comprehensive documentation and cross-referenced deployment guides explaining the security model and setup process
30+
31+
32+
### Fixed
33+
- Fixed issue where CloudFront policy statements were still appearing in generated GovCloud templates despite CloudFront resources being removed
34+
- Fix duplicate Glue tables are created when using a document class that contains a dash (-). Resolved by replacing dash in section types with underscore character when creating the table, to align with the table name generated later by the Glue crawler - resolves #57.
35+
- Fix occasional UI error 'Failed to get document details - please try again later' - resolves #58
36+
- Fixed UI zipfile creation to exclude .aws-sam directories and .env files from deployment package
37+
- Added security recommendation to set LogLevel parameter to WARN or ERROR (not INFO) for production deployments to prevent logging of sensitive information including PII data, document contents, and S3 presigned URLs
38+
- Hardened several aspects of the new Discovery feature
39+
840
## [0.3.15]
941

1042
### Added

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.15
1+
0.3.16

config_library/pattern-1/lending-package-sample/config.yaml

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,3 +215,105 @@ pricing:
215215
price: '1.5E-6'
216216
- name: cacheWriteInputTokens
217217
price: '1.875E-5'
218+
discovery:
219+
output_format:
220+
sample_json: |-
221+
{
222+
"document_class" : "Form-1040",
223+
"document_description" : "Brief summary of the document",
224+
"groups" : [
225+
{
226+
"name" : "PersonalInformation",
227+
"description" : "Personal information of Tax payer",
228+
"attributeType" : "group",
229+
"groupAttributes" : [
230+
{
231+
"name": "FirstName",
232+
"dataType" : "string",
233+
"description" : "First Name of Taxpayer"
234+
},
235+
{
236+
"name": "Age",
237+
"dataType" : "number",
238+
"description" : "Age of Taxpayer"
239+
}
240+
]
241+
},
242+
{
243+
"name" : "Dependents",
244+
"description" : "Dependents of taxpayer",
245+
"attributeType" : "list",
246+
"listItemTemplate": {
247+
"itemAttributes" : [
248+
{
249+
"name": "FirstName",
250+
"dataType" : "string",
251+
"description" : "Dependent first name"
252+
},
253+
{
254+
"name": "Age",
255+
"dataType" : "number",
256+
"description" : "Dependent Age"
257+
}
258+
]
259+
}
260+
}
261+
]
262+
}
263+
with_ground_truth:
264+
top_p: '0.1'
265+
temperature: '1.0'
266+
user_prompt: >-
267+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
268+
<GROUND_TRUTH_REFERENCE>
269+
{ground_truth_json}
270+
</GROUND_TRUTH_REFERENCE>
271+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
272+
Image may contain multiple pages, process all pages.
273+
Extract all field names including those without values.
274+
Do not change the group name and field name from ground truth in the extracted data json.
275+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
276+
Add two fields document_class and document_description.
277+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
278+
For document_description generate a description about the document in less than 50 words.
279+
If the group repeats and follows table format, update the attributeType as "list".
280+
Do not extract the values.
281+
Format the extracted data using the below JSON format:
282+
Format the extracted groups and fields using the below JSON format:
283+
284+
model_id: us.amazon.nova-pro-v1:0
285+
system_prompt: >-
286+
You are an expert in processing forms. Extracting data from images and
287+
documents. Use provided ground truth data as reference to optimize field
288+
extraction and ensure consistency with expected document structure and
289+
field definitions.
290+
max_tokens: '10000'
291+
without_ground_truth:
292+
top_p: '0.1'
293+
temperature: '1.0'
294+
user_prompt: >-
295+
This image contains forms data. Analyze the form line by line.
296+
Image may contains multiple pages, process all the pages.
297+
Form may contain multiple name value pair in one line.
298+
Extract all the names in the form including the name value pair which doesn't have value.
299+
Organize them into groups, extract field_name, data_type and field description
300+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
301+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
302+
Field_name should be unique within the group.
303+
Add two fields document_class and document_description.
304+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
305+
For document_description generate a description about the document in less than 50 words.
306+
307+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
308+
If the group repeats and follows table format, update the attributeType as "list".
309+
Do not extract the values.
310+
Return the extracted data in JSON format.
311+
Format the extracted data using the below JSON format:
312+
Format the extracted groups and fields using the below JSON format:
313+
model_id: us.amazon.nova-pro-v1:0
314+
system_prompt: >-
315+
You are an expert in processing forms. Extracting data from images and
316+
documents. Analyze forms line by line to identify field names, data types,
317+
and organizational structure. Focus on creating comprehensive blueprints
318+
for document processing without extracting actual values.
319+
max_tokens: '10000'

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ classes:
6868
description: List of all transactions in the statement period
6969
attributeType: list
7070
classification:
71+
maxPagesForClassification: "ALL"
7172
image:
7273
target_height: ''
7374
target_width: ''
@@ -371,6 +372,7 @@ summarization:
371372
372373
assessment:
373374
enabled: true
375+
validation_enabled: false
374376
image:
375377
target_height: ''
376378
target_width: ''
@@ -691,3 +693,105 @@ pricing:
691693
price: '1.5E-6'
692694
- name: cacheWriteInputTokens
693695
price: '1.875E-5'
696+
discovery:
697+
output_format:
698+
sample_json: |-
699+
{
700+
"document_class" : "Form-1040",
701+
"document_description" : "Brief summary of the document",
702+
"groups" : [
703+
{
704+
"name" : "PersonalInformation",
705+
"description" : "Personal information of Tax payer",
706+
"attributeType" : "group",
707+
"groupAttributes" : [
708+
{
709+
"name": "FirstName",
710+
"dataType" : "string",
711+
"description" : "First Name of Taxpayer"
712+
},
713+
{
714+
"name": "Age",
715+
"dataType" : "number",
716+
"description" : "Age of Taxpayer"
717+
}
718+
]
719+
},
720+
{
721+
"name" : "Dependents",
722+
"description" : "Dependents of taxpayer",
723+
"attributeType" : "list",
724+
"listItemTemplate": {
725+
"itemAttributes" : [
726+
{
727+
"name": "FirstName",
728+
"dataType" : "string",
729+
"description" : "Dependent first name"
730+
},
731+
{
732+
"name": "Age",
733+
"dataType" : "number",
734+
"description" : "Dependent Age"
735+
}
736+
]
737+
}
738+
}
739+
]
740+
}
741+
with_ground_truth:
742+
top_p: '0.1'
743+
temperature: '1.0'
744+
user_prompt: >-
745+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
746+
<GROUND_TRUTH_REFERENCE>
747+
{ground_truth_json}
748+
</GROUND_TRUTH_REFERENCE>
749+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
750+
Image may contain multiple pages, process all pages.
751+
Extract all field names including those without values.
752+
Do not change the group name and field name from ground truth in the extracted data json.
753+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
754+
Add two fields document_class and document_description.
755+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
756+
For document_description generate a description about the document in less than 50 words.
757+
If the group repeats and follows table format, update the attributeType as "list".
758+
Do not extract the values.
759+
Format the extracted data using the below JSON format:
760+
Format the extracted groups and fields using the below JSON format:
761+
762+
model_id: us.amazon.nova-pro-v1:0
763+
system_prompt: >-
764+
You are an expert in processing forms. Extracting data from images and
765+
documents. Use provided ground truth data as reference to optimize field
766+
extraction and ensure consistency with expected document structure and
767+
field definitions.
768+
max_tokens: '10000'
769+
without_ground_truth:
770+
top_p: '0.1'
771+
temperature: '1.0'
772+
user_prompt: >-
773+
This image contains forms data. Analyze the form line by line.
774+
Image may contains multiple pages, process all the pages.
775+
Form may contain multiple name value pair in one line.
776+
Extract all the names in the form including the name value pair which doesn't have value.
777+
Organize them into groups, extract field_name, data_type and field description
778+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
779+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
780+
Field_name should be unique within the group.
781+
Add two fields document_class and document_description.
782+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
783+
For document_description generate a description about the document in less than 50 words.
784+
785+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
786+
If the group repeats and follows table format, update the attributeType as "list".
787+
Do not extract the values.
788+
Return the extracted data in JSON format.
789+
Format the extracted data using the below JSON format:
790+
Format the extracted groups and fields using the below JSON format:
791+
model_id: us.amazon.nova-pro-v1:0
792+
system_prompt: >-
793+
You are an expert in processing forms. Extracting data from images and
794+
documents. Analyze forms line by line to identify field names, data types,
795+
and organizational structure. Focus on creating comprehensive blueprints
796+
for document processing without extracting actual values.
797+
max_tokens: '10000'

config_library/pattern-2/criteria-validation/config.yaml

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22
# SPDX-License-Identifier: MIT-0
33

44
notes: Criteria validation configuration for healthcare/insurance prior authorization
5+
assessment:
6+
enabled: true
7+
validation_enabled: false
58
criteria_validation:
69
model: us.anthropic.claude-3-5-sonnet-20240620-v1:0
710
temperature: 0.0
@@ -209,3 +212,105 @@ pricing:
209212
price: 0.0000032
210213
- name: cacheReadInputTokens
211214
price: 0.0000002
215+
discovery:
216+
output_format:
217+
sample_json: |-
218+
{
219+
"document_class" : "Form-1040",
220+
"document_description" : "Brief summary of the document",
221+
"groups" : [
222+
{
223+
"name" : "PersonalInformation",
224+
"description" : "Personal information of Tax payer",
225+
"attributeType" : "group",
226+
"groupAttributes" : [
227+
{
228+
"name": "FirstName",
229+
"dataType" : "string",
230+
"description" : "First Name of Taxpayer"
231+
},
232+
{
233+
"name": "Age",
234+
"dataType" : "number",
235+
"description" : "Age of Taxpayer"
236+
}
237+
]
238+
},
239+
{
240+
"name" : "Dependents",
241+
"description" : "Dependents of taxpayer",
242+
"attributeType" : "list",
243+
"listItemTemplate": {
244+
"itemAttributes" : [
245+
{
246+
"name": "FirstName",
247+
"dataType" : "string",
248+
"description" : "Dependent first name"
249+
},
250+
{
251+
"name": "Age",
252+
"dataType" : "number",
253+
"description" : "Dependent Age"
254+
}
255+
]
256+
}
257+
}
258+
]
259+
}
260+
with_ground_truth:
261+
top_p: '0.1'
262+
temperature: '1.0'
263+
user_prompt: >-
264+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
265+
<GROUND_TRUTH_REFERENCE>
266+
{ground_truth_json}
267+
</GROUND_TRUTH_REFERENCE>
268+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
269+
Image may contain multiple pages, process all pages.
270+
Extract all field names including those without values.
271+
Do not change the group name and field name from ground truth in the extracted data json.
272+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
273+
Add two fields document_class and document_description.
274+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
275+
For document_description generate a description about the document in less than 50 words.
276+
If the group repeats and follows table format, update the attributeType as "list".
277+
Do not extract the values.
278+
Format the extracted data using the below JSON format:
279+
Format the extracted groups and fields using the below JSON format:
280+
281+
model_id: us.amazon.nova-pro-v1:0
282+
system_prompt: >-
283+
You are an expert in processing forms. Extracting data from images and
284+
documents. Use provided ground truth data as reference to optimize field
285+
extraction and ensure consistency with expected document structure and
286+
field definitions.
287+
max_tokens: '10000'
288+
without_ground_truth:
289+
top_p: '0.1'
290+
temperature: '1.0'
291+
user_prompt: >-
292+
This image contains forms data. Analyze the form line by line.
293+
Image may contains multiple pages, process all the pages.
294+
Form may contain multiple name value pair in one line.
295+
Extract all the names in the form including the name value pair which doesn't have value.
296+
Organize them into groups, extract field_name, data_type and field description
297+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
298+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
299+
Field_name should be unique within the group.
300+
Add two fields document_class and document_description.
301+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
302+
For document_description generate a description about the document in less than 50 words.
303+
304+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
305+
If the group repeats and follows table format, update the attributeType as "list".
306+
Do not extract the values.
307+
Return the extracted data in JSON format.
308+
Format the extracted data using the below JSON format:
309+
Format the extracted groups and fields using the below JSON format:
310+
model_id: us.amazon.nova-pro-v1:0
311+
system_prompt: >-
312+
You are an expert in processing forms. Extracting data from images and
313+
documents. Analyze forms line by line to identify field names, data types,
314+
and organizational structure. Focus on creating comprehensive blueprints
315+
for document processing without extracting actual values.
316+
max_tokens: '10000'

0 commit comments

Comments
 (0)