Skip to content

Conversation

@redpheonixx
Copy link
Contributor

@redpheonixx redpheonixx commented Mar 24, 2025

This pull request addresses the handling of decimal physical type matching in Parquet. It implements rules such that:
For precision ≤ 9, values are stored as int32.
For precision ≤ 18, values are stored as int64.
For higher precision, values are stored as a FIXED_LEN_BYTE_ARRAY.

Closes #1789

@redpheonixx
Copy link
Contributor Author

Hi @kevinjqliu @Fokko
I have updated the code as per previous PR #1799 comments
please note that

  1. We need Decimal as to create pa.array of decimal 128 type it accepts only integer and decimal values and not float values.
  2. I tried without this parameter store_decimal_as_integer=True (ParquetWriter) but then the physical type for decimal(5, 2) will be FIXED_LEN_BYTE_ARRAY.

@Fokko Fokko added this to the PyIceberg 0.9.1 milestone Mar 26, 2025
@Fokko Fokko self-requested a review March 26, 2025 16:58
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @redpheonixx for working on this! 🙌 I think we're very close, I left one comment to simplify it a bit further. Let me know what you think!

Comment on lines 2355 to 2363
precision = stats_col.iceberg_type.precision
scale = stats_col.iceberg_type.scale
decimal_type = pa.decimal128(precision, scale)
col_aggs[field_id].update_min(
pa.array([Decimal(statistics.min_raw) / (10**scale)], decimal_type)[0].as_py()
)
col_aggs[field_id].update_max(
pa.array([Decimal(statistics.max_raw) / (10**scale)], decimal_type)[0].as_py()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this into:

Suggested change
precision = stats_col.iceberg_type.precision
scale = stats_col.iceberg_type.scale
decimal_type = pa.decimal128(precision, scale)
col_aggs[field_id].update_min(
pa.array([Decimal(statistics.min_raw) / (10**scale)], decimal_type)[0].as_py()
)
col_aggs[field_id].update_max(
pa.array([Decimal(statistics.max_raw) / (10**scale)], decimal_type)[0].as_py()
)
scale = stats_col.iceberg_type.scale
col_aggs[field_id].update_min(unscaled_to_decimal(statistics.min_raw, scale))
col_aggs[field_id].update_max(unscaled_to_decimal(statistics.max_raw, scale))

We already have the unscale_to_decimal that we can import from pyiceberg.utils.decimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Fokko ,
thanks for the comment
I have updated the code and used unscale_to_decimal now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that looks great 👍

@Fokko Fokko changed the title decimal physicial type mapping Fix decimal physicial type mapping Mar 31, 2025
@Fokko Fokko merged commit 1a5e32a into apache:main Mar 31, 2025
7 checks passed
@Fokko
Copy link
Contributor

Fokko commented Mar 31, 2025

Thanks for fixing this @redpheonixx 🙌

@redpheonixx redpheonixx deleted the decimal_physicial_type_mapping branch March 31, 2025 14:27
Fokko pushed a commit that referenced this pull request Apr 17, 2025
This pull request addresses the handling of decimal physical type
matching in Parquet. It implements rules such that:
For precision ≤ 9, values are stored as `int32`.
For precision ≤ 18, values are stored as `int64`.
For higher precision, values are stored as a `FIXED_LEN_BYTE_ARRAY`.

Closes #1789

---------

Co-authored-by: redpheonixx <amitsingh@192.168.1.5>
Co-authored-by: redpheonixx <amitsingh@192.168.1.10>
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
This pull request addresses the handling of decimal physical type
matching in Parquet. It implements rules such that:
For precision ≤ 9, values are stored as `int32`.
For precision ≤ 18, values are stored as `int64`.
For higher precision, values are stored as a `FIXED_LEN_BYTE_ARRAY`.

Closes apache#1789

---------

Co-authored-by: redpheonixx <amitsingh@192.168.1.5>
Co-authored-by: redpheonixx <amitsingh@192.168.1.10>
kevinjqliu pushed a commit that referenced this pull request Oct 26, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

# Rationale for this change
Fix for #2057. It looks
like the initial fix #1839
might have missed updating here to handle. I could use feedback on if
this is the best fix, it is at least simple.

## Are these changes tested?
- [x] added unit tests

## Are there any user-facing changes?
No

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Correctly handle decimal physicial type mapping

2 participants