fix: sanitize invalid Avro field names in manifest file #2245

kris-gaudel · 2025-07-24T03:49:40Z

Closes #2123

Rationale for this change

Fixing sanitization behaviour to match specification and Java implementation

Are these changes tested?

Yes - Unit and integration tests

Are there any user-facing changes?

Yes - Field names will be sanitized to be Avro compatible if not already

kris-gaudel · 2025-07-31T19:32:32Z

@kevinjqliu When you have a moment could you review this?

kevinjqliu

LGTM this is great, thanks for fixing this bug!

I left a few nit comments on just general structure of the some tests.
if you like, we can also add this test for duckdb reading from pyiceberg (like the original issue)
https://github.com/kevinjqliu/iceberg-python/pull/16/files#diff-7f3dd1244d08ce27c003cd091da10aa049f7bb0c7d5397acb4ec69767036accdR1204-R1244

pyiceberg/utils/schema_conversion.py

pyiceberg/schema.py

pyiceberg/utils/schema_conversion.py

tests/integration/test_avro_compatibility.py

kevinjqliu · 2025-07-31T20:23:42Z

following up to #2123 (comment) and #2123 (comment)

I see that spark produces _xD83D_xDE0E. pyiceberg previously produces \uD83D\uDE0E but now produces _x1F60E.
it might be a good idea to add an integration test for pyiceberg reading from spark written table and spark reading from pyiceberg written table

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

kris-gaudel · 2025-08-02T03:13:14Z

@kevinjqliu Thanks for the feedback! In regards to the adding the spark integration test, I've been running into some issues with running integration tests locally that use Spark

un.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available

Do you know what could be causing this?

kevinjqliu · 2025-08-02T04:57:41Z

Do you know what could be causing this?

Looks like CI passed, so it could be with your local setup. I put that in chatgpt, it says

🧠 Root Causes:
Running on a restricted JVM (like GraalVM Native Image or a non-Oracle/OpenJDK distribution)

Using a Java version where these internals are restricted (like Java 17+ with strong encapsulation)

Running in a context that limits reflection or unsafe access (e.g., with --illegal-access=deny)

Spark compiled with a newer Java version but running on an incompatible JVM

what java version are you using?

kevinjqliu

LGTM! Thanks for working on this.

This is quite close. I added a comment about actually using duckdb to read. And also a nit about consolidating some of the tests so that its easier to maintain in the future

tests/integration/test_writes/test_writes.py

tests/test_avro_sanitization.py

Fokko

Thanks @kris-gaudel for working on this! I agree with @kevinjqliu wrt maybe consolidating some of the tests.

kris-gaudel · 2025-08-05T21:44:28Z

Do you know what could be causing this?

Looks like CI passed, so it could be with your local setup. I put that in chatgpt, it says

🧠 Root Causes:
Running on a restricted JVM (like GraalVM Native Image or a non-Oracle/OpenJDK distribution)

Using a Java version where these internals are restricted (like Java 17+ with strong encapsulation)

Running in a context that limits reflection or unsafe access (e.g., with --illegal-access=deny)

Spark compiled with a newer Java version but running on an incompatible JVM

what java version are you using?

I have Java 22 active, but I will dry downgrading my version and see if that fixes it. Thank you for looking into it!

kevinjqliu · 2025-08-05T21:49:09Z

i didnt even know there's a java 22 already, haha
thats likely the reason, iceberg java only supports up to 21
https://github.com/apache/iceberg/blob/7ffc718d2857c1f4e4e7e1d70eebc8662020d6bd/build.gradle#L81

kris-gaudel · 2025-08-06T04:51:51Z

Ok so I downgraded my Java version and the integration tests run now, but I'm still having some issues with integration tests failing due to some spark issues.

Edit: Turns out some settings with Apache Arrow I was tweaking were to blame, they pass locally now!

kevinjqliu · 2025-08-06T16:52:11Z

nice! @kris-gaudel I think we're almost good to go

Could you address these 2 comments when you get a chance?
#2245 (comment)
#2245 (comment)

kris-gaudel · 2025-08-06T17:33:32Z

@kevinjqliu Done! Lmk if I missed anything 😄

kevinjqliu

LGTM! Thank you for cleaning up and adding the additional integration tests

pyiceberg/utils/schema_conversion.py

Closes apache#2123 # Rationale for this change Fixing sanitization behaviour to match specification and Java implementation # Are these changes tested? Yes - Unit and integration tests # Are there any user-facing changes? Yes - Field names will be sanitized to be Avro compatible if not already  --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

kris-gaudel added 2 commits July 23, 2025 18:06

Impl sanitization

d7c02c2

Add tests, lint

ee5931e

kevinjqliu mentioned this pull request Jul 31, 2025

[bug] Schema validation should reject field names that are invalid Avro identifiers #2123

Closed

3 tasks

kevinjqliu changed the title ~~#2123 Sanitize invalid Avro field names~~ fix: sanitize invalid Avro field names in manifest file Jul 31, 2025

This was referenced Jul 31, 2025

add more tests for avro sanitization #2241

Closed

add test for by partition by 😎 #2264

Closed

kevinjqliu reviewed Jul 31, 2025

View reviewed changes

kris-gaudel and others added 3 commits August 1, 2025 20:25

Update pyiceberg/utils/schema_conversion.py

30fd7f1

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

Move tests, address, simplify schema function, added types to tes

c0de522

Add DuckDB integration test

1dc1c01

kris-gaudel requested a review from kevinjqliu August 2, 2025 03:13

kevinjqliu reviewed Aug 2, 2025

View reviewed changes

tests/integration/test_writes/test_writes.py Outdated Show resolved Hide resolved

tests/test_avro_sanitization.py Show resolved Hide resolved

Fokko approved these changes Aug 4, 2025

View reviewed changes

kris-gaudel added 2 commits August 6, 2025 13:32

Consolidate unit tests

ed75062

Add spark integration tests, fix duckdb integration test

4f5190b

kris-gaudel requested review from Fokko and kevinjqliu August 6, 2025 17:34

kevinjqliu approved these changes Aug 6, 2025

View reviewed changes

kevinjqliu merged commit b6a45ed into apache:main Aug 6, 2025
10 checks passed

nvartolomei reviewed Aug 7, 2025

View reviewed changes

pyiceberg/utils/schema_conversion.py Outdated Show resolved Hide resolved

kevinjqliu mentioned this pull request Aug 11, 2025

Attempting to replicate avro field name issue with DuckDB #2243

Closed

fix: sanitize invalid Avro field names in manifest file #2245

fix: sanitize invalid Avro field names in manifest file #2245

Uh oh!

Conversation

kris-gaudel commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kris-gaudel commented Jul 31, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu commented Jul 31, 2025

Uh oh!

kris-gaudel commented Aug 2, 2025

Uh oh!

kevinjqliu commented Aug 2, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

kris-gaudel commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Aug 5, 2025

Uh oh!

kris-gaudel commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Aug 6, 2025

Uh oh!

kris-gaudel commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kris-gaudel commented Jul 24, 2025 •

edited

Loading

kris-gaudel commented Aug 5, 2025 •

edited

Loading

kris-gaudel commented Aug 6, 2025 •

edited

Loading

kris-gaudel commented Aug 6, 2025 •

edited

Loading