Skip to content

Conversation

@kris-gaudel
Copy link
Contributor

@kris-gaudel kris-gaudel commented Jul 24, 2025

Closes #2123

Rationale for this change

Fixing sanitization behaviour to match specification and Java implementation

Are these changes tested?

Yes - Unit and integration tests

Are there any user-facing changes?

Yes - Field names will be sanitized to be Avro compatible if not already

@kris-gaudel
Copy link
Contributor Author

@kevinjqliu When you have a moment could you review this?

@kevinjqliu kevinjqliu changed the title #2123 Sanitize invalid Avro field names fix: sanitize invalid Avro field names in manifest file Jul 31, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM this is great, thanks for fixing this bug!

I left a few nit comments on just general structure of the some tests.
if you like, we can also add this test for duckdb reading from pyiceberg (like the original issue)
https://github.com/kevinjqliu/iceberg-python/pull/16/files#diff-7f3dd1244d08ce27c003cd091da10aa049f7bb0c7d5397acb4ec69767036accdR1204-R1244

@kevinjqliu
Copy link
Contributor

following up to #2123 (comment) and #2123 (comment)

I see that spark produces _xD83D_xDE0E. pyiceberg previously produces \uD83D\uDE0E but now produces _x1F60E.
it might be a good idea to add an integration test for pyiceberg reading from spark written table and spark reading from pyiceberg written table

@kris-gaudel
Copy link
Contributor Author

@kevinjqliu Thanks for the feedback! In regards to the adding the spark integration test, I've been running into some issues with running integration tests locally that use Spark

un.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available

Do you know what could be causing this?

@kris-gaudel kris-gaudel requested a review from kevinjqliu August 2, 2025 03:13
@kevinjqliu
Copy link
Contributor

Do you know what could be causing this?

Looks like CI passed, so it could be with your local setup. I put that in chatgpt, it says

🧠 Root Causes:
Running on a restricted JVM (like GraalVM Native Image or a non-Oracle/OpenJDK distribution)

Using a Java version where these internals are restricted (like Java 17+ with strong encapsulation)

Running in a context that limits reflection or unsafe access (e.g., with --illegal-access=deny)

Spark compiled with a newer Java version but running on an incompatible JVM

what java version are you using?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this.

This is quite close. I added a comment about actually using duckdb to read. And also a nit about consolidating some of the tests so that its easier to maintain in the future

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kris-gaudel for working on this! I agree with @kevinjqliu wrt maybe consolidating some of the tests.

@kris-gaudel
Copy link
Contributor Author

kris-gaudel commented Aug 5, 2025

Do you know what could be causing this?

Looks like CI passed, so it could be with your local setup. I put that in chatgpt, it says

🧠 Root Causes:
Running on a restricted JVM (like GraalVM Native Image or a non-Oracle/OpenJDK distribution)

Using a Java version where these internals are restricted (like Java 17+ with strong encapsulation)

Running in a context that limits reflection or unsafe access (e.g., with --illegal-access=deny)

Spark compiled with a newer Java version but running on an incompatible JVM

what java version are you using?

I have Java 22 active, but I will dry downgrading my version and see if that fixes it. Thank you for looking into it!

@kevinjqliu
Copy link
Contributor

i didnt even know there's a java 22 already, haha
thats likely the reason, iceberg java only supports up to 21
https://github.com/apache/iceberg/blob/7ffc718d2857c1f4e4e7e1d70eebc8662020d6bd/build.gradle#L81

@kris-gaudel
Copy link
Contributor Author

kris-gaudel commented Aug 6, 2025

Ok so I downgraded my Java version and the integration tests run now, but I'm still having some issues with integration tests failing due to some spark issues.

Edit: Turns out some settings with Apache Arrow I was tweaking were to blame, they pass locally now!

@kevinjqliu
Copy link
Contributor

nice! @kris-gaudel I think we're almost good to go

Could you address these 2 comments when you get a chance?
#2245 (comment)
#2245 (comment)

@kris-gaudel
Copy link
Contributor Author

kris-gaudel commented Aug 6, 2025

@kevinjqliu Done! Lmk if I missed anything 😄

@kris-gaudel kris-gaudel requested review from Fokko and kevinjqliu August 6, 2025 17:34
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for cleaning up and adding the additional integration tests

@kevinjqliu kevinjqliu merged commit b6a45ed into apache:main Aug 6, 2025
10 checks passed
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
Closes apache#2123

# Rationale for this change
Fixing sanitization behaviour to match specification and Java
implementation

# Are these changes tested?
Yes - Unit and integration tests

# Are there any user-facing changes?
Yes - Field names will be sanitized to be Avro compatible if not already

<!-- In the case of user-facing changes, please add the changelog label.
-->

---------

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Schema validation should reject field names that are invalid Avro identifiers

4 participants