partition field names validation against schema field conflicts #2305

rutb327 · 2025-08-11T22:22:08Z

Closes #2272
Collaborator: @geruh

Rationale for this change

Implements the validation logic described in #2272 to match Java and Rust behavior for partition field name conflicts with schema fields.
This mirrors the method in Java checkAndAddPartitionName():
https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L392-L416

Identity transforms (sourceColumnID != null)- Allow schema field name conflicts only when sourced form the same field
Non-identity (sourceColumnID == null)- Disallow any schema field name conflicts.

In this PR isinstance(transform, (IdentityTransform, VoidTransform)) is used to achieve the same logic as Java’s sourceColumnID check.

Are these changes tested?

Yes, all existing tests pass and added a test covering validation scenarios.

Are there any user-facing changes?

Yes. Non-identity transforms can no longer use schema field names as partition field names.

rutb327 · 2025-08-12T18:09:47Z

In Java all partition-schema validation goes through https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L392-L416 during table creation with partition specs, partition spec updates and also during schema evolution.
In Python the validation in https://github.com/apache/iceberg-python/blob/d1c6005ad05166ab0fb08d3c15ccdfd7568e8013/pyiceberg/table/update/spec.py only covered partition spec updates
So, I've added the validation to:

Are these the correct locations for the validation logic, or should they be placed elsewhere?

dingo4dev

Thanks for your work on this!

To improve readability and keep related code together, what are your thoughts on placing all the partition validation logic inside the partitioning.py file? Centralizing it there could make the validation process easier for future contributors to find and understand.

Let me know what you think! @kevinjqliu

kevinjqliu

Thank you for the PR! I left a few comments. I like how we check for conflict for both changes to the PartitionSpec and changes to the Schema

I've double checked that there are only 2 places that modifies PartitionSpec, assign_fresh_partition_spec_ids and UpdateSpec._apply and we covered both with tests :)
Similarly we cover the 1 place that modifies Schema in UpdateSchema._apply

I think both java and rust lack the test to check PartitionSpec for conflict when the Schema is changed

kevinjqliu · 2025-08-16T21:04:01Z

tests/integration/test_partition_evolution.py

+def _create_table_with_schema(
+    catalog: Catalog, schema: Schema, format_version: str, partition_spec: Optional[PartitionSpec] = None
+) -> Table:


following other create table helpers in tests, for example

iceberg-python/tests/integration/test_register_table.py

Lines 40 to 59 in 8013545

def _create_table(

session_catalog: Catalog,

identifier: str,

format_version: int,

location: str,

partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,

schema: Schema = TABLE_SCHEMA,

) -> Table:

try:

session_catalog.drop_table(identifier=identifier)

except NoSuchTableError:

pass

return session_catalog.create_table(

identifier=identifier,

schema=schema,

location=location,

properties={"format-version": str(format_version)},

partition_spec=partition_spec,

)

Suggested change

def _create_table_with_schema(

catalog: Catalog, schema: Schema, format_version: str, partition_spec: Optional[PartitionSpec] = None

) -> Table:

def _create_table_with_schema(

catalog: Catalog, schema: Schema, format_version: str, partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC

) -> Table:

kevinjqliu · 2025-08-16T21:04:55Z

tests/integration/test_partition_evolution.py

+    if partition_spec:
+        return catalog.create_table(
+            identifier=tbl_name, schema=schema, partition_spec=partition_spec, properties={"format-version": format_version}
+        )
    return catalog.create_table(identifier=tbl_name, schema=schema, properties={"format-version": format_version})


and then we can just do this

Suggested change

if partition_spec:

return catalog.create_table(

identifier=tbl_name, schema=schema, partition_spec=partition_spec, properties={"format-version": format_version}

)

return catalog.create_table(identifier=tbl_name, schema=schema, properties={"format-version": format_version})

return catalog.create_table(

identifier=tbl_name, schema=schema, partition_spec=partition_spec, properties={"format-version": format_version}

)

kevinjqliu · 2025-08-16T22:18:15Z

pyiceberg/partitioning.py

+        return  # No conflict if field doesn't exist in schema
+
+    if isinstance(partition_transform, (IdentityTransform, VoidTransform)):
+        # For identity transforms, allow conflict only if sourced from the same schema field


Suggested change

# For identity transforms, allow conflict only if sourced from the same schema field

# For identity and void transforms, allow conflict only if sourced from the same schema field

kevinjqliu · 2025-08-16T22:36:10Z

pyiceberg/partitioning.py

+            raise ValueError(f"Cannot create identity partition from a different source field in the schema: {field_name}")
+    else:


match java error message

Suggested change

raise ValueError(f"Cannot create identity partition from a different source field in the schema: {field_name}")

else:

raise ValueError(f"Cannot create identity partition sourced from different field in schema: {field_name}")

else:

kevinjqliu · 2025-08-16T22:38:28Z

pyiceberg/table/update/spec.py

+            from pyiceberg.partitioning import validate_partition_name
+
+            validate_partition_name(name, transform, source_id, schema)
            if not name:


wdyt about moving L183-L186 into the validate_partition_name to mirror the java impl

https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L412-L414

We can do that

kevinjqliu · 2025-08-16T22:47:30Z

pyiceberg/table/update/spec.py

+            _check_and_add_partition_name(
+                self._transaction.table_metadata.schema(),
+                added_field.name,
+                added_field.source_id,
+                added_field.transform,
+                partition_names,
+            )


good catch. just to confirm this covers the newly added partition fields?

yes, that's correct

kevinjqliu · 2025-08-16T22:53:36Z

pyiceberg/table/update/schema.py

+        if self._transaction is not None:
+            from pyiceberg.partitioning import validate_partition_name
+
+            for spec in self._transaction.table_metadata.partition_specs:
+                for partition_field in spec.fields:
+                    validate_partition_name(
+                        partition_field.name, partition_field.transform, partition_field.source_id, new_schema
+                    )


i think there should always be a self._transaction

Suggested change

if self._transaction is not None:

from pyiceberg.partitioning import validate_partition_name

for spec in self._transaction.table_metadata.partition_specs:

for partition_field in spec.fields:

validate_partition_name(

partition_field.name, partition_field.transform, partition_field.source_id, new_schema

)

from pyiceberg.partitioning import validate_partition_name

for spec in self._transaction.table_metadata.partition_specs:

for partition_field in spec.fields:

validate_partition_name(

partition_field.name, partition_field.transform, partition_field.source_id, new_schema

)

okay, I'll do the suggested changes

Some tests show that transaction can be None in some cases, (after removing the check, tests from test_schema.py are failing). They use: UpdateSchema(transaction=None, schema=Schema())
https://github.com/rutb327/iceberg-python/blob/24b12ddd8fdab4a62650786a2c3cdd56a53f8719/tests/test_schema.py#L933

looks like everywhere else in the codebase we include transaction in UpdateSchema.

Maybe we can update the tests like this

def test_add_top_level_primitives(primitive_fields: List[NestedField], table_v2: Table) -> None: for primitive_field in primitive_fields: new_schema = Schema(primitive_field) applied = UpdateSchema(transaction=Transaction(table_v2), schema=Schema()).union_by_name(new_schema)._apply() # type: ignore assert applied == new_schema

kevinjqliu · 2025-08-16T23:22:38Z

I opened apache/iceberg#13833 and apache/iceberg-rust#1609 for checking for name conflict during schema update

tests/integration/test_partition_evolution.py

kevinjqliu · 2025-08-19T05:39:56Z

pyiceberg/table/update/schema.py

+        if self._transaction is not None:
+            from pyiceberg.partitioning import validate_partition_name
+
+            for spec in self._transaction.table_metadata.partition_specs:
+                for partition_field in spec.fields:
+                    validate_partition_name(
+                        partition_field.name, partition_field.transform, partition_field.source_id, new_schema
+                    )


looks like everywhere else in the codebase we include transaction in UpdateSchema.

Maybe we can update the tests like this

def test_add_top_level_primitives(primitive_fields: List[NestedField], table_v2: Table) -> None: for primitive_field in primitive_fields: new_schema = Schema(primitive_field) applied = UpdateSchema(transaction=Transaction(table_v2), schema=Schema()).union_by_name(new_schema)._apply() # type: ignore assert applied == new_schema

pyiceberg/partitioning.py

Co-authored-by: Fokko Driesprong <fokko@apache.org>

Fokko · 2025-08-20T20:38:25Z

Let's move this forward, thanks @rutb327 for working on this, and thanks @kevinjqliu and @dingo4dev for the review 🙌

partition field names validation against schema field conflicts

92a29e8

dingo4dev reviewed Aug 13, 2025

View reviewed changes

partition-schema name conflict validation function added

284250b

rutb327 force-pushed the issue2272 branch from 7f530ba to 284250b Compare August 14, 2025 23:02

kevinjqliu reviewed Aug 16, 2025

View reviewed changes

kevinjqliu requested a review from Fokko August 16, 2025 23:01

kevinjqliu mentioned this pull request Aug 16, 2025

bug: validate schema and partition field name conflicts during updates apache/iceberg#13833

Open

3 tasks

validate_partition_name function update

6cf4a51

kevinjqliu reviewed Aug 19, 2025

View reviewed changes

kevinjqliu and others added 2 commits August 19, 2025 06:27

fix test_schema

e63bedf

Update tests/integration/test_partition_evolution.py

61b1b6d

Fokko reviewed Aug 20, 2025

View reviewed changes

pyiceberg/partitioning.py Outdated Show resolved Hide resolved

Fokko reviewed Aug 20, 2025

View reviewed changes

pyiceberg/partitioning.py Show resolved Hide resolved

rutb327 and others added 2 commits August 20, 2025 08:41

Update pyiceberg/partitioning.py

d0b9053

Co-authored-by: Fokko Driesprong <fokko@apache.org>

tests update

252a4e6

Fokko approved these changes Aug 20, 2025

View reviewed changes

Fokko merged commit 5a781df into apache:main Aug 20, 2025
10 checks passed

zyd14 mentioned this pull request Oct 19, 2025

[dagster-iceberg] in pyiceberg 0.10.0 partition field name cannot match existing schema field name dagster-io/community-integrations#240

Closed

	def _create_table(
	session_catalog: Catalog,
	identifier: str,
	format_version: int,
	location: str,
	partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
	schema: Schema = TABLE_SCHEMA,
	) -> Table:
	try:
	session_catalog.drop_table(identifier=identifier)
	except NoSuchTableError:
	pass

	return session_catalog.create_table(
	identifier=identifier,
	schema=schema,
	location=location,
	properties={"format-version": str(format_version)},
	partition_spec=partition_spec,
	)

	# For identity transforms, allow conflict only if sourced from the same schema field
	# For identity and void transforms, allow conflict only if sourced from the same schema field

		raise ValueError(f"Cannot create identity partition from a different source field in the schema: {field_name}")
		else:

partition field names validation against schema field conflicts #2305

partition field names validation against schema field conflicts #2305

Uh oh!

Conversation

rutb327 commented Aug 11, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rutb327 commented Aug 12, 2025

Uh oh!

dingo4dev left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Aug 16, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants