Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema

### Feature Request / Improvement

Currently, `create_table` takes a `pyiceberg.schema.Schema` or `pyarrow.Schema`. The API ignores the `field_id`s provided in the schema and issues a set of new ones.

Not only does this cause confusion for the users, because `pyiceberg.schema.Schema` requires `field_id`s, but it also means that a table cannot be created with the guarantee that its schema will match the field IDs of the provided schema. This prevents the API from being able to be used for **table migrations** ([discussion on example use case](https://github.com/apache/iceberg-python/issues/1227#issuecomment-2414821679)), where a user would want to take the following course of steps:
1. `catalog.load_table` to load the `pyiceberg.table.Table` of an existing Iceberg table.
2. get `pyiceberg.schema.Schema` of the loaded table
3. create a new table in the target catalog using `catalog.create_table` using the existing table's schema
4. copy over files into the new table using `add_files` (this is not possible yet, but there is a discussion that would allow `add_files` to work on a table with schema with matching field_ids)

Above procedure will not work, unless we introduce an enhancement to `create_table` that enables creation of a new table that matches the field_ids of the provided Schema.

One way of addressing this issue will be to have two ways of representing the Iceberg table's Schema.
1. If a schema without field_id is passed into the API, we should create a new `pyiceberg.schema.Schema` with newly assigned field_ids to create the table.
2. If a schema with field_id is passed into the API, we should use the exact same `pyiceberg.schema.Schema` to create the table.

I discuss a few ideas in achieving this below, all with their own pros and cons:

### Create a subclass of `pyiceberg.schema.Schema` without `field_id`
This sounds like the best approach, but once explored, we quickly realize that this may be impossible. The main challenge with this approach is the fact that `pyiceberg.schema.Schema` describes its fields using `NestedField` which are nested structures of pydantic BaseModels with `field_id` as a required field. So there isn't a way to create a subclass class of `pyiceberg.schema.Schema` without `field_id`.

### Create a variant class of `pyiceberg.schema.Schema` without `field_id`
This is a bit different from above approach, and requires us to make variant classes of `pyiceberg.schema.Schema`, that are not subclassed from it. This is not ideal, because we will have to maintain field_id-less copies of [NestedField](https://github.com/apache/iceberg-python/blob/d559e53ed1895f947274c23de754b802a3f6c46f/pyiceberg/types.py#L282-L305), [StructType](https://github.com/apache/iceberg-python/blob/d559e53ed1895f947274c23de754b802a3f6c46f/pyiceberg/types.py#L350-L362), [MapType](https://github.com/apache/iceberg-python/blob/d559e53ed1895f947274c23de754b802a3f6c46f/pyiceberg/types.py#L465-L477), [ListType](https://github.com/apache/iceberg-python/blob/d559e53ed1895f947274c23de754b802a3f6c46f/pyiceberg/types.py#L415C7-L424) and Schema and create methods to create a field_id'ed Schema from its field_id-less variant. It is possible, but it will be hard and messy to manage. 

### We could make `field_id` an optional attribute of `NestedField` and field_id'd iceberg Types
This will allow us to create `pyiceberg.schema.Schema` with and without field_ids. However, this creates a new opportunity for issues to be introduced into PyIceberg, that have been prevented with the NestedField's attributes directly matching that of the REST Catalog spec. With `field_id` as an optional field, we will need to introduce a lot more validations across our code base to ensure that the field_id is set in all the nested fields within a schema before using it.

### Keep the `field_id`s when using `pyiceberg.schema.Schema`, but generate new ones when using `pyarrow.Schema`
I think this may be the safest approach. This will be user-friendly: `pyiceberg.schema.Schema` requires `field_id`s, and `pyarrow.Schema` does not. `pyarrow.Schema` is also just a completely different class, so users do not hold the expectation that the `field_id`s within a `pyarrow.Schema` are kept in the `pyiceberg.schema.Schema` (although this is an enhancement we could introduce in the future). When and if we introduce new Schema representations as alternate input schema representations to the API, we could evaluate whether it would make sense to keep the field IDs or assign new ones case by case.

I am personally in favor of the last approach, to revert back to keeping the field_ids of the provided `pyiceberg.schema.Schema`, if it of that type. And if the input schema is a `pyarrow.Schema`, we'd create a new Schema with freshly assigned IDs. The behavior of the API will then feel more consistent with how our users are using it, and how they expect the field_ids of the created table Schema to be, in different scenarios.

I'd love to hear the thoughts of our community members on this topic before jumping onto an implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema #1284

Feature Request / Improvement

Create a subclass of `pyiceberg.schema.Schema` without `field_id`

Create a variant class of `pyiceberg.schema.Schema` without `field_id`

We could make `field_id` an optional attribute of `NestedField` and field_id'd iceberg Types

Keep the `field_id`s when using `pyiceberg.schema.Schema`, but generate new ones when using `pyarrow.Schema`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhance catalog.create_table API to enable creation of table with matching field_ids to provided Schema #1284

Description

Feature Request / Improvement

Create a subclass of pyiceberg.schema.Schema without field_id

Create a variant class of pyiceberg.schema.Schema without field_id

We could make field_id an optional attribute of NestedField and field_id'd iceberg Types

Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema #1284

Create a subclass of `pyiceberg.schema.Schema` without `field_id`

Create a variant class of `pyiceberg.schema.Schema` without `field_id`

We could make `field_id` an optional attribute of `NestedField` and field_id'd iceberg Types

Keep the `field_id`s when using `pyiceberg.schema.Schema`, but generate new ones when using `pyarrow.Schema`