-
Notifications
You must be signed in to change notification settings - Fork 412
Description
Feature Request / Improvement
Currently, create_table takes a pyiceberg.schema.Schema or pyarrow.Schema. The API ignores the field_ids provided in the schema and issues a set of new ones.
Not only does this cause confusion for the users, because pyiceberg.schema.Schema requires field_ids, but it also means that a table cannot be created with the guarantee that its schema will match the field IDs of the provided schema. This prevents the API from being able to be used for table migrations (discussion on example use case), where a user would want to take the following course of steps:
catalog.load_tableto load thepyiceberg.table.Tableof an existing Iceberg table.- get
pyiceberg.schema.Schemaof the loaded table - create a new table in the target catalog using
catalog.create_tableusing the existing table's schema - copy over files into the new table using
add_files(this is not possible yet, but there is a discussion that would allowadd_filesto work on a table with schema with matching field_ids)
Above procedure will not work, unless we introduce an enhancement to create_table that enables creation of a new table that matches the field_ids of the provided Schema.
One way of addressing this issue will be to have two ways of representing the Iceberg table's Schema.
- If a schema without field_id is passed into the API, we should create a new
pyiceberg.schema.Schemawith newly assigned field_ids to create the table. - If a schema with field_id is passed into the API, we should use the exact same
pyiceberg.schema.Schemato create the table.
I discuss a few ideas in achieving this below, all with their own pros and cons:
Create a subclass of pyiceberg.schema.Schema without field_id
This sounds like the best approach, but once explored, we quickly realize that this may be impossible. The main challenge with this approach is the fact that pyiceberg.schema.Schema describes its fields using NestedField which are nested structures of pydantic BaseModels with field_id as a required field. So there isn't a way to create a subclass class of pyiceberg.schema.Schema without field_id.
Create a variant class of pyiceberg.schema.Schema without field_id
This is a bit different from above approach, and requires us to make variant classes of pyiceberg.schema.Schema, that are not subclassed from it. This is not ideal, because we will have to maintain field_id-less copies of NestedField, StructType, MapType, ListType and Schema and create methods to create a field_id'ed Schema from its field_id-less variant. It is possible, but it will be hard and messy to manage.
We could make field_id an optional attribute of NestedField and field_id'd iceberg Types
This will allow us to create pyiceberg.schema.Schema with and without field_ids. However, this creates a new opportunity for issues to be introduced into PyIceberg, that have been prevented with the NestedField's attributes directly matching that of the REST Catalog spec. With field_id as an optional field, we will need to introduce a lot more validations across our code base to ensure that the field_id is set in all the nested fields within a schema before using it.
Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema
I think this may be the safest approach. This will be user-friendly: pyiceberg.schema.Schema requires field_ids, and pyarrow.Schema does not. pyarrow.Schema is also just a completely different class, so users do not hold the expectation that the field_ids within a pyarrow.Schema are kept in the pyiceberg.schema.Schema (although this is an enhancement we could introduce in the future). When and if we introduce new Schema representations as alternate input schema representations to the API, we could evaluate whether it would make sense to keep the field IDs or assign new ones case by case.
I am personally in favor of the last approach, to revert back to keeping the field_ids of the provided pyiceberg.schema.Schema, if it of that type. And if the input schema is a pyarrow.Schema, we'd create a new Schema with freshly assigned IDs. The behavior of the API will then feel more consistent with how our users are using it, and how they expect the field_ids of the created table Schema to be, in different scenarios.
I'd love to hear the thoughts of our community members on this topic before jumping onto an implementation.