|
| 1 | +# PyIceberg (Python types) ⇄ PyArrow Type Mapping |
| 2 | + |
| 3 | +This document lists **PyIceberg Python type classes** and their corresponding **PyArrow** types based on the provided visitor implementation. |
| 4 | +This version uses concrete PyIceberg type names (e.g., `IntegerType`, `TimestampType`) rather than SQL-style tokens (e.g., `INT`, `TIMESTAMP`). |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## PyIceberg (Python types) → PyArrow |
| 9 | + |
| 10 | +| PyIceberg type class | PyArrow type | |
| 11 | +|---------------------------------|----------------------------------------------------------| |
| 12 | +| `BooleanType` | `pa.bool_()` | |
| 13 | +| `IntegerType` | `pa.int32()` | |
| 14 | +| `LongType` | `pa.int64()` | |
| 15 | +| `FloatType` | `pa.float32()` | |
| 16 | +| `DoubleType` | `pa.float64()` | |
| 17 | +| `DecimalType(p, s)` | `pa.decimal128(p, s)` | |
| 18 | +| `DateType` | `pa.date32()` | |
| 19 | +| `TimeType` | `pa.time64("us")` | |
| 20 | +| `TimestampType` | `pa.timestamp("us")` | |
| 21 | +| `TimestampNanoType` | `pa.timestamp("ns")` | |
| 22 | +| `TimestamptzType` | `pa.timestamp("us", tz="UTC")` | |
| 23 | +| `TimestamptzNanoType` | `pa.timestamp("ns", tz="UTC")` | |
| 24 | +| `StringType` | `pa.large_string()` | |
| 25 | +| `UUIDType` | `pa.uuid()` | |
| 26 | +| `BinaryType` | `pa.large_binary()` | |
| 27 | +| `FixedType(L)` | `pa.binary(L)` | |
| 28 | +| `StructType` | `pa.struct([...])` *(fields via `pa.field`)* | |
| 29 | +| `ListType(e)` | `pa.large_list(value_type=<element field>)` | |
| 30 | +| `MapType(k, v)` | `pa.map_(key_type=<key field>, item_type=<value field>)` | |
| 31 | +| `UnknownType` | `pa.null()` | |
| 32 | + |
| 33 | +**Field construction**: `pa.field(name, type, nullable=field.optional, metadata={...})` |
| 34 | +**Metadata**: `parquet.field.id` (when `include_field_ids=True`) and `doc` if present. |
| 35 | + |
| 36 | +No other types are supported by the visitor. |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +## PyArrow → PyIceberg (Python types) |
| 41 | + |
| 42 | +| PyArrow type | PyIceberg type class | |
| 43 | +|-----------------------------------------------|-----------------------------| |
| 44 | +| `pa.bool_()` | `BooleanType` | |
| 45 | +| `pa.int32()` | `IntegerType` | |
| 46 | +| `pa.int64()` | `LongType` | |
| 47 | +| `pa.float32()` | `FloatType` | |
| 48 | +| `pa.float64()` | `DoubleType` | |
| 49 | +| `pa.decimal128(p, s)` | `DecimalType(p, s)` | |
| 50 | +| `pa.date32()` | `DateType` | |
| 51 | +| `pa.time64("us")` | `TimeType` | |
| 52 | +| `pa.timestamp("us")` | `TimestampType` | |
| 53 | +| `pa.timestamp("ns")` | `TimestampNanoType` | |
| 54 | +| `pa.timestamp("us", tz="UTC")` | `TimestamptzType` | |
| 55 | +| `pa.timestamp("ns", tz="UTC")` | `TimestamptzNanoType` | |
| 56 | +| `pa.large_string()` or `pa.string()` | `StringType` | |
| 57 | +| `pa.uuid()` | `UUIDType` | |
| 58 | +| `pa.large_binary()` or variable `pa.binary()` | `BinaryType` | |
| 59 | +| fixed-size `pa.binary(L)` | `FixedType(L)` | |
| 60 | +| `pa.struct([...])` | `StructType` | |
| 61 | +| `pa.large_list(<element>)` or `pa.list_()` | `ListType(e)` | |
| 62 | +| `pa.map_(key_type, item_type)` | `MapType(k, v)` | |
| 63 | +| `pa.null()` | `UnknownType` | |
| 64 | + |
| 65 | +No other types are supported by the visitor. |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## Notes and Caveats |
| 70 | + |
| 71 | +- **Strings & binaries:** The visitor emits `large_string` and `large_binary` (64‑bit offsets). `string`/`binary` (32‑bit) still map to `StringType`/`BinaryType` when converting back. |
| 72 | +- **Timestamps:** Default precision is microseconds (`"us"`); nano variants are explicit. Zoned timestamps assume UTC. |
| 73 | +- **Decimals:** Implemented as `decimal128` in Arrow to avoid partial support for `decimal32/64`. |
| 74 | +- **Lists & Maps:** Element/key/value are wrapped as `pa.field`s, preserving nullability (`field.optional`). |
| 75 | +- **Field IDs & docs:** Preserved in Arrow field metadata (`parquet.field.id`, `doc`). |
| 76 | + |
| 77 | +--- |
0 commit comments