Skip to content

Commit 710f22d

Browse files
author
Rashampreet Singh
committed
docs: added pyiceberg to pyarrow conversion documentation
1 parent a56795f commit 710f22d

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# PyIceberg (Python types) ⇄ PyArrow Type Mapping
2+
3+
This document lists **PyIceberg Python type classes** and their corresponding **PyArrow** types based on the provided visitor implementation.
4+
This version uses concrete PyIceberg type names (e.g., `IntegerType`, `TimestampType`) rather than SQL-style tokens (e.g., `INT`, `TIMESTAMP`).
5+
6+
---
7+
8+
## PyIceberg (Python types) → PyArrow
9+
10+
| PyIceberg type class | PyArrow type |
11+
|---------------------------------|----------------------------------------------------------|
12+
| `BooleanType` | `pa.bool_()` |
13+
| `IntegerType` | `pa.int32()` |
14+
| `LongType` | `pa.int64()` |
15+
| `FloatType` | `pa.float32()` |
16+
| `DoubleType` | `pa.float64()` |
17+
| `DecimalType(p, s)` | `pa.decimal128(p, s)` |
18+
| `DateType` | `pa.date32()` |
19+
| `TimeType` | `pa.time64("us")` |
20+
| `TimestampType` | `pa.timestamp("us")` |
21+
| `TimestampNanoType` | `pa.timestamp("ns")` |
22+
| `TimestamptzType` | `pa.timestamp("us", tz="UTC")` |
23+
| `TimestamptzNanoType` | `pa.timestamp("ns", tz="UTC")` |
24+
| `StringType` | `pa.large_string()` |
25+
| `UUIDType` | `pa.uuid()` |
26+
| `BinaryType` | `pa.large_binary()` |
27+
| `FixedType(L)` | `pa.binary(L)` |
28+
| `StructType` | `pa.struct([...])` *(fields via `pa.field`)* |
29+
| `ListType(e)` | `pa.large_list(value_type=<element field>)` |
30+
| `MapType(k, v)` | `pa.map_(key_type=<key field>, item_type=<value field>)` |
31+
| `UnknownType` | `pa.null()` |
32+
33+
**Field construction**: `pa.field(name, type, nullable=field.optional, metadata={...})`
34+
**Metadata**: `parquet.field.id` (when `include_field_ids=True`) and `doc` if present.
35+
36+
No other types are supported by the visitor.
37+
38+
---
39+
40+
## PyArrow → PyIceberg (Python types)
41+
42+
| PyArrow type | PyIceberg type class |
43+
|-----------------------------------------------|-----------------------------|
44+
| `pa.bool_()` | `BooleanType` |
45+
| `pa.int32()` | `IntegerType` |
46+
| `pa.int64()` | `LongType` |
47+
| `pa.float32()` | `FloatType` |
48+
| `pa.float64()` | `DoubleType` |
49+
| `pa.decimal128(p, s)` | `DecimalType(p, s)` |
50+
| `pa.date32()` | `DateType` |
51+
| `pa.time64("us")` | `TimeType` |
52+
| `pa.timestamp("us")` | `TimestampType` |
53+
| `pa.timestamp("ns")` | `TimestampNanoType` |
54+
| `pa.timestamp("us", tz="UTC")` | `TimestamptzType` |
55+
| `pa.timestamp("ns", tz="UTC")` | `TimestamptzNanoType` |
56+
| `pa.large_string()` or `pa.string()` | `StringType` |
57+
| `pa.uuid()` | `UUIDType` |
58+
| `pa.large_binary()` or variable `pa.binary()` | `BinaryType` |
59+
| fixed-size `pa.binary(L)` | `FixedType(L)` |
60+
| `pa.struct([...])` | `StructType` |
61+
| `pa.large_list(<element>)` or `pa.list_()` | `ListType(e)` |
62+
| `pa.map_(key_type, item_type)` | `MapType(k, v)` |
63+
| `pa.null()` | `UnknownType` |
64+
65+
No other types are supported by the visitor.
66+
67+
---
68+
69+
## Notes and Caveats
70+
71+
- **Strings & binaries:** The visitor emits `large_string` and `large_binary` (64‑bit offsets). `string`/`binary` (32‑bit) still map to `StringType`/`BinaryType` when converting back.
72+
- **Timestamps:** Default precision is microseconds (`"us"`); nano variants are explicit. Zoned timestamps assume UTC.
73+
- **Decimals:** Implemented as `decimal128` in Arrow to avoid partial support for `decimal32/64`.
74+
- **Lists & Maps:** Element/key/value are wrapped as `pa.field`s, preserving nullability (`field.optional`).
75+
- **Field IDs & docs:** Preserved in Arrow field metadata (`parquet.field.id`, `doc`).
76+
77+
---

0 commit comments

Comments
 (0)