Commit bdf19ab
authored
fix: allow reading pyarrow timestamp as iceberg timestamptz (#2333)
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
This PR fix reading pyarrow timestamp as Iceberg timestamptz type. It
mirrors the pyarrow logic for dealing with pyarrow timestamp types
[here](https://github.com/apache/iceberg-python/blob/8b43eb88fcc80b4980ce16b71852d211d7e93b13/pyiceberg/io/pyarrow.py#L1330-L1353)
Two changes were made to `ArrowProjectionVisitor._cast_if_needed`
1. reorder the logic so that we handle dealing with timestamp first.
Otherwise, it will try to `promote()` timestamp to timestamptz and fail.
2. allow casting when the pyarrow's value has `None` timezone. This is
allowed because we gate on the target type has "UTC" timezone. It
mirrors the java logic for reading with default UTC timezone
([1](https://github.com/apache/iceberg/blob/856cbf6eb8a85dee01c65ae6291274b700f76746/core/src/main/java/org/apache/iceberg/data/avro/GenericReaders.java#L107-L116),
[2](https://github.com/apache/iceberg/blob/856cbf6eb8a85dee01c65ae6291274b700f76746/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java#L35))
### Context
I ran into an interesting edge case while testing metadata
virtualization between delta and iceberg.
Delta has both [TIMESTAMP and TIMESTAMP_NTZ data
types](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-datatypes).
TIMESTAMP has a timezone while TIMESTAMP_NTZ has no timezone.
While Iceberg has [timestamp and
timestamptz](https://iceberg.apache.org/spec/#primitive-types).
timestamp has no timezone and timestamptz has a timezone.
So Delta's TIMESTAMP -> Iceberg timestamptz and Delta's TIMESTAMP_NTZ ->
Iceberg timestamp.
Regardless of delta or iceberg, the [parquet file stores timestamp
without the timezone
information](https://github.com/apache/parquet-format/blob/1dbc814b97c9307687a2e4bee55545ab6a2ef106/LogicalTypes.md#timestamp)
So I end up a parquet file with timestamp column, and an iceberg table
with timestamptz column, and pyiceberg cannot read this table.
Its hard to recreate the scenario but i did trace it to the
`_to_requested_schema` function. I added a unit test for this case.
The issue is that `ArrowProjectionVisitor._cast_if_needed` will try to
promote `timestamp` to `timstamptz` and this is not a valid promotion.
```
E pyiceberg.exceptions.ResolveError: Cannot promote timestamp to timestamptz
```
https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L1779-L1782
The `elif` case below that can handle this case
https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L1800-L1806
So maybe we just need to switch the order of execution...
This was also an interesting read..
https://arrow.apache.org/docs/python/timestamps.html
# Are these changes tested?
# Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->1 parent 5acca48 commit bdf19ab
2 files changed
+50
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1788 | 1788 | | |
1789 | 1789 | | |
1790 | 1790 | | |
1791 | | - | |
1792 | | - | |
1793 | | - | |
1794 | | - | |
1795 | | - | |
1796 | | - | |
1797 | | - | |
1798 | | - | |
| 1791 | + | |
1799 | 1792 | | |
1800 | 1793 | | |
1801 | 1794 | | |
| |||
1814 | 1807 | | |
1815 | 1808 | | |
1816 | 1809 | | |
1817 | | - | |
| 1810 | + | |
1818 | 1811 | | |
1819 | 1812 | | |
1820 | 1813 | | |
1821 | 1814 | | |
1822 | 1815 | | |
1823 | 1816 | | |
| 1817 | + | |
| 1818 | + | |
| 1819 | + | |
| 1820 | + | |
| 1821 | + | |
| 1822 | + | |
| 1823 | + | |
| 1824 | + | |
| 1825 | + | |
1824 | 1826 | | |
1825 | 1827 | | |
1826 | 1828 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2550 | 2550 | | |
2551 | 2551 | | |
2552 | 2552 | | |
| 2553 | + | |
| 2554 | + | |
| 2555 | + | |
| 2556 | + | |
| 2557 | + | |
| 2558 | + | |
| 2559 | + | |
| 2560 | + | |
| 2561 | + | |
| 2562 | + | |
| 2563 | + | |
| 2564 | + | |
| 2565 | + | |
| 2566 | + | |
| 2567 | + | |
| 2568 | + | |
| 2569 | + | |
| 2570 | + | |
| 2571 | + | |
| 2572 | + | |
| 2573 | + | |
| 2574 | + | |
| 2575 | + | |
| 2576 | + | |
| 2577 | + | |
| 2578 | + | |
| 2579 | + | |
| 2580 | + | |
| 2581 | + | |
| 2582 | + | |
| 2583 | + | |
| 2584 | + | |
| 2585 | + | |
| 2586 | + | |
| 2587 | + | |
| 2588 | + | |
| 2589 | + | |
| 2590 | + | |
| 2591 | + | |
2553 | 2592 | | |
2554 | 2593 | | |
2555 | 2594 | | |
| |||
0 commit comments