Commit 7a56ddb
authored
Arrow: Infer the types when reading (#1669)
### Rationale for this change
Time to give this another go 😆
When reading a Parquet file using PyArrow, there is some metadata stored
in the Parquet file to either make it a large type (eg `large_string`,
or a normal type (`string`). The difference is that the large types use
a 64 bit offset to encode their arrays. This is not always needed, and
we can could first check all the in the types of which it is stored, and
let PyArrow decide here:
https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579
In PyArrow today we just bump everything to a large type, which might
lead to additional memory consumption because it allocates an int64
array to allocate the offsets, instead of an int32.
I thought we would be good to go for this now with the new lower bound
of PyArrow to 17. But, it looks like we still have to wait for Arrow 18
to fix the issue with the `date` types:
apache/arrow#43183
Fixes: #1049
### Are these changes tested?
Yes, existing tests :)
### Are there any user-facing changes?
Before, PyIceberg would always return the large Arrow types (eg,
`large_string` instead of `string`). After this change, it will return
the type it was written with.1 parent 62191ee commit 7a56ddb
File tree
6 files changed
+72
-69
lines changed- pyiceberg
- io
- table
- tests
- integration
- test_writes
- io
6 files changed
+72
-69
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
175 | 175 | | |
176 | 176 | | |
177 | 177 | | |
| 178 | + | |
178 | 179 | | |
179 | 180 | | |
180 | 181 | | |
| |||
1385 | 1386 | | |
1386 | 1387 | | |
1387 | 1388 | | |
1388 | | - | |
1389 | 1389 | | |
1390 | 1390 | | |
1391 | 1391 | | |
| |||
1415 | 1415 | | |
1416 | 1416 | | |
1417 | 1417 | | |
1418 | | - | |
1419 | | - | |
1420 | | - | |
1421 | | - | |
1422 | | - | |
1423 | | - | |
1424 | | - | |
| 1418 | + | |
1425 | 1419 | | |
1426 | 1420 | | |
1427 | 1421 | | |
| |||
1456 | 1450 | | |
1457 | 1451 | | |
1458 | 1452 | | |
1459 | | - | |
1460 | 1453 | | |
1461 | 1454 | | |
1462 | 1455 | | |
| |||
1542 | 1535 | | |
1543 | 1536 | | |
1544 | 1537 | | |
1545 | | - | |
1546 | | - | |
1547 | | - | |
1548 | | - | |
1549 | | - | |
1550 | | - | |
1551 | | - | |
1552 | | - | |
1553 | 1538 | | |
1554 | 1539 | | |
1555 | 1540 | | |
| |||
1611 | 1596 | | |
1612 | 1597 | | |
1613 | 1598 | | |
| 1599 | + | |
| 1600 | + | |
1614 | 1601 | | |
1615 | | - | |
| 1602 | + | |
1616 | 1603 | | |
1617 | 1604 | | |
1618 | 1605 | | |
| 1606 | + | |
| 1607 | + | |
| 1608 | + | |
| 1609 | + | |
| 1610 | + | |
| 1611 | + | |
| 1612 | + | |
| 1613 | + | |
1619 | 1614 | | |
1620 | 1615 | | |
1621 | 1616 | | |
| |||
1658 | 1653 | | |
1659 | 1654 | | |
1660 | 1655 | | |
1661 | | - | |
1662 | 1656 | | |
1663 | 1657 | | |
1664 | 1658 | | |
| |||
1677 | 1671 | | |
1678 | 1672 | | |
1679 | 1673 | | |
1680 | | - | |
1681 | 1674 | | |
1682 | 1675 | | |
1683 | 1676 | | |
1684 | 1677 | | |
1685 | 1678 | | |
1686 | | - | |
| 1679 | + | |
1687 | 1680 | | |
1688 | 1681 | | |
1689 | 1682 | | |
| |||
1693 | 1686 | | |
1694 | 1687 | | |
1695 | 1688 | | |
1696 | | - | |
| 1689 | + | |
1697 | 1690 | | |
1698 | 1691 | | |
1699 | 1692 | | |
1700 | 1693 | | |
1701 | 1694 | | |
1702 | 1695 | | |
1703 | | - | |
| 1696 | + | |
1704 | 1697 | | |
1705 | 1698 | | |
1706 | 1699 | | |
1707 | 1700 | | |
1708 | 1701 | | |
1709 | 1702 | | |
| 1703 | + | |
| 1704 | + | |
| 1705 | + | |
| 1706 | + | |
| 1707 | + | |
| 1708 | + | |
| 1709 | + | |
1710 | 1710 | | |
1711 | 1711 | | |
1712 | 1712 | | |
| |||
1715 | 1715 | | |
1716 | 1716 | | |
1717 | 1717 | | |
1718 | | - | |
| 1718 | + | |
1719 | 1719 | | |
1720 | 1720 | | |
1721 | 1721 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1785 | 1785 | | |
1786 | 1786 | | |
1787 | 1787 | | |
1788 | | - | |
| 1788 | + | |
1789 | 1789 | | |
1790 | 1790 | | |
1791 | 1791 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | | - | |
| 36 | + | |
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| |||
588 | 588 | | |
589 | 589 | | |
590 | 590 | | |
591 | | - | |
592 | | - | |
593 | | - | |
594 | | - | |
595 | | - | |
596 | 591 | | |
597 | 592 | | |
598 | 593 | | |
| |||
614 | 609 | | |
615 | 610 | | |
616 | 611 | | |
617 | | - | |
| 612 | + | |
618 | 613 | | |
619 | 614 | | |
620 | 615 | | |
621 | 616 | | |
622 | 617 | | |
623 | | - | |
| 618 | + | |
624 | 619 | | |
625 | 620 | | |
626 | 621 | | |
627 | 622 | | |
628 | 623 | | |
629 | 624 | | |
630 | | - | |
| 625 | + | |
631 | 626 | | |
632 | 627 | | |
633 | 628 | | |
634 | 629 | | |
635 | 630 | | |
636 | 631 | | |
637 | | - | |
| 632 | + | |
638 | 633 | | |
639 | 634 | | |
640 | 635 | | |
| |||
748 | 743 | | |
749 | 744 | | |
750 | 745 | | |
751 | | - | |
752 | | - | |
| 746 | + | |
| 747 | + | |
753 | 748 | | |
754 | 749 | | |
755 | 750 | | |
| |||
799 | 794 | | |
800 | 795 | | |
801 | 796 | | |
802 | | - | |
| 797 | + | |
803 | 798 | | |
804 | 799 | | |
805 | 800 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
831 | 831 | | |
832 | 832 | | |
833 | 833 | | |
834 | | - | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
835 | 844 | | |
836 | 845 | | |
837 | 846 | | |
| |||
840 | 849 | | |
841 | 850 | | |
842 | 851 | | |
843 | | - | |
| 852 | + | |
844 | 853 | | |
845 | 854 | | |
846 | 855 | | |
| |||
859 | 868 | | |
860 | 869 | | |
861 | 870 | | |
862 | | - | |
863 | | - | |
864 | | - | |
865 | | - | |
866 | | - | |
867 | | - | |
868 | | - | |
869 | | - | |
870 | | - | |
871 | 871 | | |
872 | 872 | | |
873 | 873 | | |
| |||
906 | 906 | | |
907 | 907 | | |
908 | 908 | | |
909 | | - | |
| 909 | + | |
910 | 910 | | |
911 | 911 | | |
912 | 912 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
| |||
401 | 402 | | |
402 | 403 | | |
403 | 404 | | |
404 | | - | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
405 | 413 | | |
406 | 414 | | |
407 | 415 | | |
| |||
422 | 430 | | |
423 | 431 | | |
424 | 432 | | |
425 | | - | |
| 433 | + | |
426 | 434 | | |
427 | 435 | | |
428 | 436 | | |
| |||
432 | 440 | | |
433 | 441 | | |
434 | 442 | | |
435 | | - | |
| 443 | + | |
436 | 444 | | |
437 | 445 | | |
438 | 446 | | |
| |||
448 | 456 | | |
449 | 457 | | |
450 | 458 | | |
451 | | - | |
| 459 | + | |
452 | 460 | | |
453 | | - | |
| 461 | + | |
454 | 462 | | |
455 | 463 | | |
456 | 464 | | |
457 | 465 | | |
458 | | - | |
459 | | - | |
| 466 | + | |
| 467 | + | |
460 | 468 | | |
461 | | - | |
| 469 | + | |
462 | 470 | | |
463 | 471 | | |
464 | 472 | | |
| |||
1164 | 1172 | | |
1165 | 1173 | | |
1166 | 1174 | | |
1167 | | - | |
1168 | | - | |
| 1175 | + | |
| 1176 | + | |
1169 | 1177 | | |
1170 | 1178 | | |
1171 | 1179 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1065 | 1065 | | |
1066 | 1066 | | |
1067 | 1067 | | |
1068 | | - | |
1069 | | - | |
1070 | | - | |
1071 | | - | |
| 1068 | + | |
| 1069 | + | |
| 1070 | + | |
| 1071 | + | |
1072 | 1072 | | |
1073 | 1073 | | |
1074 | 1074 | | |
| |||
1181 | 1181 | | |
1182 | 1182 | | |
1183 | 1183 | | |
1184 | | - | |
| 1184 | + | |
1185 | 1185 | | |
1186 | 1186 | | |
1187 | 1187 | | |
| |||
1245 | 1245 | | |
1246 | 1246 | | |
1247 | 1247 | | |
1248 | | - | |
| 1248 | + | |
1249 | 1249 | | |
1250 | 1250 | | |
1251 | 1251 | | |
| |||
1470 | 1470 | | |
1471 | 1471 | | |
1472 | 1472 | | |
1473 | | - | |
1474 | | - | |
1475 | | - | |
| 1473 | + | |
| 1474 | + | |
| 1475 | + | |
1476 | 1476 | | |
1477 | 1477 | | |
1478 | 1478 | | |
| |||
0 commit comments