Skip to content

Commit 672efe2

Browse files
committed
Merge remote-tracking branch 'origin/main' into kevinjqliu/fix-schema-comparison
2 parents d05a3fb + 36b56eb commit 672efe2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+5528
-1400
lines changed

.asf.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ github:
4444
projects: true
4545
collaborators: # Note: the number of collaborators is limited to 10
4646
- ajantha-bhat
47+
- syun64
4748
ghp_branch: gh-pages
4849
ghp_path: /
4950

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
name: Check Markdown links
2+
3+
on:
4+
push:
5+
paths:
6+
- mkdocs/**
7+
8+
jobs:
9+
markdown-link-check:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@master
13+
- uses: gaurav-nelson/github-action-markdown-link-check@v1

.github/workflows/python-release.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
runs-on: ${{ matrix.os }}
3535
strategy:
3636
matrix:
37-
os: [ ubuntu-22.04, windows-2022, macos-11 ]
37+
os: [ ubuntu-22.04, windows-2022, macos-11, macos-12, macos-13, macos-14 ]
3838

3939
steps:
4040
- uses: actions/checkout@v4
@@ -43,7 +43,7 @@ jobs:
4343

4444
- uses: actions/setup-python@v5
4545
with:
46-
python-version: '3.8'
46+
python-version: '3.11'
4747

4848
- name: Install poetry
4949
run: pip install poetry
@@ -59,15 +59,15 @@ jobs:
5959
if: startsWith(matrix.os, 'ubuntu')
6060

6161
- name: Build wheels
62-
uses: pypa/cibuildwheel@v2.16.3
62+
uses: pypa/cibuildwheel@v2.16.5
6363
with:
6464
output-dir: wheelhouse
6565
config-file: "pyproject.toml"
6666
env:
6767
# Ignore 32 bit architectures
6868
CIBW_ARCHS: "auto64"
6969
CIBW_PROJECT_REQUIRES_PYTHON: ">=3.8,<3.12"
70-
CIBW_TEST_REQUIRES: "pytest==7.4.2 moto==4.2.2"
70+
CIBW_TEST_REQUIRES: "pytest==7.4.2 moto==5.0.1"
7171
CIBW_TEST_EXTRAS: "s3fs,glue"
7272
CIBW_TEST_COMMAND: "pytest {project}/tests/avro/test_decoder.py"
7373
# There is an upstream issue with installing on MacOSX
@@ -80,7 +80,7 @@ jobs:
8080
if: startsWith(matrix.os, 'ubuntu')
8181
run: ls -lah dist/* && cp dist/* wheelhouse/
8282

83-
- uses: actions/upload-artifact@v4
83+
- uses: actions/upload-artifact@v3
8484
with:
8585
name: "release-${{ github.event.inputs.version }}"
8686
path: ./wheelhouse/*

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,13 @@ repos:
3636
- id: ruff-format
3737
args: [ --preview ]
3838
- repo: https://github.com/pre-commit/mirrors-mypy
39-
rev: v1.6.1
39+
rev: v1.8.0
4040
hooks:
4141
- id: mypy
4242
args:
4343
[--install-types, --non-interactive, --config=pyproject.toml]
4444
- repo: https://github.com/hadialqattan/pycln
45-
rev: v2.3.0
45+
rev: v2.4.0
4646
hooks:
4747
- id: pycln
4848
args: [--config=pyproject.toml]

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ install-poetry:
1919
pip install poetry==1.7.1
2020

2121
install-dependencies:
22-
poetry install -E pyarrow -E hive -E s3fs -E glue -E adlfs -E duckdb -E ray -E sql-postgres -E gcsfs -E sql-sqlite
22+
poetry install -E pyarrow -E hive -E s3fs -E glue -E adlfs -E duckdb -E ray -E sql-postgres -E gcsfs -E sql-sqlite -E daft
2323

2424
install: | install-poetry install-dependencies
2525

NOTICE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11

22
Apache Iceberg
3-
Copyright 2017-2022 The Apache Software Foundation
3+
Copyright 2017-2024 The Apache Software Foundation
44

55
This product includes software developed at
66
The Apache Software Foundation (http://www.apache.org/).

mkdocs/docs/SUMMARY.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,12 @@
1717

1818
<!-- prettier-ignore-start -->
1919

20-
- [Home](index.md)
20+
- [Getting started](index.md)
2121
- [Configuration](configuration.md)
2222
- [CLI](cli.md)
2323
- [API](api.md)
2424
- [Contributing](contributing.md)
25+
- [Community](community.md)
2526
- Releases
2627
- [Verify a release](verify-release.md)
2728
- [How to release](how-to-release.md)

mkdocs/docs/api.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -418,6 +418,63 @@ with table.update_schema(allow_incompatible_changes=True) as update:
418418
update.delete_column("some_field")
419419
```
420420

421+
## Partition evolution
422+
423+
PyIceberg supports partition evolution. See the [partition evolution](https://iceberg.apache.org/spec/#partition-evolution)
424+
for more details.
425+
426+
The API to use when evolving partitions is the `update_spec` API on the table.
427+
428+
```python
429+
with table.update_spec() as update:
430+
update.add_field("id", BucketTransform(16), "bucketed_id")
431+
update.add_field("event_ts", DayTransform(), "day_ts")
432+
```
433+
434+
Updating the partition spec can also be done as part of a transaction with other operations.
435+
436+
```python
437+
with table.transaction() as transaction:
438+
with transaction.update_spec() as update_spec:
439+
update_spec.add_field("id", BucketTransform(16), "bucketed_id")
440+
update_spec.add_field("event_ts", DayTransform(), "day_ts")
441+
# ... Update properties etc
442+
```
443+
444+
### Add fields
445+
446+
New partition fields can be added via the `add_field` API which takes in the field name to partition on,
447+
the partition transform, and an optional partition name. If the partition name is not specified,
448+
one will be created.
449+
450+
```python
451+
with table.update_spec() as update:
452+
update.add_field("id", BucketTransform(16), "bucketed_id")
453+
update.add_field("event_ts", DayTransform(), "day_ts")
454+
# identity is a shortcut API for adding an IdentityTransform
455+
update.identity("some_field")
456+
```
457+
458+
### Remove fields
459+
460+
Partition fields can also be removed via the `remove_field` API if it no longer makes sense to partition on those fields.
461+
462+
```python
463+
with table.update_spec() as update:some_partition_name
464+
# Remove the partition field with the name
465+
update.remove_field("some_partition_name")
466+
```
467+
468+
### Rename fields
469+
470+
Partition fields can also be renamed via the `rename_field` API.
471+
472+
```python
473+
with table.update_spec() as update:
474+
# Rename the partition field with the name bucketed_id to sharded_id
475+
update.rename_field("bucketed_id", "sharded_id")
476+
```
477+
421478
## Table properties
422479

423480
Set and remove properties through the `Transaction` API:
@@ -636,3 +693,56 @@ print(ray_dataset.take(2))
636693
},
637694
]
638695
```
696+
697+
### Daft
698+
699+
PyIceberg interfaces closely with Daft Dataframes (see also: [Daft integration with Iceberg](https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/iceberg.html)) which provides a full lazily optimized query engine interface on top of PyIceberg tables.
700+
701+
<!-- prettier-ignore-start -->
702+
703+
!!! note "Requirements"
704+
This requires [Daft to be installed](index.md).
705+
706+
<!-- prettier-ignore-end -->
707+
708+
A table can be read easily into a Daft Dataframe:
709+
710+
```python
711+
df = table.to_daft() # equivalent to `daft.read_iceberg(table)`
712+
df = df.where(df["trip_distance"] >= 10.0)
713+
df = df.select("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime")
714+
```
715+
716+
This returns a Daft Dataframe which is lazily materialized. Printing `df` will display the schema:
717+
718+
```
719+
╭──────────┬───────────────────────────────┬───────────────────────────────╮
720+
│ VendorID ┆ tpep_pickup_datetime ┆ tpep_dropoff_datetime │
721+
│ --- ┆ --- ┆ --- │
722+
│ Int64 ┆ Timestamp(Microseconds, None) ┆ Timestamp(Microseconds, None) │
723+
╰──────────┴───────────────────────────────┴───────────────────────────────╯
724+
725+
(No data to display: Dataframe not materialized)
726+
```
727+
728+
We can execute the Dataframe to preview the first few rows of the query with `df.show()`.
729+
730+
This is correctly optimized to take advantage of Iceberg features such as hidden partitioning and file-level statistics for efficient reads.
731+
732+
```python
733+
df.show(2)
734+
```
735+
736+
```
737+
╭──────────┬───────────────────────────────┬───────────────────────────────╮
738+
│ VendorID ┆ tpep_pickup_datetime ┆ tpep_dropoff_datetime │
739+
│ --- ┆ --- ┆ --- │
740+
│ Int64 ┆ Timestamp(Microseconds, None) ┆ Timestamp(Microseconds, None) │
741+
╞══════════╪═══════════════════════════════╪═══════════════════════════════╡
742+
│ 2 ┆ 2008-12-31T23:23:50.000000 ┆ 2009-01-01T00:34:31.000000 │
743+
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
744+
│ 2 ┆ 2008-12-31T23:05:03.000000 ┆ 2009-01-01T16:10:18.000000 │
745+
╰──────────┴───────────────────────────────┴───────────────────────────────╯
746+
747+
(Showing first 2 rows)
748+
```

mkdocs/docs/cli.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ Options:
3636
--catalog TEXT
3737
--verbose BOOLEAN
3838
--output [text|json]
39+
--ugi TEXT
3940
--uri TEXT
4041
--credential TEXT
4142
--help Show this message and exit.

mkdocs/docs/community.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
hide:
3+
- navigation
4+
---
5+
6+
<!--
7+
- Licensed to the Apache Software Foundation (ASF) under one
8+
- or more contributor license agreements. See the NOTICE file
9+
- distributed with this work for additional information
10+
- regarding copyright ownership. The ASF licenses this file
11+
- to you under the Apache License, Version 2.0 (the
12+
- "License"); you may not use this file except in compliance
13+
- with the License. You may obtain a copy of the License at
14+
-
15+
- http://www.apache.org/licenses/LICENSE-2.0
16+
-
17+
- Unless required by applicable law or agreed to in writing,
18+
- software distributed under the License is distributed on an
19+
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
20+
- KIND, either express or implied. See the License for the
21+
- specific language governing permissions and limitations
22+
- under the License.
23+
-->
24+
25+
# Join the community
26+
27+
Apache Iceberg tracks issues in GitHub and prefers to receive contributions as pull requests.
28+
29+
Community discussions happen primarily on the [dev mailing list](https://lists.apache.org/list.html?dev@iceberg.apache.org), on [Apache Iceberg Slack workspace](https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A) in the #python channel, and on specific [GitHub issues](https://github.com/apache/iceberg-python/issues).
30+
31+
## Iceberg Community Events
32+
33+
The PyIceberg community sync is on the last Tuesday of every month. To join, make sure to subscribe to the [iceberg-python-sync Google group](https://groups.google.com/g/iceberg-python-sync).
34+
35+
## Community Guidelines
36+
37+
### Apache Iceberg Community Guidelines
38+
39+
The Apache Iceberg community is built on the principles described in the [Apache Way](https://www.apache.org/theapacheway/index.html)
40+
and all who engage with the community are expected to be respectful, open, come with the best interests of the community in mind,
41+
and abide by the Apache Foundation [Code of Conduct](https://www.apache.org/foundation/policies/conduct.html).
42+
43+
### Participants with Corporate Interests
44+
45+
A wide range of corporate entities have interests that overlap in both features and frameworks related to Iceberg and while we
46+
encourage engagement and contributions, the community is not a venue for marketing, solicitation, or recruitment.
47+
48+
Any vendor who wants to participate in the Apache Iceberg community Slack workspace should create a dedicated vendor channel
49+
for their organization prefixed by `vendor-`.
50+
51+
This space can be used to discuss features and integration with Iceberg related to the vendor offering. This space should not
52+
be used to promote competing vendor products/services or disparage other vendor offerings. Discussion should be focused on
53+
questions asked by the community and not to expand/introduce/redirect users to alternate offerings.
54+
55+
### Marketing / Solicitation / Recruiting
56+
57+
The Apache Iceberg community is a space for everyone to operate free of influence. The development lists, Slack workspace,
58+
and GitHub should not be used to market products or services. Solicitation or overt promotion should not be performed in common
59+
channels or through direct messages.
60+
61+
Recruitment of community members should not be conducted through direct messages or community channels, but opportunities
62+
related to contributing to or using Iceberg can be posted to the `#jobs` channel.
63+
64+
For questions regarding any of the guidelines above, please contact a PMC member

0 commit comments

Comments
 (0)