Skip to content

Commit 3fbae72

Browse files
committed
Merge remote-tracking branch 'upstream/main'
Merge with upstream
2 parents 8944dd2 + 5a781df commit 3fbae72

40 files changed

+1159
-498
lines changed

.github/workflows/check-md-link.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,4 @@ jobs:
3636
runs-on: ubuntu-latest
3737
steps:
3838
- uses: actions/checkout@master
39-
- uses: gaurav-nelson/github-action-markdown-link-check@v1
39+
- uses: tcort/github-action-markdown-link-check@v1

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,21 +19,21 @@ exclude: ^vendor/
1919

2020
repos:
2121
- repo: https://github.com/pre-commit/pre-commit-hooks
22-
rev: v5.0.0
22+
rev: v6.0.0
2323
hooks:
2424
- id: trailing-whitespace
2525
- id: end-of-file-fixer
2626
- id: debug-statements
2727
- id: check-yaml
2828
- id: check-ast
2929
- repo: https://github.com/astral-sh/ruff-pre-commit
30-
rev: v0.11.13
30+
rev: v0.12.9
3131
hooks:
3232
- id: ruff
3333
args: [ --fix, --exit-non-zero-on-fix ]
3434
- id: ruff-format
3535
- repo: https://github.com/pre-commit/mirrors-mypy
36-
rev: v1.16.0
36+
rev: v1.17.1
3737
hooks:
3838
- id: mypy
3939
args:

mkdocs/docs/api.md

Lines changed: 53 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1004,6 +1004,34 @@ To show only data files or delete files in the current snapshot, use `table.insp
10041004

10051005
Expert Iceberg users may choose to commit existing parquet files to the Iceberg table as data files, without rewriting them.
10061006

1007+
<!-- prettier-ignore-start -->
1008+
1009+
!!! note "Name Mapping"
1010+
Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one.
1011+
1012+
!!! note "Partitions"
1013+
`add_files` only requires the client to read the existing parquet files' metadata footer to infer the partition value of each file. This implementation also supports adding files to Iceberg tables with partition transforms like `MonthTransform`, and `TruncateTransform` which preserve the order of the values after the transformation (Any Transform that has the `preserves_order` property set to True is supported). Please note that if the column statistics of the `PartitionField`'s source column are not present in the parquet metadata, the partition value is inferred as `None`.
1014+
1015+
!!! warning "Maintenance Operations"
1016+
Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them.
1017+
1018+
!!! warning "Check Duplicate Files"
1019+
The `check_duplicate_files` parameter determines whether the method validates that the specified `file_paths` do not already exist in the Iceberg table. When set to True (the default), the method performs a validation against the table’s current data files to prevent accidental duplication, helping to maintain data consistency by ensuring the same file is not added multiple times. While this check is important for data integrity, it can introduce performance overhead for tables with a large number of files. Setting check_duplicate_files=False can improve performance but increases the risk of duplicate files, which may lead to data inconsistencies or table corruption. It is strongly recommended to keep this parameter enabled unless duplicate file handling is strictly enforced elsewhere.
1020+
1021+
<!-- prettier-ignore-end -->
1022+
1023+
### Usage
1024+
1025+
| Parameter | Required? | Type | Description |
1026+
| ------------------------- | --------- | -------------- | ----------------------------------------------------------------------- |
1027+
| `file_paths` | ✔️ | List[str] | The list of full file paths to be added as data files to the table |
1028+
| `snapshot_properties` | | Dict[str, str] | Properties to set for the new snapshot. Defaults to an empty dictionary |
1029+
| `check_duplicate_files` | | bool | Whether to check for duplicate files. Defaults to `True` |
1030+
1031+
### Example
1032+
1033+
Add files to Iceberg table:
1034+
10071035
```python
10081036
# Given that these parquet files have schema consistent with the Iceberg table
10091037
@@ -1019,18 +1047,35 @@ tbl.add_files(file_paths=file_paths)
10191047
# A new snapshot is committed to the table with manifests pointing to the existing parquet files
10201048
```
10211049

1022-
<!-- prettier-ignore-start -->
1050+
Add files to Iceberg table with custom snapshot properties:
10231051

1024-
!!! note "Name Mapping"
1025-
Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one.
1052+
```python
1053+
# Assume an existing Iceberg table object `tbl`
10261054

1027-
!!! note "Partitions"
1028-
`add_files` only requires the client to read the existing parquet files' metadata footer to infer the partition value of each file. This implementation also supports adding files to Iceberg tables with partition transforms like `MonthTransform`, and `TruncateTransform` which preserve the order of the values after the transformation (Any Transform that has the `preserves_order` property set to True is supported). Please note that if the column statistics of the `PartitionField`'s source column are not present in the parquet metadata, the partition value is inferred as `None`.
1055+
file_paths = [
1056+
"s3a://warehouse/default/existing-1.parquet",
1057+
"s3a://warehouse/default/existing-2.parquet",
1058+
]
10291059

1030-
!!! warning "Maintenance Operations"
1031-
Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them.
1060+
# Custom snapshot properties
1061+
snapshot_properties = {"abc": "def"}
10321062

1033-
<!-- prettier-ignore-end -->
1063+
# Enable duplicate file checking
1064+
check_duplicate_files = True
1065+
1066+
# Add the Parquet files to the Iceberg table without rewriting
1067+
tbl.add_files(
1068+
file_paths=file_paths,
1069+
snapshot_properties=snapshot_properties,
1070+
check_duplicate_files=check_duplicate_files
1071+
)
1072+
1073+
# NameMapping must have been set to enable reads
1074+
assert tbl.name_mapping() is not None
1075+
1076+
# Verify that the snapshot property was set correctly
1077+
assert tbl.metadata.snapshots[-1].summary["abc"] == "def"
1078+
```
10341079

10351080
## Schema evolution
10361081

mkdocs/docs/configuration.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ For the FileIO there are several configuration options available:
127127
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
128128
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
129129
| s3.retry-strategy-impl | None | Ability to set a custom S3 retry strategy. A full path to a class needs to be given that extends the [S3RetryStrategy](https://github.com/apache/arrow/blob/639201bfa412db26ce45e73851432018af6c945e/python/pyarrow/_s3fs.pyx#L110) base class. |
130+
| s3.anonymous | True | Configure whether to use anonymous connection. If False (default), uses key/secret if configured or boto's credential resolver. |
130131

131132
<!-- markdown-link-check-enable-->
132133

@@ -161,6 +162,7 @@ For the FileIO there are several configuration options available:
161162
| adls.dfs-storage-authority | .dfs.core.windows.net | The hostname[:port] of the Data Lake Gen 2 Service. Defaults to `.dfs.core.windows.net`. Useful for connecting to a local emulator, like [azurite](https://github.com/azure/azurite). See [AzureFileSystem](https://arrow.apache.org/docs/python/filesystems.html#azure-storage-file-system) for reference |
162163
| adls.blob-storage-scheme | https | Either `http` or `https`. Defaults to `https`. Useful for connecting to a local emulator, like [azurite](https://github.com/azure/azurite). See [AzureFileSystem](https://arrow.apache.org/docs/python/filesystems.html#azure-storage-file-system) for reference |
163164
| adls.dfs-storage-scheme | https | Either `http` or `https`. Defaults to `https`. Useful for connecting to a local emulator, like [azurite](https://github.com/azure/azurite). See [AzureFileSystem](https://arrow.apache.org/docs/python/filesystems.html#azure-storage-file-system) for reference |
165+
| adls.token | eyJ0eXAiOiJKV1QiLCJhbGci... | Static access token for authenticating with ADLS. Used for OAuth2 flows. |
164166

165167
<!-- markdown-link-check-enable-->
166168

@@ -197,6 +199,7 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
197199
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
198200
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
199201
| s3.force-virtual-addressing | True | Whether to use virtual addressing of buckets. This is set to `True` by default as OSS can only be accessed with virtual hosted style address. |
202+
| s3.anonymous | True | Configure whether to use anonymous connection. If False (default), uses key/secret if configured or standard AWS configuration methods. |
200203

201204
<!-- markdown-link-check-enable-->
202205

@@ -388,6 +391,7 @@ The RESTCatalog supports pluggable authentication via the `auth` configuration b
388391

389392
- `noop`: No authentication (no Authorization header sent).
390393
- `basic`: HTTP Basic authentication.
394+
- `oauth2`: OAuth2 client credentials flow.
391395
- `custom`: Custom authentication manager (requires `auth.impl`).
392396
- `google`: Google Authentication support
393397

@@ -411,9 +415,10 @@ catalog:
411415

412416
| Property | Required | Description |
413417
|------------------|----------|-------------------------------------------------------------------------------------------------|
414-
| `auth.type` | Yes | The authentication type to use (`noop`, `basic`, or `custom`). |
418+
| `auth.type` | Yes | The authentication type to use (`noop`, `basic`, `oauth2`, or `custom`). |
415419
| `auth.impl` | Conditionally | The fully qualified class path for a custom AuthManager. Required if `auth.type` is `custom`. |
416420
| `auth.basic` | If type is `basic` | Block containing `username` and `password` for HTTP Basic authentication. |
421+
| `auth.oauth2` | If type is `oauth2` | Block containing OAuth2 configuration (see below). |
417422
| `auth.custom` | If type is `custom` | Block containing configuration for the custom AuthManager. |
418423
| `auth.google` | If type is `google` | Block containing `credentials_path` to a service account file (if using). Will default to using Application Default Credentials. |
419424

@@ -436,6 +441,20 @@ auth:
436441
password: mypass
437442
```
438443

444+
OAuth2 Authentication:
445+
446+
```yaml
447+
auth:
448+
type: oauth2
449+
oauth2:
450+
client_id: my-client-id
451+
client_secret: my-client-secret
452+
token_url: https://auth.example.com/oauth/token
453+
scope: read
454+
refresh_margin: 60 # (optional) seconds before expiry to refresh
455+
expires_in: 3600 # (optional) fallback if server does not provide
456+
```
457+
439458
Custom Authentication:
440459

441460
```yaml
@@ -451,7 +470,7 @@ auth:
451470

452471
- If `auth.type` is `custom`, you **must** specify `auth.impl` with the full class path to your custom AuthManager.
453472
- If `auth.type` is not `custom`, specifying `auth.impl` is not allowed.
454-
- The configuration block under each type (e.g., `basic`, `custom`) is passed as keyword arguments to the corresponding AuthManager.
473+
- The configuration block under each type (e.g., `basic`, `oauth2`, `custom`) is passed as keyword arguments to the corresponding AuthManager.
455474

456475
<!-- markdown-link-check-enable-->
457476

mkdocs/docs/expression-dsl.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ age_greater_than_18 = GreaterThan("age", 18)
6060

6161
# Greater than or equal to
6262
age_greater_than_or_equal_18 = GreaterThanOrEqual("age", 18)
63+
64+
6365
```
6466

6567
#### Set Predicates

mkdocs/docs/row-filter-syntax.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,15 @@ column NOT LIKE 'prefix%'
100100
!!! important
101101
The `%` wildcard is only supported at the end of the pattern. Using it in the middle or beginning of the pattern will raise an error.
102102

103+
## BETWEEN
104+
105+
The BETWEEN operator filters a numeric value against an inclusive range, e.g. `a between 1 and 2` is equivalent to `a >= 1 and a <= 2`.
106+
107+
```sql
108+
column BETWEEN 1 AND 2
109+
column BETWEEN 1.0 AND 2.0
110+
```
111+
103112
## Logical Operations
104113

105114
Combine multiple conditions using logical operators:

0 commit comments

Comments
 (0)