Skip to content

Conversation

@jayceslesar
Copy link
Contributor

@jayceslesar jayceslesar commented Jun 14, 2025

In pursuit of #813

@jayceslesar
Copy link
Contributor Author

@kevinjqliu @Fokko thoughts here? I think would be pretty light work to set up a common test suite for all catalogs

Some immediate open questions:

  1. Hive and Rest are like a pure mock, but I think they can be worked in (open to ideas)
  2. What's up with this? https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/sql.py#L100 NAMESPACE_MINIMAL_PROPERTIES Are those used anywhere?

@Fokko
Copy link
Contributor

Fokko commented Jun 14, 2025

Good call @jayceslesar 🙌

Hive and Rest are like a pure mock, but I think they can be worked in (open to ideas)

I'm not sure if I understand the question correctly, but let me give it a try. I'd prefer to avoid mock and rely on the integration tests that we have with the rest and hive image. The rest container is updated daily from the Java repository, and will let us know if something incompatible changes.

@jayceslesar
Copy link
Contributor Author

I'm not sure if I understand the question correctly, but let me give it a try. I'd prefer to avoid mock and rely on the integration tests that we have with the rest and hive image. The rest container is updated daily from the Java repository, and will let us know if something incompatible changes.

Amazing -- I will take a stab at consolidating the testing of all main catalog behavior in an integration test

@jayceslesar
Copy link
Contributor Author

jayceslesar commented Jun 15, 2025

Ok check out the latest commit. All integration tests. Looks like unable to run locally because im missing AWS_TEST_BUCKET... Possible to use minio for that?

Edit: Also the aws creds but I trust that this is all configured in CI and I just can't see it which is fine, so if we are happy with the little setup I have I can expand it to cover all common behavior

@jayceslesar
Copy link
Contributor Author

@kevinjqliu can you please trigger tests again? Just want to see everything hopefully work

@kevinjqliu
Copy link
Contributor

oops, im late to the party.

I'd prefer to avoid mock and rely on the integration tests that we have with the rest and hive image.

+1 for these integration tests, it would be great to use the actual catalog implementations instead of mocks.

What's up with this? https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/sql.py#L100 NAMESPACE_MINIMAL_PROPERTIES Are those used anywhere?

thats an implementation detail :)

Just want to see everything hopefully work

i think we can start with rest, hive, and sqlite. Those 3 are fairly straight forward. glue and dynamodb catalogs are difficult to setup as integration tests. We can first expand to more tests and then add more catalog implementations. WDYT?

@jayceslesar jayceslesar changed the title maint: catalog implementation roundtripping tests maint: common catalog integration test suite Jul 19, 2025
@jayceslesar
Copy link
Contributor Author

@kevinjqliu I have implemented that change!

Also, this testing immediately exposes some inconsistencies with the rest catalog...

  1. test_create_table_with_invalid_database no exceptions are raised
  2. test_drop_namespace gives pyiceberg.exceptions.BadRequestError: NamespaceNotEmptyException: Namespace my_iceberg_database-dqlcspylaukmozvevwik is not empty. 1 tables exist. even though we are dropping the table before dropping the namespace

These both seem like issues with the backend rest catalog and not pyiceberg at first glance

@kevinjqliu
Copy link
Contributor

Also, this testing immediately exposes some inconsistencies with the rest catalog...

thats what we want to see :)

lets check in this general test suite and then add more tests to fix these specific issues

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! super excited for this.

@jayceslesar
Copy link
Contributor Author

jayceslesar commented Jul 19, 2025

Also just mostly fixed hive tests I believe?? Super unsure what the two errors I am still getting locally are

FAILED tests/integration/test_catalog.py::test_update_namespace_properties[hive_catalog] - TypeError: encoding without a string argument
ERROR tests/integration/test_catalog.py::test_update_namespace_properties[hive_catalog] - thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

Happy to exclude hive from that test but we should probably see if its fixable

Edit:

It looks like updating namespace properties (removing properties) on a hive catalog corrupts the namespace???

@jayceslesar
Copy link
Contributor Author

jayceslesar commented Jul 19, 2025

Fixed the issue with removing properties from hive catalog! Looks like that has existed since apache/iceberg#5391

All of these new tests are passing except for the two rest catalog issues mentioned above

@jayceslesar jayceslesar marked this pull request as ready for review July 19, 2025 18:32
@jayceslesar
Copy link
Contributor Author

Yeah the response from the rest catalog is a 400 and it should be a 404 https://github.com/apache/iceberg/blob/77f1f5ba47f0e5250b8af4e16cbe6febc8244ee5/open-api/rest-catalog-open-api.yaml#L580

Comment on lines 2327 to 2333
try:
test_catalog.purge_table(identifier)
except Exception as e:
if drop_if_cannot_purge:
test_catalog.drop_table(identifier)
else:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just default to drop_table here? and worry about purge_table later. its not yet implemented in pyiceberg's hive client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I will modify to drop here instead of purge -- should be fine as integration tests are transient? I dont think we will end up writing dangling data right?

@jayceslesar
Copy link
Contributor Author

jayceslesar commented Jul 19, 2025

okay actually im pretty confused because running the following for the first time

curl -X POST http://localhost:8181/v1/namespaces/new_uncreated_namespace/tables \
  -H "Content-Type: application/json" \
  -d '{
    "name": "some_table",
    "schema": {
      "type": "struct",
      "fields": [
        {"id": 1, "name": "foo", "type": "string", "required": false},
        {"id": 2, "name": "bar", "type": "int", "required": true},
        {"id": 3, "name": "baz", "type": "boolean", "required": false},
        {"id": 4, "name": "qux", "type": {"type": "list", "element-id": 8, "element": "string", "element-required": true}, "required": true},
        {"id": 5, "name": "quux", "type": {
          "type": "map", "key-id": 9, "key": "string", "value-id": 10,
          "value": {
            "type": "map", "key-id": 11, "key": "string", "value-id": 12,
            "value": "int", "value-required": true
          },
          "value-required": true
        }, "required": true},
        {"id": 6, "name": "location", "type": {
          "type": "list", "element-id": 13,
          "element": {
            "type": "struct",
            "fields": [
              {"id": 14, "name": "latitude", "type": "float", "required": false},
              {"id": 15, "name": "longitude", "type": "float", "required": false}
            ]
          },
          "element-required": true
        }, "required": true},
        {"id": 7, "name": "person", "type": {
          "type": "struct",
          "fields": [
            {"id": 16, "name": "name", "type": "string", "required": false},
            {"id": 17, "name": "age", "type": "int", "required": true}
          ]
        }, "required": false}
      ],
      "schema-id": 0,
      "identifier-field-ids": [2]
    },
    "partition-spec": {
      "spec-id": 0,
      "fields": []
    },
    "write-order": {
      "order-id": 0, 
      "fields": []
    },
    "stage-create": false,
    "properties": {}
  }'

results in the namespace and table being created??

Running rest server via docker compose -f dev/docker-compose-integration.yml up

@kevinjqliu
Copy link
Contributor

Looks like we found another inconsistent behavior in the catalogs. This ones in the iceberg-rest-fixture which uses the JDBC catalog implementation under the hood. For createTable, it defaults to creating the table without checking if the namespace exists.

We can align the behavior by adding this to the rest section of docker-compose-integration.yml

      - CATALOG_JDBC_STRICT__MODE=true

we might also want to make the same change in upstream https://github.com/kevinjqliu/iceberg/blob/master/docker/iceberg-rest-fixture/Dockerfile

@jayceslesar
Copy link
Contributor Author

Thank you for finding and explaining! I think I will make a PR upstream

@kevinjqliu
Copy link
Contributor

We can fix it here first to unblock this PR

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great @jayceslesar, it already caught some bugs :)

Comment on lines 90 to 102


@pytest.mark.integration
@pytest.mark.parametrize(
"test_catalog",
[
pytest.lazy_fixture("memory_catalog"),
pytest.lazy_fixture("sqlite_catalog_memory"),
pytest.lazy_fixture("sqlite_catalog_file"),
pytest.lazy_fixture("rest_catalog"),
pytest.lazy_fixture("hive_catalog"),
],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this to a common variable instead of repeating this for each of the tests?

Suggested change
@pytest.mark.integration
@pytest.mark.parametrize(
"test_catalog",
[
pytest.lazy_fixture("memory_catalog"),
pytest.lazy_fixture("sqlite_catalog_memory"),
pytest.lazy_fixture("sqlite_catalog_file"),
pytest.lazy_fixture("rest_catalog"),
pytest.lazy_fixture("hive_catalog"),
],
)
CATALOGS = [
pytest.lazy_fixture("memory_catalog"),
pytest.lazy_fixture("sqlite_catalog_memory"),
pytest.lazy_fixture("sqlite_catalog_file"),
pytest.lazy_fixture("rest_catalog"),
pytest.lazy_fixture("hive_catalog"),
]
@pytest.mark.integration
@pytest.mark.parametrize("test_catalog", CATALOGS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, added

@kevinjqliu
Copy link
Contributor

      - CATALOG_JDBC_STRICT__MODE=true

is also being fixed upstream in apache/iceberg#13599

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We caught 1 bug in the pyiceberg hive client and another in the pyiceberg rest client. And a bug in the apache/iceberg-rest-fixture!

This is going to be really useful. Looking forward to extending this with more tests and more catalogs.

@kevinjqliu
Copy link
Contributor

FAILED tests/integration/test_writes/test_writes.py::test_table_v1_with_null_nested_namespace - pyiceberg.exceptions.NoSuchNamespaceError: NoSuchNamespaceException: Cannot create table default.lower.table_v1_with_null_nested_namespace in catalog rest_backend. Namespace default.lower does not exist

eca9870#diff-7f3dd1244d08ce27c003cd091da10aa049f7bb0c7d5397acb4ec69767036accdR1302

i dont think default.lower.table_v1_with_null_nested_namespace is intended, change it to default.table_v1_with_null_nested_namespace

@kevinjqliu kevinjqliu merged commit fb78904 into apache:main Jul 20, 2025
10 checks passed
@kevinjqliu
Copy link
Contributor

Thanks @jayceslesar for the PR and @Fokko for the review

gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
In pursuit of apache#813

---------

Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants