-
Notifications
You must be signed in to change notification settings - Fork 412
Feat/json serialize or expression #2565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/json serialize or expression #2565
Conversation
1d481ad to
26722cf
Compare
| def __init__(self, left: BooleanExpression, right: BooleanExpression, *rest: BooleanExpression) -> None: | ||
| if isinstance(self, Or) and not hasattr(self, "left") and not hasattr(self, "right"): | ||
| super().__init__(left=left, right=right) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This weird __init__ is to make sure that if the class has been already initialized in __new__, it is not initialized again.
If this is not made, multiple things can happen:
- If *rest is given, although
_build_balanced_treebuild anOrinstance correctly, Python would call againOrinitializer with the original parametersleftandrightresulting inOr(left, right), ignoring the createdOrinstance from_build_balanced_tree - If left or right is
AlwaysTrueit will get overriden and the instance will be initialized asOr(left, right) - If left and right is
AlwaysFalseit will get overriden and the instance will be initialized asOr(left, right)
I do not know why exactly but it gotta be something from Pydantic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent several hours debugging this, and it is pretty weird. Pydantic seems not to work well with __new__. I came up with the following:
class Or(IcebergBaseModel, BooleanExpression):
"""OR operation expression - logical disjunction."""
- model_config = ConfigDict(arbitrary_types_allowed=True)
+ model_config = ConfigDict(arbitrary_types_allowed=True, frozen=False)
- type: TypingLiteral["str"] = Field(default="or", alias="type")
- left: BooleanExpression
- right: BooleanExpression
+ type: TypingLiteral["or"] = Field(default="or")
+ left: BooleanExpression = Field()
+ right: BooleanExpression = Field()
def __init__(self, left: BooleanExpression, right: BooleanExpression, *rest: BooleanExpression) -> None:
- if isinstance(self, Or) and not hasattr(self, "left") and not hasattr(self, "right"):
- super().__init__(left=left, right=right)
+ super().__init__(left=self.left, right=self.right)
def __new__(cls, left: BooleanExpression, right: BooleanExpression, *rest: BooleanExpression) -> BooleanExpression: # type: ignore
if rest:
@@ -326,20 +325,11 @@ class Or(IcebergBaseModel, BooleanExpression):
return left
else:
obj = super().__new__(cls)
+ obj.__pydantic_fields_set__ = set()
+ obj.left = left
+ obj.right = right
return objI see what direction you were going, but if we leave out the obj.left = left then we don't have the attributes and we have an empty Or object. The test is still failing, but I think that's because the UnaryPredicates are not yet serializable. If you update the tests to use LiteralPredicates, then it should work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot test it now, but I am pretty sure that I left out obj.left = left because of how Python works with __new__. If you return a class of the same type that happens:
If new() is invoked during object construction and it returns an instance of cls, then the new instance’s init() method will be invoked like init(self[, ...]), where self is the new instance and the remaining arguments are the same as were passed to the object constructor.
Ref: https://docs.python.org/3.12/reference/datamodel.html#object.__new__
So that is why I purposely left the object uninitialized, so that when the raw obj is returned it is initialized by __init__.
If the object has already being initialized by _build_balanced_tree I just leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing, the tests are passing in my implementation, as far as I know, so you know that Or is being instantiated correctly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier today I had failing tests, but I was still at an older commit of the branch. Let me check again
pyiceberg/expressions/__init__.py
Outdated
|
|
||
| model_config = ConfigDict(arbitrary_types_allowed=True) | ||
|
|
||
| type: TypingLiteral["str"] = Field(default="or", alias="type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| type: TypingLiteral["str"] = Field(default="or", alias="type") | |
| type: TypingLiteral["or"] = Field(default="or", alias="type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| assert ( | ||
| or_.model_dump_json() | ||
| == '{"type":"or","left":"IsNull(term=Reference(name=\'a\'))","right":"IsNaN(term=Reference(name=\'b\'))"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is invalid, it should return something like:
| == '{"type":"or","left":"IsNull(term=Reference(name=\'a\'))","right":"IsNaN(term=Reference(name=\'b\'))"}' | |
| == '{"type":"or","left":{"type":"is-null","term":"a"},"right":{"type":"is-nan","term":"b"}}' |
But the unary predicates are still WIP: #2598
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I build an "easier" test for now? Something like Or("a", "b") and then check if {"type":"or","left":"a","right":"b"}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you replace it with EqualTo("a", 10) then it should produce a valid JSON encoded expression. On top of that, we can also remove these: https://github.com/apache/iceberg-python/pull/2565/files#diff-4d0ae566a5a0bf0ceba68c8119e95c3d2d71d92e70e0dd9b48589d3379c7f1caR331-R341
These will override the serializer when it encounters a non-Pydantic object in left/right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a seperate test would be best :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e58863b to
417678a
Compare
|
|
CI failing at PS: I see that you solved it in #2654 |
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @jaimeferj
|
@jaimeferj Also feel free to fix the CI in this branch |
417678a to
1b18573
Compare
|
Sorry for pushing again, I though that the linting error was fixed by #2654 but it still errors: |
|
@jaimeferj It can be that it is due some dependency or something. Maybe |
I just reinstalled the whole thing again and it keeps giving the same error while linting. I'll try to sort it out later and if I cannot, I'll open an issue. Thanks |
|
@jaimeferj Maybe it is a caching issue, maybe run |
|
Thanks for working on this @jaimeferj 🙌 |
Closes #2519
Rationale for this change
Are these changes tested?
yes
Are there any user-facing changes?