-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[python] Fix filter on data evolution table not working issue #7211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[python] Fix filter on data evolution table not working issue #7211
Conversation
4354588 to
b8e709d
Compare
c694ad0 to
867ca01
Compare
867ca01 to
653f23e
Compare
| simple_null = self._filter_batch_simple_null(batch) | ||
| if simple_null is not None: | ||
| return simple_null | ||
| if not self.predicate.has_null_check(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just use Predicate.to_arrow, it is same to other table modes.
| schema=result.schema, | ||
| ) | ||
| except (TypeError, ValueError, pa.ArrowInvalid) as e: | ||
| logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exception here?
| return False | ||
| if self.partition_key_predicate and not self.partition_key_predicate.test(entry.partition): | ||
| return False | ||
| if self.deletion_vectors_enabled and entry.file.level == 0: # do not read level 0 file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can refactor this, this if should be in is_primary_key_table.
| if not self.predicate: | ||
| return True | ||
| if self.predicate_for_stats is None: | ||
| if self.predicate_for_stats is None or self.data_evolution: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a separate if. And we should add comments to this if, explain why there is no filtering done here.
| super().__init__(table, predicate, read_type, actual_split, row_tracking_enabled) | ||
|
|
||
| def _push_down_predicate(self) -> Optional[Predicate]: | ||
| # Do not push predicate to file readers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detailed comments, why not push predicate.
| return ConcatBatchReader(suppliers) | ||
| merge_reader = ConcatBatchReader(suppliers) | ||
| if self.predicate is not None: | ||
| # Only apply filter when all predicate columns are in read_type (e.g. projected schema). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we are returning here is complete row, right? So this check should be applicable to all table modes? That shouldn't be added here, it should be verified in SplitRead.init.
Purpose/Problem
Filter not working on data evolution read: when a predicate is provided, all rows are returned.
Tests
API and Format
Documentation