Add implementation of SQL Except operation#135
Add implementation of SQL Except operation#135demianw wants to merge 2 commits intodask-contrib:mainfrom
Conversation
nils-braun
left a comment
There was a problem hiding this comment.
Thank you so much @demianw! That is a very nice addon!
If you want, you can also add it into the documentation in select.rst (e.g. below the UNION).
I have a few comments on the changes, but in general I am already quite fine with it.
| rel_string = str(generator.getRelationalAlgebraString(rel)) | ||
| logger.debug( | ||
| f"Non optimised query plan: \n " | ||
| f"{str(generator.getRelationalAlgebraString(nonOptimizedRelNode))}" |
There was a problem hiding this comment.
Ok, makes sense to do that. Good idea.
|
|
||
| class LogicalMinusPlugin(BaseRelPlugin): | ||
| """ | ||
| LogicalUnion is used on EXCEPT clauses. |
There was a problem hiding this comment.
I guess you still would like to update that docstring :-)
| second_df = second_dc.df | ||
| second_cc = second_dc.column_container | ||
|
|
||
| # For concatenating, they should have exactly the same fields |
There was a problem hiding this comment.
| # For concatenating, they should have exactly the same fields | |
| # For subtracting, they should have exactly the same fields |
| second_df = second_dc.assign() | ||
|
|
||
| self.check_columns_from_row_type(first_df, rel.getExpectedInputRowType(0)) | ||
| self.check_columns_from_row_type(second_df, rel.getExpectedInputRowType(1)) |
There was a problem hiding this comment.
There is now a lot of code duplication in this and the LogicalUnion plugin. I think it would make sense to extract the basic functionalities (the column name cleaning) into a function in utils.py and then reuse it here - or what do you think @demianw?
| indicator=True, | ||
| ) | ||
|
|
||
| df = df[df.iloc[:, -1] == "left_only"].iloc[:, :-1] |
| ) | ||
| result_df = result_df.compute() | ||
| assert result_df.columns == "a" | ||
| assert set(result_df["a"]) == set([1, 3]) |
There was a problem hiding this comment.
Would you mind adding a test with NaNs? You can also use one of the prepared tables (e.g. user_table_nan) if it makes sense.
It might also make sense to test this functionality against sqlite - I am just scared that especially on NULL (NaN) pandas/dask and SQL have different opinions (as I have seen so often, unfortunately...)
| @@ -0,0 +1,29 @@ | |||
| def test_except_empty(c, df): | |||
There was a problem hiding this comment.
I think you do not need the df parameter here
|
Oh, and if you want to get rid of the style errors: make sure you use black version 19.10 (I have seen that that might make some difference). |
|
@demianw - would you still like to work on the PR? I think your changes are absolutely worth being included and just need some small amount of tweaking! |
There is also an added log line to show non-optimised query plans