Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix join with sort push down #13560

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

haohuaijin
Copy link
Contributor

@haohuaijin haohuaijin commented Nov 25, 2024

Which issue does this PR close?

Closes #13559

Rationale for this change

What changes are included in this PR?

Are these changes tested?

test by existed test and add a sqllogicaltest test

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 25, 2024
@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

Thank you @haohuaijin

This appears to be code that was introduced in #12559 from @berkaysynnada

@berkaysynnada is there any way you can help review this PR?

@berkaysynnada
Copy link
Contributor

Hi @haohuaijin, and sorry for the delayed response. I have been very busy over the past few days. I have reviewed your fix and have some comments about the problem and the solution.

The issue originates from handle_custom_pushdown(), which overfits to operators that simply concatenate the input children's schemas side by side. In your example, the join is a right-semi join, which exposed this bug.

Your fix refers to column names when propagating the order. However, if there are multiple columns in the child schema, this fix may still be insufficient. To address this, I suggest two potential solutions:

  1. You can write a specific handler for hash joins. This approach would not rely on a general mapping of input_fields to output_fields but would instead use a join type parameter to propagate the ordering. This solution should work for any type of join in our project and is my recommended approach. You could copy the current logic from this function and adapt it for different join types.
  2. You could generalize the logic further to include a mapping of input children fields to output children fields and propagate the ordering accordingly. However, implementing this solution would be non-trivial and would require extensive testing.
    If you go with the first solution, I suggest adding a note to the function’s header to clarify its purpose:
    "The function can be used for operators that simply concatenate the input children's schemas side by side."

By the way, @alamb, isn’t it risky to have an index_of() API in the Schema implementation? Doesn’t this create the potential for mismatches when columns are matched based solely on their names?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An error occurred when the sort push down rule pushed sort below join
3 participants