Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34698: [C++][Acero] Add node that emits explicit ordering after asserting order #44738

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Nov 15, 2024

Rationale for this change

An acero node that turns an implicit ordering into an explicit ordering (rows sorted by some columns) is useful to re-use order that already exists in the data.

What changes are included in this PR?

This PR adds the AssertOrderNode that implements this logic. The Scanner employs that node to turn the implicit ordering of the ScanNode into an explicit order as defined by user code via ScanBuilder.Ordering.

Are these changes tested?

There are unit tests for the AssertOrderNode as well as for the ScanNode and ScanBuilder.

Are there any user-facing changes?

The following options has been added:

  • the ordering option added to ScanOptions
  • the Ordering option added to ScannerBuilder
  • the AssertNodeOptions class

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@EnricoMi EnricoMi changed the title GH-#34698: [C++] Add node that emits explicit ordering after asserting order GH-#34698: [C++][Acero] Add node that emits explicit ordering after asserting order Nov 15, 2024
@EnricoMi EnricoMi changed the title GH-#34698: [C++][Acero] Add node that emits explicit ordering after asserting order GH-34698: [C++][Acero] Add node that emits explicit ordering after asserting order Nov 15, 2024
Copy link

⚠️ GitHub issue #34698 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably not qualified to review this thoroughly but the CI failure is related:


[  FAILED  ] TestScannerThreading/TestScanner.ScanOrderingFail/0Serial1d1b1024r, where GetParam() = serial-1d-1b-1024i (4 ms)
[ RUN      ] TestScannerThreading/TestScanner.ScanOrderingFail/1Serial2d16b1024r
/arrow/cpp/src/arrow/dataset/scanner_test.cc:1040: Failure
Expected equality of these values:
  next.status().message()
    Which is: "Data is not ordered\n/arrow/cpp/src/arrow/acero/assert_order_node.cc:190  AssertInBatchOrder(*record_batch, ordering_, comparators_)\n/arrow/cpp/src/arrow/acero/source_node.cc:158  output_->InputReceived(this, std::move(batch))"
  "Data is not ordered"
With diff:
@@ -1,3 @@
 Data is not ordered
-/arrow/cpp/src/arrow/acero/assert_order_node.cc:190  AssertInBatchOrder(*record_batch, ordering_, comparators_)
-/arrow/cpp/src/arrow/acero/source_node.cc:158  output_->InputReceived(this, std::move(batch))

There seems to be some stray output on that specific job.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Nov 19, 2024
@EnricoMi
Copy link
Contributor Author

@raulcd thanks, assertions adjusted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants