Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Invalid comparison operation: Utf8 == Utf8View error during LEFT ANTI JOIN #13510

Open
sergiimk opened this issue Nov 21, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@sergiimk
Copy link
Contributor

Describe the bug

Between 42.2.0 and 43.0.0 there appears to have been a regression that introduced an error:

External(ArrowError(InvalidArgumentError("Invalid comparison operation: Utf8 == Utf8View"), None))

Note that the error happens at the plan execution phase, i.e. plan validation passes successfully.

As Arrow / DF errors don't have any trace info I have no visibility on which operator this error originates from. I will proceed to patch Arrow to try to dump the traceback and produce a minimal reproduce steps.

To Reproduce

I'm still struggling to isolate a minimal repro case. Can only share the full optimized physical plan for now:

DataSinkExec: sink=ParquetSink(file_groups=[])
  CoalescePartitionsExec
    ProjectionExec: expr=[CAST(row_number() ORDER BY [date ASC NULLS FIRST] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW@5 AS Int64) + 2 as offset, op@0 as op, system_time@4 as system_time, date@1 as date, city@2 as city, population@3 as population]
      RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1
        BoundedWindowAggExec: wdw=[row_number() ORDER BY [date ASC NULLS FIRST] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Ok(Field { name: "row_number() ORDER BY [date ASC NULLS FIRST] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW", data_type: UInt64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: CurrentRow, is_causal: true }], mode=[Sorted]
          SortPreservingMergeExec: [date@1 ASC]
            SortExec: expr=[date@1 ASC], preserve_partitioning=[true]
              ProjectionExec: expr=[0 as op, CASE WHEN date@0 IS NULL THEN 2524651200000 ELSE date@0 END as date, city@1 as city, population@2 as population, 2524651200000 as system_time]
                CoalesceBatchesExec: target_batch_size=8192
                  HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(date@0, date@0), (city@1, city@1)]
                    CoalesceBatchesExec: target_batch_size=8192
                      RepartitionExec: partitioning=Hash([date@0, city@1], 16), input_partitions=16
                        ProjectionExec: expr=[CAST(date@0 AS Timestamp(Millisecond, Some("UTC"))) as date, city@1 as city, population@2 as population]
                          RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1
                            CsvExec: file_groups={1 group: [[tmp/.tmpBsbmOs/data.csv]]}, projection=[date, city, population], has_header=true
                    CoalesceBatchesExec: target_batch_size=8192
                      RepartitionExec: partitioning=Hash([date@0, city@1], 16), input_partitions=1
                        ParquetExec: file_groups={1 group: [[tmp/.tmpBsbmOs/datasets/foo.bar/data/f1620e5d7c1414a901e3c1e2324870a568f07ca633701de7358c4c50b2785be745060]]}, projection=[date, city]

The error happens deep in our system when merging two dataframes, one read from Parquet (and having Utf8View columns) and another from CSV (having Utf8 columns). The key operation is to de-duplicate records in CSV with those in Parquet using LeftAnti JOIN.

Expected behavior

No error / error during planning if some operation is invalid

Additional context

No response

@sergiimk sergiimk added the bug Something isn't working label Nov 21, 2024
@ttencate
Copy link

Related to #13568? I filed that as a separate issue, but it seems like they might have the same root cause.

@findepi
Copy link
Member

findepi commented Nov 26, 2024

@sergiimk are you able to simplify the failing query so that we have the simplest possible reproducer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants