[SPARK-49992][SQL] Session level collation should not impact DDL queries #48436

stefankandic · 2024-10-12T21:28:49Z

What changes were proposed in this pull request?

This PR proposes not using session-level collation in DDL commands, specifically in CREATE TABLE v2 and ALTER TABLE ADD/REPLACE COLUMNS commands.

Why are the changes needed?

The default collation for DDL commands should be tied to the object being created or altered (e.g., table, view, schema) rather than the session-level setting. Since object-level collations are not yet supported, we will assume the UTF8_BINARY collation by default for now.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

MaxGekk · 2024-10-14T08:28:53Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveImplicitStringTypes.scala

+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{DataType, ImplicitStringType, StringType}
+
+object ResolveImplicitStringTypes extends Rule[LogicalPlan] {


@stefankandic Not all commands have been ported on V2 API yet, so, V1 commands like ClearCacheCommand are in the sql package. And you cannot access to them from catalyst, and cannot handle them.

If it is need, we could at least create a logical nodes for them, see https://issues.apache.org/jira/browse/SPARK-33392. Let me know which commands you need to handle that are not ported to DS V2 API.

I think we just need create/alter for table/view/function/schema. Is that doable?

stefankandic · 2024-10-16T12:43:53Z

Please take a look when you have the time @mihailom-db @cloud-fan

mihailom-db

Looks ok, just please make sure all cases are tested properly so we do not miss something

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala

cloud-fan · 2024-10-16T13:28:20Z

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala

+ * we can still differentiate it from a regular string type, because in some places default string
+ * is not the one with the session collation (e.g. in DDL commands).
+ */
+private[spark] class DefaultStringType extends StringType(0) {


how is this different from case object StringType extends StringType(0)?

case object StringType will have collationId = _collationId, where DefaultStringType will have SQLconf.get.defaultStringType.collationId

Basically DefaultStringType will behave same as the StringType(sessionCollation.id), while StringType(0) is just UTF8_BINARY

ok, so this is kind of StringType(0) with a special marker to indicate that it should be replaced by a default collation later. Can we use a special collation id for this purpose?

Exactly, it should be replaced in the analyzer for DDL operations; for others it should just behave as the underlying session collation. I spent some time thinking about it and haven't been able to figure out a better solution for this, but let me know if you think something else might work better.

I added the special id as requested.

cloud-fan · 2024-10-16T13:29:52Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ReplaceDefaultStringType.scala

+      case _: V2CreateTablePlan =>
+        transform(plan, StringType)
+
+      case addCols: AddColumns =>


so we assume this rule ReplaceDefaultStringType will run before ResolveSessionCatalog? Otherwise these commands may be converted to v1 and can't be matched here.

Yes, however I am having some issues there. For example, create view can be converted here: https://github.com/stefankandic/spark/blob/b1ff7672cba12750d41d803f0faeb3487d934601/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1563. I talked with Max a bit about this issue in the comment above, do you think there is a good way to catch v1 stuff as well?

After some more digging I think it's best we handle views separately, as they need to be treated differently; it's not enough to just replace create/alter view as we still need to resolve default types correctly once the view is queried.

initial

25c057b

github-actions bot added the SQL label Oct 12, 2024

stefankandic added 3 commits October 13, 2024 00:57

remove check in CheckAnalysis.scala

0c77855

some working version

0921258

add fix for eager eval of inline tables

b0a2139

MaxGekk reviewed Oct 14, 2024

View reviewed changes

initial working version with tests

a18d5b9

stefankandic changed the title ~~[SQL] Fix session level collation~~ [SPARK-49992][SQL] Session level collation should not impact DDL queries Oct 16, 2024

stefankandic marked this pull request as ready for review October 16, 2024 12:50

mihailom-db approved these changes Oct 16, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala Show resolved Hide resolved

cloud-fan reviewed Oct 16, 2024

View reviewed changes

stefankandic added 2 commits October 16, 2024 19:22

change collation id for default type

14130fa

fix typo

67e8dbb

stefankandic requested a review from cloud-fan October 16, 2024 17:47

stefankandic added 5 commits October 17, 2024 19:51

fix failing and add new tests

172d58f

merge with master

d3b8f27

formatting

7ac5654

fix toString method

82dcbf4

fix duplicate test name

014c855

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49992][SQL] Session level collation should not impact DDL queries #48436

[SPARK-49992][SQL] Session level collation should not impact DDL queries #48436

stefankandic commented Oct 12, 2024 •

edited

Loading

MaxGekk Oct 14, 2024

stefankandic Oct 16, 2024 •

edited

Loading

stefankandic commented Oct 16, 2024

mihailom-db left a comment

cloud-fan Oct 16, 2024

mihailom-db Oct 16, 2024

stefankandic Oct 16, 2024 •

edited

Loading

cloud-fan Oct 16, 2024

stefankandic Oct 16, 2024

cloud-fan Oct 16, 2024

stefankandic Oct 16, 2024

stefankandic Oct 16, 2024

[SPARK-49992][SQL] Session level collation should not impact DDL queries #48436

Are you sure you want to change the base?

[SPARK-49992][SQL] Session level collation should not impact DDL queries #48436

Conversation

stefankandic commented Oct 12, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

stefankandic Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

stefankandic commented Oct 16, 2024

mihailom-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefankandic Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefankandic commented Oct 12, 2024 •

edited

Loading

stefankandic Oct 16, 2024 •

edited

Loading

stefankandic Oct 16, 2024 •

edited

Loading