Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49992][SQL] Session level collation should not impact DDL queries #48436

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Oct 12, 2024

What changes were proposed in this pull request?

This PR proposes not using session-level collation in DDL commands, specifically in CREATE TABLE v2 and ALTER TABLE ADD/REPLACE COLUMNS commands.

Why are the changes needed?

The default collation for DDL commands should be tied to the object being created or altered (e.g., table, view, schema) rather than the session-level setting. Since object-level collations are not yet supported, we will assume the UTF8_BINARY collation by default for now.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 12, 2024
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{DataType, ImplicitStringType, StringType}

object ResolveImplicitStringTypes extends Rule[LogicalPlan] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic Not all commands have been ported on V2 API yet, so, V1 commands like ClearCacheCommand are in the sql package. And you cannot access to them from catalyst, and cannot handle them.

If it is need, we could at least create a logical nodes for them, see https://issues.apache.org/jira/browse/SPARK-33392. Let me know which commands you need to handle that are not ported to DS V2 API.

Copy link
Contributor Author

@stefankandic stefankandic Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need create/alter for table/view/function/schema. Is that doable?

@stefankandic
Copy link
Contributor Author

Please take a look when you have the time @mihailom-db @cloud-fan

@stefankandic stefankandic changed the title [SQL] Fix session level collation [SPARK-49992][SQL] Session level collation should not impact DDL queries Oct 16, 2024
@stefankandic stefankandic marked this pull request as ready for review October 16, 2024 12:50
Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, just please make sure all cases are tested properly so we do not miss something

* we can still differentiate it from a regular string type, because in some places default string
* is not the one with the session collation (e.g. in DDL commands).
*/
private[spark] class DefaultStringType extends StringType(0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different from case object StringType extends StringType(0)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case object StringType will have collationId = _collationId, where DefaultStringType will have SQLconf.get.defaultStringType.collationId

Copy link
Contributor Author

@stefankandic stefankandic Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically DefaultStringType will behave same as the StringType(sessionCollation.id), while StringType(0) is just UTF8_BINARY

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so this is kind of StringType(0) with a special marker to indicate that it should be replaced by a default collation later. Can we use a special collation id for this purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, it should be replaced in the analyzer for DDL operations; for others it should just behave as the underlying session collation. I spent some time thinking about it and haven't been able to figure out a better solution for this, but let me know if you think something else might work better.

I added the special id as requested.

case _: V2CreateTablePlan =>
transform(plan, StringType)

case addCols: AddColumns =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we assume this rule ReplaceDefaultStringType will run before ResolveSessionCatalog? Otherwise these commands may be converted to v1 and can't be matched here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however I am having some issues there. For example, create view can be converted here: https://github.com/stefankandic/spark/blob/b1ff7672cba12750d41d803f0faeb3487d934601/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1563. I talked with Max a bit about this issue in the comment above, do you think there is a good way to catch v1 stuff as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more digging I think it's best we handle views separately, as they need to be treated differently; it's not enough to just replace create/alter view as we still need to resolve default types correctly once the view is queried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants