-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49992][SQL] Session level collation should not impact DDL queries #48436
base: master
Are you sure you want to change the base?
Conversation
import org.apache.spark.sql.internal.SQLConf | ||
import org.apache.spark.sql.types.{DataType, ImplicitStringType, StringType} | ||
|
||
object ResolveImplicitStringTypes extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefankandic Not all commands have been ported on V2 API yet, so, V1 commands like ClearCacheCommand
are in the sql
package. And you cannot access to them from catalyst
, and cannot handle them.
If it is need, we could at least create a logical nodes for them, see https://issues.apache.org/jira/browse/SPARK-33392. Let me know which commands you need to handle that are not ported to DS V2 API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need create/alter for table/view/function/schema. Is that doable?
Please take a look when you have the time @mihailom-db @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, just please make sure all cases are tested properly so we do not miss something
* we can still differentiate it from a regular string type, because in some places default string | ||
* is not the one with the session collation (e.g. in DDL commands). | ||
*/ | ||
private[spark] class DefaultStringType extends StringType(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from case object StringType extends StringType(0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case object StringType will have collationId = _collationId, where DefaultStringType will have SQLconf.get.defaultStringType.collationId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically DefaultStringType will behave same as the StringType(sessionCollation.id), while StringType(0) is just UTF8_BINARY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so this is kind of StringType(0)
with a special marker to indicate that it should be replaced by a default collation later. Can we use a special collation id for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, it should be replaced in the analyzer for DDL operations; for others it should just behave as the underlying session collation. I spent some time thinking about it and haven't been able to figure out a better solution for this, but let me know if you think something else might work better.
I added the special id as requested.
case _: V2CreateTablePlan => | ||
transform(plan, StringType) | ||
|
||
case addCols: AddColumns => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we assume this rule ReplaceDefaultStringType
will run before ResolveSessionCatalog
? Otherwise these commands may be converted to v1 and can't be matched here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, however I am having some issues there. For example, create view can be converted here: https://github.com/stefankandic/spark/blob/b1ff7672cba12750d41d803f0faeb3487d934601/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1563. I talked with Max a bit about this issue in the comment above, do you think there is a good way to catch v1 stuff as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more digging I think it's best we handle views separately, as they need to be treated differently; it's not enough to just replace create/alter view as we still need to resolve default types correctly once the view is queried.
What changes were proposed in this pull request?
This PR proposes not using session-level collation in DDL commands, specifically in CREATE TABLE v2 and ALTER TABLE ADD/REPLACE COLUMNS commands.
Why are the changes needed?
The default collation for DDL commands should be tied to the object being created or altered (e.g., table, view, schema) rather than the session-level setting. Since object-level collations are not yet supported, we will assume the UTF8_BINARY collation by default for now.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.