Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write.metadata.metrics.max-inferred-column-defaults doesn't respect nested columns #11253

Open
2 of 3 tasks
jkolash opened this issue Oct 3, 2024 · 0 comments
Open
2 of 3 tasks
Labels
bug Something isn't working

Comments

@jkolash
Copy link
Contributor

jkolash commented Oct 3, 2024

Apache Iceberg version

1.5.1

Query engine

Spark

Please describe the bug 🐞

write.metadata.metrics.max-inferred-column-defaults

only considers top level columns, not nested columns

Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));
only looks at the top level struct not all structs we could simply add a counter while we iterate through with TypeUtil.getProjectedIds

Simply fixing this would be a behavior change that would impact people on upgrading iceberg where their previously nested structs had metrics and then suddenly after upgrade they no longer have them with new writes. we have basically 2 choices.

  1. Just change the behavior and document it with some workaround to kind of preserve existing behavior (set the value to 1K for example)
  2. Leave existing behavior alone and introduce a new config and deprecate the old one.

Ideally I think I'd want to prioritize top level columns getting metrics first and then the first fields in each struct weighted equally for metrics. so if we have multiple nested structs each one gets their first fields with metrics and not just 1 large struct consuming all the metrics.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@jkolash jkolash added the bug Something isn't working label Oct 3, 2024
jkolash added a commit to jkolash/iceberg that referenced this issue Oct 4, 2024
Show that we do not bound the number of column statistics if
the nested struct had 200 columns we would end up with over 200
statistics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant