-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark vectorized read of Parquet produces incorrect result for a decimal column #11221
Open
1 of 3 tasks
Labels
bug
Something isn't working
Comments
In Iceberg 1.1, a different bug occurs when reading the Iceberg table; the read fails altogether due to:
As far as I can tell, this has never worked or worked correctly. |
Please rename the file (I used a .txt extension just to workaround my Mac preventing me from uploading the file). |
cc @nastra |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Apache Iceberg version
main (development)
Query engine
Spark
Please describe the bug 🐞
The bug is present in Iceberg 1.2 and later (and is in main).
A customer uses Impala to write Parquet data into an Iceberg table. We have a sample Parquet file written by Impala. It contains a single decimal(38, 0) column.
Spark is able to read the Parquet file correctly (note this is Spark's own Parquet read path, not Spark Iceberg support):
However, when we create an Iceberg table, add the file to it (using the Spark
add_files
procedure), and then query the table, we getAs we can see, the values do show up, but there are 15 zeros that show up after each value.
This is using vectorized Parquet read, as
read.parquet.vectorization.enabled
is true by default.When I set it to false in table properties for the table and query it again, the results are correct. Thus the bug is in the vectorized read path.
Note that when I write the DataFrame from reading the original Parquet file back out into another Iceberg table (with the same schema), the file written by Spark has a different encoding:
Querying this other table (with Parquet written by Spark) works fine, even using vectorized read.
Willingness to contribute
The text was updated successfully, but these errors were encountered: