You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In case of a binary column, Iceberg Equality Delete loader, BaseDeleteLoader reuses the underlying ByteBuffer used to read the delete records. In some situations it leads to overwriting previously read records from the delete file and in turn produces duplicates in the output.
Unfortunately, I can't share the exact code to reproduce the bug, but here is the table that contains the issue - 9a264a21-cc2f-4f13-8e7d-4d899a31ca2f 2.zip. A read from that table will contain duplicates as the equality delete reader will miss one of the 2 records present in the delete file.
In principle, it should be easy to reproduce with a table having a binary column and doing an overwrite on it. Unfortunately, I am new to Iceberg and couldn't find working instructions on how to set up a local environment and do an overwrite using equality deletes. If someone helps me with a local setup, I will be happy to provide a more actionable example.
Here are the pointers in the code to show how/why the overwrite happens:
Parquet reader configuration for delete files set up to reuse the ByteBuffer:
Apache Iceberg version
1.6.1 (latest release)
Query engine
None
Please describe the bug 🐞
In case of a binary column, Iceberg Equality Delete loader,
BaseDeleteLoader
reuses the underlying ByteBuffer used to read the delete records. In some situations it leads to overwriting previously read records from the delete file and in turn produces duplicates in the output.Unfortunately, I can't share the exact code to reproduce the bug, but here is the table that contains the issue - 9a264a21-cc2f-4f13-8e7d-4d899a31ca2f 2.zip. A read from that table will contain duplicates as the equality delete reader will miss one of the 2 records present in the delete file.
In principle, it should be easy to reproduce with a table having a binary column and doing an overwrite on it. Unfortunately, I am new to Iceberg and couldn't find working instructions on how to set up a local environment and do an overwrite using equality deletes. If someone helps me with a local setup, I will be happy to provide a more actionable example.
Here are the pointers in the code to show how/why the overwrite happens:
Parquet reader configuration for delete files set up to reuse the
ByteBuffer
:iceberg/data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java
Lines 197 to 203 in 09370dd
ParquetValueReaders
that reuses the same ByteByffer:iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java
Lines 366 to 380 in 09370dd
InternalRecordWrapper
copies the pointer to that ByteBuffer (instead of its content) leading to the overwrite:iceberg/data/src/main/java/org/apache/iceberg/data/InternalRecordWrapper.java
Lines 76 to 78 in 09370dd
Willingness to contribute
The text was updated successfully, but these errors were encountered: