You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, If I try to read the file back with DuckDB, it fails.
$ duckdb --version
v1.1.1 af39bd0dcf
$ duckdb -c 'SELECT * FROM "repro-uint8.parquet"'
Invalid Input Error: Failed to cast value: Type UINT32 with value 4294967295 can't be cast because the value is out of range for the destination type UINT8
Likewise with Tad, which doesn't show any kind of error, just a blank screen.
Repro-Tad.mp4
As implied by the DuckDB error message, it seems the uint8 value is the root cause, since if I stick to the default integer, the problem doesn't appear.
Of course, even as I can show these multiple tools choking on it, the reader at https://parquetreader.com loads the one with the uint8 without complaint, so it's hard for me to guess if this a bug or if some tools just have spotty type coverage.
That said, DuckDB has no problem with its own Parquet-that-contains-uint8.
$ duckdb my.db
v1.1.1 af39bd0dcf
Enter ".help" for usage hints.
D CREATE TABLE repro (device_floor UTINYINT);
D INSERT INTO repro (device_floor) values(3);
D select * from repro;
┌──────────────┐
│ device_floor │
│ uint8 │
├──────────────┤
│ 3 │
└──────────────┘
D copy repro to "uint8-from-duckdb.parquet" (FORMAT PARQUET);
$ duckdb -c 'SELECT * FROM "uint8-from-duckdb.parquet"'
┌──────────────┐
│ device_floor │
│ uint8 │
├──────────────┤
│ 3 │
└──────────────┘
Tad loads that one ok too.
The text was updated successfully, but these errors were encountered:
philrz
changed the title
Parquet output that can't be read by other tools
Parquet output with uint8 can't be read by some other tools
Oct 8, 2024
After reviewing these symptoms in a team meeting, @nwt said it's likely this would be a bug in the Arrow Parquet writer that Zed's Parquet writer depends on.
This does look like an Arrow Parquet bug. Note the bogus min value in the page statistics here. That's the same value DuckDB complains about.
$ echo'{device_floor: 3 (uint8)}'| super -f parquet -o repro-uint8.parquet -
$ parquet pages repro-uint8.parquetColumn: device_floor-------------------------------------------------------------------------------- page type enc count avg size size rows nulls min / max 0-D dict _ _ 1 4.00 B 4 B 0-1 data _ R 1 9.00 B 9 B 0 "4294967295" / "3"
tl;dr
The file generated by the following can be read back in with Zed tools but not some other tools.
Details
Repro is with Zed commit b05e70b.
The Parquet file generated is readable by
zq
itself.However, If I try to read the file back with DuckDB, it fails.
Likewise with Tad, which doesn't show any kind of error, just a blank screen.
Repro-Tad.mp4
As implied by the DuckDB error message, it seems the
uint8
value is the root cause, since if I stick to the default integer, the problem doesn't appear.Tad-Working.mp4
Of course, even as I can show these multiple tools choking on it, the reader at https://parquetreader.com loads the one with the
uint8
without complaint, so it's hard for me to guess if this a bug or if some tools just have spotty type coverage.That said, DuckDB has no problem with its own Parquet-that-contains-
uint8
.Tad loads that one ok too.
The text was updated successfully, but these errors were encountered: