Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet output with uint8 can't be read by some other tools #5326

Open
philrz opened this issue Oct 8, 2024 · 2 comments
Open

Parquet output with uint8 can't be read by some other tools #5326

philrz opened this issue Oct 8, 2024 · 2 comments
Assignees

Comments

@philrz
Copy link
Contributor

philrz commented Oct 8, 2024

tl;dr

The file generated by the following can be read back in with Zed tools but not some other tools.

$ echo '{device_floor: 3 (uint8)}' | zq -f parquet -o repro-uint8.parquet -

Details

Repro is with Zed commit b05e70b.

The Parquet file generated is readable by zq itself.

$ zq -version
Version: v1.18.0-18-gb05e70bd

$ echo '{device_floor: 3 (uint8)}' | zq -f parquet -o repro-uint8.parquet -

$ zq -z -i parquet repro-uint8.parquet
{device_floor:3(uint8)}

However, If I try to read the file back with DuckDB, it fails.

$ duckdb --version
v1.1.1 af39bd0dcf

$ duckdb -c 'SELECT * FROM "repro-uint8.parquet"'
Invalid Input Error: Failed to cast value: Type UINT32 with value 4294967295 can't be cast because the value is out of range for the destination type UINT8

Likewise with Tad, which doesn't show any kind of error, just a blank screen.

Repro-Tad.mp4

As implied by the DuckDB error message, it seems the uint8 value is the root cause, since if I stick to the default integer, the problem doesn't appear.

$ echo '{device_floor: 3}' | zq -f parquet -o repro-default.parquet -

$ duckdb -c 'SELECT * FROM "repro-default.parquet"'
┌──────────────┐
│ device_floor │
│    int64     │
├──────────────┤
│            3 │
└──────────────┘
Tad-Working.mp4

Of course, even as I can show these multiple tools choking on it, the reader at https://parquetreader.com loads the one with the uint8 without complaint, so it's hard for me to guess if this a bug or if some tools just have spotty type coverage.

image

That said, DuckDB has no problem with its own Parquet-that-contains-uint8.

$ duckdb my.db
v1.1.1 af39bd0dcf
Enter ".help" for usage hints.
D CREATE TABLE repro (device_floor UTINYINT);
D INSERT INTO repro (device_floor) values(3);
D select * from repro;
┌──────────────┐
│ device_floor │
│    uint8     │
├──────────────┤
│            3 │
└──────────────┘
D copy repro to "uint8-from-duckdb.parquet" (FORMAT PARQUET);

$ duckdb -c 'SELECT * FROM "uint8-from-duckdb.parquet"'
┌──────────────┐
│ device_floor │
│    uint8     │
├──────────────┤
│            3 │
└──────────────┘

Tad loads that one ok too.

@philrz philrz changed the title Parquet output that can't be read by other tools Parquet output with uint8 can't be read by some other tools Oct 8, 2024
@philrz
Copy link
Contributor Author

philrz commented Oct 8, 2024

After reviewing these symptoms in a team meeting, @nwt said it's likely this would be a bug in the Arrow Parquet writer that Zed's Parquet writer depends on.

@nwt
Copy link
Member

nwt commented Oct 23, 2024

This does look like an Arrow Parquet bug. Note the bogus min value in the page statistics here. That's the same value DuckDB complains about.

$ echo '{device_floor: 3 (uint8)}' | super -f parquet -o repro-uint8.parquet -
$ parquet pages repro-uint8.parquet

Column: device_floor
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  1       4.00 B     4 B
  0-1    data  _ R  1       9.00 B     9 B                 0       "4294967295" / "3"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants