Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Interchange pa.Table's Column.null_count doesn't count NaNs #34774

Closed
honno opened this issue Mar 29, 2023 · 2 comments
Closed

[Python] Interchange pa.Table's Column.null_count doesn't count NaNs #34774

honno opened this issue Mar 29, 2023 · 2 comments

Comments

@honno
Copy link

honno commented Mar 29, 2023

Describe the bug, including details regarding any error messages, version, and platform.

It seems the interchange Column.null_count() (relevant spec) has erroneous behaviour

>>> import pyarrow as pa
>>> pa.__version__
'12.0.0.dev304'  # from https://pypi.fury.io/arrow-nightlies/
>>> df = pa.table([pa.array([float("nan")], type=pa.float64())], ["foo"])
>>> dfi = df.__dataframe__()
>>> col = dfi.get_column(0)
>>> col.null_count
0  # should be 1

I assume this is because Arrow does not treat NaNs as nulls, which semantically makes sense, but in the interchange protocol it should—see vaexio/vaex#2120 for a related discussion.

See pandas for expected behaviour

>>> import pandas as pa
>>> df = pd.DataFrame({"foo": [float("nan")]})
>>> dfi = df.__dataframe__()
>>> col = dfi.get_column(0)
>>> col.null_count
1

cc @AlenkaF (let me know if not to tag you on things! coincidentally I was working on data-apis/dataframe-interchange-tests#20 today when Ralf commented heh.)

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title Interchange pa.Table's Column.null_count doesn't count NaNs [Python] Interchange pa.Table's Column.null_count doesn't count NaNs Mar 29, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Mar 30, 2023

cc @AlenkaF (let me know if not to tag you on things! coincidentally I was working on data-apis/dataframe-interchange-tests#20 today when Ralf commented heh.)

Do tag me please! =)
Very cool you both jumped on it so quickly hehe.

I assume this is because Arrow does not treat NaNs as nulls, which semantically makes sense, but in the interchange protocol it should—see vaexio/vaex#2120 for a related discussion.

But we agree with Maarten 😊 vaexio/vaex#2120 (comment)
There would be loss of information for all the libraries that actually differentiate between nan and null, right?

@AlenkaF
Copy link
Member

AlenkaF commented Apr 2, 2023

Closing this issue as we have received confirmation from the protocol side that nan and null can be treated separately, see discussion in data-apis/dataframe-api#126

@AlenkaF AlenkaF closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants