Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values #90

Open
AlenkaF opened this issue Nov 16, 2022 · 2 comments

Comments

@AlenkaF
Copy link

AlenkaF commented Nov 16, 2022

I am currently working on the implementation of the dataframe interchange protocol for PyArrow. After testing the current PyArrow implementation for producing a __dataframe__ object with Pandas implementation for consuming I have noticed that columns that use bit/bytemask null representation, but do not have missing values, error.

The reason for this is that Apache Arrow does not create a mask buffer when there are no missing values present. Therefore the result of calling .get_buffers()["validity"] on the PyArrow __dataframe__ object without missing values is None which is currently not handled by the protocol specification. See:
https://github.com/pandas-dev/pandas/blob/5c66e65d7b9fef47ccb585ce2fd0b3ea18dc82ea/pandas/core/interchange/from_dataframe.py#L502

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

@honno
Copy link
Member

honno commented Nov 16, 2022

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.


For onlookers, the relevant docs for what buf, dtype = Column.get_buffers()["validity"] currently should contain

- "validity": a two-element tuple whose first element is a buffer
containing mask values indicating missing data and
whose second element is the mask value buffer's
associated dtype. None if the null representation is
not a bit or byte mask.

@jorisvandenbossche
Copy link
Member

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.

That's certainly a possible solution, but I personally find that it feels a bit wrong. The column is nullable, in the meaning that it "can" have nulls (that's typically how "nullable" is interpreted, I think). The null count just happens to be 0, in which case arrow can optimize this by not allocating the bitmask.
Also for a datetime64 column, you probably won't change the null type from USE_SENTINEL to NON_NULLABLE if there are no nulls (NaT) present (although of course here it has no impact on the memory layout).

One corner case where this fallback to non-nullable doesn't necessarily work optimally is that a column can have multiple chunks, and in pyarrow, one chunk might have a null bitmap, and a next chunk might not have one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants