Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values #90

AlenkaF · 2022-11-16T07:46:42Z

I am currently working on the implementation of the dataframe interchange protocol for PyArrow. After testing the current PyArrow implementation for producing a __dataframe__ object with Pandas implementation for consuming I have noticed that columns that use bit/bytemask null representation, but do not have missing values, error.

The reason for this is that Apache Arrow does not create a mask buffer when there are no missing values present. Therefore the result of calling .get_buffers()["validity"] on the PyArrow __dataframe__ object without missing values is None which is currently not handled by the protocol specification. See:
https://github.com/pandas-dev/pandas/blob/5c66e65d7b9fef47ccb585ce2fd0b3ea18dc82ea/pandas/core/interchange/from_dataframe.py#L502

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

The text was updated successfully, but these errors were encountered:

honno · 2022-11-16T10:19:29Z

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.

For onlookers, the relevant docs for what buf, dtype = Column.get_buffers()["validity"] currently should contain

dataframe-api/protocol/dataframe_protocol.py

Lines 353 to 357 in aa6fe7d

    
                       - "validity": a two-element tuple whose first element is a buffer 
        
                                     containing mask values indicating missing data and 
        
                                     whose second element is the mask value buffer's 
        
                                     associated dtype. None if the null representation is 
        
                                     not a bit or byte mask.

jorisvandenbossche · 2022-11-16T12:52:37Z

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.

That's certainly a possible solution, but I personally find that it feels a bit wrong. The column is nullable, in the meaning that it "can" have nulls (that's typically how "nullable" is interpreted, I think). The null count just happens to be 0, in which case arrow can optimize this by not allocating the bitmask.
Also for a datetime64 column, you probably won't change the null type from USE_SENTINEL to NON_NULLABLE if there are no nulls (NaT) present (although of course here it has no impact on the memory layout).

One corner case where this fallback to non-nullable doesn't necessarily work optimally is that a column can have multiple chunks, and in pyarrow, one chunk might have a null bitmap, and a next chunk might not have one.

rgommers added the interchange-protocol label Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values #90

Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values #90

AlenkaF commented Nov 16, 2022

honno commented Nov 16, 2022 •

edited

Loading

jorisvandenbossche commented Nov 16, 2022

Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values #90

Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values #90

Comments

AlenkaF commented Nov 16, 2022

honno commented Nov 16, 2022 • edited Loading

jorisvandenbossche commented Nov 16, 2022

honno commented Nov 16, 2022 •

edited

Loading