Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of Column.offset? #67

Open
maartenbreddels opened this issue Sep 16, 2021 · 5 comments
Open

Meaning of Column.offset? #67

maartenbreddels opened this issue Sep 16, 2021 · 5 comments

Comments

@maartenbreddels
Copy link

Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?
If that is the case, this can always be 0 for numpy and primitive Arrow arrays (except for Arrow-boolean since they are bits), since we can always slice them right?

@jorisvandenbossche
Copy link
Member

Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?

(personally I don't really like that we use the same class for both ..)

@jorisvandenbossche
Copy link
Member

The docstring seems to indicate it's indeed for chunks:

@property
def offset(self) -> int:
"""
Offset of first element.
May be > 0 if using chunks; for example for a column with N chunks of
equal size M (only the last chunk may be shorter),
``offset = n * M``, ``n = 0 .. N-1``.
"""
pass

@maartenbreddels
Copy link
Author

So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?

@jorisvandenbossche
Copy link
Member

Ah, sorry I wasn't thinking about the case where your original data isn't chunked but you could return it in chunks.

Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer? Given that it says "may be > 0 if using chunks", it might actually be the second (your interpretation)

@rgommers
Copy link
Member

It was indeed meant for supporting an offset into a data buffer - this could be for chunking, or perhaps for other reasons like returning a subset of rows from the original dataframe/buffer and not wanting to create a new buffer.

So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?

Yes indeed. Although in practice I think chunks are normally coming from different buffers, because if all data fits in a single buffer then chunking isn't necessary.

Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?

Same basic principle, but Column.offset is just a single value. Column.get_buffers returns an "offsets" buffer that's for variable-length data:

            - "offsets": a two-element tuple whose first element is a buffer
                         containing the offset values for variable-size binary
                         data (e.g., variable-length strings) and whose second
                         element is the offsets buffer's associated dtype. None
                         if the data buffer does not have an associated offsets
                         buffer.

Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?

That's not it, I hope the docstring is clear enough. If not, we should extend it.

(personally I don't really like that we use the same class for both ..)

Agreed, it was a bit of a compromise between "I want one class per concept" and "I want as few classes as possible" opinions.

Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer?

The latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants