Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4144: Implement dataframe exchange protocol for pandas storage format #4150

Merged
merged 34 commits into from
Mar 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
f144a23
FEAT-#4144: Implement dataframe exchange protocol
YarShev Feb 4, 2022
866856a
Fix some dostrings, and some renamings
YarShev Feb 4, 2022
70e2ddf
Move the protocol to lower layer
YarShev Feb 7, 2022
d8aca3f
Implement methods for DataFrame
YarShev Feb 7, 2022
db884cd
Move the protocol in df.pandas; impl some column methods
YarShev Feb 8, 2022
56b631e
Some more fixes, impls, moves
YarShev Feb 10, 2022
3ac61a2
Some fixes
YarShev Feb 11, 2022
5bcdfbc
Some fixes
YarShev Feb 14, 2022
88f6d6a
Apply comments
YarShev Feb 14, 2022
53c5592
Some fixes taking into account some tests
YarShev Feb 21, 2022
9232312
Some fixes
YarShev Feb 22, 2022
115bbd9
Refactor
YarShev Feb 22, 2022
b927b5d
Some fixes
YarShev Feb 22, 2022
a71b6c3
Some fixes
YarShev Feb 22, 2022
f11fd7a
fix comments
YarShev Feb 24, 2022
4c938af
Rebase on master and refactor
YarShev Feb 25, 2022
eb6c9aa
Refactor
YarShev Mar 1, 2022
32249bd
Add general tests, some fixes
YarShev Mar 3, 2022
fb1eda3
Fix lgtm warning
YarShev Mar 3, 2022
c79c1ee
Remove from_dataframe impl
YarShev Mar 4, 2022
c7477ad
Change metadata return value
YarShev Mar 4, 2022
8105e3f
Simplfy new_lengths computation
YarShev Mar 4, 2022
202300b
Use DTypeKind directly
YarShev Mar 4, 2022
4f2bc6e
Add a comment on pandas.RangeIndex(1) usage
YarShev Mar 4, 2022
60477dd
Fix metadata for cat dtype
YarShev Mar 5, 2022
b2fa39b
Return offset that is always equal to 0
YarShev Mar 5, 2022
86bb363
Fix describe_categorical
YarShev Mar 5, 2022
65fea9e
Use specific exceptions for unsuitable buffers
YarShev Mar 5, 2022
5e385b8
Fix lgtm warning
YarShev Mar 5, 2022
44dfef8
Add docstrings for the exceptions
YarShev Mar 5, 2022
c6574f7
Apply suggestions from code review
YarShev Mar 5, 2022
4ad1703
Address comments
YarShev Mar 11, 2022
b3beb79
Address comments
YarShev Mar 14, 2022
831e8a4
Address a comment
YarShev Mar 14, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,7 @@ jobs:
- run: python -m pytest modin/pandas/test/test_io.py --verbose
- run: python -m pytest modin/experimental/pandas/test/test_io_exp.py
- run: pip install "dfsql>=0.4.2" "pyparsing<=2.4.7" && pytest modin/experimental/sql/test/test_sql.py
- run: pytest modin/test/exchange/dataframe_protocol/test_general.py
- uses: codecov/codecov-action@v2

test-experimental:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ jobs:
- run: python -m pytest -n 2 modin/pandas/test/test_general.py
- run: python -m pytest modin/pandas/test/test_io.py
- run: python -m pytest modin/experimental/pandas/test/test_io_exp.py
- run: pytest modin/test/exchange/dataframe_protocol/test_general.py
- uses: codecov/codecov-action@v2

test-windows:
Expand Down
2 changes: 1 addition & 1 deletion modin/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ def from_arrow(cls, at, data_cls):
def free(self):
pass

def to_dataframe(self, nan_as_null: bool = False, allow_copy: bool = True) -> dict:
def to_dataframe(self, nan_as_null: bool = False, allow_copy: bool = True):
raise NotImplementedError(
"The selected execution does not implement the DataFrame exchange protocol."
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,3 @@

See more in https://data-apis.org/dataframe-protocol/latest/index.html.
"""

from .dataframe import ProtocolDataframe

__all__ = ["ProtocolDataframe"]
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,10 @@ def get_chunks(self, n_chunks: Optional[int] = None) -> Iterable["ProtocolColumn
------
DataFrame
A ``DataFrame`` object(s).

Raises
------
``RuntimeError`` if ``n_chunks`` is not a multiple of ``self.num_chunks()``.
"""
pass

Expand Down Expand Up @@ -539,5 +543,9 @@ def get_chunks(
------
ProtocolDataframe
A ``ProtocolDataframe`` object(s).

Raises
------
``RuntimeError`` if ``n_chunks`` is not a multiple of ``self.num_chunks()``.
"""
pass
33 changes: 33 additions & 0 deletions modin/core/dataframe/pandas/dataframe/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -2826,3 +2826,36 @@ def finalize(self):
that were used to build it.
"""
self._partition_mgr_cls.finalize(self._partitions)

def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True):
"""
Get a Modin DataFrame that implements the dataframe exchange protocol.

See more about the protocol in https://data-apis.org/dataframe-protocol/latest/index.html.

Parameters
----------
nan_as_null : bool, default: False
A keyword intended for the consumer to tell the producer
to overwrite null values in the data with ``NaN`` (or ``NaT``).
This currently has no effect; once support for nullable extension
dtypes is added, this value should be propagated to columns.
allow_copy : bool, default: True
A keyword that defines whether or not the library is allowed
to make a copy of the data. For example, copying data would be necessary
if a library supports strided buffers, given that this protocol
specifies contiguous buffers. Currently, if the flag is set to ``False``
and a copy is needed, a ``RuntimeError`` will be raised.

Returns
-------
ProtocolDataframe
A dataframe object following the dataframe protocol specification.
"""
from modin.core.dataframe.pandas.exchange.dataframe_protocol.dataframe import (
PandasProtocolDataframe,
)

return PandasProtocolDataframe(
self, nan_as_null=nan_as_null, allow_copy=allow_copy
)
14 changes: 14 additions & 0 deletions modin/core/dataframe/pandas/exchange/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Licensed to Modin Development Team under one or more contributor license agreements.
# See the NOTICE file distributed with this work for additional information regarding
# copyright ownership. The Modin Development Team licenses this file to you under the
# Apache License, Version 2.0 (the "License"); you may not use this file except in
# compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under
# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific language
# governing permissions and limitations under the License.

"""Base Modin Dataframe functionality related to data exchange protocols and optimized for pandas storage format."""
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Licensed to Modin Development Team under one or more contributor license agreements.
# See the NOTICE file distributed with this work for additional information regarding
# copyright ownership. The Modin Development Team licenses this file to you under the
# Apache License, Version 2.0 (the "License"); you may not use this file except in
# compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under
# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific language
# governing permissions and limitations under the License.

"""
Base Modin Dataframe functionality related to the dataframe exchange protocol and optimized for pandas storage format.

See more in https://data-apis.org/dataframe-protocol/latest/index.html.
"""
116 changes: 116 additions & 0 deletions modin/core/dataframe/pandas/exchange/dataframe_protocol/buffer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Licensed to Modin Development Team under one or more contributor license agreements.
# See the NOTICE file distributed with this work for additional information regarding
# copyright ownership. The Modin Development Team licenses this file to you under the
# Apache License, Version 2.0 (the "License"); you may not use this file except in
# compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under
# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific language
# governing permissions and limitations under the License.

"""
Dataframe exchange protocol implementation.

See more in https://data-apis.org/dataframe-protocol/latest/index.html.

Notes
-----
- Interpreting a raw pointer (as in ``Buffer.ptr``) is annoying and unsafe to
do in pure Python. It's more general but definitely less friendly than having
``to_arrow`` and ``to_numpy`` methods. So for the buffers which lack
``__dlpack__`` (e.g., because the column dtype isn't supported by DLPack),
this is worth looking at again.
"""

import enum
import numpy as np
from typing import Tuple

from modin.core.dataframe.base.exchange.dataframe_protocol.dataframe import (
ProtocolBuffer,
)
from modin.utils import _inherit_docstrings


@_inherit_docstrings(ProtocolBuffer)
class PandasProtocolBuffer(ProtocolBuffer):
"""
Data in the buffer is guaranteed to be contiguous in memory.

Note that there is no dtype attribute present, a buffer can be thought of
as simply a block of memory. However, if the column that the buffer is
attached to has a dtype that's supported by DLPack and ``__dlpack__`` is
implemented, then that dtype information will be contained in the return
value from ``__dlpack__``.

This distinction is useful to support both (a) data exchange via DLPack on a
buffer and (b) dtypes like variable-length strings which do not have a
fixed number of bytes per element.

Parameters
----------
x : np.ndarray
Data to be held by ``Buffer``.
allow_copy : bool, default: True
A keyword that defines whether or not the library is allowed
to make a copy of the data. For example, copying data would be necessary
if a library supports strided buffers, given that this protocol
specifies contiguous buffers. Currently, if the flag is set to ``False``
and a copy is needed, a ``RuntimeError`` will be raised.
"""

def __init__(self, x: np.ndarray, allow_copy: bool = True) -> None:
if not x.strides == (x.dtype.itemsize,):
# The protocol does not support strided buffers, so a copy is
# necessary. If that's not allowed, we need to raise an exception.
if allow_copy:
x = x.copy()
else:
raise RuntimeError(
"Exports cannot be zero-copy in the case "
+ "of a non-contiguous buffer"
)

# Store the numpy array in which the data resides as a private
# attribute, so we can use it to retrieve the public attributes
self._x = x

@property
def bufsize(self) -> int:
return self._x.size * self._x.dtype.itemsize

@property
def ptr(self) -> int:
return self._x.__array_interface__["data"][0]

def __dlpack__(self):
raise NotImplementedError("__dlpack__")

def __dlpack_device__(self) -> Tuple[enum.IntEnum, int]:
class Device(enum.IntEnum):
CPU = 1

return (Device.CPU, None)

def __repr__(self) -> str:
"""
Return a string representation for a particular ``PandasProtocolBuffer``.

Returns
-------
str
"""
return (
"Buffer("
+ str(
{
"bufsize": self.bufsize,
"ptr": self.ptr,
"device": self.__dlpack_device__()[0].name,
}
)
+ ")"
)
Loading