You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of hash_pandas_object does not meet collision resistance requirements, although this is known to the developers. However, it is not prominently documented, and the function is already widely used in many downstream AI platforms, such as MLflow, AutoGluon, and others. These platforms use pandas_hash_object to convert DataFrame structures and then apply MD5 or SHA-256 for uniqueness checks, enabling caching and related functionalities. This makes these platforms more vulnerable to malicious datasets.
Therefore, I propose adding a safe option with a default value set to True. This would directly benefit the security of a large number of downstream applications. If not, the documentation should explicitly state that the function does not provide collision resistance and should not be used for caching or similar tasks.
cryptochecktool
changed the title
ENH: Add a safe Option to pandas_hash_object with Default Value Set to True
ENH: Add a safe Option to hash_pandas_object with Default Value Set to True
Nov 27, 2024
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
The current implementation of hash_pandas_object does not meet collision resistance requirements, although this is known to the developers. However, it is not prominently documented, and the function is already widely used in many downstream AI platforms, such as MLflow, AutoGluon, and others. These platforms use pandas_hash_object to convert DataFrame structures and then apply MD5 or SHA-256 for uniqueness checks, enabling caching and related functionalities. This makes these platforms more vulnerable to malicious datasets.
Therefore, I propose adding a safe option with a default value set to True. This would directly benefit the security of a large number of downstream applications. If not, the documentation should explicitly state that the function does not provide collision resistance and should not be used for caching or similar tasks.
Feature Description
Alternative Solutions
Alternatively, if users need to modify the function themselves, they can use to_pickle() to serialize the DataFrame before hashing.
Additional Context
autogluon code:
https://github.com/autogluon/autogluon/blob/082d8bae7343f02e9dc9ce3db76bc3f305027b10/common/src/autogluon/common/utils/utils.py#L176
mlflow code at:
https://github.com/mlflow/mlflow/blob/615c4cbafd616e818ff17bfcd964e8366a5cd3ed/mlflow/data/digest_utils.py#L39
graphistry code at:
https://github.com/graphistry/pygraphistry/blob/52ea49afbea55291c41962f79a90d74d76c721b9/graphistry/util.py#L84
Developer discussion on pandas functionality: #16372 (comment)
Documentation link for
hash_pandas_object
: https://pandas.pydata.org/docs/reference/api/pandas.util.hash_pandas_object.html#pandas.util.hash_pandas_objectone dome:
The text was updated successfully, but these errors were encountered: