Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: numerical inconsistency in calculating rolling std, when the same data from different begining #60053

Open
1 of 3 tasks
tunkill opened this issue Oct 16, 2024 · 7 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@tunkill
Copy link

tunkill commented Oct 16, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
test = pd.read_csv('std_problem.csv', index_col=0, parse_dates=True)

print(test.rolling(1000).std().iloc[-1])
data    0.0
Name: 2018-01-03 08:45:00, dtype: float64
print(test.iloc[-35785:].rolling(1000).std().iloc[-1])
data    0.0
Name: 2018-01-03 08:45:00, dtype: float64
print(test.iloc[-35784:].rolling(1000).std().iloc[-1])
data    1.230596
print(test.iloc[-35781:].rolling(1000).std().iloc[-1])
data    0.959358
Name: 2018-01-03 08:45:00, dtype: float64
Name: 2018-01-03 08:45:00, dtype: float64
print(test.iloc[-1000:].rolling(1000).std().iloc[-1])
data    0.701844
Name: 2018-01-03 08:45:00, dtype: float64
print(np.std(test.iloc[-1000:], ddof=1))
data    0.701844
dtype: float64

Issue Description

I have a data Series,which has a length of 93230,I want to calculate rolling std,but I got 0 for last one, that’s alomost impossible, so I check the result for only the last 1000 std, it’s the same with numpy.std, I found maybe from a special beging, the rolling std will give different results!

Expected Behavior

I expect they give the same result: 0.701844, no matter what the begining is, because the rolling 1000 should only use the latest 1000 numbers for the last std
std_problem.csv

Installed Versions

I test with pandas = 2.1.2 and pandas = 2.2.3, both have the same problem

pd.show_versions()
INSTALLED VERSIONS

commit : a60ad39
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 6.8.0-45-generic
Version : #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.5.1
pip : 24.0
lxml.etree : 5.2.2
jinja2 : 3.1.4
IPython : 8.20.0
pandas_datareader : 0.10.0
bs4 : 4.12.3
bottleneck : 1.3.7
fsspec : 2024.6.1
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.10.1
pyarrow : 16.1.0
s3fs : 2024.6.1
scipy : 1.12.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.6.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1

INSTALLED VERSIONS

commit : 0691c5c
python : 3.11.10
python-bits : 64
OS : Linux
OS-release : 6.8.0-45-generic
Version : #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.1.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
IPython : 8.28.0
tzdata : 2024.2

@tunkill tunkill added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 16, 2024
@auderson
Copy link
Contributor

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results.
Note the location of the peaks:
image

@Harshal19t
Copy link

Harshal19t commented Oct 16, 2024

The values are tremendous when looking at the data from 2016-07 to 2016-11 (~ approximately in terms of 10^7—10^10). If most of the values in a rolling window are very close to each other and much smaller than the outlier/extrema, the variance can become extremely small, leading to a near-zero standard deviation. The small values will have almost no impact on the mean, as the outlier/extrema skews the mean towards itself. This reduces the variance and the standard deviation becomes close to zero. Needed to understand how this function is tackling the outlier/extrema.

@yuanx749
Copy link
Contributor

Note that there is a warning in the documentation:
https://pandas.pydata.org/docs/dev/user_guide/window.html#overview

Some windowing aggregation, mean, sum, var and std methods may suffer from numerical imprecision due to the underlying windowing algorithms accumulating sums. When values differ with magnitude
this results in truncation. It must be noted, that large values may have an impact on windows, which do not include these values. Kahan summation is used to compute the rolling sums to preserve accuracy as much as possible.

@tunkill
Copy link
Author

tunkill commented Oct 18, 2024

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results. Note the location of the peaks: image

I got it, thank you very much! But if we don't drop or modify the extreme value, do we have any method to got the real std。I test for df.rolling().apply(np.std), but It's two slow to calulate the large datas

@auderson
Copy link
Contributor

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results. Note the location of the peaks: image

I got it, thank you very much! But if we don't drop or modify the extreme value, do we have any method to got the real std。I test for df.rolling().apply(np.std), but It's two slow to calulate the large datas

You could try numba with engine='numba'.

@tunkill
Copy link
Author

tunkill commented Oct 18, 2024

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results. Note the location of the peaks: image

I got it, thank you very much! But if we don't drop or modify the extreme value, do we have any method to got the real std。I test for df.rolling().apply(np.std), but It's two slow to calulate the large datas

You could try numba with engine='numba'.


Thank you again!
if I use engine='numba', it's faster than test.rolling().apply(np.std), but slower than test.rolling().std().
but there is another problem, that's I want to calculate np.std(ddof=1), If I use test.rolling(1000).apply(np.std, raw=True, engine='numba', engine_kwargs={'nopython': False}, kwargs={'ddof': 1}), it seems like the np.std parma ddof just be ingored!

print(test.rolling(1000).apply(np.std, raw=True, engine='numba', engine_kwargs={'nopython': False}, kwargs={'ddof': 1}).iloc[-1])
print(np.std(test.iloc[-1000:], ddof=0))
print(np.std(test.iloc[-1000:], ddof=1))
data 0.701493
Name: 2018-01-03 08:45:00, dtype: float64
data 0.701493
dtype: float64
data 0.701844
dtype: float64

so if nopython=True, you can't pass func params, but if nopython=False,the params in kwargs is useless, I write a function


def std(x:np.ndarray, ddof: int =0):
n = len(x)
if n == 0:
return np.nan
mean = np.mean(x)
variance = np.sum((x - mean) ** 2) / (n-ddof) # use ddof
return np.sqrt(variance)


the ddof still can't be passed into std with kwargs, like the numba just ingored all the func params!

@auderson
Copy link
Contributor

auderson commented Oct 18, 2024

This is a limitation for now: see #58712
Currently you can only pass in single argument numba function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants