BUG: numerical inconsistency in calculating rolling std, when the same data from different begining #60053

tunkill · 2024-10-16T02:05:01Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
test = pd.read_csv('std_problem.csv', index_col=0, parse_dates=True)

print(test.rolling(1000).std().iloc[-1])
data    0.0
Name: 2018-01-03 08:45:00, dtype: float64
print(test.iloc[-35785:].rolling(1000).std().iloc[-1])
data    0.0
Name: 2018-01-03 08:45:00, dtype: float64
print(test.iloc[-35784:].rolling(1000).std().iloc[-1])
data    1.230596
print(test.iloc[-35781:].rolling(1000).std().iloc[-1])
data    0.959358
Name: 2018-01-03 08:45:00, dtype: float64
Name: 2018-01-03 08:45:00, dtype: float64
print(test.iloc[-1000:].rolling(1000).std().iloc[-1])
data    0.701844
Name: 2018-01-03 08:45:00, dtype: float64
print(np.std(test.iloc[-1000:], ddof=1))
data    0.701844
dtype: float64

Issue Description

I have a data Series，which has a length of 93230，I want to calculate rolling std，but I got 0 for last one， that’s alomost impossible， so I check the result for only the last 1000 std， it’s the same with numpy.std, I found maybe from a special beging, the rolling std will give different results!

Expected Behavior

I expect they give the same result: 0.701844, no matter what the begining is, because the rolling 1000 should only use the latest 1000 numbers for the last std
std_problem.csv

Installed Versions

I test with pandas = 2.1.2 and pandas = 2.2.3, both have the same problem

pd.show_versions()
INSTALLED VERSIONS

commit : a60ad39
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 6.8.0-45-generic
Version : #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.5.1
pip : 24.0
lxml.etree : 5.2.2
jinja2 : 3.1.4
IPython : 8.20.0
pandas_datareader : 0.10.0
bs4 : 4.12.3
bottleneck : 1.3.7
fsspec : 2024.6.1
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.10.1
pyarrow : 16.1.0
s3fs : 2024.6.1
scipy : 1.12.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.6.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1

INSTALLED VERSIONS

commit : 0691c5c
python : 3.11.10
python-bits : 64
OS : Linux
OS-release : 6.8.0-45-generic
Version : #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.1.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
IPython : 8.28.0
tzdata : 2024.2

The text was updated successfully, but these errors were encountered:

auderson · 2024-10-16T06:51:51Z

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results.
Note the location of the peaks:

Harshal19t · 2024-10-16T19:06:14Z

The values are tremendous when looking at the data from 2016-07 to 2016-11 (~ approximately in terms of 10^7—10^10). If most of the values in a rolling window are very close to each other and much smaller than the outlier/extrema, the variance can become extremely small, leading to a near-zero standard deviation. The small values will have almost no impact on the mean, as the outlier/extrema skews the mean towards itself. This reduces the variance and the standard deviation becomes close to zero. Needed to understand how this function is tackling the outlier/extrema.

yuanx749 · 2024-10-17T03:40:45Z

Note that there is a warning in the documentation:
https://pandas.pydata.org/docs/dev/user_guide/window.html#overview

Some windowing aggregation, mean, sum, var and std methods may suffer from numerical imprecision due to the underlying windowing algorithms accumulating sums. When values differ with magnitude
this results in truncation. It must be noted, that large values may have an impact on windows, which do not include these values. Kahan summation is used to compute the rolling sums to preserve accuracy as much as possible.

tunkill · 2024-10-18T04:33:46Z

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results. Note the location of the peaks:

I got it， thank you very much! But if we don't drop or modify the extreme value， do we have any method to got the real std。I test for df.rolling().apply(np.std), but It's two slow to calulate the large datas

auderson · 2024-10-18T04:36:34Z

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results. Note the location of the peaks:

I got it， thank you very much! But if we don't drop or modify the extreme value， do we have any method to got the real std。I test for df.rolling().apply(np.std), but It's two slow to calulate the large datas

You could try numba with engine='numba'.

tunkill · 2024-10-18T08:30:59Z

Pandas uses online algorithms for some rolling functions like std, skew, etc. When your data has some extremas, these algorithms may yield inaccurate results. Note the location of the peaks:

I got it， thank you very much! But if we don't drop or modify the extreme value， do we have any method to got the real std。I test for df.rolling().apply(np.std), but It's two slow to calulate the large datas

You could try numba with engine='numba'.

Thank you again!
if I use engine='numba', it's faster than test.rolling().apply(np.std), but slower than test.rolling().std().
but there is another problem, that's I want to calculate np.std(ddof=1), If I use test.rolling(1000).apply(np.std, raw=True, engine='numba', engine_kwargs={'nopython': False}, kwargs={'ddof': 1}), it seems like the np.std parma ddof just be ingored!

print(test.rolling(1000).apply(np.std, raw=True, engine='numba', engine_kwargs={'nopython': False}, kwargs={'ddof': 1}).iloc[-1])
print(np.std(test.iloc[-1000:], ddof=0))
print(np.std(test.iloc[-1000:], ddof=1))
data 0.701493
Name: 2018-01-03 08:45:00, dtype: float64
data 0.701493
dtype: float64
data 0.701844
dtype: float64

so if nopython=True, you can't pass func params, but if nopython=False，the params in kwargs is useless， I write a function

def std(x:np.ndarray, ddof: int =0):
n = len(x)
if n == 0:
return np.nan
mean = np.mean(x)
variance = np.sum((x - mean) ** 2) / (n-ddof) # use ddof
return np.sqrt(variance)

the ddof still can't be passed into std with kwargs， like the numba just ingored all the func params!

auderson · 2024-10-18T09:07:06Z

This is a limitation for now: see #58712
Currently you can only pass in single argument numba function.

tunkill added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: numerical inconsistency in calculating rolling std, when the same data from different begining #60053

BUG: numerical inconsistency in calculating rolling std, when the same data from different begining #60053

tunkill commented Oct 16, 2024 •

edited

Loading

pd.show_versions()
INSTALLED VERSIONS

INSTALLED VERSIONS

auderson commented Oct 16, 2024

Harshal19t commented Oct 16, 2024 •

edited

Loading

yuanx749 commented Oct 17, 2024

tunkill commented Oct 18, 2024

auderson commented Oct 18, 2024

tunkill commented Oct 18, 2024 •

edited

Loading

auderson commented Oct 18, 2024 •

edited

Loading

BUG: numerical inconsistency in calculating rolling std, when the same data from different begining #60053

BUG: numerical inconsistency in calculating rolling std, when the same data from different begining #60053

Comments

tunkill commented Oct 16, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

pd.show_versions() INSTALLED VERSIONS

INSTALLED VERSIONS

auderson commented Oct 16, 2024

Harshal19t commented Oct 16, 2024 • edited Loading

yuanx749 commented Oct 17, 2024

tunkill commented Oct 18, 2024

auderson commented Oct 18, 2024

tunkill commented Oct 18, 2024 • edited Loading

auderson commented Oct 18, 2024 • edited Loading

tunkill commented Oct 16, 2024 •

edited

Loading

pd.show_versions()
INSTALLED VERSIONS

Harshal19t commented Oct 16, 2024 •

edited

Loading

tunkill commented Oct 18, 2024 •

edited

Loading

auderson commented Oct 18, 2024 •

edited

Loading