Why does pandas sum() give wrong answers for Sparse dataframe?

94 Views Asked by At

In a Sparse dataframe, the sum() method applied on the whole dataframe gives wrong results, while sum() applied to specific column or to a dataframe subset works.

It looks like an overflow issue for sum() when applied to the whole dataframe, since type Sparse[int8, 0] is chosen for sum result. However, why isn't that the case for the other two scenarios?

Note: Strangely, when run in Anaconda terminal, each scenario gives correct result, while in Pycharms I see the error.

>>> import numpy as np
>>> import pandas as pd

>>> # Generate standard and sparse DF with binary variable.
>>> # Use int8 to minimize memory usage.
>>> df = pd.DataFrame(np.random.randint(low=0, high=2, size=(50_000, 1)))
>>> sdf = df.astype(pd.SparseDtype(dtype='int8', fill_value=0))
>>> print(df.sum(axis=0))
0    24954
dtype: int64

>>> # Why does this give a wrong answer while the other two work?
>>> print(sdf.sum(axis=0))
0    122
dtype: Sparse[int8, 0]

>>> # Works
>>> print(sdf[0].sum())
24954

>>> # Works
>>> print(sdf[sdf==1].sum())
0    24954.0
dtype: float64

Finally, what's a safe way for summing Sparse df columns without going dense or changing the dtype? I currently iterate over each column and save the sum() result in a dictionary (similar to Scenario 2 in this example), then transform to dataframe, which seems a bit cumbersome.

1

There are 1 best solutions below

4
Corralien On BEST ANSWER

Unfortunately, I think there is probably no good answer to your question. I would rather use scipy if I had to deal with sparse matrices:

import pandas as pd
from scipy.sparse import csr_matrix

df = pd.DataFrame(np.random.randint(low=0, high=2, size=(50_000, 3)))
sdf = csr_matrix(df, dtype='int8')
>>> sdf 
<50000x3 sparse matrix of type '<class 'numpy.int8'>'
    with 75298 stored elements in Compressed Sparse Row format>

>>> sdf.sum(axis=0)
matrix([[24963, 25202, 25133]])

>>> pd.DataFrame(sdf.sum(axis=0), columns=df.columns)
       0      1      2
0  24963  25202  25133

However, note the ticket opened by a Pandas member: DEPR: SparseDtype #56518