Add two Pandas Series or DataFrame objects in-place?

1k Views Asked by At

I have a dataset where we record the electrical power demand from each individual appliance in the home. The dataset is quite large (2 years or data; 1 sample every 6 seconds; 50 appliances). The data is in a compressed HDF file.

We need to add the power demand for every appliance to get the total aggregate power demand over time. Each individual meter might have a different start and end time.

The naive approach (using a simple model of our data) is to do something like this:

LENGHT = 2**25
N = 30
cumulator = pd.Series()

for i in range(N):
    # change the index for each new_entry to mimick the fact
    # that out appliance meters have different start and end time.
    new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
    cumulator = cumulator.add(new_entry, fill_value=0)

This works fine for small amounts of data. It also works OK with large amounts of data as long as every new_entry has exactly the same index.

But, with large amounts of data, where each new_entry has a different start and end index, Python quickly gobbles up all the available RAM. I suspect this is a memory fragmentation issue. If I use multiprocessing to fire up a new process for each meter (to load the meter's data from disk, load the cumulator from disk, do the addition in memory, then save the cumulator back to disk, and exit the process) then we have fine memory behaviour but, of course, all that disk IO slows us down a lot.

So, I think what I want is an in-place Pandas add function. The plan would be to initialise cumulator to have an index which is the union of all the meters' indicies. Then allocate memory once for that cumulator. Hence no more fragmentation issues.

I have tried two approaches but neither is satisfactory.

I tried using numpy.add to allow me to set the out argument:

# Allocate enough space for the cumulator
cumulator = pd.Series(0, index=np.arange(0, LENGTH+N))

for i in range(N):
    new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
    cumulator, aligned_new_entry = cumulator.align(new_entry, copy=False, fill_value=0)
    del new_entry
    np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values)
    del aligned_new_entry

But this gobbles up all my RAM too and doesn't seem to do the addition. If I change the penaultiate line to cumulator.values = np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values) then I get an error about not being able to assign to cumulator.values.

This second approach appears to have the correct memory behaviour but is far too slow to run:

for i in range(N):
    new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
    for index in cumulator.index:
        try:
            cumulator[index] += new_entry[index]
        except KeyError:
            pass

I suppose I could write this function in Cython. But I'd rather not have to do that.

So: is there any way to do an 'inplace add' in Pandas?

Update

In response to comments below, here is a toy example of our meter data and the sum we want. All values are watts.

 time     meter1  meter2  meter3  sum
09:00:00   10                      10
09:00:06   10       20             30
09:00:12   10       20             30
09:00:18   10       20      30     50
09:00:24   10       20      30     50
09:00:30   10               30     40

If you want to see more details then here's the file format description of our data logger, and here's the 4TByte archive of our entire dataset.

2

There are 2 best solutions below

0
On

After messing around a lot with multiprocessing, I think I've found a fairly simple and efficient way to do an in-place add without using multiprocessing:

import numpy as np
import pandas as pd

LENGTH = 2**26
N = 10
DTYPE = np.int

# Allocate memory *once* for a Series which will hold our cumulator
cumulator = pd.Series(0, index=np.arange(0, N+LENGTH), dtype=DTYPE)

# Get a numpy array from the Series' buffer
cumulator_arr = np.frombuffer(cumulator.data, dtype=DTYPE)

# Create lots of dummy data. Each new_entry has a different start
# and end index.
for i in range(N):
    new_entry = pd.Series(1, index=np.arange(i, LENGTH+i), dtype=DTYPE)
    aligned_new_entry = np.pad(new_entry.values, pad_width=((i, N-i)), 
                               mode='constant', constant_values=((0, 0)))
    # np.pad could be replaced by new_entry.reindex(index, fill_value=0)
    # but np.pad is faster and more memory efficient than reindex

    del new_entry
    np.add(cumulator_arr, aligned_new_entry, out=cumulator_arr)
    del aligned_new_entry

del cumulator_arr

print cumulator.head(N*2)

which prints:

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    10
11    10
12    10
13    10
14    10
15    10
16    10
17    10
18    10
19    10
0
On

assuming that your dataframe looks something like:

df.index.names == ['time']
df.columns == ['meter1', 'meter2', ..., 'meterN']

then all you need to do is:

df['total'] = df.fillna(0, inplace=True).sum(1)