Pandas HDFStore retrieving data preformance understanding

232 Views Asked by At

So at first it was like I encounter on another bug, but in the end of all tests I become pretty unsure that it's really a bug and that I understand how to build data processing pipeline with pandas HDF.

Take a sit, let's ride together. Hope you may clarify for me something in the end.



Preparing

SIZE_0 = 10**6
SIZE_1 = 4
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))
print (df.head())
          0         1         2         3
0  0.327362  0.084638  0.124322  0.116745
1  0.606545  0.484079  0.977239  0.120613
2  0.014407  0.973912  0.464409  0.959907
3  0.357551  0.641503  0.889408  0.776769
4  0.770845  0.548562  0.587054  0.569719

Put to store into 2 parts

cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
    store.put('df1', df[cols1], 't')
    store.put('df2', df[cols2], 't')


Now to the problem. Reading whole df from HDFStore with select_as_multiple is much more slower than select:

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
(1000000, 4)
CPU times: user 24.3 s, sys: 38.6 ms, total: 24.3 s
Wall time: 24.3 s

And plain select:

%%time
with pd.HDFStore('test.h5') as store:
    df1 = store.select('df1')
    df2 = store.select('df2')
    out = pd.concat([df1, df2], axis=1)
print (out.shape)
(1000000, 4)
CPU times: user 48.1 ms, sys: 23.9 ms, total: 72 ms
Wall time: 68.3 ms

So at this point i was going to post it as performance issue, but after some initial 'dumb-checking' (as i thought it was) I receive surprising (at least for me) results.



Let's increase number of columns and see what will happen.

SIZE_1 = 8
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))

cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
    store.put('df1', df[cols1], 't')
    store.put('df2', df[cols2], 't')

Now using same code for select_as_multiple we will get output:

(1000000, 8)
CPU times: user 14.7 s, sys: 87.3 ms, total: 14.8 s
Wall time: 14.8 s

Strange things. We increased our data size twice but wall time now is 10s lower.

At the same time code for retrieve df by select executes a bit slower:

(1000000, 8)
CPU times: user 90.6 ms, sys: 27.9 ms, total: 119 ms
Wall time: 115 ms


Well after that I was unable to stop my curiosity and make it another test shot :). Now with instantiating df by SIZE_1 = 16 (again all other rows of code are unchanged - won't copy them here again for brevity sake).

Now select_as_multiple runs even faster:

(1000000, 16)
CPU times: user 8.27 s, sys: 184 ms, total: 8.45 s
Wall time: 8.45 s

But for plain select everything is as expected - execution time increased:

(1000000, 16)
CPU times: user 181 ms, sys: 124 ms, total: 306 ms
Wall time: 302 ms

But at the same time select still works much-much faster.



Finally to the questions:

1. Why `select_as_multiple` works so poorly comparing with `select`?

BTW, this is not only the problem for selection without specifying where condition:

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'], where='index < 500000')
print (out.shape)
(500000, 16)
CPU times: user 4.65 s, sys: 56.7 ms, total: 4.7 s
Wall time: 4.69 s

And for select:

%%time
with pd.HDFStore('test.h5') as store:
    df1 = store.select('df1', where='index < 500000')
    df2 = store.select('df2', where='index < 500000')
    out = pd.concat([df1, df2], axis=1)
print (out.shape)
(500000, 16)
CPU times: user 871 ms, sys: 89 ms, total: 960 ms
The decreasing of time (with such particular `where`) should be more expected, since we have to 
Wall time: 927 ms

Still much faster. But one may note, that where reduces time for select_as_multiple and at the same time increases it for select. So here is another question:


2. Why specifying `where` clause _reduces_ time for `select_as_multiple` and at the same time _increases_ it for `select`?

The expected behavior is both increasing or both decreasing for particular where. But not opposite to each other.


3. Why increasing data size in axis=1 direction decreases read time for `select_as_multiple`?

We are increasing data size but selection performs almost multiple times faster? That's strange. Maybe this is design 'feature' that says - Don't use HDF storage until you have really big column count in your df? But I couldn't recall something like that from docs. Only opposite use cases - exactly in section with select_as_multiple that suggests to split your data into 'query' columns and 'others' (so decrease column count in stored dfs) in order to speed up queries.


Let's do some more tests.

SIZE_0 = 10**6 and SIZE_1 = 16:

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
(1000000, 16)
CPU times: user 8.39 s, sys: 232 ms, total: 8.62 s
Wall time: 8.64 s

Increasing df size in axis=0 twice SIZE_0 = 2*10**6 and SIZE_1 = 16:

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
(2000000, 16)
CPU times: user 32.3 s, sys: 370 ms, total: 32.6 s
Wall time: 32.6 s

Versus increasing df size in axis=1 twice SIZE_0 = 10**6 and SIZE_1 = 2*16:

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
(1000000, 32)
CPU times: user 9.05 s, sys: 384 ms, total: 9.43 s
Wall time: 9.43 s

So 32s v 10s.

4. This means that for HDF storing it is much more efficient appending columns instead of rows?!

That is really confusing. Isn't it wrong? As far as I understand PyTables same as pandas HDF is 'row oriented'?


One may have notice from latter test that we finally get execution time increased after expand our data in axis=1 direction. Let's find out when exactly it has started to be true:

SIZE_0 = 10**6
SIZE_1s = list(range(4, 40)) # List of SIZE_1 to iterate

# Iterating
timings = []
for SIZE_1 in SIZE_1s:
    df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))

    cols1 = list(df.columns[:SIZE_1//2])
    cols2 = list(df.columns[SIZE_1//2:])
    with pd.HDFStore('test.h5') as store:
        # Put to store
        store.put('df1', df[cols1], 't')
        store.put('df2', df[cols2], 't')
        # Read from store, note the time
        start = pd.Timestamp.now()
        out = store.select_as_multiple(['df1', 'df2'])
        # Appending timings
        timings.append((pd.Timestamp.now()-start).total_seconds())
# Plotting
to_plot = pd.DataFrame(timings,
                       index=pd.Index(SIZE_1s, name='column_C'),
                       columns=['read time'])
_ = to_plot.plot(figsize = (14, 6),
                 title = 'DF read time from HDF store by DFs column count',
                 color = 'blue')

OUTPUT

So initial dynamic (up to 23-24 columns) is straightforward - more columns = faster reading.

5. Is 24 columns count (we should divide here for 2 df, i.e. about 12 columns) something like design threshold? And only after reaching it one should think about using HDF store?



Some system info:

pd.__version__
tables.__version__
'0.24.2'
'3.5.1'

Also has 24GB install on 64-bit Ubuntu 19.04. And at the same time largest of used dfs in tests was about 300MB size. So it shouldn't have made any problem.

EDIT: since no explanation were given - I've opened an issue.

0

There are 0 best solutions below