HDF + pandas: how can I use a where mask with multindex?

480 Views Asked by At

I would like to select from a HUGE hdf5 a subset of the data, day by day. It would be perfect to use a where mask, but I can't make it work with a multiindex (since I have to have a where with two conditions). can't use a where mask with a multiindex:

import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = (0, 1, 2)
idx = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
                            names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 7), index=idx,
              columns=('I', 'II', 'III', 'IV', 'V', 'VI', 'VII'))

enter image description here

Ok, now I put it in a hdf store

from pandas.io.pytables import HDFStore
store =HDFStore('cancella.h5', 'w')
store.append('df_mask',df)

But if I read it again I have

c = store.select_column('df_mask','index')
print c

this index which is WRONG.

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

So I can't use the where mask. Can you help me?

4

There are 4 best solutions below

0
On

This may be orthogonal to your question, but I don't see a column or an index in your DataFrame called 'index' so I don't know what you're actually selecting with

c = store.select_column('df_mask','index')

I've always used the read_hdf wrapper to deal with PyTables and have found it very flexible. The syntax for it is exceedingly flexible, e.g.

c = pd.read_hdf('/home/Jian/Downloads/temp.h5', 'df_mask', where="Alpha='A' | Alpha='B' & Int=0")

There's a firm limit to the number of conditions in your where clause, but if I've read your question correctly, that'd be a second order issue for you.

1
On

Have you tried using Pandas to_hdf (after installing pytables)?

df.to_hdf('/Users/Alexander/Downloads/hdf.h5', key='data', format='t')

>>> pd.read_hdf('/Users/Alexander/Downloads/hdf.h5', 'data', columns='index')
Empty DataFrame
Columns: []
Index: [(A, 0), (A, 1), (A, 2), (B, 0), (B, 1), (B, 2)]

Choosing the format='t' option:

"Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data"

This documentation provides details of selecting data with a where mask.

1
On

I guess this is because pandas might reset_index before putting table into HDF5 (reason might be that it tries to avoid any potential duplicated index, which is allowed in dataframe but very bad for database), and use the integer auto-incremented primary key as the index in HDF5 table. So the code you wrote in the post selects these auto-incremented primary key.

There might be some other more elegant approaches, but I found the following work. (a consequence if pandas indeed reset_index, so now that multi-level index becomes columns...)

import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = (0, 1, 2)
idx = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
                            names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 7), index=idx,
              columns=('I', 'II', 'III', 'IV', 'V', 'VI', 'VII'))
print(df)

store = pd.HDFStore('/home/Jian/Downloads/temp.h5')
store.append('df_mask',df)

store.select('df_mask', columns=['Alpha','Int'])

enter image description here

0
On

In current pandas version 0.24.2, you may use direct querying without intermediate MultiIndex retrieval:

idx = pd.MultiIndex.from_product([('A', 'B'), range(3)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx,
                  columns=('I', 'II', 'III'))
df

>>>                    I           II         III
>>> Alpha   Int             
>>>     A   0    0.547846   -1.630704    0.456686
>>>         1   -0.057442   -0.232721    0.349548
>>>         2    1.495230    0.661166    1.309862
>>>     B   0    2.116818    0.477923   -0.252352
>>>         1    1.001081    2.578723   -0.040782
>>>         2   -1.286964    0.357969    0.002976

Just use built-in where mask:

with pd.HDFStore('test.h5') as store:
    store.put('df_mask', df, format='t')
    store.select('df_mask', where = 'Alpha = A & Int in [0,2]')

>>>                         I          II        III
>>> Alpha   Int             
>>>     A     0      0.621453   -1.757883   0.494433
>>>           2     -0.689012   -1.254884   0.655424

For different functions allowed in where mask visit docs.

But if your desired filters are very complicated and you want to retrieve MultiIndex and build mask 'by hands', you may select it like so:

with pd.HDFStore('test.h5') as store:
    store.select('df_mask', columns=['index'])

>>> Alpha   Int
>>>     A     0
>>>           1
>>>           2
>>>     B     0
>>>           1
>>>           2