under pandas, each time i use the table format instead fixed format my memory consuption explode.
import numpy as np
np.random.seed(seed=10)
df = pd.DataFrame ({'ID' : ['foo', 'bar'] * 10000000,
'ORDER' : np.arange(20000000),
'VAL' : np.random.randn(20000000)})
case #1 : fixe format
df.to_hdf('test.h5','df',append=False,format ='fixed')
now , i read ten ten time the df, i work without high memory consumption
for a in range(10):
df1 = pd.read_hdf('test.h5','df')
case #2 : table format
df.to_hdf('test.h5','df',append=False,format ='table')
now , i read ten ten time the df, it doesnt release memory on each itération.memory consuption is getting to high
for a in range(10):
df1 = pd.read_hdf('test.h5','df')
Any suggestion ?
windows 64bits python 3.4, pandas 0.15.1
Using a smaller file n=1MM.
Their table format allocates about 2x memory then collects it. This is mainly of function of the storage format.