Memory leak when using hdf in table format?

381 Views Asked by At

under pandas, each time i use the table format instead fixed format my memory consuption explode.

import numpy as np
np.random.seed(seed=10)
df = pd.DataFrame ({'ID' : ['foo', 'bar'] * 10000000,
                'ORDER' : np.arange(20000000),
         'VAL' : np.random.randn(20000000)})

case #1 : fixe format

df.to_hdf('test.h5','df',append=False,format ='fixed')

now , i read ten ten time the df, i work without high memory consumption

for a in range(10):
    df1 = pd.read_hdf('test.h5','df')

case #2 : table format

df.to_hdf('test.h5','df',append=False,format ='table')

now , i read ten ten time the df, it doesnt release memory on each itération.memory consuption is getting to high

for a in range(10):
    df1 = pd.read_hdf('test.h5','df')

Any suggestion ?

windows 64bits python 3.4, pandas 0.15.1

1

There are 1 best solutions below

0
On

Using a smaller file n=1MM.

Their table format allocates about 2x memory then collects it. This is mainly of function of the storage format.

In [12]: %mprun -f f f()
Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5    115.1 MiB      0.0 MiB   def f():
     6    125.8 MiB     10.7 MiB       pd.read_hdf('test.h5','df')
     7    125.8 MiB      0.0 MiB       gc.collect()
('',)

In [13]: %mprun -f f2 f2()
Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     9    125.8 MiB      0.0 MiB   def f2():
    10    228.5 MiB    102.7 MiB       pd.read_hdf('test2.h5','df')
    11    115.0 MiB   -113.5 MiB       gc.collect()