performance of appending data into a bcolz table

420 Views Asked by At

I'm just getting started using the bcolz package and running through the tutorial on ctables. Creating a table using the fromiter function, i.e:

N = 100*1000
ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N, rootdir='mydir', mode="w")

is fast, taking about 30ms on my computer (2.7GHz Core i7 with SSD storage), however the second example:

with bcolz.zeros(0, dtype="i4,f8", rootdir='mydir', mode="w") as ct:
     for i in range(N):
        ct.append((i, i**2))

is very slow (45 seconds). I can get it closer to the fromiter time by not writing to disk (i.e. removing rootdir='mydir', mode="w", but it's still around 2 seconds).

This example uses a lot of very small appends and I'm wondering if this is the recommended use case when one has lots of data. There aren't any hard numbers on how long these operations should take, just lots of suggestions that the library is fast.

I tried modifying the code to write the data in blocks:

with bcolz.zeros(0, dtype="i4,f8", rootdir="mydir", mode='w') as ct:
    for i in range(10):
        ii = np.arange(10000) + 10000*i
        ct.append((ii,ii**2))

and this now takes 45ms—down to 6ms if I don't write to disk. This seems more compatible with the suggested uses for bcolz cases I've seen.

I can't find much documentation about needing blocking when writing, so I think it may be due to my system?

1

There are 1 best solutions below

0
On
%timeit with bcolz.zeros(0, dtype="i4,f8", rootdir="mydir", mode='w') as ct: ct.append([np.arange(10000)+10000*i for i in range(10)])

100 loops, best of 3: 3.82 ms per loop