Writing larger than memory data into bcolz

495 Views Asked by At

so I got this big tick data file (one day 60GB uncompressed) that I want to put into bcolz. I planned to read this file chunk by chunk and append them into bcolz.

As far as I know, bcolz only support append columns not rows. However, tick data is more row-wise than column-wise I would say. For instance:

0  ACTX.IV         0  13.6316 2016-09-26 03:45:00.846     ARCA        66   
1  ACWF.IV         0  23.9702 2016-09-26 03:45:00.846     ARCA        66   
2  ACWV.IV         0  76.4004 2016-09-26 03:45:00.846     ARCA        66   
3  ALTY.IV         0  15.5851 2016-09-26 03:45:00.846     ARCA        66   
4  AMLP.IV         0  12.5845 2016-09-26 03:45:00.846     ARCA        66   
  1. Does anyone have any suggestions on how to do this?
  2. And is there any suggestion on compress level I should choose, when using bcolz. I'm more concerned about later query speed than size. (I'm asking this, coz as shown below, it seems level one compressed bcolz ctable actually has better query speed than uncompressed one. So my guess would be the query speed is not a monotonic function with compression level). reference: http://nbviewer.jupyter.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb

Thanks in advance! query speed for bcolz

1

There are 1 best solutions below

0
On
  1. You could do something like:

    import blaze as bz
    ds = bz.data('my_file.csv')
    for chunk in bz.odo(ds, bz.chunks(pd.DataFrame), chunksize=1000000):
        bcolz.ctable.fromdataframe(chunk, rootdir=dir_path_for_chunk,
                                   mode='w', 
                                   cparams=your_compression_params)
    

    and than use bcolz.walk to iterate over chunks.

  2. Default (blosc level 5) would be appropriate in most cases. If you want every bit of performance you'll have to create sample file from real data with size around 1-2 GB and test performance with different compression parameters.