In my data processing application, I have around 80% of the processing time just spend in the function pandas.HDFStore.put. Although there are various SO questions around similar problems, I haven't found any definite guide on how to use HDFStore in the most efficient way.
What options do I have to reduce the writing-time?
My data consists only out of float64 columns and some spare int-columns, it may contain duplicated indices and/or column names and it is a-priori not sorted. It will be data collected over decades (sec-min resolution), so the solution should be scalable.
My Basic usage cases are the following:
# 1. Store creation
store = pd.HDFStore(pro['hdf_path'], complevel=7,
complib='blosc', fletcher32=True)
# 2. Iterative addition of new data
store.put('/table/T1', data, format='table', data_columns=True,
append=True, index=False)
# 3. Basic queries of certain columns (I only need 'index' in 'where')
store.select('/table/T1', columns=['A', 'B', ...],
where='index>="{}" & index<{sign}"{}"'.format(_t1, _t2))
# 4. Retrieving a tree with all tables and all column
# names in that table (without loading it)
for path, groups, leaves in store.walk():
...
for lv in sorted(leaves):
_item_path = '/'.join([path, lv])
columns = store.get_node('{}/table'.format(_item_path)).description._v_names
Concretely I would be interested in how to change the following parameters to optimize the writing time:
- 'complib', 'complevel' in store creation
- making the indexing more efficient (maybe calling
create_table_indexonly at the end?) - parameters of
store.put/store.append - I read something about index levels, like
('medium', 6), would that have an influence? - Could I reduce the index from datetime (which is stored as Int64 then) to something more efficient, by dropping e.g. miliseconds?
(Reading is not so much a problem as store.select with where=... is quite efficient.)
Thank you for any help, it is very much appreciated!